TRENDS delivers Hadoop courses in partnership with Koenig Solutions as an authorized Cloudera Training Partner.

Hadoop Developer with Spark

Schedule

Start End Duration Location Details

Course Details

Hadoop Developer with Spark 

Duration: 4 Days

Prerequisites: 

This course is best suited to developers and engineers who have programming experience. Knowledge of Java is strongly recommended and is required to complete the hands-on exercises. 

Course Description:

Hadoop Developer with Spark certification will let students create robust data processing applications using Apache Hadoop. After completing this course, students will be able to comprehend workflow execution and working with APIs by executing joins and writing MapReduce code. This course will offer the most excellent practice environment for the real-world issues faced by Hadoop developers. With Big Data being the buzzword, Hadoop certification and skills are being sought by companies across the globe. Big Data Analytics is a priority for many large organizations, and it helps them improve performance. Therefore, professionals with Big Data Hadoop expertise are required by the industry at large. 

Course Objectives: 

•    The Hadoop certification will help you learn how to distribute, store, and process data in a Hadoop cluster
•    After completing this course, you can easily write, configure, and deploy Apache Spark applications on a Hadoop cluster
•    Learn how to use the Spark shell for interactive data analysis
•    Use Spark Streaming to process a live data stream
•    Find out ways to process and query structured data using Spark SQL
•    This Hadoop course will help you use Flume and Kafka to ingest data for Spark Streaming.
 
Intended Audience: 

•    Developers
•    Engineers
•    Security Officers
•    Any professional who has programming experience with basic familiarity of SQL and Linux commands.
 
Course Outlines:

Introduction to Apache Hadoop and the Hadoop Ecosystem 

•    Apache Hadoop Overview  
•    Data Ingestion and Storage  
•    Data Processing  
•    Data Analysis and Exploration  
•    Other Ecosystem Tools 
•    Introduction to the Hands-On Exercises 

Apache Hadoop File Storage 

•    Apache Hadoop Cluster Components  
•    HDFS Architecture 
•    Using HDFS 

Distributed Processing on an Apache Hadoop Cluster 

•    YARN Architecture  
•    Working With YARN 

Apache Spark Basics 

•    What is Apache Spark?  
•    Starting the Spark Shell 
•    Using the Spark Shell  
•    Getting Started with Datasets and DataFrames 
•    DataFrame Operations 

Working with DataFrames and Schemas 

•    Creating DataFrames from Data Sources  
•    Saving DataFrames to Data Sources 
•    DataFrame Schemas  
•    Eager and Lazy Execution 

Analyzing Data with DataFrame Queries 

•    Querying DataFrames Using Column Expressions  
•    Grouping and Aggregation Queries  
•    Joining DataFrames

RDD Overview 

•    RDD Overview  
•    RDD Data Sources  
•    Creating and Saving RDDs 
•    RDD Operations 

Transforming Data with RDDs 

•    Writing and Passing Transformation Functions  
•    Transformation Execution  
•    Converting Between RDDs and DataFrames 

Aggregating Data with Pair RDDs 

•    Key-Value Pair RDDs  
•    Map-Reduce  
•    Other Pair RDD Operations 

Querying Tables and Views with Apache Spark SQL 

•    Querying Tables in Spark Using SQL  
•    Querying Files and Views  
•    The Catalog API 
•    Comparing Spark SQL, Apache Impala, and Apache Hive-on-Spark

 Working with Datasets in Scala 

•    Datasets and DataFrames  
•    Creating Datasets  
•    Loading and Saving Datasets 
•    Dataset Operations 

Writing, Configuring, and Running Apache Spark Applications 

•    Writing a Spark Application 
•    Building and Running an Application  
•    Application Deployment Mode  
•    The Spark Application Web UI 
•    Configuring Application Properties

Distributed Processing 

•    Review: Apache Spark on a Cluster  
•    RDD Partitions  
•    Example: Partitioning in Queries  
•    Stages and Tasks 
•    Job Execution Planning  
•    Example: Catalyst Execution Plan  
•    Example: RDD Execution Plan     

Distributed Data Persistence 

•    DataFrame and Dataset Persistence  
•    Persistence Storage Levels  
•    Viewing Persisted RDDs 

Common Patterns in Apache Spark Data Processing 

•    Common Apache Spark Use Cases  
•    Iterative Algorithms in Apache Spark 
•    Machine Learning  
•    Example: k-means 

Apache Spark Streaming: Introduction to DStreams 

•    Apache Spark Streaming Overview  
•    Example: Streaming Request Count  
•    DStreams  
•    Developing Streaming Applications 

Apache Spark Streaming: Processing Multiple Batches 

•    Multi-Batch Operations  
•    Time Slicing  
•    State Operations  
•    Sliding Window Operations  
•    Preview: Structured Streaming 

Apache Spark Streaming: Data Sources 

•    Streaming Data Source Overview  
•    Apache Flume and Apache Kafka Data Sources  
•    Example: Using a Kafka Direct Data Source