Big Data, Hadood and Spark Training & Certification

Course Duration75 Hrs.
Course ModeInstructor Led Training
Course Fee₹ 9700

About The Course

AICouncil certified Training and certification program will let you understand Big Data Hadoop and Spark with around 15 real-time industry oriented project development. This course has been designed with MapReduce, Hive, Pig, Sqoop, Oozie and Flume and work with Amazon EC2 for cluster setup, Spark framework and RDD, Scala and Spark SQL, Machine Learning using Spark, Spark Streaming, etc. It’s a training which is designed by industry experts to let you prepared for Cloudera CCA Spark and Hadoop Developer Certification (CCA175) and current job requirements. Your certificate will be industry recognized as Hadoop developer, Hadoop administrator, Hadoop testing and analytics with Apache Spark.

Key Features

Instructor–led training

Highly interactive instructor-led training

Free lifetime access to recorded classes

Get lifetime access of all recored classes in your profile

Regular assignment and assessments

Real-time projects after every module

Lifetime accessibility

Lifetime access and free upgrade to the latest version

3 Years of technical support

Lifetime 24/7 technical support and query resolution

Globally Recognized Certification

Get global industry-recognized certifications

Highlights

How to write applications using Hadoop and YARN?
How to create pseudo-node and multi-node clusters on Amazon EC2?
Implementing and Hands on experience on HDFS, MapReduce, Hive, Pig, Oozie, Sqoop, Flume, ZooKeeper and HBase, Spark, Spark SQL, Streaming, Data Frame, RDD, GraphX and MLlib
Hadoop administration activities and testing applications
ETL tool configuration like Pentaho/Talend with Mapreduce, Hive and Pig.
How to work with Avro data formats?
How to work and deploy real-time projects based out of Hadoop and Apache Spark?

Mode of Learning and Duration

Online

Weekdays – 7 to 8 weeks
Weekend – 10 to 12 weeks
FastTrack – 5 to 6 weeks

Offline

Weekdays – 7 to 8 weeks
Weekend – 10 to 12 weeks
FastTrack – 5 to 6 weeks

Course Agenda

Getting Started with BigData & Hadoop

Introduction to Big Data and Hadoop
Introduction to Big Data Big Data Analytics
What is Big Data?
Various shell commands in Hadoop
Understanding configuration files in Hadoop
Installing single node cluster with Cloudera Manager and understanding Spark, Scala, Sqoop, Pig and Flume
What is Big Data and where does Hadoop fit in,
Two important Hadoop ecosystem components, namely, MapReduce and HDFS,
In-depth Hadoop Distributed File System – Replications, Block Size, Secondary Name node, High Availability
YARN – resource manager and node manager

Hands-on (Introduction):

HDFS working mechanism,
Data replication process,
Determine the size of the block,
Understanding a data node and name node

Data Ingestion into Big Data Systems and ETL

Data Ingestion Into Big Data Systems and Etl Data Ingestion
Data Ingestion
Apache Sqoop
Sqoop and Its Uses
Sqoop Processing
Sqoop Import
Process Sqoop Connectors
Importing and Exporting Data from MySQL to HDFS
How ETL tools work in Big Data industry,
Introduction to ETL and data warehousing,
Working with prominent use cases of Big Data in ETL industry and end-to-end ETL PoC showing Big Data integration with ETL tool

Hands-on (ETL Connectivity with Hadoop):

Connecting to HDFS from ETL tool and moving data from Local system to HDFS
Moving data from DBMS to HDFS
Working with Hive with ETL Tool
Creating MapReduce job in ETL tool

Understanding MapReduce and Pig with Distributed Processing

Working mechanism of MapReduce,
Mapping and reducing stages in MR,
Input Format,
Output Format,
Partitioners,
Combiners
Shuffle and Sort
Pig Distributed Processing in Mapreduce
Word Count Example
Map Execution Phases
Map Execution Distributed
Two Node Environment Mapreduce Jobs
Hadoop Mapreduce Job Work Interaction
Setting Up the Environment for Mapreduce Development
Set of Classes Creating a New Project
Advanced Mapreduce Data Types in Hadoop Output formats in Mapreduce
Using Distributed Cache
Joins in Mapreduce
Replicated Join
Introduction to Pig
Components of Pig
Pig Data Model
Pig Interactive Modes
Pig Operations
Various Relations Performed by Developers

Hands-on (MapReduce and Pig with Distributed Processing):

Analyzing Sales Data and Solving Kpis Using Pig Practice
Analyzing Web Log Data Using Mapreduce
Write a Word Count program in MapReduce,
Write a Custom Partitioner,
MapReduce Combiner,
Run a job in a local job runner,
Deploying unit test,
Map side join v/s reduce side join,
Tool runner,
Use counters,
Dataset joining with map side and reduce side joins

Apache Hive

Introduction to HIVE
Detailed Mapreduce Hive Architecture
Comparing Hive with Pig and RDBMS,
Working with Hive Query Language,
Creation of database, table,
Group by and other clauses,
Various types of Hive tables,
HCatalog, storing the Hive Results,
Hive partitioning and Buckets
SQL over Hadoop
Interfaces to Run Hive
Beeline from Command Line
Hive Metastore
Hive DDL and DML
Creating New Table Data Types
Validation of Data
File Format Types
Data Serialization
Hive Table and Avro
Schema Hive Optimization
Partitioning
Bucketing and Sampling Non-Partitioned Table Data
Insertion Dynamic Partitioning in Hive Bucketing
What Do Buckets Do?
Hive Analytics UDF and UDAF
Other Functions of Hive,

Hands-on (Apache Hive):

Real-time Analysis and Data Filtration
Data Representation and Import Using Hive
Database creation in Hive,
Dropping a database,
Hive table creation,
Change the database,
Data loading,
Hive table creation,
Dropping and altering table,
Pulling data by writing Hive queries with filter conditions,
Table partitioning in Hive and Group by clause.

Advanced Hive and Impala

Indexing in Hive,
Map Side Join in Hive,
Working with complex data types,
Hive User-defined Functions,
Introduction to Impala,
Comparing Hive with Impala,
Detailed architecture of Impala

Hands-on (Hive and Impala):

How to work with Hive queries,
Process of joining table and writing indexes,
External table and sequence table deployment
Data storage in a different table

Flume, Sqoop and NoSQL Databases HBase

Apache Flume
Flume Model Scalability in Flume Components in Flume’s Architecture
Configuring Flume Components
Apache Sqoop introduction
Overview
Importing and exporting data
Performance improvement with Sqoop
Sqoop limitations
Introduction to Flume and understanding the architecture of Flume
What is HBase and the CAP theorem
NoSQL Databases
HBase NoSQL Introduction
Demo: Yarn Tuning
Hbase Overview
Hbase Architecture Data Model
Connecting to HBase
Practice Project: HBase Shell

Hands-on (Flume, Sqoop and NoSQL):

Working with Flume to generate Sequence Number and consuming it
Using the Flume Agent to consume the Twitter data
Using AVRO to create Hive Table, AVRO with Pig
Creating Table in HBase and deploying Disable
Scan and Enable Table

Functional Programming and Scala

Basics of Functional Programming and Scala
Introduction to Scala
Scala Installation
Functional Programming
Programming with Scala
Basic Literals and Arithmetic Programming
Logical Operators Type
Inference Classes
Objects and Functions in Scala
Type Inference Functions
Anonymous Function and Class Collections
Types of Collections
Five Types of Collections
Operations on List Scala REPL
Features of Scala REPL Key

Apache Spark Framework

Apache Spark
Next-Generation Big Data Framework
History of Spark Limitations of Mapreduce in Hadoop
Introduction to Apache Spark
Components of Spark Application of In-memory Processing
Hadoop Ecosystem vs Spark
Advantages of Spark
Spark Architecture
Spark Cluster in Real World
Running a Scala Programs in Spark Shell
Setting Up Execution Environment in IDE
Spark Web UI
Key Takeaways Knowledge Check

Writing Spark Applications Using Scala

Using Scala for writing Apache Spark applications
Detailed study of Scala, and the need for Scala
The concept of object oriented programming
Executing the Scala code
Various classes in Scala like Getters, Setters, Constructors, Abstract, Extending Objects, Overriding Methods
The Java and Scala interoperability
The concept of functional programming and anonymous functions
Bobsrockets package and comparing the mutable and immutable collections
Scala REPL
Lazy Values
Control Structures in Scala
Directed Acyclic Graph (DAG)
first Spark application using SBT/Eclipse
Spark Web UI
Spark in Hadoop ecosystem.

Hands-on (Spark Applications):

Writing Spark application using Scala
Understanding the robustness of Scala for Spark real-time analytics operation

RDD in Spark

Introduction to Spark RDD
RDD in Spark
Creating Spark RDD
Pair RDD
RDD Operations
Spark Transformation and Action
Storage Levels Lineage and DAG
Need for DAG Debugging in Spark
Partitioning in Spark
Scheduling in Spark Shuffling in Spark
Sort Shuffle
Aggregating Data with Paired RDD
Spark Application with Data Written Back to HDFS and Spark UI
Changing Spark Application Parameters
Handling Different File Formats
Spark RDD with Real-world Application
Optimizing Spark Jobs

Hands-on (RDD):

Deploy RDD with HDFS
Using the in-memory dataset
Using file for RDD
Define the base RDD from external file
Deploying RDD via transformation
Using the Map and Reduce functions and working on word count and count log severity

Processing Data Frames and Spark SQL

Spark SQL
Processing DataFrames
SQL in Spark for working with structured data processing
Spark SQL Architecture Dataframes
Handling Various Data Formats
Implement Various Dataframe Operations
UDF and UDAF Interoperating With RDDs
Process Dataframe Using SQL Query
RDD vs Dataframe vs Dataset
Spark SQL JSON support
Working with XML data and parquet files
Creating Hive Context
Writing Data Frame to Hive
How to read a JDBC file
Significance of a Spark Data Frame
How to create a Data Frame
What is schema manual inferring
How to work with CSV files
JDBC table reading
Data conversion from Data Frame to JDBC
Spark SQL user-defined functions
Shared variable and accumulators
How to query and transform data in Data Frames
How Data Frame provides the benefits of both Spark RDD and Spark SQL and deploying Hive on Spark as the execution engine

Hands-on (Data Frames and Spark SQL):

Data querying and transformation using Data Frames
Finding out the benefits of Data Frames over Spark SQL and Spark RDD

Machine Learning Using Spark (MLlib)

Introduction to Spark MLlib
Big Data With Spark
Role of Data Scientist and Data Analyst in Big Data
Analytics in Spark
understanding various algorithms
what is Spark iterative algorithm
Spark graph processing analysis
Machine Learning
Supervised Learning
Classification of Linear SVM
Linear Regression
Unsupervised Clustering K-means
Reinforcement Learning
Semi-supervised Learning
Overview of Mlib
Spark variables like shared and broadcast variables and what are accumulators, various ML algorithms supported by MLlib
Linear Regression, Logistic Regression, Decision Tree, Random Forest, K-means clustering techniques, building a Recommendation Engine

Hands-on (MLlib):

Building a Recommendation Engine

Spark Streaming

Introduction to Spark streaming
Architecture of Spark streaming
Working with the Spark streaming program
Data Processing Architectures
Real-time Data Processing
Writing Spark Streaming
Processing data using Spark streaming
Requesting count and DStream
Multi-batch and sliding window operations
Working with advanced data sources
Introduction to Spark Streaming
Features of Spark Streaming
Spark Streaming workflow
Initializing StreamingContext
Discretized Streams (DStreams), Input DStreams and Receivers
Transformations on DStreams
Output Operations on DStreams
Windowed Operators and why it is useful
Important Windowed Operators
Stateful Operators
Join Operations
Stream-dataset Join
Windowing of Real-time Data Processing Streaming Sources
Processing Twitter Streaming Data
Structured Spark Streaming Use Case
Banking Transactions
Structured Streaming Architecture Model and Its Components
Output Sinks Structured Streaming
APIs Constructing Columns in Structured Streaming Windowed Operations on Event-time Use Cases
Streaming Pipeline

Hands-on (Spark Streaming):

Twitter Sentiment Analysis
Streaming using netcat server
Kafka-Spark Streaming and Spark-Flume Streaming

Apache Flume and Apache Kafka Integration

Why Kafka
What is Kafka
Kafka architecture
Kafka workflow
Configuring Kafka cluster & basic operations
Kafka monitoring tools
Integrating Apache Flume and Apache Kafka

Hands-on (Flume & Kafka):

Configuring Single Node Single Broker Cluster
Configuring Single Node Multi Broker Cluster
Producing and consuming messages
Integrating Apache Flume and Apache Kafka.

Spark GraphX

Spark GraphX
Introduction to Graph
GraphX in Spark
GraphX Operators
Join Operators
GraphX Parallel System
Algorithms in Spark
Pregel API
Use Case of GraphX
GraphX Vertex Predicate
Page Rank Algorithm

Hadoop Administration – Multi-node Cluster Setup Using Amazon EC2

Create a 4-node Hadoop cluster setup
Running the MapReduce Jobs on the Hadoop cluster
Successfully running the MapReduce code and working with the Cloudera Manager setup

Hands-on (Cluster Setup):

The method to build a multi-node Hadoop cluster using an Amazon EC2 instance and working with the Cloudera Manager

Hadoop Administration – Cluster Configuration

Overview of Hadoop configuration
Importance of Hadoop configuration file
Various parameters and values of configuration
HDFS parameters and MapReduce parameters
Setting up the Hadoop environment
Include and Exclude configuration files
Administration and maintenance of name node
Data node directory structures and files
What is a File system image and understanding Edit log.

Hands-on (Cluster Configuration):

The process of performance tuning in MapReduce

Hadoop Administration – Maintenance, Monitoring and Troubleshooting

Introduction to the checkpoint procedure
Name node failure and how to ensure the recovery procedure
Safe Mode, Metadata and Data backup
Various potential problems and solutions
What to look for and how to add and remove nodes

Hands-on Exercise (Hadoop Administration):

How to go about ensuring the MapReduce File System Recovery for different scenarios
JMX monitoring of the Hadoop cluster
How to use the logs and stack traces for monitoring and troubleshooting?
Using the Job Scheduler for scheduling jobs in the same cluster
Getting the MapReduce job submission flow
FIFO schedule and getting to know the Fair Scheduler and its configuration

Project Discussion and Cloudera Certification Preparation:

Hadoop project solution discussion
Preparing for the Cloudera certifications
Tips for cracking Hadoop interview questions
Real world Project development and Deployment

Projects

1. Analysis of Airline Service

Use the data of airlines services in India in terms of routes covered and operational airports to analyse list of operating airports in India with maximum stops and minimum stops. As well as finding out the territory with highest number of airports and active airlines in cross ponding territory. This analysis will lead to match the demand of airline service requirement in a particular area.

2. Movie data analysis to find out top grossing and rated movies

Here you will use IMDB movie rating data set to analyse top rated movies using MapReduce program. Main highlight of the program is the use of Apache PIG and Apache Hive with Mapreduce for analysing, warehousing and querying the data.

3. Speed up Table Data Partitioning using Apache Hive

This project will be implemented with Hive table data partitioning to ensure right partitioning helps to read the data, deploy it on the HDFS and run the MapReduce jobs at a much faster rate. You can do Data partition in multiple ways to deploy single SQL execution in Dynamic partition and bucketing of data.

4. Deploy ETL for data analysis activities

In this project you will connect Pentaho with Hadoop ecosystem which works well with HDFS, HBase, Oozie and ZooKeeper. You will connect the Hadoop cluster with Pentaho data integration, analytics, Pentaho server and report designer. This project will develop an experience with Working knowledge of ETL and Business Intelligence along with Configuring Pentaho to work with Hadoop distribution.

5. Working with Incremental Data

In this project you will have hands on experience with bringing daily data into Hadoop Distributed File system. We have transaction data which is daily recorded/stored in the RDBMS. Now this data is transferred everyday into HDFS for further Big Data Analytics. You will work on live Hadoop YARN cluster which is a part of the Hadoop ecosystem that lets Hadoop to decouple from MapReduce and deploy more competitive processing and wider array of applications.

6. Setup Hadoop Real time cluster on Amazon EC2

With this project you will know how to work on real world Hadoop multi-node cluster setup in a distributed environment. You will get a complete demonstration of working with various Hadoop cluster master and slave nodes, installing Java as a prerequisite for running Hadoop, installation of Hadoop and mapping the nodes in the Hadoop cluster. You will be mostly focused over multimode clustering on Amazon EC2 and deployment of MapReduce job on Hadoop cluster.

7. Deriving insights from web log data using Web Log Analytics

In this project you will be focused over making sense of all web log data to derive meaning full insights from it. Here you need to work with loading server data which includes various URLs visited, cookie data, user demographics, location, date and time of web service access, etc into Hadoop cluster using various techniques. you will transport the data using Apache Flume or Kafka, workflow and data cleansing using MapReduce, Pig or Spark. The insight thus derived can be used for analyzing customer behavior and predict buying patterns.

8. Data Exploration and making sense of Wikipedia data using Spark SQL

This project will let you use Spark SQL tool for analysing the WikiPedia Data with hands on experience in integrating Spark SQL for various applications like batch analysis, Machine Learning, visualizing and processing of data and ETL processes, along with real-time analysis of data.

Certification

Career Support

We have a dedicated team which is taking care of our learners learning objectives.

FAQ

What are Prerequisites?

There is no such prerequisite if you are enrolling for Master’s Course as everything will start from scratch. Whether you are a working IT professional or a fresher you will find a course well planned and designed to incorporate trainee from various professional backgrounds.

What sort of support can I expect?

AI Council offers 24/7 query resolution, you can raise a ticket with a dedicated support team and expect a revert within 24 Hrs. Email support can resolve all your query but if still it wasn’t resolved then we can schedule one-on-one session with our instructor or dedicated team. You can even contact our support after completing the training as well. There are no limits on number of tickets raised.

Which are the different training modes provided by AI council??

AI council provide two different modes for training one can choose for instructor lead training or learning with prerecorded video on demand. We also offer faculty development programs for college and schools. apart from this corporate training for organization/companies to enhance and update technical skills of the employees. We have highly qualified trainers who are working in the training industry from a very long time and have delivered the sessions and training for top colleges/schools and companies.

What if I fail to understand the topic or have doubt in the topic delivered?

We are providing a 24/7 assistance for the ease of the student. Any query can be raised through the interface itself as well as can be communicated through email also. If someone is facing difficulties with above methods mentioned above we can arrange a one on one session with the trainer to help you with difficulties faced in learning. You can raise the query throughout the total training period as well as after the completion of the training.

What kind of projects are included as part of the training?

AI Council offers you the latest, appropriate and most importantly the real-world projects throughout your training period. This makes student to gain industry level experience and converting the learning’s into solution to create the projects. Each Training Module is having Task or projects designed for the students so that you can evaluate your learning’s. You will be working on projects related to different industries such as marketing, e-commerce, automation, sales etc.

Do AI Council provide any job assistance?

Yes, we do provide the job assistance so that a learner can apply for a job directly after the completion of the training. We have tied-ups with companies so when required we refers our students to those companies for interviews. Our team will help you to build a good resume and will trained you for your job interview.

How one can be awarded with AI Council verified certificate?

After the successful completion of the training program and the submission of assignments/quiz, projects you have to secure at least B grade in qualifying exam, AI Council certified certificate will be awarded to you. Every certificate will be having a unique number through which same can be verified on our site.

Is there any guarantee of job through job Assistance?

To be very professional and transparent No, we don’t guarantee the job. the job assistance will help to provide you an opportunity to grab a dream job. The selection totally depends upon the performance of the candidate in the interview and the demand of the recruiter.

Do the courses offered are instructor led or self-paced?

Our most of the programs are having both the modes of training i.e. instructor led and self-paced. One can choose any of the modes depending upon their work schedule. We provide flexibility to choose the type of training modes. While registering for courses you will be asked to submit your preference to select any of the modes. If any of the course is not offered in both modes so you can check in which mode, the training is going on and then you can register for the same. In any case if you feel you need any other training mode you can contact our team.

Can I enroll in multiple courses at a time?

Yes, definitely you can opt for multiple courses at a time. We provide flexible timings. If you are having a desire for learning different topics while continuing with your daily hectic schedule our course timing and modes will help you a lot to carry on the learning’s.

When I will get the access to the course, if I am registering today?

Whenever you are enrolling in any of the courses we will send the notification you on your contact details. You will be provided with unique registration id and after successful enrollment all of the courses will be added to your account profile on our website.AI Council provides lifetime access to course content whenever needed.

What is a Major/Capstone Project?

A Capstone project is an outcome of the culminating learning throughout the academic years. It is the final project that represents your knowledge, efforts in the field of educational learning. It can be chosen by the mentor or by the students to come with a solution.

Is it compulsory to submit the capstone project?

Yes, for obtaining the certificate of diploma programmer you have to submit the capstone project.

₹ 9700

Choose Supports and Services (Optional)

Professional Networking Management Skill
Resume Building
Portfolio Website Development
Lifetime Job Assistance Program

GET REGISTERED CHECKOUT