Big Data, Hadood and Spark Training & Certification

  • Course Duration75 Hrs.
  • Course ModeInstructor Led Training
  • Course Fee₹ 9700

About The Course

AICouncil certified Training and certification program will let you understand Big Data Hadoop and Spark with around 15 real-time industry oriented project development. This course has been designed with MapReduce, Hive, Pig, Sqoop, Oozie and Flume and work with Amazon EC2 for cluster setup, Spark framework and RDD, Scala and Spark SQL, Machine Learning using Spark, Spark Streaming, etc. It’s a training which is designed by industry experts to let you prepared for Cloudera CCA Spark and Hadoop Developer Certification (CCA175) and current job requirements. Your certificate will be industry recognized as Hadoop developer, Hadoop administrator, Hadoop testing and analytics with Apache Spark.

Key Features

Instructor–led training

Highly interactive instructor-led training

Free lifetime access to recorded classes

Get lifetime access of all recored classes in your profile

Regular assignment and assessments

Real-time projects after every module

Lifetime accessibility

Lifetime access and free upgrade to the latest version

3 Years of technical support

Lifetime 24/7 technical support and query resolution

Globally Recognized Certification

Get global industry-recognized certifications


  • How to write applications using Hadoop and YARN?
  • How to create pseudo-node and multi-node clusters on Amazon EC2?
  • Implementing and Hands on experience on HDFS, MapReduce, Hive, Pig, Oozie, Sqoop, Flume, ZooKeeper and HBase, Spark, Spark SQL, Streaming, Data Frame, RDD, GraphX and MLlib
  • Hadoop administration activities and testing applications
  • ETL tool configuration like Pentaho/Talend with Mapreduce, Hive and Pig.
  • How to work with Avro data formats?
  • How to work and deploy real-time projects based out of Hadoop and Apache Spark?

Mode of Learning and Duration

  • Weekdays – 7 to 8 weeks
  • Weekend – 10 to 12 weeks
  • FastTrack – 5 to 6 weeks
  • Weekdays – 7 to 8 weeks
  • Weekend – 10 to 12 weeks
  • FastTrack – 5 to 6 weeks


Course Agenda

  • Introduction to Big Data and Hadoop
  • Introduction to Big Data Big Data Analytics
  • What is Big Data?
  • Various shell commands in Hadoop
  • Understanding configuration files in Hadoop
  • Installing single node cluster with Cloudera Manager and understanding Spark, Scala, Sqoop, Pig and Flume
  • What is Big Data and where does Hadoop fit in,
  • Two important Hadoop ecosystem components, namely, MapReduce and HDFS,
  • In-depth Hadoop Distributed File System – Replications, Block Size, Secondary Name node, High Availability
  • YARN – resource manager and node manager
  • HDFS working mechanism,
  • Data replication process,
  • Determine the size of the block,
  • Understanding a data node and name node
  • Data Ingestion Into Big Data Systems and Etl Data Ingestion
  • Data Ingestion
  • Apache Sqoop
  • Sqoop and Its Uses
  • Sqoop Processing
  • Sqoop Import
  • Process Sqoop Connectors
  • Importing and Exporting Data from MySQL to HDFS
  • How ETL tools work in Big Data industry,
  • Introduction to ETL and data warehousing,
  • Working with prominent use cases of Big Data in ETL industry and end-to-end ETL PoC showing Big Data integration with ETL tool
  • Connecting to HDFS from ETL tool and moving data from Local system to HDFS
  • Moving data from DBMS to HDFS
  • Working with Hive with ETL Tool
  • Creating MapReduce job in ETL tool
  • Working mechanism of MapReduce,
  • Mapping and reducing stages in MR,
  • Input Format,
  • Output Format,
  • Partitioners,
  • Combiners
  • Shuffle and Sort
  • Pig Distributed Processing in Mapreduce
  • Word Count Example
  • Map Execution Phases
  • Map Execution Distributed
  • Two Node Environment Mapreduce Jobs
  • Hadoop Mapreduce Job Work Interaction
  • Setting Up the Environment for Mapreduce Development
  • Set of Classes Creating a New Project
  • Advanced Mapreduce Data Types in Hadoop Output formats in Mapreduce
  • Using Distributed Cache
  • Joins in Mapreduce
  • Replicated Join
  • Introduction to Pig
  • Components of Pig
  • Pig Data Model
  • Pig Interactive Modes
  • Pig Operations
  • Various Relations Performed by Developers
  • Analyzing Sales Data and Solving Kpis Using Pig Practice
  • Analyzing Web Log Data Using Mapreduce
  • Write a Word Count program in MapReduce,
  • Write a Custom Partitioner,
  • MapReduce Combiner,
  • Run a job in a local job runner,
  • Deploying unit test,
  • Map side join v/s reduce side join,
  • Tool runner,
  • Use counters,
  • Dataset joining with map side and reduce side joins
  • Introduction to HIVE
  • Detailed Mapreduce Hive Architecture
  • Comparing Hive with Pig and RDBMS,
  • Working with Hive Query Language,
  • Creation of database, table,
  • Group by and other clauses,
  • Various types of Hive tables,
  • HCatalog, storing the Hive Results,
  • Hive partitioning and Buckets
  • SQL over Hadoop
  • Interfaces to Run Hive
  • Beeline from Command Line
  • Hive Metastore
  • Hive DDL and DML
  • Creating New Table Data Types
  • Validation of Data
  • File Format Types
  • Data Serialization
  • Hive Table and Avro
  • Schema Hive Optimization
  • Partitioning
  • Bucketing and Sampling Non-Partitioned Table Data
  • Insertion Dynamic Partitioning in Hive Bucketing
  • What Do Buckets Do?
  • Hive Analytics UDF and UDAF
  • Other Functions of Hive,
  • Real-time Analysis and Data Filtration
  • Data Representation and Import Using Hive
  • Database creation in Hive,
  • Dropping a database,
  • Hive table creation,
  • Change the database,
  • Data loading,
  • Hive table creation,
  • Dropping and altering table,
  • Pulling data by writing Hive queries with filter conditions,
  • Table partitioning in Hive and Group by clause.
  • Indexing in Hive,
  • Map Side Join in Hive,
  • Working with complex data types,
  • Hive User-defined Functions,
  • Introduction to Impala,
  • Comparing Hive with Impala,
  • Detailed architecture of Impala
  • How to work with Hive queries,
  • Process of joining table and writing indexes,
  • External table and sequence table deployment
  • Data storage in a different table
  • Apache Flume
  • Flume Model Scalability in Flume Components in Flume’s Architecture
  • Configuring Flume Components
  • Apache Sqoop introduction
  • Overview
  • Importing and exporting data
  • Performance improvement with Sqoop
  • Sqoop limitations
  • Introduction to Flume and understanding the architecture of Flume
  • What is HBase and the CAP theorem
  • NoSQL Databases
  • HBase NoSQL Introduction
  • Demo: Yarn Tuning
  • Hbase Overview
  • Hbase Architecture Data Model
  • Connecting to HBase
  • Practice Project: HBase Shell
  • Working with Flume to generate Sequence Number and consuming it
  • Using the Flume Agent to consume the Twitter data
  • Using AVRO to create Hive Table, AVRO with Pig
  • Creating Table in HBase and deploying Disable
  • Scan and Enable Table
  • Basics of Functional Programming and Scala
  • Introduction to Scala
  • Scala Installation
  • Functional Programming
  • Programming with Scala
  • Basic Literals and Arithmetic Programming
  • Logical Operators Type
  • Inference Classes
  • Objects and Functions in Scala
  • Type Inference Functions
  • Anonymous Function and Class Collections
  • Types of Collections
  • Five Types of Collections
  • Operations on List Scala REPL
  • Features of Scala REPL Key
  • Apache Spark
  • Next-Generation Big Data Framework
  • History of Spark Limitations of Mapreduce in Hadoop
  • Introduction to Apache Spark
  • Components of Spark Application of In-memory Processing
  • Hadoop Ecosystem vs Spark
  • Advantages of Spark
  • Spark Architecture
  • Spark Cluster in Real World
  • Running a Scala Programs in Spark Shell
  • Setting Up Execution Environment in IDE
  • Spark Web UI
  • Key Takeaways Knowledge Check
  • Using Scala for writing Apache Spark applications
  • Detailed study of Scala, and the need for Scala
  • The concept of object oriented programming
  • Executing the Scala code
  • Various classes in Scala like Getters, Setters, Constructors, Abstract, Extending Objects, Overriding Methods
  • The Java and Scala interoperability
  • The concept of functional programming and anonymous functions
  • Bobsrockets package and comparing the mutable and immutable collections
  • Scala REPL
  • Lazy Values
  • Control Structures in Scala
  • Directed Acyclic Graph (DAG)
  • first Spark application using SBT/Eclipse
  • Spark Web UI
  • Spark in Hadoop ecosystem.
  • Writing Spark application using Scala
  • Understanding the robustness of Scala for Spark real-time analytics operation
  • Introduction to Spark RDD
  • RDD in Spark
  • Creating Spark RDD
  • Pair RDD
  • RDD Operations
  • Spark Transformation and Action
  • Storage Levels Lineage and DAG
  • Need for DAG Debugging in Spark
  • Partitioning in Spark
  • Scheduling in Spark Shuffling in Spark
  • Sort Shuffle
  • Aggregating Data with Paired RDD
  • Spark Application with Data Written Back to HDFS and Spark UI
  • Changing Spark Application Parameters
  • Handling Different File Formats
  • Spark RDD with Real-world Application
  • Optimizing Spark Jobs
  • Deploy RDD with HDFS
  • Using the in-memory dataset
  • Using file for RDD
  • Define the base RDD from external file
  • Deploying RDD via transformation
  • Using the Map and Reduce functions and working on word count and count log severity
  • Spark SQL
  • Processing DataFrames
  • SQL in Spark for working with structured data processing
  • Spark SQL Architecture Dataframes
  • Handling Various Data Formats
  • Implement Various Dataframe Operations
  • UDF and UDAF Interoperating With RDDs
  • Process Dataframe Using SQL Query
  • RDD vs Dataframe vs Dataset
  • Spark SQL JSON support
  • Working with XML data and parquet files
  • Creating Hive Context
  • Writing Data Frame to Hive
  • How to read a JDBC file
  • Significance of a Spark Data Frame
  • How to create a Data Frame
  • What is schema manual inferring
  • How to work with CSV files
  • JDBC table reading
  • Data conversion from Data Frame to JDBC
  • Spark SQL user-defined functions
  • Shared variable and accumulators
  • How to query and transform data in Data Frames
  • How Data Frame provides the benefits of both Spark RDD and Spark SQL and deploying Hive on Spark as the execution engine
  • Data querying and transformation using Data Frames
  • Finding out the benefits of Data Frames over Spark SQL and Spark RDD
  • Introduction to Spark MLlib
  • Big Data With Spark
  • Role of Data Scientist and Data Analyst in Big Data
  • Analytics in Spark
  • understanding various algorithms
  • what is Spark iterative algorithm
  • Spark graph processing analysis
  • Machine Learning
  • Supervised Learning
  • Classification of Linear SVM
  • Linear Regression
  • Unsupervised Clustering K-means
  • Reinforcement Learning
  • Semi-supervised Learning
  • Overview of Mlib
  • Spark variables like shared and broadcast variables and what are accumulators, various ML algorithms supported by MLlib
  • Linear Regression, Logistic Regression, Decision Tree, Random Forest, K-means clustering techniques, building a Recommendation Engine
  • Building a Recommendation Engine
  • Introduction to Spark streaming
  • Architecture of Spark streaming
  • Working with the Spark streaming program
  • Data Processing Architectures
  • Real-time Data Processing
  • Writing Spark Streaming
  • Processing data using Spark streaming
  • Requesting count and DStream
  • Multi-batch and sliding window operations
  • Working with advanced data sources
  • Introduction to Spark Streaming
  • Features of Spark Streaming
  • Spark Streaming workflow
  • Initializing StreamingContext
  • Discretized Streams (DStreams), Input DStreams and Receivers
  • Transformations on DStreams
  • Output Operations on DStreams
  • Windowed Operators and why it is useful
  • Important Windowed Operators
  • Stateful Operators
  • Join Operations
  • Stream-dataset Join
  • Windowing of Real-time Data Processing Streaming Sources
  • Processing Twitter Streaming Data
  • Structured Spark Streaming Use Case
  • Banking Transactions
  • Structured Streaming Architecture Model and Its Components
  • Output Sinks Structured Streaming
  • APIs Constructing Columns in Structured Streaming Windowed Operations on Event-time Use Cases
  • Streaming Pipeline
  • Twitter Sentiment Analysis
  • Streaming using netcat server
  • Kafka-Spark Streaming and Spark-Flume Streaming
  • Why Kafka
  • What is Kafka
  • Kafka architecture
  • Kafka workflow
  • Configuring Kafka cluster & basic operations
  • Kafka monitoring tools
  • Integrating Apache Flume and Apache Kafka
  • Configuring Single Node Single Broker Cluster
  • Configuring Single Node Multi Broker Cluster
  • Producing and consuming messages
  • Integrating Apache Flume and Apache Kafka.
  • Spark GraphX
  • Introduction to Graph
  • GraphX in Spark
  • GraphX Operators
  • Join Operators
  • GraphX Parallel System
  • Algorithms in Spark
  • Pregel API
  • Use Case of GraphX
  • GraphX Vertex Predicate
  • Page Rank Algorithm
  • Create a 4-node Hadoop cluster setup
  • Running the MapReduce Jobs on the Hadoop cluster
  • Successfully running the MapReduce code and working with the Cloudera Manager setup
  • The method to build a multi-node Hadoop cluster using an Amazon EC2 instance and working with the Cloudera Manager
  • Overview of Hadoop configuration
  • Importance of Hadoop configuration file
  • Various parameters and values of configuration
  • HDFS parameters and MapReduce parameters
  • Setting up the Hadoop environment
  • Include and Exclude configuration files
  • Administration and maintenance of name node
  • Data node directory structures and files
  • What is a File system image and understanding Edit log.
  • The process of performance tuning in MapReduce
  • Introduction to the checkpoint procedure
  • Name node failure and how to ensure the recovery procedure
  • Safe Mode, Metadata and Data backup
  • Various potential problems and solutions
  • What to look for and how to add and remove nodes
  • How to go about ensuring the MapReduce File System Recovery for different scenarios
  • JMX monitoring of the Hadoop cluster
  • How to use the logs and stack traces for monitoring and troubleshooting?
  • Using the Job Scheduler for scheduling jobs in the same cluster
  • Getting the MapReduce job submission flow
  • FIFO schedule and getting to know the Fair Scheduler and its configuration
  • Hadoop project solution discussion
  • Preparing for the Cloudera certifications
  • Tips for cracking Hadoop interview questions
  • Real world Project development and Deployment



Use the data of airlines services in India in terms of routes covered and operational airports to analyse list of operating airports in India with maximum stops and minimum stops. As well as finding out the territory with highest number of airports and active airlines in cross ponding territory. This analysis will lead to match the demand of airline service requirement in a particular area.

Here you will use IMDB movie rating data set to analyse top rated movies using MapReduce program. Main highlight of the program is the use of Apache PIG and Apache Hive with Mapreduce for analysing, warehousing and querying the data.

This project will be implemented with Hive table data partitioning to ensure right partitioning helps to read the data, deploy it on the HDFS and run the MapReduce jobs at a much faster rate. You can do Data partition in multiple ways to deploy single SQL execution in Dynamic partition and bucketing of data.

In this project you will connect Pentaho with Hadoop ecosystem which works well with HDFS, HBase, Oozie and ZooKeeper. You will connect the Hadoop cluster with Pentaho data integration, analytics, Pentaho server and report designer. This project will develop an experience with Working knowledge of ETL and Business Intelligence along with Configuring Pentaho to work with Hadoop distribution.

In this project you will have hands on experience with bringing daily data into Hadoop Distributed File system. We have transaction data which is daily recorded/stored in the RDBMS. Now this data is transferred everyday into HDFS for further Big Data Analytics. You will work on live Hadoop YARN cluster which is a part of the Hadoop ecosystem that lets Hadoop to decouple from MapReduce and deploy more competitive processing and wider array of applications.

With this project you will know how to work on real world Hadoop multi-node cluster setup in a distributed environment. You will get a complete demonstration of working with various Hadoop cluster master and slave nodes, installing Java as a prerequisite for running Hadoop, installation of Hadoop and mapping the nodes in the Hadoop cluster. You will be mostly focused over multimode clustering on Amazon EC2 and deployment of MapReduce job on Hadoop cluster.

In this project you will be focused over making sense of all web log data to derive meaning full insights from it. Here you need to work with loading server data which includes various URLs visited, cookie data, user demographics, location, date and time of web service access, etc into Hadoop cluster using various techniques. you will transport the data using Apache Flume or Kafka, workflow and data cleansing using MapReduce, Pig or Spark. The insight thus derived can be used for analyzing customer behavior and predict buying patterns.

This project will let you use Spark SQL tool for analysing the WikiPedia Data with hands on experience in integrating Spark SQL for various applications like batch analysis, Machine Learning, visualizing and processing of data and ETL processes, along with real-time analysis of data.



Career Support

We have a dedicated team which is taking care of our learners learning objectives.


There is no such prerequisite if you are enrolling for Master’s Course as everything will start from scratch. Whether you are a working IT professional or a fresher you will find a course well planned and designed to incorporate trainee from various professional backgrounds.

AI Council offers 24/7 query resolution, you can raise a ticket with a dedicated support team and expect a revert within 24 Hrs. Email support can resolve all your query but if still it wasn’t resolved then we can schedule one-on-one session with our instructor or dedicated team. You can even contact our support after completing the training as well. There are no limits on number of tickets raised.
AI council provide two different modes for training one can choose for instructor lead training or learning with prerecorded video on demand. We also offer faculty development programs for college and schools. apart from this corporate training for organization/companies to enhance and update technical skills of the employees. We have highly qualified trainers who are working in the training industry from a very long time and have delivered the sessions and training for top colleges/schools and companies.
We are providing a 24/7 assistance for the ease of the student. Any query can be raised through the interface itself as well as can be communicated through email also. If someone is facing difficulties with above methods mentioned above we can arrange a one on one session with the trainer to help you with difficulties faced in learning. You can raise the query throughout the total training period as well as after the completion of the training.
AI Council offers you the latest, appropriate and most importantly the real-world projects throughout your training period. This makes student to gain industry level experience and converting the learning’s into solution to create the projects. Each Training Module is having Task or projects designed for the students so that you can evaluate your learning’s. You will be working on projects related to different industries such as marketing, e-commerce, automation, sales etc.
Yes, we do provide the job assistance so that a learner can apply for a job directly after the completion of the training. We have tied-ups with companies so when required we refers our students to those companies for interviews. Our team will help you to build a good resume and will trained you for your job interview.
After the successful completion of the training program and the submission of assignments/quiz, projects you have to secure at least B grade in qualifying exam, AI Council certified certificate will be awarded to you. Every certificate will be having a unique number through which same can be verified on our site.
To be very professional and transparent No, we don’t guarantee the job. the job assistance will help to provide you an opportunity to grab a dream job. The selection totally depends upon the performance of the candidate in the interview and the demand of the recruiter.
Our most of the programs are having both the modes of training i.e. instructor led and self-paced. One can choose any of the modes depending upon their work schedule. We provide flexibility to choose the type of training modes. While registering for courses you will be asked to submit your preference to select any of the modes. If any of the course is not offered in both modes so you can check in which mode, the training is going on and then you can register for the same. In any case if you feel you need any other training mode you can contact our team.
Yes, definitely you can opt for multiple courses at a time. We provide flexible timings. If you are having a desire for learning different topics while continuing with your daily hectic schedule our course timing and modes will help you a lot to carry on the learning’s.
Whenever you are enrolling in any of the courses we will send the notification you on your contact details. You will be provided with unique registration id and after successful enrollment all of the courses will be added to your account profile on our website.AI Council provides lifetime access to course content whenever needed.
A Capstone project is an outcome of the culminating learning throughout the academic years. It is the final project that represents your knowledge, efforts in the field of educational learning. It can be chosen by the mentor or by the students to come with a solution.
Yes, for obtaining the certificate of diploma programmer you have to submit the capstone project.