Introduction to Apache Spark
Certified Apache Spark and Scala Training – DataFlair
Agenda
Before Spark Need for Spark What is Apache Spark ? Goals Why Spark ? RDD & its Operations Features Of Spark
Certified Apache Spark and Scala Training – DataFlair
Before Spark
Batch Processing
Stream Processing
Interactive Processing
Graph Processing
Certified Apache Spark and Scala Training – DataFlair
Machine Learning
Need For Spark • •
•
Need for a powerful engine that can process the data in Real-Time (streaming) as well as in Batch mode Need for a powerful engine that can respond in Sub-second and perform In-memory analytics Need for a powerful engine that can handle diverse workloads: – – – – –
Batch Streaming Interactive Graph Machine Learning Certified Apache Spark and Scala Training – DataFlair
What is Apache Spark? Apache Spark is a powerful open source engine which can handle: – Batch processing
– Real-time (stream) – Interactive – Graph – Machine Learning (Iterative) – In-memory
Certified Apache Spark and Scala Training – DataFlair
Introduction to Apache Spark
Lightening fast cluster computing tool
General purpose distributed system
Provides APIs in Scala, Java, Python, and R
Certified Apache Spark and Scala Training – DataFlair
History Became Top-level Most active project project at Apache
Open Sourced
Donated to Apache
Introduced by UC Berkeley
2009
2010
2011
2012
2013
Certified Apache Spark and Scala Training – DataFlair
World record in sorting
2014
2015
Sort Record 2100 Nodes
Hadoop-MapReduce Spark
72 min
206 Nodes
23 min
Hadoop MapReduce
Spark
Data Size
102.5 TB
100 TB
Time Taken
72 min
23 min
No of nodes
2100
206
No of cores
50400 physical
6592 virtualized
Cluster disk throughput
3150 GBPS
618 GBPS
Network
Dedicated 10 Gbps
Virtualized 10 Gbps Src: Databricks
Certified Apache Spark and Scala Training – DataFlair
Goals ďƒ˜ Easy to combine batch, streaming, and interactive computations
Batch
One Stack to Rule them all Interactive
Streaming
Certified Apache Spark and Scala Training – DataFlair
Goals Easy to combine batch, streaming, and interactive computations Easy to develop sophisticated algorithms
Certified Apache Spark and Scala Training – DataFlair
Goals Easy to combine batch, streaming, and interactive computations Easy to develop sophisticated algorithms Compatible with existing open source ecosystem
Certified Apache Spark and Scala Training – DataFlair
Why Spark ? ďƒ˜
100x faster than Hadoop.
Certified Apache Spark and Scala Training – DataFlair
Why Spark ?
100x faster than Hadoop. In-memory computation.
Disk
Operation1
Operation1
Operation2
Operation1
…
Disk
Certified Apache Spark and Scala Training – DataFlair
…
Why Spark ?
100x faster than Hadoop. In-memory computation.
Operation 1 Disk
…
Operation 2 Disk
Operation 1
Operation n
Disk
Operation 2
Disk
… Operation n
Disk
Disk
Certified Apache Spark and Scala Training – DataFlair
Why Spark ?
100x faster than Hadoop. In-memory computation. Language support like Scala, Java, Python and R.
Certified Apache Spark and Scala Training – DataFlair
Why Spark ?
100x faster than Hadoop. In-memory computation. Language support like Scala, Java, Python and R. Support Real time and Batch Processing.
Input data stream
Spark Streaming
Batches of Input data
Spark Engine
Certified Apache Spark and Scala Training – DataFlair
Batches of Processed data
Why Spark ?
100x faster than Hadoop. In-memory computation. Language support like Scala, Java, Python and R. Support Real time and Batch Processing. Lazy Operations – optimize the job before execution.
Certified Apache Spark and Scala Training – DataFlair
Why Spark ?
100x faster than Hadoop. In-memory computation. Language support like Scala, Java, Python and R. Support Real time and Batch Processing. Lazy Operations – optimize the job before execution. Support for multiple transformations and actions. Transformation 1 map()
RDD1
Action (collect)
Transformation 2 filter() RDD2
RDD3
Certified Apache Spark and Scala Training – DataFlair
Result
Why Spark ?
100x faster than Hadoop. In-memory computation. Language support like Scala, Java, Python and R. Support Real time and Batch Processing. Lazy Operations – optimize the job before execution. Support for multiple transformations and actions. Compatible with hadoop, can process existing hadoop data.
Certified Apache Spark and Scala Training – DataFlair
Spark Architecture
Certified Apache Spark and Scala Training – DataFlair
Spark Nodes Nodes
Master Node
Slave Nodes
Master
Worker
Certified Apache Spark and Scala Training – DataFlair
Basic Spark Architecture
Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Sub Work
Certified Apache Spark and Scala Training – DataFlair
Resilient Distributed Dataset (RDD) RDD is a simple and immutable collection of objects.
Obj1 Obj2 Obj3
....
Obj n
RDD
Certified Apache Spark and Scala Training – DataFlair
Resilient Distributed Dataset (RDD)
RDD is a simple and immutable collection of objects. RDD can contain any type of (scala, java, python and R) objects.
RDD Objects
Certified Apache Spark and Scala Training – DataFlair
Resilient Distributed Dataset (RDD)
RDD is a simple and immutable collection of objects. RDD can contain any type of (scala, java, python and R) objects. Each RDD is split-up into different partitions, which may be computed on different nodes of clusters. RDD Partition1 Partition2 Partition3 Partition4 Partition5 Partition6
Certified Apache Spark and Scala Training – DataFlair
Resilient Distributed Dataset (RDD) B2
B12
B1
Create RDD
Partition-1 Partition-2 Partition-3 Partition-4 Partition-5
...
B5
B10
B3
B7
B4
B9
B11
B6
B8
Employee-data.txt
RDD
Hadoop Cluster
Certified Apache Spark and Scala Training – DataFlair
RDD Operations RDD Operations
Transformations
Actions
Persistence
Certified Apache Spark and Scala Training – DataFlair
RDD Operations – Transformation Transformation:
Set of operations that define how RDD should be transformed
Creates a new RDD from the existing one to process the data
Lazy evaluation: Computation doesn’t start until an action associated
E.g. Map, FlatMap, Filter, Union, GroupBy, etc.
Certified Apache Spark and Scala Training – DataFlair
RDD Operations – Action Action:
Triggers job execution.
Returns the result or write it to the storage.
E.g. Count, Collect, Reduce, Take, etc.
Certified Apache Spark and Scala Training – DataFlair
RDD Operations – Persistence Persistence:
Spark allows caching/Persisting entire dataset in memory
Caches the RDD in the memory for future operations
Cache
Primary Storage Certified Apache Spark and Scala Training – DataFlair
RDD Operations Parent RDD
(map(), flatMap()…)
Transformations
Creates a new RDD based on custom business logic
RDD RDD Lineage
Actions
Returns output to Driver or exports data to storage system after computation
(saveAsTextFile(), count()…)
Result Certified Apache Spark and Scala Training – DataFlair
Features of Spark 100 X Faster Than Hadoop Speed
Automatic Memory Management
Recovers Automatically
Duplicate Elimination
Memory Management
Fault Tolerance
Process every record exactly once
Processing
Window Criteria
Certified Apache Spark and Scala Training – DataFlair
Diverse processing platform
Time based window criteria
Thank You DataFlair /DataFlairWS
/c/DataFlairWS Certified Apache Spark and Scala Training – DataFlair