Introduction to Apache Spark

Certified Apache Spark and Scala Training â&#x20AC;&#x201C; DataFlair

Agenda       

Before Spark Need for Spark What is Apache Spark ? Goals Why Spark ? RDD & its Operations Features Of Spark

Certified Apache Spark and Scala Training – DataFlair

Before Spark

Batch Processing

Stream Processing

Interactive Processing

Graph Processing

Certified Apache Spark and Scala Training â&#x20AC;&#x201C; DataFlair

Machine Learning

Need For Spark • •

•

Need for a powerful engine that can process the data in Real-Time (streaming) as well as in Batch mode Need for a powerful engine that can respond in Sub-second and perform In-memory analytics Need for a powerful engine that can handle diverse workloads: – – – – –

Batch Streaming Interactive Graph Machine Learning Certified Apache Spark and Scala Training – DataFlair

What is Apache Spark? Apache Spark is a powerful open source engine which can handle: – Batch processing

– Real-time (stream) – Interactive – Graph – Machine Learning (Iterative) – In-memory

Certified Apache Spark and Scala Training – DataFlair

Introduction to Apache Spark 

Lightening fast cluster computing tool



General purpose distributed system



Provides APIs in Scala, Java, Python, and R

Certified Apache Spark and Scala Training – DataFlair

History Became Top-level Most active project project at Apache

Open Sourced

Donated to Apache

Introduced by UC Berkeley

2009

2010

2011

2012

2013

Certified Apache Spark and Scala Training â&#x20AC;&#x201C; DataFlair

World record in sorting

2014

2015

Sort Record 2100 Nodes

Hadoop-MapReduce Spark

72 min

206 Nodes

23 min

Hadoop MapReduce

Spark

Data Size

102.5 TB

100 TB

Time Taken

72 min

23 min

No of nodes

2100

206

No of cores

50400 physical

6592 virtualized

Cluster disk throughput

3150 GBPS

618 GBPS

Network

Dedicated 10 Gbps

Virtualized 10 Gbps Src: Databricks

Certified Apache Spark and Scala Training â&#x20AC;&#x201C; DataFlair

Goals ď&#x192;&#x2DC; Easy to combine batch, streaming, and interactive computations

Batch

One Stack to Rule them all Interactive

Streaming

Certified Apache Spark and Scala Training â&#x20AC;&#x201C; DataFlair

Goals  Easy to combine batch, streaming, and interactive computations  Easy to develop sophisticated algorithms

Certified Apache Spark and Scala Training – DataFlair

Goals  Easy to combine batch, streaming, and interactive computations  Easy to develop sophisticated algorithms  Compatible with existing open source ecosystem

Certified Apache Spark and Scala Training – DataFlair

Why Spark ? ď&#x192;&#x2DC;

100x faster than Hadoop.

Certified Apache Spark and Scala Training â&#x20AC;&#x201C; DataFlair

Why Spark ?  

100x faster than Hadoop. In-memory computation.

Disk

Operation1

Operation2

Operation1

…

Disk

Certified Apache Spark and Scala Training – DataFlair

…

Why Spark ?  

100x faster than Hadoop. In-memory computation.

Operation 1 Disk

…

Operation 2 Disk

Operation 1

Operation n

Disk

Operation 2

Disk

… Operation n

Disk

Certified Apache Spark and Scala Training – DataFlair

Why Spark ?   

100x faster than Hadoop. In-memory computation. Language support like Scala, Java, Python and R.

Certified Apache Spark and Scala Training – DataFlair

Why Spark ?    

100x faster than Hadoop. In-memory computation. Language support like Scala, Java, Python and R. Support Real time and Batch Processing.

Input data stream

Spark Streaming

Batches of Input data

Spark Engine

Certified Apache Spark and Scala Training – DataFlair

Batches of Processed data

Why Spark ?     

100x faster than Hadoop. In-memory computation. Language support like Scala, Java, Python and R. Support Real time and Batch Processing. Lazy Operations – optimize the job before execution.

Certified Apache Spark and Scala Training – DataFlair

Why Spark ?      

100x faster than Hadoop. In-memory computation. Language support like Scala, Java, Python and R. Support Real time and Batch Processing. Lazy Operations – optimize the job before execution. Support for multiple transformations and actions. Transformation 1 map()

RDD1

Action (collect)

Transformation 2 filter() RDD2

RDD3

Certified Apache Spark and Scala Training – DataFlair

Result

Why Spark ?       

100x faster than Hadoop. In-memory computation. Language support like Scala, Java, Python and R. Support Real time and Batch Processing. Lazy Operations – optimize the job before execution. Support for multiple transformations and actions. Compatible with hadoop, can process existing hadoop data.

Certified Apache Spark and Scala Training – DataFlair

Spark Architecture

Certified Apache Spark and Scala Training â&#x20AC;&#x201C; DataFlair

Spark Nodes Nodes

Master Node

Slave Nodes

Master

Worker

Certified Apache Spark and Scala Training â&#x20AC;&#x201C; DataFlair

Basic Spark Architecture

Work

Sub Work

Certified Apache Spark and Scala Training â&#x20AC;&#x201C; DataFlair

Resilient Distributed Dataset (RDD) RDD is a simple and immutable collection of objects.

Obj1 Obj2 Obj3

....



Obj n

RDD

Certified Apache Spark and Scala Training – DataFlair

Resilient Distributed Dataset (RDD)  

RDD is a simple and immutable collection of objects. RDD can contain any type of (scala, java, python and R) objects.

RDD Objects

Certified Apache Spark and Scala Training – DataFlair

Resilient Distributed Dataset (RDD)   

RDD is a simple and immutable collection of objects. RDD can contain any type of (scala, java, python and R) objects. Each RDD is split-up into different partitions, which may be computed on different nodes of clusters. RDD Partition1 Partition2 Partition3 Partition4 Partition5 Partition6

Certified Apache Spark and Scala Training – DataFlair

Resilient Distributed Dataset (RDD) B2

B12

B1

Create RDD

Partition-1 Partition-2 Partition-3 Partition-4 Partition-5

...

B5

B10

B3

B7

B4

B9

B11

B6

B8

Employee-data.txt

RDD

Hadoop Cluster

Certified Apache Spark and Scala Training â&#x20AC;&#x201C; DataFlair

RDD Operations RDD Operations

Transformations

Actions

Persistence

Certified Apache Spark and Scala Training â&#x20AC;&#x201C; DataFlair

RDD Operations – Transformation Transformation: 

Set of operations that define how RDD should be transformed



Creates a new RDD from the existing one to process the data



Lazy evaluation: Computation doesn’t start until an action associated



E.g. Map, FlatMap, Filter, Union, GroupBy, etc.

Certified Apache Spark and Scala Training – DataFlair

RDD Operations – Action Action: 

Triggers job execution.



Returns the result or write it to the storage.



E.g. Count, Collect, Reduce, Take, etc.

Certified Apache Spark and Scala Training – DataFlair

RDD Operations – Persistence Persistence: 

Spark allows caching/Persisting entire dataset in memory



Caches the RDD in the memory for future operations

Cache

Primary Storage Certified Apache Spark and Scala Training – DataFlair

RDD Operations Parent RDD

(map(), flatMap()…)

Transformations

Creates a new RDD based on custom business logic

RDD RDD Lineage

Actions

Returns output to Driver or exports data to storage system after computation

(saveAsTextFile(), count()…)

Result Certified Apache Spark and Scala Training – DataFlair

Features of Spark 100 X Faster Than Hadoop Speed

Automatic Memory Management

Recovers Automatically

Duplicate Elimination

Memory Management

Fault Tolerance

Process every record exactly once

Processing

Window Criteria

Certified Apache Spark and Scala Training â&#x20AC;&#x201C; DataFlair

Diverse processing platform

Time based window criteria

Thank You DataFlair /DataFlairWS

/c/DataFlairWS Certified Apache Spark and Scala Training â&#x20AC;&#x201C; DataFlair

Turn static files into dynamic content formats.

Create a flipbook