2 1 hadoop introduction v04

Page 1

Hadoop Tutorial

Certified Big Data & Hadoop Training – DataFlair


Agenda  Introduction to Hadoop  Hadoop nodes & daemons  Hadoop Architecture  Characteristics  Hadoop Features

Certified Big Data & Hadoop Training – DataFlair


What is Hadoop? The Technology that empowers Yahoo, Facebook, Twitter, Walmart and others

Hadoop

Certified Big Data & Hadoop Training – DataFlair


What is Hadoop? An Open Source framework that allows distributed processing of large data-sets across the cluster of commodity hardware

Certified Big Data & Hadoop Training – DataFlair


What is Hadoop? An Open Source framework that allows distributed processing of large data-sets across the cluster of commodity hardware

Open Source  

Source code is freely available It may be redistributed and modified

Certified Big Data & Hadoop Training – DataFlair


What is Hadoop? An open source framework that allows Distributed Processing of large data-sets across the cluster of commodity hardware

Distributed Processing ď ś ď ś

Data is processed distributedly on multiple nodes / servers Multiple machines processes the data independently

Certified Big Data & Hadoop Training – DataFlair


What is Hadoop? An open source framework that allows distributed processing of large data-sets across the Cluster of commodity hardware

Cluster  

Multiple machines connected together Nodes are connected via LAN

Certified Big Data & Hadoop Training – DataFlair


What is Hadoop? An open source framework that allows distributed processing of large data-sets across the cluster of Commodity Hardware

Commodity Hardware  

Economic / affordable machines Typically low performance hardware

Certified Big Data & Hadoop Training – DataFlair


What is Hadoop? • •

Open source framework written in Java Inspired by Google's Map-Reduce programming model as well as its file system (GFS)

Certified Big Data & Hadoop Training – DataFlair


Hadoop History Doug Cutting added DFS & MapReduce in

Hadoop defeated Super computer converted 4TB of image archives over 100 EC2 instances

Doug Cutting started working on 2002

2003

2004

2005

published GFS & MapReduce papers

2006

2007

Development of started as Lucene sub-project

Doug Cutting joined Cloudera 2008

Hadoop became top-level project

launched Hive, SQL Support for Hadoop Certified Big Data & Hadoop Training – DataFlair

2009


Hadoop Components Hadoop consists of three key parts

Certified Big Data & Hadoop Training – DataFlair


Hadoop Nodes Nodes

Master Node

Slave Node

Certified Big Data & Hadoop Training – DataFlair


Hadoop Daemons Nodes

Master Node

Slave Node

Resource Manager

Node Manager

NameNode

DataNode

Certified Big Data & Hadoop Training – DataFlair


Basic Hadoop Architecture

Work

USER

MASTER(S)

Sub Work

Sub Work

Sub Work

Sub Work

Sub Work

Sub Work

Sub Work

Sub Work

Sub Work

Sub Work

Sub Work

Sub Work

Sub Work

Sub Work

Sub Work

Sub Work

Sub Work

Sub Work

Sub Work

Sub Work

Sub Work

Sub Work

Sub Work

Sub Work

Sub Work

Sub Work

Sub Work

Sub Work

Sub Work

Sub Work

Sub Work

Sub Work

100 SLAVES Certified Big Data & Hadoop Training – DataFlair


Hadoop Characteristics Open Source

Distributed Processing Fault Tolerance

Easy to use Reliability

Economic Scalability

Certified Big Data & Hadoop Training – DataFlair

High Availability


Open Source • • •

Source code is freely available Can be redistributed Can be modified

Free Free

InterInteroperable operable

Transparent Transparent

Open Source

No No vendor vendor lock lock

Certified Big Data & Hadoop Training – DataFlair

Affordable Affordable

Community Community


Distributed Processing • •

Data is processed distributedly on cluster Multiple nodes in the cluster process data independently Centralized Processing

Distributed Processing Certified Big Data & Hadoop Training – DataFlair


Fault Tolerance • •

Failure of nodes are recovered automatically Framework takes care of failure of hardware as well tasks

Certified Big Data & Hadoop Training – DataFlair


Reliability •

Data is reliably stored on the cluster of machines despite machine failures Failure of nodes doesn’t cause data loss

Certified Big Data & Hadoop Training – DataFlair


High Availability •

Data is highly available and accessible despite hardware failure There will be no downtime for end user application due to data

USER Certified Big Data & Hadoop Training – DataFlair


Scalability •

Vertical Scalability – New hardware can be added to the nodes

Horizontal Scalability – New nodes can be added on the fly

Certified Big Data & Hadoop Training – DataFlair


Economic • •

No need to purchase costly license No need to purchase costly hardware

Open Source Open Source

+

Commodity Commodity Hardware Hardware

=

Certified Big Data & Hadoop Training – DataFlair

Economic Economic


Easy to Use • •

Distributed computing challenges are handled by framework Client just need to concentrate on business logic

Certified Big Data & Hadoop Training – DataFlair


Data Locality • •

Move computation to data instead of data to computation Data is processed on the nodes where it is stored

Data

Data

Data

Data

Storage Servers

App Servers

Algo

Algo

Data

Data

Algo

Algo

Data

Data

Algorithm

Servers

Certified Big Data & Hadoop Training – DataFlair


Summary • • • • •

Everyday we generate 2.3 trillion GBs of data Hadoop handles huge volumes of data efficiently Hadoop uses the power of distributed computing HDFS & Yarn are two main components of Hadoop It is highly fault tolerant, reliable & available

Certified Big Data & Hadoop Training – DataFlair


Thank You DataFlair /DataFlairWS

/c/DataFlairWS Certified Big Data & Hadoop Training – DataFlair


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.