Hadoop eco system

Page 1

Diploma in Big Data and Analytics Introduction to Hadoop Eco-System


Agenda In this session, you will learn about: • • • • • • • • • • • •

What is Hadoop? Why Hadoop? Advantages of Hadoop? History of Hadoop Key Characteristics of Hadoop Hadoop 1.0 & 2.0 Eco-system Hadoop Use Cases Where Hadoop Fits? Traditional vs. Hadoop Architecture RDBMS vs. Hadoop When to Use or Not Use Hadoop? Hadoop Opportunities

Private and Confidential

2


What is Hadoop?

A solution to the Big Data problem.

A free Java-based framework that allows for the distributed processing of large data sets. Processes data across clusters of commodity computers, using a simple programming model. An Apache project, inspired by Google's MapReduce and Google File System papers.

Fault tolerant & reliable open source system. Private and Confidential

3


What is Hadoop?

Private and Confidential

4


Why Hadoop? Why is Big Data Technology Needed?

90% of the data in the world today has been created in the last

2 years alone.

80% of the data is unstructured or exists in widely varying structures, which are difficult to analyze.

Structured formats have some limitations with respect to handling large

Difficult to integrate information distributed across multiple systems.

quantities of data. Private and Confidential

5


Why Hadoop?

Additional Advantages

Most business users do not know what should be analyzed.

Potentially valuable data is dormant or discarded.

A lot of information has a short, useful lifespan.

It is too expensive to justify the integration of large volumes of unstructured data.

Context adds meaning to the existing information.

Private and Confidential

6


Advantages of Hadoop Why is Big Data Technology Appealing? Runs a number of applications on distributed systems with thousands of nodes involving petabytes of data Has a distributed file system, called Hadoop Distributed File System or HDFS, which enables fast data transfer among the nodes It helps to manage and process a huge amount of data cost efficiently. It analyzes data in its native form, which may be unstructured, structured, or streaming. It captures data from fast-happening events in real time. It can handle failure of isolated nodes and tasks assigned to such nodes. It can turn data into actionable insights. Private and Confidential

7


History of Hadoop

Private and Confidential

8


Hadoop’s Key Characteristics

Reliability

Provides a reliable, fault tolerant shared data storage and analysis system.

Scalability

Offers very high linear scalability.

Flexibility

It can process structured, semistructured & unstructured data.

Economical

Robust

Works on inexpensive commodity hardware. Well suited to meet the analytical needs of developers.

Private and Confidential

9


Hadoop is Reliable

System automatically reallocates work to another location

Level of replication is configurable

Why is Hadoop Reliable? Data automatically gets replicated at two other locations

File is available on the third system at least in case of 2 system collapses

Private and Confidential

High level of fault tolerance

10


Scalable Development Environment

Private and Confidential

11


Flexibility in Data Processing Hadoop Brings Flexibility In Data Processing • One of the biggest challenges organizations have had in that past was the challenge of handling unstructured data. • Let’s face it, only 20% of data in any organization is structured while the rest is all unstructured whose value has been largely ignored due to lack of technology to analyze it. • Hadoop manages data whether structured or unstructured, encoded or formatted, or any other type of data. • Hadoop brings the value to the table where unstructured data can be useful in decision making process

Application data

Machine data

Social data

Private and Confidential

Enterprise data

12


Hadoop is Very Cost Effective • Hadoop generates cost benefits by bringing massively parallel computing to commodity servers, resulting in a substantial reduction in the cost per terabyte of storage, which in turn makes it reasonable to model all your data. • Apache Hadoop was developed to help Internet-based companies deal with prodigious volumes of data. • According to some analysts, the cost of a Hadoop data management system, including hardware, software, and other expenses, comes to about $1,000 a terabyte–about one-fifth to onetwentieth the cost of other data management technologies

Private and Confidential

13


Hadoop Ecosystem is Robust Why is Hadoop Considered Robust?

Meets analytical needs of developers and small to large organizations

Deliver to a variety of data processing needs

Private and Confidential

Projects such as MapReduce, Hive, HBase, Apache Pig, Sqoop, Flume

14


The Hadoop 1.0 Eco-System

Private and Confidential

15


The Hadoop 2.0 Eco-System

Private and Confidential

16


Hadoop Use Cases

Private and Confidential

17


Where does Hadoop Fit?

Web and e-tailing • • • •

Recommendation Engines Ad Targeting Search Quality Abuse and Click Fraud Detection

Telecommunications • • •

Customer Churn Prevention Network Performance Optimization Calling Data Record (CDR) Analysis Analyzing Network to Predict Failure Private and Confidential

Government • • •

Fraud Detection & Cyber Security Welfare schemes Justice

18


Where does Hadoop Fit? Healthcare & Life Sciences • • • • •

Health information exchange Gene sequencing Serialization Healthcare service quality improvements Drug Safety

Banks and Financial Services • • • • •

Modeling True Risk Threat Analysis Fraud Detection Trade Surveillance Credit Scoring And Analysis

Retail • Point of sales Transaction Analysis • Customer Churn Analysis • Sentiment Analysis Private and Confidential

19


Leading Brands using Hadoop

Source: https://wiki.apache.org/hadoop/PoweredBy Private and Confidential

20


Traditional Data Analytics Architecture BI Reports + Interactive Apps RDBMS (Aggregated Data) ETL Compute Grid

Can’t explore original raw high fidelity data

Moving data to compute doesn’t scale Storage Only Grid (Original Raw Data)

Mostly Append Collection

Premature death of data

Instrumentation Private and Confidential

21


Hadoop Data Analytics Architecture BI Reports + Interactive Apps RDBMS (Aggregated Data)

Data exploration & advanced analytics

Scalable throughout for ETL and aggregation Hadoop : Storage + Compute Grid

Mostly Append Collection

Data is alive forever

Instrumentation Private and Confidential

22


RDBMS v/s Hadoop RDBMS

Hadoop

Data processing efficiency in Gigabytes +

Data processing efficiency in Petabytes +

Mostly proprietory

Open Source framework

One project with multiple components

Eco System Suite of java based(mostly) projects

Designed for client server architecture

Designed to support distributed architecture

High usage would require high end server

Designed to run on commodity hardware

Costly

Cost efficient

Legacy procedure

High fault tolerance Private and Confidential

23


RDBMS v/s Hadoop (Contd) RDBMS

Hadoop

Relies on OS file system

Based on distributed file system (HDFS)

Needs structured data

Very good support of unstructured data

Needs to follow defined constraints

Flexible, evolvable and fast

Stable products

Still evolving

Real time Read/Write (OLTP)

Suitable for Batch processing (OLAP)

Arbitrary insert and update

Sequential write

Supports ACID transactions

Supports BASE

Schema required on write

Schema required on read

Repeated Read and Write

Write once, Repeated Read Private and Confidential

24


When to use Hadoop? Hadoop can be used in various scenarios including some of the following: Analytics

Search

Data Retention

Log file processing

Analysis of Text, Image, Audio, & Video content Recommendation systems like in E-Commerce Website Private and Confidential

25


When not to use Hadoop? Hadoop may not be a right fit in the following situations:

Low-latency or near real-time data access. If you have a large number of small files to be processed. Multiple writes scenario or scenarios requiring arbitrary writes or writes between the files. Private and Confidential

26


Opportunities on Hadoop •

Dice.com – Nearly 2500 jobs across US

Indeed.com – Over 13200 jobs across US

Job Type

Job functions

Skills

Develops MapReduce jobs, designs data warehouses

Java, Scripting, Linux

Hadoop Admin

Manages Hadoop cluster, designs data pipelines

Linux administration, Network Management, Experience in managing large cluster of machines

Data Scientist

Data mining and figuring out hidden knowledge in data

Math, data mining algorithms

Business Analyst

Analyzes data

Pig, Hive, HSQL , familiarity with other BI tools

Hadoop Developer

Private and Confidential

27


Quiz - Time Identify what does not characterize Hadoop?

A

An open source Apache project

B

Distributed data processing framework

C

Highly secured

D

Highly reliable & redundant

Private and Confidential

28


Quiz - Time A bank transaction database file from an external location needs to be analyzed by Hadoop. What tool is used to load this data in hadoop for processing?

HDFS

MapReduce

A

B

Flume

Sqoop

C

D Private and Confidential

29


Summary Hadoop an Apache project to handle Big Data.

A framework that supports distributed processing of large data sets in a cluster.

Key characteristics:

Reliability

Scalability A scalable development environment. Flexibility Allows a high percentage of data for BI & advanced analytics.

Cost & Fault Tolerance


Thank you Mumbai | Bangalore | Pune | Chennai | Jaipur

ACCREDITED TRAINING PARTNER:


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.