Hadoop eco system by Imarticus Learning

Diploma in Big Data and Analytics Introduction to Hadoop Eco-System

Agenda In this session, you will learn about: • • • • • • • • • • • •

What is Hadoop? Why Hadoop? Advantages of Hadoop? History of Hadoop Key Characteristics of Hadoop Hadoop 1.0 & 2.0 Eco-system Hadoop Use Cases Where Hadoop Fits? Traditional vs. Hadoop Architecture RDBMS vs. Hadoop When to Use or Not Use Hadoop? Hadoop Opportunities

Private and Confidential

What is Hadoop?

A solution to the Big Data problem.

A free Java-based framework that allows for the distributed processing of large data sets. Processes data across clusters of commodity computers, using a simple programming model. An Apache project, inspired by Google's MapReduce and Google File System papers.

Fault tolerant & reliable open source system. Private and Confidential

What is Hadoop?

Private and Confidential

Why Hadoop? Why is Big Data Technology Needed?

90% of the data in the world today has been created in the last

2 years alone.

80% of the data is unstructured or exists in widely varying structures, which are difficult to analyze.

Structured formats have some limitations with respect to handling large

Difficult to integrate information distributed across multiple systems.

quantities of data. Private and Confidential

Why Hadoop?

Additional Advantages

Most business users do not know what should be analyzed.

Potentially valuable data is dormant or discarded.

A lot of information has a short, useful lifespan.

It is too expensive to justify the integration of large volumes of unstructured data.

Context adds meaning to the existing information.

Private and Confidential

Advantages of Hadoop Why is Big Data Technology Appealing? Runs a number of applications on distributed systems with thousands of nodes involving petabytes of data Has a distributed file system, called Hadoop Distributed File System or HDFS, which enables fast data transfer among the nodes It helps to manage and process a huge amount of data cost efficiently. It analyzes data in its native form, which may be unstructured, structured, or streaming. It captures data from fast-happening events in real time. It can handle failure of isolated nodes and tasks assigned to such nodes. It can turn data into actionable insights. Private and Confidential

History of Hadoop

Private and Confidential

Hadoopâ&#x20AC;&#x2122;s Key Characteristics

Reliability

Provides a reliable, fault tolerant shared data storage and analysis system.

Scalability

Offers very high linear scalability.

Flexibility

It can process structured, semistructured & unstructured data.

Economical

Robust

Works on inexpensive commodity hardware. Well suited to meet the analytical needs of developers.

Private and Confidential

Hadoop is Reliable

System automatically reallocates work to another location

Level of replication is configurable

Why is Hadoop Reliable? Data automatically gets replicated at two other locations

File is available on the third system at least in case of 2 system collapses

Private and Confidential

High level of fault tolerance

Scalable Development Environment

Private and Confidential

Flexibility in Data Processing Hadoop Brings Flexibility In Data Processing • One of the biggest challenges organizations have had in that past was the challenge of handling unstructured data. • Let’s face it, only 20% of data in any organization is structured while the rest is all unstructured whose value has been largely ignored due to lack of technology to analyze it. • Hadoop manages data whether structured or unstructured, encoded or formatted, or any other type of data. • Hadoop brings the value to the table where unstructured data can be useful in decision making process

Application data

Machine data

Social data

Private and Confidential

Enterprise data

Hadoop is Very Cost Effective • Hadoop generates cost benefits by bringing massively parallel computing to commodity servers, resulting in a substantial reduction in the cost per terabyte of storage, which in turn makes it reasonable to model all your data. • Apache Hadoop was developed to help Internet-based companies deal with prodigious volumes of data. • According to some analysts, the cost of a Hadoop data management system, including hardware, software, and other expenses, comes to about $1,000 a terabyte–about one-fifth to onetwentieth the cost of other data management technologies

Private and Confidential

Hadoop Ecosystem is Robust Why is Hadoop Considered Robust?

Meets analytical needs of developers and small to large organizations

Deliver to a variety of data processing needs

Private and Confidential

Projects such as MapReduce, Hive, HBase, Apache Pig, Sqoop, Flume

The Hadoop 1.0 Eco-System

Private and Confidential

The Hadoop 2.0 Eco-System

Private and Confidential

Hadoop Use Cases

Private and Confidential

Where does Hadoop Fit?

Web and e-tailing • • • •

Recommendation Engines Ad Targeting Search Quality Abuse and Click Fraud Detection

Telecommunications • • •

•

Customer Churn Prevention Network Performance Optimization Calling Data Record (CDR) Analysis Analyzing Network to Predict Failure Private and Confidential

Government • • •

Fraud Detection & Cyber Security Welfare schemes Justice

Where does Hadoop Fit? Healthcare & Life Sciences • • • • •

Health information exchange Gene sequencing Serialization Healthcare service quality improvements Drug Safety

Banks and Financial Services • • • • •

Modeling True Risk Threat Analysis Fraud Detection Trade Surveillance Credit Scoring And Analysis

Retail • Point of sales Transaction Analysis • Customer Churn Analysis • Sentiment Analysis Private and Confidential

Leading Brands using Hadoop

Source: https://wiki.apache.org/hadoop/PoweredBy Private and Confidential

Traditional Data Analytics Architecture BI Reports + Interactive Apps RDBMS (Aggregated Data) ETL Compute Grid

Canâ&#x20AC;&#x2122;t explore original raw high fidelity data

Moving data to compute doesnâ&#x20AC;&#x2122;t scale Storage Only Grid (Original Raw Data)

Mostly Append Collection

Premature death of data

Instrumentation Private and Confidential

Hadoop Data Analytics Architecture BI Reports + Interactive Apps RDBMS (Aggregated Data)

Data exploration & advanced analytics

Scalable throughout for ETL and aggregation Hadoop : Storage + Compute Grid

Mostly Append Collection

Data is alive forever

Instrumentation Private and Confidential

RDBMS v/s Hadoop RDBMS

Hadoop

Data processing efficiency in Gigabytes +

Data processing efficiency in Petabytes +

Mostly proprietory

Open Source framework

One project with multiple components

Eco System Suite of java based(mostly) projects

Designed for client server architecture

Designed to support distributed architecture

High usage would require high end server

Designed to run on commodity hardware

Costly

Cost efficient

Legacy procedure

High fault tolerance Private and Confidential

RDBMS v/s Hadoop (Contd) RDBMS

Hadoop

Relies on OS file system

Based on distributed file system (HDFS)

Needs structured data

Very good support of unstructured data

Needs to follow defined constraints

Flexible, evolvable and fast

Stable products

Still evolving

Real time Read/Write (OLTP)

Suitable for Batch processing (OLAP)

Arbitrary insert and update

Sequential write

Supports ACID transactions

Supports BASE

Schema required on write

Schema required on read

Repeated Read and Write

Write once, Repeated Read Private and Confidential

When to use Hadoop? Hadoop can be used in various scenarios including some of the following: Analytics

Data Retention

Log file processing

Analysis of Text, Image, Audio, & Video content Recommendation systems like in E-Commerce Website Private and Confidential

When not to use Hadoop? Hadoop may not be a right fit in the following situations:

Low-latency or near real-time data access. If you have a large number of small files to be processed. Multiple writes scenario or scenarios requiring arbitrary writes or writes between the files. Private and Confidential

Opportunities on Hadoop •

Dice.com – Nearly 2500 jobs across US

•

Indeed.com – Over 13200 jobs across US

Job Type

Job functions

Skills

Develops MapReduce jobs, designs data warehouses

Java, Scripting, Linux

Hadoop Admin

Manages Hadoop cluster, designs data pipelines

Linux administration, Network Management, Experience in managing large cluster of machines

Data Scientist

Data mining and figuring out hidden knowledge in data

Math, data mining algorithms

Business Analyst

Analyzes data

Pig, Hive, HSQL , familiarity with other BI tools

Hadoop Developer

Private and Confidential

Quiz - Time Identify what does not characterize Hadoop?

An open source Apache project

Distributed data processing framework

Highly secured

Highly reliable & redundant

Private and Confidential

Quiz - Time A bank transaction database file from an external location needs to be analyzed by Hadoop. What tool is used to load this data in hadoop for processing?

HDFS

MapReduce

Flume

Sqoop

D Private and Confidential

Summary Hadoop an Apache project to handle Big Data.

A framework that supports distributed processing of large data sets in a cluster.

Key characteristics:

Reliability

Scalability A scalable development environment. Flexibility Allows a high percentage of data for BI & advanced analytics.

Cost & Fault Tolerance

Thank you Mumbai | Bangalore | Pune | Chennai | Jaipur

ACCREDITED TRAINING PARTNER: