Diploma in Big Data and Analytics Introduction to Hadoop Eco-System
Agenda In this session, you will learn about: • • • • • • • • • • • •
What is Hadoop? Why Hadoop? Advantages of Hadoop? History of Hadoop Key Characteristics of Hadoop Hadoop 1.0 & 2.0 Eco-system Hadoop Use Cases Where Hadoop Fits? Traditional vs. Hadoop Architecture RDBMS vs. Hadoop When to Use or Not Use Hadoop? Hadoop Opportunities
Private and Confidential
2
What is Hadoop?
A solution to the Big Data problem.
A free Java-based framework that allows for the distributed processing of large data sets. Processes data across clusters of commodity computers, using a simple programming model. An Apache project, inspired by Google's MapReduce and Google File System papers.
Fault tolerant & reliable open source system. Private and Confidential
3
What is Hadoop?
Private and Confidential
4
Why Hadoop? Why is Big Data Technology Needed?
90% of the data in the world today has been created in the last
2 years alone.
80% of the data is unstructured or exists in widely varying structures, which are difficult to analyze.
Structured formats have some limitations with respect to handling large
Difficult to integrate information distributed across multiple systems.
quantities of data. Private and Confidential
5
Why Hadoop?
Additional Advantages
Most business users do not know what should be analyzed.
Potentially valuable data is dormant or discarded.
A lot of information has a short, useful lifespan.
It is too expensive to justify the integration of large volumes of unstructured data.
Context adds meaning to the existing information.
Private and Confidential
6
Advantages of Hadoop Why is Big Data Technology Appealing? Runs a number of applications on distributed systems with thousands of nodes involving petabytes of data Has a distributed file system, called Hadoop Distributed File System or HDFS, which enables fast data transfer among the nodes It helps to manage and process a huge amount of data cost efficiently. It analyzes data in its native form, which may be unstructured, structured, or streaming. It captures data from fast-happening events in real time. It can handle failure of isolated nodes and tasks assigned to such nodes. It can turn data into actionable insights. Private and Confidential
7
History of Hadoop
Private and Confidential
8
Hadoop’s Key Characteristics
Reliability
Provides a reliable, fault tolerant shared data storage and analysis system.
Scalability
Offers very high linear scalability.
Flexibility
It can process structured, semistructured & unstructured data.
Economical
Robust
Works on inexpensive commodity hardware. Well suited to meet the analytical needs of developers.
Private and Confidential
9
Hadoop is Reliable
System automatically reallocates work to another location
Level of replication is configurable
Why is Hadoop Reliable? Data automatically gets replicated at two other locations
File is available on the third system at least in case of 2 system collapses
Private and Confidential
High level of fault tolerance
10
Scalable Development Environment
Private and Confidential
11
Flexibility in Data Processing Hadoop Brings Flexibility In Data Processing • One of the biggest challenges organizations have had in that past was the challenge of handling unstructured data. • Let’s face it, only 20% of data in any organization is structured while the rest is all unstructured whose value has been largely ignored due to lack of technology to analyze it. • Hadoop manages data whether structured or unstructured, encoded or formatted, or any other type of data. • Hadoop brings the value to the table where unstructured data can be useful in decision making process
Application data
Machine data
Social data
Private and Confidential
Enterprise data
12
Hadoop is Very Cost Effective • Hadoop generates cost benefits by bringing massively parallel computing to commodity servers, resulting in a substantial reduction in the cost per terabyte of storage, which in turn makes it reasonable to model all your data. • Apache Hadoop was developed to help Internet-based companies deal with prodigious volumes of data. • According to some analysts, the cost of a Hadoop data management system, including hardware, software, and other expenses, comes to about $1,000 a terabyte–about one-fifth to onetwentieth the cost of other data management technologies
Private and Confidential
13
Hadoop Ecosystem is Robust Why is Hadoop Considered Robust?
Meets analytical needs of developers and small to large organizations
Deliver to a variety of data processing needs
Private and Confidential
Projects such as MapReduce, Hive, HBase, Apache Pig, Sqoop, Flume
14
The Hadoop 1.0 Eco-System
Private and Confidential
15
The Hadoop 2.0 Eco-System
Private and Confidential
16
Hadoop Use Cases
Private and Confidential
17
Where does Hadoop Fit?
Web and e-tailing • • • •
Recommendation Engines Ad Targeting Search Quality Abuse and Click Fraud Detection
Telecommunications • • •
•
Customer Churn Prevention Network Performance Optimization Calling Data Record (CDR) Analysis Analyzing Network to Predict Failure Private and Confidential
Government • • •
Fraud Detection & Cyber Security Welfare schemes Justice
18
Where does Hadoop Fit? Healthcare & Life Sciences • • • • •
Health information exchange Gene sequencing Serialization Healthcare service quality improvements Drug Safety
Banks and Financial Services • • • • •
Modeling True Risk Threat Analysis Fraud Detection Trade Surveillance Credit Scoring And Analysis
Retail • Point of sales Transaction Analysis • Customer Churn Analysis • Sentiment Analysis Private and Confidential
19
Leading Brands using Hadoop
Source: https://wiki.apache.org/hadoop/PoweredBy Private and Confidential
20
Traditional Data Analytics Architecture BI Reports + Interactive Apps RDBMS (Aggregated Data) ETL Compute Grid
Can’t explore original raw high fidelity data
Moving data to compute doesn’t scale Storage Only Grid (Original Raw Data)
Mostly Append Collection
Premature death of data
Instrumentation Private and Confidential
21
Hadoop Data Analytics Architecture BI Reports + Interactive Apps RDBMS (Aggregated Data)
Data exploration & advanced analytics
Scalable throughout for ETL and aggregation Hadoop : Storage + Compute Grid
Mostly Append Collection
Data is alive forever
Instrumentation Private and Confidential
22
RDBMS v/s Hadoop RDBMS
Hadoop
Data processing efficiency in Gigabytes +
Data processing efficiency in Petabytes +
Mostly proprietory
Open Source framework
One project with multiple components
Eco System Suite of java based(mostly) projects
Designed for client server architecture
Designed to support distributed architecture
High usage would require high end server
Designed to run on commodity hardware
Costly
Cost efficient
Legacy procedure
High fault tolerance Private and Confidential
23
RDBMS v/s Hadoop (Contd) RDBMS
Hadoop
Relies on OS file system
Based on distributed file system (HDFS)
Needs structured data
Very good support of unstructured data
Needs to follow defined constraints
Flexible, evolvable and fast
Stable products
Still evolving
Real time Read/Write (OLTP)
Suitable for Batch processing (OLAP)
Arbitrary insert and update
Sequential write
Supports ACID transactions
Supports BASE
Schema required on write
Schema required on read
Repeated Read and Write
Write once, Repeated Read Private and Confidential
24
When to use Hadoop? Hadoop can be used in various scenarios including some of the following: Analytics
Search
Data Retention
Log file processing
Analysis of Text, Image, Audio, & Video content Recommendation systems like in E-Commerce Website Private and Confidential
25
When not to use Hadoop? Hadoop may not be a right fit in the following situations:
Low-latency or near real-time data access. If you have a large number of small files to be processed. Multiple writes scenario or scenarios requiring arbitrary writes or writes between the files. Private and Confidential
26
Opportunities on Hadoop •
Dice.com – Nearly 2500 jobs across US
•
Indeed.com – Over 13200 jobs across US
Job Type
Job functions
Skills
Develops MapReduce jobs, designs data warehouses
Java, Scripting, Linux
Hadoop Admin
Manages Hadoop cluster, designs data pipelines
Linux administration, Network Management, Experience in managing large cluster of machines
Data Scientist
Data mining and figuring out hidden knowledge in data
Math, data mining algorithms
Business Analyst
Analyzes data
Pig, Hive, HSQL , familiarity with other BI tools
Hadoop Developer
Private and Confidential
27
Quiz - Time Identify what does not characterize Hadoop?
A
An open source Apache project
B
Distributed data processing framework
C
Highly secured
D
Highly reliable & redundant
Private and Confidential
28
Quiz - Time A bank transaction database file from an external location needs to be analyzed by Hadoop. What tool is used to load this data in hadoop for processing?
HDFS
MapReduce
A
B
Flume
Sqoop
C
D Private and Confidential
29
Summary Hadoop an Apache project to handle Big Data.
A framework that supports distributed processing of large data sets in a cluster.
Key characteristics:
Reliability
Scalability A scalable development environment. Flexibility Allows a high percentage of data for BI & advanced analytics.
Cost & Fault Tolerance
Thank you Mumbai | Bangalore | Pune | Chennai | Jaipur
ACCREDITED TRAINING PARTNER: