Cloudera Online Training

Page 1

Cloudera Certified Developer for Apache Hadoop (CCDH)

1


Who We Are Mission: To help organizations profit from their data

How We Do It

Credentials

Technical Team

Leadership

We deliver relevant products and services.

The Apache Hadoop experts.

Unmatched knowledge and experience.

Strong executive team with proven abilities.

 A distribution of Apache Hadoop that is tested, certified and supported  Comprehensive support and professional service offerings  A suite of management software for Hadoop operations  Training and certification programs for developers, administrators, managers and data scientists

2

 Number 1 distribution of Apache Hadoop in the world

 Founders, committers and contributors to Hadoop

 Largest contributor to the open source Hadoop ecosystem

 A wealth of experience in the design and delivery of production software

 More committers on staff than any other company  More than 100 customers across a wide variety of industries  Strong growth in revenue and new accounts

Mike Olson CEO Kirk Dunn COO Charles Zedlewski VP, Product Mary Rorabaugh CFO

Jeff Hammerbacher Chief Scientist

Amr Awadalla VP Engineering Doug Cutting Chief Architect Omer Trajman VP, Customer Solutions


Users of Cloudera Financial

3

Web

Telecom

Media

https://www.pass4sureexam.com/ccD-410.html

Retail & Consumer


What is Apache Hadoop? Hadoop is a platform for data storage and processing that is…  Scalable  Fault tolerant  Open source

Flexibility

4

CORE HADOOP COMPONENTS Hadoop Distributed File System (HDFS)

MapReduce

File Sharing & Data Protection Across Physical Servers

Distributed Computing Across Physical Servers

Scalability

 A single repository for storing processing & analyzing any type of data

 Scale-out architecture divides workloads across multiple nodes

 Not bound by a single schema

 Flexible file system eliminates ETL bottlenecks

Low Cost  Can be deployed on commodity hardware  Open source platform guards against vendor lock

https://www.pass4sureexam.com/ccD-410.html


What Makes Hadoop Different? • Ability to scale out to Petabytes in size using commodity hardware • Processing (MapReduce) jobs are sent to the data versus shipping the data to be processed • Hadoop doesn’t impose a single data format so it can easily handle structure, semi-structure and unstructured data • Manages fault tolerance and data replication automatically 5

https://www.pass4sureexam.com/ccD-410.html


GIGABYTES OF DATA CREATED (IN BILLIONS)

Why the Need for Hadoop? 10,000

1.8 trillion gigabytes of data was created in 2011…  More than 90% is unstructured data  Approx. 500 quadrillion files  Quantity doubles every 2 years

5,000

0

2005 Source: IDC 2011

6

2015

2010 STRUCTURED DATA

UNSTRUCTURED DATA


Hadoop Use Cases

7

Application

Industry

Application

Social Network Analysis

Web

Clickstream Sessionization

Content Optimization

Media

Clickstream Sessionization

Network Analytics

Telco

Mediation

Loyalty & Promotions Analysis

Retail

Data Factory

Fraud Analysis

Financial

Trade Reconciliation

Entity Analysis

Federal

SIGINT

Sequencing Analysis

Bioinformatics

Genome Mapping

Use Case

DATA PROCESSING

ADVANCED ANALYTICS

Use Case


Hadoop in the Enterprise

OPERATORS

ENGINEERS

ANALYSTS

BUSINESS USERS

Management Tools

IDE’s

BI / Analytics

Enterprise Reporting

Enterprise Data Warehouse

CUSTOMERS

Web Application

Logs

8

Files

Web Data

Relational Databases

https://www.pass4sureexam.com/ccD-410.html


What is CDH? Cloudera’s Distribution Including Apache Hadoop (CDH) is an enterprise-ready distribution of Hadoop that is…    

100% Apache open source Contains all components needed for deployment Fully documented and supported Released on a reliable schedule

Fastest Path to Success

9

Stable and Reliable

 No need to write your own scripts or do integration testing on different components

 Extensive Cloudera QA systems, software & processes

 Works with a wide range of operating systems, hardware, databases and data warehouses

 Proven at scale in dozens of enterprise environments

 Tested & run in production at scale

Community Driven  Incorporates only main-line components from the Apache Hadoop ecosystem – no forks or proprietary underpinnings

 FREE


Cloudera’s Commitment to the Open Source Community Component

10

Cloudera Committers

Cloudera Founder

2011 Commits

Common

6

Yes

#1

HDFS

6

Yes

#2

MapReduce

5

Yes

#1

HBase

2

No

#2

Zookeeper

1

Yes

#2

Oozie

1

Yes

#1

Pig

0

No

#3

Hive

1

No

#2

Sqoop

2

Yes

#1

Flume

3

Yes

#1

Hue

3

Yes

#1

Snappy

2

No

#1

Bigtop

8

Yes

#1

Avro

4

Yes

#1

Whirr

2

Yes

#1


Components of CDH Cloudera Enterprise User Interface HUE

Workflow

File System Mount APACHE OOZIE

FUSE-DFS

Scheduling APACHE OOZIE

Languages / Compilers APACHE PIG, APACHE HIVE

Data Integration

Fast Read/Write Access

APACHE FLUME, APACHE SQOOP APACHE HBASE

Coordination

11

APACHE ZOOKEEPER

https://www.pass4sureexam.com/ccD-410.html


Hadoop Distributed File System Block Size = 64MB Replication Factor = 3 1

2

1

4

2

5

5 1

2

HDFS 3

3

4

4

5

5

2 1 3

Cost is $400-$500/TB

12

3 4

5


Components of Hadoop • NameNode – Holds all metadata for HDFS – Needs to be a highly reliable machine • RAID drives – typically RAID 10 • Dual power supplies • Dual network cards – Bonded

– The more memory the better – typical 36GB to - 64GB

• Secondary NameNode – Provides check pointing for the NameNode. Same hardware as the NameNode should be used 13


Components of Hadoop • DataNodes – Hardware will depend on the specific needs of the cluster – No RAID needed, JBOD (just a bunch of disks) is used – Typical ratio is: • 1 hard drive • 2 cores • 4GB of RAM

14

https://www.pass4sureexam.com/ccD-410.html


Networking • One of the most important things to consider when setting up a Hadoop cluster • Typically a top of rack is used with Hadoop with a core switch • Careful on over subscribing the backplane of the switch!

15


Map • Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key*value pairs: e.g., (filename, line). • map() produces one or more intermediate values along with an output key from the input. (key 1, values)

Map Task

(key 2, values)

(key 3, values)

16

(key 1, int. values)

Shuffle Phase

(key 1, int. values)

(key 1, int. values)

Reduce Task

Final (key, values)


Reduce • After the map phase is over, all the intermediate values for a given output key are combined together into a list • reduce() combines those intermediate values into one or more final values for that same output key (key 1, values)

Map Task

(key 2, values)

(key 3, values)

17

(key 1, int. values)

Shuffle Phase

(key 1, int. values)

(key 1, int. values)

Reduce Task

Final (key, values)


MapReduce Execution

18

https://www.pass4sureexam.com/ccD-410.html


Sqoop SQL to Hadoop  Tool to import/export any JDBC-supported database into Hadoop  Transfer data between Hadoop and external databases or EDW  High performance connectors for some RDBMS  Developed at Cloudera

19


Flume Distributed, reliable, available service for efficiently moving large amounts of data as it is produced ď ś Suited for gathering logs from multiple systems ď ś Inserting them into HDFS as they are generated Design goals ď ś Reliability, Scalability, Manageability, Extensibility Developed at Cloudera

20


Flume: high-level architecture Master send configuration to all Agents Agent

Agent

Agent

Agent

Configurable levels of reliability Guarantee delivery in event of failure Deployable, centrally administered

encrypt

MASTER Processor

Processor

compress

batch

Optionally pre-process incoming data: perform transformations, suppressions, metadata enrichment

encrypt

Writes to multiple HDFS file formats (text, sequence, JSON, Avro, others) Parallelized writes across many collectors – as much write throughput as

21

Collector(s)

Flexibly deploy decorators at any step to improve performance, reliability or security


HBase Column-family store. Based on design of Google BigTable

 Provides interactive access to information  Holds extremely large datasets (multi-TB)  Constrained access model

 (key, value) lookup  Limited transactions (only one row)

22

https://www.pass4sureexam.com/ccD-410.html


HBase

23


Hive SQL-based data warehousing application

 Language is SQL-like  Supports SELECT, JOIN, GROUP BY, etc.  Features for analyzing very large data sets  Partition columns, Sampling, Buckets  Example: SELECT s.word, s.freq, k.freq FROM shakespeares JOIN ON (s.word= k.word) WHERE s.freq >= 5;

24


Pig Data-flow oriented language – “Pig latin”

 Datatypes include sets, associative arrays, tuples  High-level language for routing data, allows easy integration of Java for complex tasks  Example: emps=LOAD 'people.txt’ AS(id,name,salary); rich = FILTER emps BY salary > 100000; srtd = ORDER rich BY salary DESC; STORE srtd INTO ’ rich_people.txt';

25

https://www.pass4sureexam.com/ccD-410.html


Oozie Oozie is a workflow/cordination service to manage data processing jobs for Hadoop

26


Zookeeper Zookeeper is a distributed consensus engine

 Provides well-defined concurrent access semantics:  Leader election  Service discovery  Distributed locking / mutual exclusion  Message board / mailboxes

27


Pipes and Streaming Multi-language connector libraries for MapReduce  Write native-code MapReduce in C++  Write MapReduce passes in any scripting language, including  Perl  Python

28

https://www.pass4sureexam.com/ccD-410.html


FUSE - DFS Allows mounting of HDFS volumes via Linux FUSE file system  Does allow easy integration with other systems for data import/export  Does not imply HDFS can be used for general-purpose file system

29


Hadoop Security  Authentication is secured by Kerberos v5 and integrated with LDAP  Hadoop server can ensure that users and groups are who they say they are  Job Control includes Access Control Lists, which means Jobs can specify who can view logs, counters, configurations and who can modify a job  Tasks now run as the user who launched the job

30

https://www.pass4sureexam.com/ccD-410.html


Cloudera Enterprise Cloudera Enterprise makes open source Hadoop enterprise-easy  Simplify and Accelerate Hadoop Deployment  Reduce Adoption Costs and Risks

CLOUDERA ENTERPRISE COMPONENTS Cloudera Manager

Production-Level Support

End-to-End Management Application for Apache Hadoop

Our Team of Experts OnCall to Help You Meet Your SLAs

 Lower the Cost of Administration  Increase the Transparency Control of Hadoop  Leverage the Experience of Our Experts

31

EFFECTIVENESS

EFFICIENCY

Ensuring You Get Value From Your Hadoop Deployment

Enabling You to Affordably Run Hadoop in Production


Cloudera Manager The industry’s first

for Apache Hadoop

the Apache Hadoop stack HDFS

MAPREDUCE

Automates the of Apache Hadoop

HBASE DISCOVER

ZOOKEEPER

32

OOZIE

HUE

DIAGNOSE

ACT

OPTIMIZE


Cloudera Enterprise Including Cloudera Support

34

Feature

Benefit

Flexible Support Windows

Choose from 8x5 or 24x7 options to meet SLA requirements

Configuration Checks

Verify that your Hadoop cluster is fine-tuned for your environment

Issue Resolution and Escalation Processes

Proven processes ensure that support cases get resolved with maximum efficiency

Comprehensive Knowledgebase

Browse through hundreds of Articles and Tech Notes to expand upon your knowledge of Apache Hadoop

Certified Connectors

Connect your Apache Hadoop cluster to your existing data analysis tools such as IBM Netezza and Revolution Analytics

Notification of New Developments and Events

Stay up to speed with what’s going on in the Apache Hadoop community


Cloudera University Public and Private Training to Enable Your Success Class

Description

Developer Training & Certification

Hands-on training and certification for developers who want to analyze their data but are new to Apache Hadoop

(4 Days)

System Administrator Training & Certification (3 Days)

Hands-on training and certification for administrators who will be responsible for setting up, configuring, monitoring an Apache Hadoop cluster

HBase Training (2 Day)

Covers the HBase architecture, data model, and Java API as well as some advanced topics and best practices

Analyzing Data with Hive and Pig

Hive and Pig training is designed for people who have a basic understanding of how Apache Hadoop works and want to utilize these languages for analysis of their data

(2 Days)

Essentials for Managers (1 Day)

35

Provides decision-makers the information they need to know about Apache Hadoop, answering questions such as “when is Hadoop appropriate?”, “what are people using Hadoop for?” and “what do I need to know about choosing Hadoop?”


Cloudera Consulting Services Put Our Expertise To Work For You. Cloudera’s team of Solutions Architects provides guidance and hands-on expertise to address unique enterprise challenges.

36

Service

Description

Use Case Discovery

Assess the appropriateness and value of Hadoop for your organization

New Hadoop Deployment

Set up and configure high performance, production-ready Hadoop clusters

Proof of Concept

Verify the prototype functionality and project feasibility for a new Hadoop cluster

Production Pilot

Deploy your first production-level project using Hadoop

Process and Team Development

Define the requirements and processes for creating a new Hadoop team

Hadoop Deployment Certification

Perform periodic health checks to certify and tune up existing Hadoop clusters


Journey of the Cloudera Customer Discover the Benefits of Apache Hadoop

Flexibility to store and mine all types of data

37

Cloudera’s Distribution

Subscribe to Cloudera Enterprise

The fastest, surest path to success with Apache Hadoop

Simplify and accelerate Apache Hadoop deployment

https://www.pass4sureexam.com/ccD-410.html


Cloudera in Production  

Consulting Services Cloudera University

Cloudera Services

OPERATORS

ENGINEERS

ANALYSTS

BUSINESS USERS

CUSTOMERS

IDE’s

BI / Analytics

Enterprise Reporting

Web Application

Cloudera Enterprise Management Tools

 

Cloudera Management Suite Cloudera Support

Enterprise Data Warehouse Cloudera’s Distribution Including Apache Hadoop (CDH) & SCM Express

Logs

38

Files

Operational Rules Engines

Web Data

Relational Databases


Get Hadoop

Cloudera helps you profit from all your data.

+1 (888) 789-1488 sales@cloudera.com

cloudera.com

twitter.com/ cloudera facebook.com/ cloudera

39


Cloudera Manager The application that:

Hadoop management

Manages the

Manages and monitors the

Incorporates comprehensive

Has

40

built-in

https://www.pass4sureexam.com/ccD-410.html


Cloudera Manager Key

and ONLY CLOUDERA

Installs the complete Hadoop stack in minutes. The simple, wizard-based interface guides you through the steps. Gives you complete, end-to-end visibility and control over your Hadoop cluster from a single interface

ONLY CLOUDERA

Set server roles, configure services and manage security across the cluster Gracefully start, stop and restart of services as needed

ONLY CLOUDERA

Maintains a complete record of configuration changes for SOX compliance ONLY CLOUDERA

ONLY CLOUDERA

Monitors dozens of service performance metrics and alerts you when you approach critical thresholds Gather, view and search Hadoop logs collected from across the cluster Scans Hadoop logs for irregularities and warns you before they impact the cluster

41

https://www.pass4sureexam.com/ccD-410.html


Cloudera Manager Key

and ONLY CLOUDERA

Establishes the time context globally for almost all views Correlates jobs, activities, logs, system changes, configuration changes and service metrics along a single timeline to simplify diagnosis

ONLY CLOUDERA

ONLY CLOUDERA

Takes a snapshot of the cluster state and automatically sends it to Cloudera support to assist with resolution

Creates and aggregates relevant Hadoop events pertaining to system health, log messages, user services and activities and make them available for alerting and searching Generates email alerts when certain events occur

ONLY CLOUDERA

Visualize current and historical disk usage by user, group and directory Track MapReduce activity on the cluster by job or user

View information pertaining to hosts in your cluster including status, resident memory, virtual memory and roles

42


Two Editions: Max Number of Nodes Supported Automated Deployment Host-Level Monitoring Secure Communication Between Server & Agents Configuration Management Manage HDFS, MapReduce, HBase, Hue, Oozie & Zookeeper Audit Trails

Start/Stop/Restart Services Add/Restart/Decomission Role Instances Configuration Versioning & History

Support for Kerberos Service Monitoring Proactive Health Checks Status & Health Summary Intelligent Log Management Events Management & Alerts Activity Monitoring Operational Reporting Global Time Control Support Integration ** Part of the Cloudera Enterprise subscription

43

FREE EDITION

ENTERPRISE EDITION**

50

Unlimited


View Service Health and Performance

44

https://www.pass4sureexam.com/ccD-410.html


Get Host-Level Snapshots

45

https://www.pass4sureexam.com/ccD-410.html


Monitor and Diagnose Cluster Workloads

46

https://www.pass4sureexam.com/ccD-410.html


Gather, View and Search Hadoop Logs

47

https://www.pass4sureexam.com/ccD-410.html


Track Events From Across the Cluster

48

https://www.pass4sureexam.com/ccD-410.html


Run Reports on System Performance & Usage

49

https://www.pass4sureexam.com/ccD-410.html


New in Cloudera Manager 3.7 Proactive Health Checks

ONLY CLOUDERA

Intelligent Log Management Global Time Control

Support Integration Event Management

ONLY CLOUDERA

ONLY CLOUDERA

ONLY CLOUDERA

Alerts Audit Trails

Gathers and scans Hadoop logs for irregularities and warns you before they impact the cluster Correlates jobs, activities, logs, system changes, configuration changes and service metrics along a single timeline to simplify diagnosis Takes a snapshot of the cluster state and automatically sends it to Cloudera support to assist with resolution Creates and aggregates relevant Hadoop events pertaining to system health, log messages, user services and activities and make them available for alerting and searching Generates email alerts when certain events occur

ONLY CLOUDERA

Operational Reporting

50

ONLY CLOUDERA

Monitors dozens of service performance metrics and alerts you when you approach critical thresholds

Maintains a complete record of configuration changes for SOX compliance ONLY CLOUDERA

Visualize current and historical disk usage by user, group and directory and track MapReduce activity on the cluster by job or user

https://www.pass4sureexam.com/ccD-410.html


Cloudera Support

51

Our

on call to help you meet your SLAs

Feature

Benefit

Flexible Support Windows

Choose from 8x5 or 24x7 options to meet SLA requirements

Configuration Checks

Verify that your Hadoop cluster is fine-tuned for your environment

Issue Resolution and Escalation Processes

Proven processes ensure that support cases get resolved with maximum efficiency

Comprehensive Knowledgebase

Browse through hundreds of Articles and Tech Notes to expand upon your knowledge of Apache Hadoop

Certified Connectors

Connect your Apache Hadoop cluster to your existing data analysis tools such as IBM Netezza, Revolution Analytics, and MicroStrategy

Proactive Notification of New Developments and Events

Stay up to speed with what’s going on in the Apache Hadoop community

https://www.pass4sureexam.com/ccD-410.html


Cloudera Enterprise The Fastest Path to Success Running Apache Hadoop in Production.

Why Cloudera Enterprise?  Apache Hadoop is a distributed system that presents unique operational challenges  The fixed cost of managing an internal patch and release infrastructure is prohibitive  Apache Hadoop skills and expertise are scarce  It’s challenging to track consistently to community development efforts

52

Only Cloudera Enterprise Has a management application that supports the full lifecycle of operationalizing Apache Hadoop ••• Has production support backed by the Apache committers ••• Has the depth of experience supporting hundreds of production Apache Hadoop clusters


Hadoop Distributed File System Block Size = 64MB Replication Factor = 3

Cost is $400-$500/TB

53


MapReduce: Distributed Processing

54

https://www.pass4sureexam.com/ccD-410.html


Thank you.

https://www.pass4sureexam.com/ccD-410.html


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.