Cloudera Certified Developer for Apache Hadoop (CCDH)
1
Who We Are Mission: To help organizations profit from their data
How We Do It
Credentials
Technical Team
Leadership
We deliver relevant products and services.
The Apache Hadoop experts.
Unmatched knowledge and experience.
Strong executive team with proven abilities.
A distribution of Apache Hadoop that is tested, certified and supported Comprehensive support and professional service offerings A suite of management software for Hadoop operations Training and certification programs for developers, administrators, managers and data scientists
2
Number 1 distribution of Apache Hadoop in the world
Founders, committers and contributors to Hadoop
Largest contributor to the open source Hadoop ecosystem
A wealth of experience in the design and delivery of production software
More committers on staff than any other company More than 100 customers across a wide variety of industries Strong growth in revenue and new accounts
Mike Olson CEO Kirk Dunn COO Charles Zedlewski VP, Product Mary Rorabaugh CFO
Jeff Hammerbacher Chief Scientist
Amr Awadalla VP Engineering Doug Cutting Chief Architect Omer Trajman VP, Customer Solutions
Users of Cloudera Financial
3
Web
Telecom
Media
https://www.pass4sureexam.com/ccD-410.html
Retail & Consumer
What is Apache Hadoop? Hadoop is a platform for data storage and processing that is… Scalable Fault tolerant Open source
Flexibility
4
CORE HADOOP COMPONENTS Hadoop Distributed File System (HDFS)
MapReduce
File Sharing & Data Protection Across Physical Servers
Distributed Computing Across Physical Servers
Scalability
A single repository for storing processing & analyzing any type of data
Scale-out architecture divides workloads across multiple nodes
Not bound by a single schema
Flexible file system eliminates ETL bottlenecks
Low Cost Can be deployed on commodity hardware Open source platform guards against vendor lock
https://www.pass4sureexam.com/ccD-410.html
What Makes Hadoop Different? • Ability to scale out to Petabytes in size using commodity hardware • Processing (MapReduce) jobs are sent to the data versus shipping the data to be processed • Hadoop doesn’t impose a single data format so it can easily handle structure, semi-structure and unstructured data • Manages fault tolerance and data replication automatically 5
https://www.pass4sureexam.com/ccD-410.html
GIGABYTES OF DATA CREATED (IN BILLIONS)
Why the Need for Hadoop? 10,000
1.8 trillion gigabytes of data was created in 2011… More than 90% is unstructured data Approx. 500 quadrillion files Quantity doubles every 2 years
5,000
0
2005 Source: IDC 2011
6
2015
2010 STRUCTURED DATA
UNSTRUCTURED DATA
Hadoop Use Cases
7
Application
Industry
Application
Social Network Analysis
Web
Clickstream Sessionization
Content Optimization
Media
Clickstream Sessionization
Network Analytics
Telco
Mediation
Loyalty & Promotions Analysis
Retail
Data Factory
Fraud Analysis
Financial
Trade Reconciliation
Entity Analysis
Federal
SIGINT
Sequencing Analysis
Bioinformatics
Genome Mapping
Use Case
DATA PROCESSING
ADVANCED ANALYTICS
Use Case
Hadoop in the Enterprise
OPERATORS
ENGINEERS
ANALYSTS
BUSINESS USERS
Management Tools
IDE’s
BI / Analytics
Enterprise Reporting
Enterprise Data Warehouse
CUSTOMERS
Web Application
Logs
8
Files
Web Data
Relational Databases
https://www.pass4sureexam.com/ccD-410.html
What is CDH? Cloudera’s Distribution Including Apache Hadoop (CDH) is an enterprise-ready distribution of Hadoop that is…
100% Apache open source Contains all components needed for deployment Fully documented and supported Released on a reliable schedule
Fastest Path to Success
9
Stable and Reliable
No need to write your own scripts or do integration testing on different components
Extensive Cloudera QA systems, software & processes
Works with a wide range of operating systems, hardware, databases and data warehouses
Proven at scale in dozens of enterprise environments
Tested & run in production at scale
Community Driven Incorporates only main-line components from the Apache Hadoop ecosystem – no forks or proprietary underpinnings
FREE
Cloudera’s Commitment to the Open Source Community Component
10
Cloudera Committers
Cloudera Founder
2011 Commits
Common
6
Yes
#1
HDFS
6
Yes
#2
MapReduce
5
Yes
#1
HBase
2
No
#2
Zookeeper
1
Yes
#2
Oozie
1
Yes
#1
Pig
0
No
#3
Hive
1
No
#2
Sqoop
2
Yes
#1
Flume
3
Yes
#1
Hue
3
Yes
#1
Snappy
2
No
#1
Bigtop
8
Yes
#1
Avro
4
Yes
#1
Whirr
2
Yes
#1
Components of CDH Cloudera Enterprise User Interface HUE
Workflow
File System Mount APACHE OOZIE
FUSE-DFS
Scheduling APACHE OOZIE
Languages / Compilers APACHE PIG, APACHE HIVE
Data Integration
Fast Read/Write Access
APACHE FLUME, APACHE SQOOP APACHE HBASE
Coordination
11
APACHE ZOOKEEPER
https://www.pass4sureexam.com/ccD-410.html
Hadoop Distributed File System Block Size = 64MB Replication Factor = 3 1
2
1
4
2
5
5 1
2
HDFS 3
3
4
4
5
5
2 1 3
Cost is $400-$500/TB
12
3 4
5
Components of Hadoop • NameNode – Holds all metadata for HDFS – Needs to be a highly reliable machine • RAID drives – typically RAID 10 • Dual power supplies • Dual network cards – Bonded
– The more memory the better – typical 36GB to - 64GB
• Secondary NameNode – Provides check pointing for the NameNode. Same hardware as the NameNode should be used 13
Components of Hadoop • DataNodes – Hardware will depend on the specific needs of the cluster – No RAID needed, JBOD (just a bunch of disks) is used – Typical ratio is: • 1 hard drive • 2 cores • 4GB of RAM
14
https://www.pass4sureexam.com/ccD-410.html
Networking • One of the most important things to consider when setting up a Hadoop cluster • Typically a top of rack is used with Hadoop with a core switch • Careful on over subscribing the backplane of the switch!
15
Map • Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key*value pairs: e.g., (filename, line). • map() produces one or more intermediate values along with an output key from the input. (key 1, values)
Map Task
(key 2, values)
(key 3, values)
16
(key 1, int. values)
Shuffle Phase
(key 1, int. values)
(key 1, int. values)
Reduce Task
Final (key, values)
Reduce • After the map phase is over, all the intermediate values for a given output key are combined together into a list • reduce() combines those intermediate values into one or more final values for that same output key (key 1, values)
Map Task
(key 2, values)
(key 3, values)
17
(key 1, int. values)
Shuffle Phase
(key 1, int. values)
(key 1, int. values)
Reduce Task
Final (key, values)
MapReduce Execution
18
https://www.pass4sureexam.com/ccD-410.html
Sqoop SQL to Hadoop Tool to import/export any JDBC-supported database into Hadoop Transfer data between Hadoop and external databases or EDW High performance connectors for some RDBMS Developed at Cloudera
19
Flume Distributed, reliable, available service for efficiently moving large amounts of data as it is produced ď ś Suited for gathering logs from multiple systems ď ś Inserting them into HDFS as they are generated Design goals ď ś Reliability, Scalability, Manageability, Extensibility Developed at Cloudera
20
Flume: high-level architecture Master send configuration to all Agents Agent
Agent
Agent
Agent
Configurable levels of reliability Guarantee delivery in event of failure Deployable, centrally administered
encrypt
MASTER Processor
Processor
compress
batch
Optionally pre-process incoming data: perform transformations, suppressions, metadata enrichment
encrypt
Writes to multiple HDFS file formats (text, sequence, JSON, Avro, others) Parallelized writes across many collectors – as much write throughput as
21
Collector(s)
Flexibly deploy decorators at any step to improve performance, reliability or security
HBase Column-family store. Based on design of Google BigTable
Provides interactive access to information Holds extremely large datasets (multi-TB) Constrained access model
(key, value) lookup Limited transactions (only one row)
22
https://www.pass4sureexam.com/ccD-410.html
HBase
23
Hive SQL-based data warehousing application
Language is SQL-like Supports SELECT, JOIN, GROUP BY, etc. Features for analyzing very large data sets Partition columns, Sampling, Buckets Example: SELECT s.word, s.freq, k.freq FROM shakespeares JOIN ON (s.word= k.word) WHERE s.freq >= 5;
24
Pig Data-flow oriented language – “Pig latin”
Datatypes include sets, associative arrays, tuples High-level language for routing data, allows easy integration of Java for complex tasks Example: emps=LOAD 'people.txt’ AS(id,name,salary); rich = FILTER emps BY salary > 100000; srtd = ORDER rich BY salary DESC; STORE srtd INTO ’ rich_people.txt';
25
https://www.pass4sureexam.com/ccD-410.html
Oozie Oozie is a workflow/cordination service to manage data processing jobs for Hadoop
26
Zookeeper Zookeeper is a distributed consensus engine
Provides well-defined concurrent access semantics: Leader election Service discovery Distributed locking / mutual exclusion Message board / mailboxes
27
Pipes and Streaming Multi-language connector libraries for MapReduce Write native-code MapReduce in C++ Write MapReduce passes in any scripting language, including Perl Python
28
https://www.pass4sureexam.com/ccD-410.html
FUSE - DFS Allows mounting of HDFS volumes via Linux FUSE file system Does allow easy integration with other systems for data import/export Does not imply HDFS can be used for general-purpose file system
29
Hadoop Security Authentication is secured by Kerberos v5 and integrated with LDAP Hadoop server can ensure that users and groups are who they say they are Job Control includes Access Control Lists, which means Jobs can specify who can view logs, counters, configurations and who can modify a job Tasks now run as the user who launched the job
30
https://www.pass4sureexam.com/ccD-410.html
Cloudera Enterprise Cloudera Enterprise makes open source Hadoop enterprise-easy Simplify and Accelerate Hadoop Deployment Reduce Adoption Costs and Risks
CLOUDERA ENTERPRISE COMPONENTS Cloudera Manager
Production-Level Support
End-to-End Management Application for Apache Hadoop
Our Team of Experts OnCall to Help You Meet Your SLAs
Lower the Cost of Administration Increase the Transparency Control of Hadoop Leverage the Experience of Our Experts
31
EFFECTIVENESS
EFFICIENCY
Ensuring You Get Value From Your Hadoop Deployment
Enabling You to Affordably Run Hadoop in Production
Cloudera Manager The industry’s first
for Apache Hadoop
the Apache Hadoop stack HDFS
MAPREDUCE
Automates the of Apache Hadoop
HBASE DISCOVER
ZOOKEEPER
32
OOZIE
HUE
DIAGNOSE
ACT
OPTIMIZE
Cloudera Enterprise Including Cloudera Support
34
Feature
Benefit
Flexible Support Windows
Choose from 8x5 or 24x7 options to meet SLA requirements
Configuration Checks
Verify that your Hadoop cluster is fine-tuned for your environment
Issue Resolution and Escalation Processes
Proven processes ensure that support cases get resolved with maximum efficiency
Comprehensive Knowledgebase
Browse through hundreds of Articles and Tech Notes to expand upon your knowledge of Apache Hadoop
Certified Connectors
Connect your Apache Hadoop cluster to your existing data analysis tools such as IBM Netezza and Revolution Analytics
Notification of New Developments and Events
Stay up to speed with what’s going on in the Apache Hadoop community
Cloudera University Public and Private Training to Enable Your Success Class
Description
Developer Training & Certification
Hands-on training and certification for developers who want to analyze their data but are new to Apache Hadoop
(4 Days)
System Administrator Training & Certification (3 Days)
Hands-on training and certification for administrators who will be responsible for setting up, configuring, monitoring an Apache Hadoop cluster
HBase Training (2 Day)
Covers the HBase architecture, data model, and Java API as well as some advanced topics and best practices
Analyzing Data with Hive and Pig
Hive and Pig training is designed for people who have a basic understanding of how Apache Hadoop works and want to utilize these languages for analysis of their data
(2 Days)
Essentials for Managers (1 Day)
35
Provides decision-makers the information they need to know about Apache Hadoop, answering questions such as “when is Hadoop appropriate?”, “what are people using Hadoop for?” and “what do I need to know about choosing Hadoop?”
Cloudera Consulting Services Put Our Expertise To Work For You. Cloudera’s team of Solutions Architects provides guidance and hands-on expertise to address unique enterprise challenges.
36
Service
Description
Use Case Discovery
Assess the appropriateness and value of Hadoop for your organization
New Hadoop Deployment
Set up and configure high performance, production-ready Hadoop clusters
Proof of Concept
Verify the prototype functionality and project feasibility for a new Hadoop cluster
Production Pilot
Deploy your first production-level project using Hadoop
Process and Team Development
Define the requirements and processes for creating a new Hadoop team
Hadoop Deployment Certification
Perform periodic health checks to certify and tune up existing Hadoop clusters
Journey of the Cloudera Customer Discover the Benefits of Apache Hadoop
Flexibility to store and mine all types of data
37
Cloudera’s Distribution
Subscribe to Cloudera Enterprise
The fastest, surest path to success with Apache Hadoop
Simplify and accelerate Apache Hadoop deployment
https://www.pass4sureexam.com/ccD-410.html
Cloudera in Production
Consulting Services Cloudera University
Cloudera Services
OPERATORS
ENGINEERS
ANALYSTS
BUSINESS USERS
CUSTOMERS
IDE’s
BI / Analytics
Enterprise Reporting
Web Application
Cloudera Enterprise Management Tools
Cloudera Management Suite Cloudera Support
Enterprise Data Warehouse Cloudera’s Distribution Including Apache Hadoop (CDH) & SCM Express
Logs
38
Files
Operational Rules Engines
Web Data
Relational Databases
Get Hadoop
Cloudera helps you profit from all your data.
+1 (888) 789-1488 sales@cloudera.com
cloudera.com
twitter.com/ cloudera facebook.com/ cloudera
39
Cloudera Manager The application that:
Hadoop management
Manages the
Manages and monitors the
Incorporates comprehensive
Has
40
built-in
https://www.pass4sureexam.com/ccD-410.html
Cloudera Manager Key
and ONLY CLOUDERA
Installs the complete Hadoop stack in minutes. The simple, wizard-based interface guides you through the steps. Gives you complete, end-to-end visibility and control over your Hadoop cluster from a single interface
ONLY CLOUDERA
Set server roles, configure services and manage security across the cluster Gracefully start, stop and restart of services as needed
ONLY CLOUDERA
Maintains a complete record of configuration changes for SOX compliance ONLY CLOUDERA
ONLY CLOUDERA
Monitors dozens of service performance metrics and alerts you when you approach critical thresholds Gather, view and search Hadoop logs collected from across the cluster Scans Hadoop logs for irregularities and warns you before they impact the cluster
41
https://www.pass4sureexam.com/ccD-410.html
Cloudera Manager Key
and ONLY CLOUDERA
Establishes the time context globally for almost all views Correlates jobs, activities, logs, system changes, configuration changes and service metrics along a single timeline to simplify diagnosis
ONLY CLOUDERA
ONLY CLOUDERA
Takes a snapshot of the cluster state and automatically sends it to Cloudera support to assist with resolution
Creates and aggregates relevant Hadoop events pertaining to system health, log messages, user services and activities and make them available for alerting and searching Generates email alerts when certain events occur
ONLY CLOUDERA
Visualize current and historical disk usage by user, group and directory Track MapReduce activity on the cluster by job or user
View information pertaining to hosts in your cluster including status, resident memory, virtual memory and roles
42
Two Editions: Max Number of Nodes Supported Automated Deployment Host-Level Monitoring Secure Communication Between Server & Agents Configuration Management Manage HDFS, MapReduce, HBase, Hue, Oozie & Zookeeper Audit Trails
Start/Stop/Restart Services Add/Restart/Decomission Role Instances Configuration Versioning & History
Support for Kerberos Service Monitoring Proactive Health Checks Status & Health Summary Intelligent Log Management Events Management & Alerts Activity Monitoring Operational Reporting Global Time Control Support Integration ** Part of the Cloudera Enterprise subscription
43
FREE EDITION
ENTERPRISE EDITION**
50
Unlimited
View Service Health and Performance
44
https://www.pass4sureexam.com/ccD-410.html
Get Host-Level Snapshots
45
https://www.pass4sureexam.com/ccD-410.html
Monitor and Diagnose Cluster Workloads
46
https://www.pass4sureexam.com/ccD-410.html
Gather, View and Search Hadoop Logs
47
https://www.pass4sureexam.com/ccD-410.html
Track Events From Across the Cluster
48
https://www.pass4sureexam.com/ccD-410.html
Run Reports on System Performance & Usage
49
https://www.pass4sureexam.com/ccD-410.html
New in Cloudera Manager 3.7 Proactive Health Checks
ONLY CLOUDERA
Intelligent Log Management Global Time Control
Support Integration Event Management
ONLY CLOUDERA
ONLY CLOUDERA
ONLY CLOUDERA
Alerts Audit Trails
Gathers and scans Hadoop logs for irregularities and warns you before they impact the cluster Correlates jobs, activities, logs, system changes, configuration changes and service metrics along a single timeline to simplify diagnosis Takes a snapshot of the cluster state and automatically sends it to Cloudera support to assist with resolution Creates and aggregates relevant Hadoop events pertaining to system health, log messages, user services and activities and make them available for alerting and searching Generates email alerts when certain events occur
ONLY CLOUDERA
Operational Reporting
50
ONLY CLOUDERA
Monitors dozens of service performance metrics and alerts you when you approach critical thresholds
Maintains a complete record of configuration changes for SOX compliance ONLY CLOUDERA
Visualize current and historical disk usage by user, group and directory and track MapReduce activity on the cluster by job or user
https://www.pass4sureexam.com/ccD-410.html
Cloudera Support
51
Our
on call to help you meet your SLAs
Feature
Benefit
Flexible Support Windows
Choose from 8x5 or 24x7 options to meet SLA requirements
Configuration Checks
Verify that your Hadoop cluster is fine-tuned for your environment
Issue Resolution and Escalation Processes
Proven processes ensure that support cases get resolved with maximum efficiency
Comprehensive Knowledgebase
Browse through hundreds of Articles and Tech Notes to expand upon your knowledge of Apache Hadoop
Certified Connectors
Connect your Apache Hadoop cluster to your existing data analysis tools such as IBM Netezza, Revolution Analytics, and MicroStrategy
Proactive Notification of New Developments and Events
Stay up to speed with what’s going on in the Apache Hadoop community
https://www.pass4sureexam.com/ccD-410.html
Cloudera Enterprise The Fastest Path to Success Running Apache Hadoop in Production.
Why Cloudera Enterprise? Apache Hadoop is a distributed system that presents unique operational challenges The fixed cost of managing an internal patch and release infrastructure is prohibitive Apache Hadoop skills and expertise are scarce It’s challenging to track consistently to community development efforts
52
Only Cloudera Enterprise Has a management application that supports the full lifecycle of operationalizing Apache Hadoop ••• Has production support backed by the Apache committers ••• Has the depth of experience supporting hundreds of production Apache Hadoop clusters
Hadoop Distributed File System Block Size = 64MB Replication Factor = 3
Cost is $400-$500/TB
53
MapReduce: Distributed Processing
54
https://www.pass4sureexam.com/ccD-410.html
Thank you.
https://www.pass4sureexam.com/ccD-410.html