ABSTRACT In 2001, Meta Group (Now Gartner) published a report by Doug Laney, wherein the first analysis of Big Data challenges was documented. This document is an attempt to understand those challenges and the solutions offered. This document has been prepared using multiple sources and has been quoted likewise.
By: Smarak Das [EMP Id: 391485]
DECODING THE BIG DATA HYPE From A Layman’s Perspective
Teradata Certified Database Administrator & Technical Specialist
DECODING THE BIG DATA HYPE
INDEX (a) Fundamentals Of Big Data
Page 2
- Characteristics Of Big Data
Page 3
- Warehouse Vs Hadoop
Page 7
- Use Cases Of Big Data
Page 11
- Big Data Demographics
Page 15
(b) All About Hadoop
Page 21
- HDFS
Page 23
- Assumptions & Goals
Page 25
- DataNodes & NameNodes
Page 26
- Data Replication & Integrity
Page 28
- Data Blocks Organization & Pipelining
Page 30
- File Permission Guide
Page 34
(c) Basics Of MapReduce
Page 35
(d) Hadoop Common Components
Page 38
- Pig & PigLatin
Page 39
- HIVE
Page 40
- JAQL
Page 41
- FLUME
Page 42
- ZooKeeper
Page 43
- Oozie
Page 44
- Lucene - Avro
Page 44 Page 44
1|Page
DECODING THE BIG DATA HYPE
Fundamentals Of BIG DATA You Are Part Of It Every Day Wikipedia quotes “Big Data is the collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.” The term “Big Data” is a misnomer since it implies that pre-existing data is somehow small (it isn’t) or the only challenge is its sheer size (size is one of them, but there are often more). In short, Big Data applies to information that can’t be processed or analyzed using traditional tools. The effect of e-commerce, social media, rise in merger/acquisition, increasing collaboration/partnership etc. is driving enterprises to higher levels of consciousness about how the data is being managed at its basic level. Today, every organization has access to a wealth of information, yet they don’t know how to get value out of it because it is sitting in its most raw format or in semistructured, unstructured format, and as a result, they don’t even know whether it’s worth keeping. Also, organization are getting overwhelmed by the volume of the data generated, variety of data being available, and velocity of data availability. Companies have the ability to store anything and they are generating data like never before, yet as this potential gold mine of data piles up, the percentage of data the business can process is going down. Today’s world is changing. Through instrumentation and sensors, we are able to track and sense more things, and if we can track and sense it, we tend to store it. Big Data is the game changer for the overall effectiveness of your data centers, because of its potential as a powerful tool in your information management repertoire. The practice and tools of Big Data and Data Science doesn’t stand alone in the data ecosystem. They rely on the usability of data, and a platform for future discovery and innovation. As Big Data grows over 2014, we will see more of Big Data acceptability, maturity as an industry, and adoption across industries.
2|Page
DECODING THE BIG DATA HYPE
Characteristics Of Big Data Three characteristics define Big Data: VOLUME, VARIETY, and VELOCITY.
Source: Google Images
Fig: 3V Of Big Data Other “V”s have been included in the Big Data characteristics, with “Variety” and “Variability” being two other characteristics. However, we shall concentrate on “Volume”, “Variety” and “Velocity”. Each of these characteristics introduces its own set of complexities concerning data processing. All these 03 features have created the need for a new class of capabilities to augment the way things are done today to provide a better line of sight and controls over our existing knowledge domains and the ability to act on them. Together, handling these 03 challenges effectively will decide the efficiency and successfulness of any Big Data initiative.
3|Page
DECODING THE BIG DATA HYPE VOLUME: The Volume of data is exploding. Year 2000 had 800,000 Petabytes of data and by 2020, we are expecting to reach 35 Zettabytes. Twitter generates 7 TB of data, while Facebook generates 10 TB every day. If the figures of Twitter and Facebook didn’t amaze you, below lists numbers much beyond the realm of any volume accumulated: (a) Large Hadron Collider: Generated 15PB of data. (b) YouTube: 72 hours of Video per hour. (c) Human Genomics: 7000 PB (d) Large Synoptic Survey Telescope: 30 TB of Images per day. (e) Annual Email Traffic (No SPAM): 300 PB + These numbers will be out of date by the time this document is prepared, and further outdated by the time you have read it. Today, we are storing everything: environmental data, financial data, medical data, surveillance data, click stream data, and the list goes on and on. The term “Big Data” means organizations are facing massive volumes of data, and this volume is increasing with each day. As the amount of data available to business is on the rise, the percent of data it can process, understand and analyze is on the decline, thus creating a Blind Zone. This Blind Zone creates an uncertainty concerning the value of all the captured and yet-unexplored data.
Source: Google Images
Fig: Discrepancy between Data Storage & Data Analysis
4|Page
DECODING THE BIG DATA HYPE VARIETY: Of all “Vs”, Variety holds the potential for most exploitation. While everybody doesn’t have the huge volume of data like Twitter, Facebook, eBay etc. even the medium to small scale industries have multiple data sources which can be integrated for organizational benefit. With the explosion of sensors, smart phones, social collaboration technologies, data in an organization is becoming complex, because it includes not only relational-database-suitable structured data, but also raw, unstructured, semi-structured data. Traditional systems are struggling to store and perform the required analytics to gain understanding from these varieties of data. Only 20 percent of today’s data is traditional. Rest 80 percent of world’s data is moving towards unstructured or semi-structured at most. Videos & Pictures aren’t easily stored in relational databases. Variety is harder to grasp and analyze than “bigger” (Volume) or “faster” (Variety). To capitalize on Big Data opportunity, enterprises must be able to analyze all types of data, both relational & non-relational: text, sensor data, audio, video, transactional, and more.
Figure: Share Of Structured & Un-Structured Data
Figure: Difference Between Variety of Data
Source: Google Images
Source: Relational Source
5|Page
DECODING THE BIG DATA HYPE VELOCITY: Velocity refers to the speed with which data is stored or retrieved. Earlier, the typical process was to fire a batch job against data and wait for results to arrive. It used to work because the incoming data rate is slower than the batch processing rate. Today, data in streaming into the server in real-time and have a very short shelf-life. As such, the need to analyze the data-in-motion, rather than the datain-rest is critical. Sometimes, the competitive advantage for an organization is decided by identifying a trend, problem, opportunity few minutes or few seconds before someone else. There is a difference between “How many people live in London” and “How many people are currently in London”. Dealing effectively with Big Data requires performing analytics against the volume and variety of data while it is still in motion, not just after it is at rest.
Source: Scale DB
Fig: Velocity & Big Data
6|Page
DECODING THE BIG DATA HYPE
Warehouse Vs. Hadoop (The Versus Thing) Gartner in its 2014 Data Warehouse Database Management Systems Magic Quadrant said “Entering 2014, the hype around replacing the data warehouse gives way to the more sensible strategy of augmenting it.” Data in a warehouse goes through multiple quality rigors: Cleaning, enrichment, matching, glossary, metadata, master data management, modeling, and other services before it’s ready for analysis. This is a very expensive and time consuming process. Having said that, Business realizes that the data in the Data Warehouse is required for reporting and BI purposes, which is essential for its functioning. Hence, the “high compute per byte” (high computation cost) is associated with the “high value per byte” warehouse’ data. The difference between traditional BI analytics and Big Data analytics is shown in the below figure:
Source: Storage Networking Industry Association (SNIA)
Fig: Big Data Is Different From Business Intelligence
7|Page
DECODING THE BIG DATA HYPE In contrast, Big Data repositories rarely undergoes the full quality rigors similar to Data Warehouse’ data. Hadoop data might seems to be of “low value per byte”, it also have “low compute by byte” factor. With the volume and velocity of today’s data, we cannot afford to cleanse and document every piece of data properly, because it’s not going to be economical. The data in Hadoop might sit for a while for analysis, and when its value is discovered, it might migrate its way to way to the data warehouse post the quality rigors associated with Data Warehouse’ data. 03 major considerations for Big Data technologies: (a) Big Data solutions are ideal for analyzing not only raw data, but structured, semi-structured, unstructured data as well. (b) Big Data solutions are ideal when all the data needs to be analyzed in comparison to a sample of data. (c) Big Data solutions are ideal for iterative and exploratory analysis when business measures of data is not predetermined. Hadoop is not meant for high performance interactive use, not it supports database features like schemas, indexes, optimizer, data structure, data models etc.
Source: Google Images
8|Page
DECODING THE BIG DATA HYPE
Source: Storage Networking Industry Association (SNIA)
Fig: Business Requirement Case Study Big Data solutions aren’t a replacement of traditional and existing warehouse solutions. Data bound for the analytic warehouse has to be cleaned, documented, and trusted before it’s neatly placed into the warehouse. A Big Data solution is going to give up some of the formalities and strictness of data. The next figure, as provided by Data Warehouse and Big Data Market Leader Teradata (Gartner 2014) explains the best approach by workload and data type concerning when to use what: Legend: (a) STABLE SCHEMA: Financial Analysis, OLAP, Enterprise-wide BI, Reporting, Active Intelligence etc. (b) EVOLVING SCHEMA: Interactive data discovery, web clickstream, social feeds, set-top box analysis, sensor logs, JSON etc. (c) FORMAT, NO SCHEMA: Image processing, audio/video storage and refining, storage and batch transformation.
9|Page
DECODING THE BIG DATA HYPE
Source: Teradata
Fig: Teradata Aster [When to Use What] Mapping As Per Requirement] Your information platform shouldn’t go into the future without these two important entities working together, because the outcomes of a cohesive analytic solutions deliver premium results.
Source: SAS Best Practices 2013
Fig: Big Data & Data Warehouse Together = Premium Results
10 | P a g e
DECODING THE BIG DATA HYPE
Use Cases Of Big Data The early companies to embrace Big Data were Google, LinkedIn, Facebook, eBay etc. These companies didn’t have to reconcile or integrate big data with the traditional sources of data and perform analytics on them as they were built around Big Data from the beginning. For these companies, Big Data could stand alone, Big Data analytics could be the only focus of analytics and Big Data technology architecture could be the only architecture. However, Large and well-established business should integrate their Big Data technologies with everything else going on with their company i.e. analytics on Big Data should co-exist with analytics of other types of data. Below, we list 05 instances of Big Data use by popular companies:
Source: International Institute for Analytics Study Sponsored by SAS May 2013
11 | P a g e
DECODING THE BIG DATA HYPE
Source: International Institute for Analytics Study Sponsored by SAS May 2013
Source: International Institute for Analytics Study Sponsored by SAS May 2013
12 | P a g e
DECODING THE BIG DATA HYPE
Source: International Institute for Analytics Study Sponsored by SAS May 2013
13 | P a g e
DECODING THE BIG DATA HYPE
Source: International Institute for Analytics Study Sponsored by SAS May 2013
14 | P a g e
DECODING THE BIG DATA HYPE
Big Data Demographics NVP [NewVantage Partners] conducted a survey in 2013 for Big Data statistics with the following participating companies {Below Figure}, with the survey participants including Chief Information Officers, Chief Analytics and Risk Officers, Chief Technology Officers, Chief Marketing Officers, Senior Line-of-Business Executives (EVP/SVP), Chief Architects, and Heads of Big Data and Analytics.
Source: NewVantage Partners (NVP) Big Data Executive Survey 2013
15 | P a g e
DECODING THE BIG DATA HYPE (a) BIG Data Acceptance In Terms Of Industry [Financial Services usage of Big Data stems from development of highly sophisticated customer analytics and predictive behavior models, fraud detection and risk analytics etc. Health Care & Life Sciences firms are in nascent stage of Big Data adoption, with only 17% LS & HC executives reporting Big Data systems operational in production as compared with 33% for financial services]
Source: NewVantage Partners (NVP) Big Data Executive Survey 2013
(b) BIG Data Initiative Status [In 2012, 85% of executives indicated embarking on initial forays of Big Data initiatives. In 2013, 91% executives were planning or have embarked on a Big Data initiatives. Also, 68% executives reported investment of more than $1MM in Big Data initiatives]
Source: NewVantage Partners (NVP) Big Data Executive Survey 2013
16 | P a g e
DECODING THE BIG DATA HYPE (c) BIG Data Initiative Being Planned By Organization [The most significant factor for Big Data initiatives were to enhance the analytics power and capabilities as a mean to compete more successfully and operate more efficiently, with 70% of executives’ demand. Effective integration of existing data, be it structured, unstructured, semi-structured represented 69% of executives’ driving factor for Big Data initiative]
Source: NewVantage Partners (NVP) Big Data Executive Survey 2013
17 | P a g e
DECODING THE BIG DATA HYPE (d) Primary Focus Of Big Data Analysis Initiative
Source: NewVantage Partners (NVP) Big Data Executive Survey 2013
Most Big Data initiatives do not currently require any ROI playback analysis for justifying their investment in Big Data, with 50% indicating long term strategic investment. The below figure shows whether an ROI has been conducted before approving the Big Data investment:
Source: NewVantage Partners (NVP) Big Data Executive Survey 2013
18 | P a g e
DECODING THE BIG DATA HYPE (e) Factors Critical To Business Adoption Of Big Data Initiatives [Executive Sponsorship is the most critical factor driving Big Data initiative]
Source: NewVantage Partners (NVP) Big Data Executive Survey 2013
(f) Primary Business Benefit Expected By Big Data Analysis
Source: NewVantage Partners (NVP) Big Data Executive Survey 2013
19 | P a g e
DECODING THE BIG DATA HYPE (g) Big Data Solutions Used
Source: NewVantage Partners (NVP) Big Data Executive Survey 2013
(h) Analytics & Visualization Solutions Used
Source: NewVantage Partners (NVP) Big Data Executive Survey 2013
20 | P a g e
DECODING THE BIG DATA HYPE
All About Hadoop The problem is obvious: There is a staggering amount of data lying around in every enterprise of various formats and types, with more and more data being added to the repository each and every moment. But, the enterprise aren’t sure whether to continue storing the data, or analyze it, or whether there is any value in it. It will not be wrong to say the Big Data is the culmination of technological advancement which can spiraled the volume, variety and velocity aspect of data beyond the organizational grasp. People and organization have attempted to tackle this problem from many different angles. The angle which is currently leading the pack in terms of popularity is an open source project called Hadoop.
Source: Storage Networking Industry Association (SNIA)
Fig: Hadoop Adoption Hadoop is a top-level Apache project in the Apache Software Foundation written in Java. For the definition, we can define Hadoop as a computing environment built on top of a distributed clustered file system that was designed specifically for very large-scale operations. 21 | P a g e
DECODING THE BIG DATA HYPE Hadoop is based on Google’s work on MapReduce programming paradigm. Unlike traditional systems, Hadoop is designed to scan through large data sets to produce its results through a highly scalable, distributed batch processing system. Hadoop is not about speed-of-thought response times, real-time warehousing, or blazing transactional speeds as mentioned in the Big Data vs. Data Warehouse section. It is about discovery and making the once near-impossible possible from a scalable and analytic perspective. The Hadoop project has 03 major components: (a) Hadoop Distributed File System (HDFS) (b) Hadoop MapReduce. (c) Hadoop Common.
Source: Google Images
Fig: Hadoop Base Components
One of the key component of Hadoop is the redundancy built into Hadoop. Hadoop recognizes that failure is a norm rather than an exception. As mentioned earlier, Hadoop achieves the performance and scalability via commodity based hardware (ready-made and easily available inexpensive hardware). It is well known that commodity based hardware will fail, especially when you have large numbers of them. But the redundancy built into Hadoop provides the fault tolerance and the capability of Hadoop to heal itself. This allows Hadoop to scale out workloads across large clusters of inexpensive machines to work on Big Data problems.
22 | P a g e
DECODING THE BIG DATA HYPE
Hadoop Distributed File System [HDFS] The Hadoop Distributed File System [HDFS] is a distributed file system designed to run on commodity hardware. HDFS is highly fault tolerant and is designed to be deployed on low cost hardware. Data in Hadoop cluster is broken down into smaller pieces called Blocks and distributed across the cluster. In this way, the map & reduce functions can be executed on smaller subsets of your larger data sets and this provides the scalability needed by Big Data processing.
Source: Apache Hadoop Org
Fig: Hadoop HDFS Architecture The goal of Hadoop is to use commonly available servers in very large cluster, where each server has a set of inexpensive internal disk drives. As commodity hardware failure’s probability is high, Hadoop built-in tolerance and fault compensation facilities. Any data is divided into blocks and copies of these blocks are stored on other servers in the Hadoop Cluster. That is, an individual file is actually stored as smaller blocks on several servers in the entire cluster. 23 | P a g e
DECODING THE BIG DATA HYPE In Hadoop, a file is broken down into multiple “n” blocks. Each of these “n” block is replicated across 03 Servers by default [The number of Blocks per file and the replication factor can be customized on a per-file basis e.g. the Development Hadoop needn’t have any replication]. Coordination amongst all the servers has a significant overhead, so the ability to process large chunks of data locally helps improve both performance and communication overhead. For example: Imagine a file having all the Employee Ids of an organization. This file is divided into say, 03 parts (Block 1, Block 2, and Block 3) and is stored across multiple servers in a Hadoop Cluster.
Source: VMware’s Networking and Security Business Unit [Brad Hedlund]
Fig: Block Replication across HDFS Here, Block 1 is replicated by Block 1` and Block 1``. Same applies for Block 2 and Block 3. With the default replication factor of 3, each block is replicated thrice.
This redundancy has 02 major advantages: (a) High availability (b) It allows Hadoop to break a work into smaller chunks and run those jobs on all servers in the cluster for better scalability.
24 | P a g e
DECODING THE BIG DATA HYPE
Assumptions & Goals (a) Hardware Failure: Hardware Failure is a norm rather than an exception. As HDFS uses thousands of commodity hardware component, and each component has a non-trivial probability of failure, some component of HDFS is always at fault. However, this failure is expected and part of the design perspective of Hadoop. (b) Streaming Data Access: HDFS is designed for batch processing, rather than interactive uses by users. It is not a general purpose distributed file systems dealing with stand-alone data. It requires continuous streaming access to the data sets. (c) Large Data Sets: A typical file in Hadoop is terabytes in size. HDFS is built to support huge data sets.
(d) Simple Coherency Model: HDFS applications have a write-once-read-many access model for files. A file once created, written, and saved need not be changed. This assumption greatly simplifies the data coherency issues. There is a plan to introduce file’s data-append in the future release of Hadoop. (e) “Moving Computation To Data, Rather Than Data To Computation”: Hadoop believes moving computation to data is much faster than moving data to computation. This is very true for large data sets. Moving huge data sets to the application greatly increases the network bandwidth usages.
(f) Portability: HDFS is designed to work on any platform supporting Java. With highly portable Java at helm, Hadoop advocates widespread adoption as a platform of choice for large data sets applications.
25 | P a g e
DECODING THE BIG DATA HYPE
DataNodes & NameNodes Hadoop has a Master/Slave architecture. All of Hadoop data placement is managed by special server called NAMENODE. Each cluster of Hadoop has 1 NameNode Server assigned to it. This server keeps track of all the data in the HDFS. All the NameNode’s information is stored in memory, which allows quick response time to storage manipulation and read requests. In other words, NameNode deals with CLUSTER METADATA. The obvious scenario coming to our mind is the Single Point of Failure (SPOF) with all these details stored in a server. Hence, it is advisable to choose a robust server component for NameNode as compared to other servers. Initial version of Hadoop had only 1 NameNode Server. Hadoop version 0.21 included the capability of a Backup Node, which acts as a clod standby for NameNode.
Source: Storage Networking Industry Association (SNIA)
Fig: Hadoop Master Slave Architecture
26 | P a g e
DECODING THE BIG DATA HYPE DataNodes manages the storage attached to each node in the cluster. Usually, 1 DataNode is assigned to 1 Node. Internally, when a file is divided into multiple blocks, each block is assigned to multiple DataNode [03 By Default]. Overall, the NameNode handles the namespaces operation like opening, closing, renaming files in addition to determining the mapping of blocks to DataNodes. The DataNodes is responsible for servicing read and write request.
Source: Google Images
Fig: NameNode & DataNode Functionality When you fire a job for inserting data into HDFS or for retrieving data from HDFS, Hadoop has the responsibility of communicating with NameNode for necessary information, be it storage location and replication details for INSERT operation or the server from which the data for SELECT operation is to be fetched. In other words, any Hadoop operation needn’t reference NameNode directly or indirectly. Hadoop isn’t UNIX (POSIX) portable. It means that all the familiar commands for copying, deleting, inserting, opening, moving etc. are slightly available in different forms with HDFS. To work around this, either we can develop our own Java applications to perform some of the functions, or we can use Hadoop components available readily in the Apache Software Foundation. 27 | P a g e
DECODING THE BIG DATA HYPE
Data Replication & Integrity HDFS is designed to store huge files, and each file is divided into many blocks, with all the blocks occupying same size except the last block. All these blocks are replicated for fault tolerance. The block size and replication factor is configurable for each file. The replication factor for a file can be changed later also. The NameNode makes all the decisions regarding the block’s replication. It also receives a HeartBeat and BlockReport from each DataNodes. HeartBeat report signifies all the DataNodes are functioning properly. The BlockReport contains the list of all blocks on a DataNode.
Source: VMware’s Networking and Security Business Unit [Brad Hedlund]
Fig: Heartbeat & Block Report. The placement of the first replica is crucial. Optimizing the block’s replica placement requires lots of tuning and optimization. Hadoop uses a Rack Awareness policy for replication. A group of nodes forms a Rack. By default, the Replication Factor is 03. A simple policy will be to place one replica per rack. This policy will ensure even distribution of blocks, but increases the cost of writes as a write needs to transfer blocks to multiple racks. 28 | P a g e
DECODING THE BIG DATA HYPE Hadoop Rack Awareness policy is to put first replica on one node in the local rack, another on a different node in the same local rack, and the last on a different node in a different rack. One third of replicas is on one node, two third of replicas is in one rack, and the other third are evenly distributed across the remaining racks. The necessity of block re-replication may arise due to many reasons: DataNode may become unavailable, a replica may become corrupted, a hard disk on a DataNode may fail, or the replication factor of a file may be increased. The Blocks’ information across DataNode is sent to NameNode via HeartBeat message periodically. It is very possible for a block of data to be fetched arrives corrupted. Whenever a block of a file is stored in a DataNode, checksum is implemented for data integrity. This checksum is used for validation of data blocks whenever data is received from the DataNode by the NameNode.
29 | P a g e
DECODING THE BIG DATA HYPE
Data Blocks Organization & Pipelining When a client request a file write, it doesn’t reach the NameNode immediately. In fact, the HDFS Client caches the data into a temporary file until the accumulated data is worth over one HDFS block size. At this point of time, the Client contacts the NameNode. The NameNode inserts the file name into the file system hierarchy and allocates a data block for it. The NameNode responds to the client request with the identity of the DataNode and the destination data block. Then the client flushes the block of data from the local temporary file to the specified DataNode. When a file is closed, the remaining un-flushed data in the temporary local file is transferred to the DataNode. The client then tells the NameNode that the file is closed. At this point, the NameNode commits the file creation operation into a persistent store. If the NameNode dies before the file is closed, the file is lost.
Source: Storage Networking Industry Association (SNIA)
Figure: Pipelined Flow of HDFS Read Operation
30 | P a g e
DECODING THE BIG DATA HYPE Explanation of HDFS File Read Operation: (a) The HDFS Client requests a file to be read for its operation. (b) The NameNode is contacted for the file information. The information sought is the block’s address across DataNodes for the file. (c) The NameNode fetch the block information based on the latest Block Report sent by each DataNodes. (d) The NameNode send these information ordering the Blocks by their ascending order based on the distance from the first block containing the data. Example: Assume a file has 2 blocks (B1 and B2) with a replication factor of 02. Node 1 contains B1, B2`; Node 2 contains B2, B1`. When the NameNode delivers the block ordering, it will place the node containing the first replica, followed by other nodes based on the distance of other same-block-otherreplica carrying node. For the above example: B1 (Node1, Node2) B2 (Node2, Node1) (e) The InputStream will fetch the data from the ordering specified, choosing the first Node containing the block, and then checking the checksum to verify data correctness. If correct, then the block from the first node is used, else the second node’ block will be used.
31 | P a g e
DECODING THE BIG DATA HYPE
FIG: WRITE PIPELINE The Normal Line represent communication between Client &HDFS. Solid Line represent data transfer and dotted line represent acknowledgement. Time t0: Client request WRITE Operation. Time t1-t2: Packets of data sent from Client to HDFS in BLOCK Size. Time t2: CLOSE Signal Time t3: Hadoop Saves the file.
Source: Storage Networking Industry Association (SNIA)
Figure: Pipelined Flow of HDFS Write Operation
32 | P a g e
DECODING THE BIG DATA HYPE The Pipelined approach explained above for WRITE operation has been explained using the below figure:
Source: VMware’s Networking and Security Business Unit [Brad Hedlund]
Fig: HDFS Write Operation The policy of putting the first 2 blocks in the same rack, and then the 3rd block in another rack is based on Hadoop Rack Awareness Policy, explained in the “Data Replication & Integrity” section.
33 | P a g e
DECODING THE BIG DATA HYPE
HDFS File Permission Guide The Hadoop Distributed File System (HDFS) implements a permissions model for files and directories that shares much of the POSIX model. Each file and directory is associated with an owner and a group. The file or directory has separate permissions for the user that is the owner, for other users that are members of the group, and for all other users. For files, the r permission is required to read the file, and the w permission is required to write or append to the file. For directories, the r permission is required to list the contents of the directory, the w permission is required to create or delete files or directories, and the x permission is required to access a child of the directory. Each client process that accesses HDFS has a two-part identity composed of the user name, and groups list. Whenever HDFS must do a permissions check for a file or directory foo accessed by a client process, (a) If the user name matches the owner of foo, then the owner permissions are tested; (b) Else if the group of foo matches any of member of the groups list, then the group permissions are tested; (c) Otherwise the other permissions of foo are tested. If a permissions check fails, the client operation fails.
34 | P a g e
DECODING THE BIG DATA HYPE
Basics Of MapReduce MapReduce is the heart of Big Data. It is the programming paradigm that allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster. The term MapReduce refers to two separate and distinct tasks: Map & Reduce. The “Map” takes data as input and converts it into another set of data, where every individual elements is broken down into tuples (key/value pairs). The “Reduce” takes the output of “Map” as input and combines those tuples into a smaller set of tuples. As the sequence of name MapReduce suggests, the “Reduce” job is always performed after “Map” job. For Example: Consider 3 Files containing Animal’s name and we wish to count the number of occurrence of each animal. First, the Input File is split into blocks (3 by Default). For each block, a Map task calculate the number of occurrence of each animal. The Shuffle unit of MapReduce takes the output of Map and directs them to appropriate Reducer tasks for consolidation and delivery of the final result.
Source: Google Images
Fig: MapReduce Operation
35 | P a g e
DECODING THE BIG DATA HYPE In Hadoop, every MapReduce Program is called “JOB”. A job is executed by subsequently breaking it down into smaller pieces called “TASKS”. Every Hadoop cluster has a program running called “JOBTRACKER”. The JobTracker communicates with the NameNode to find out where all the data required by the submitted job exists across the cluster and also breaks the job into map and reduce tasks. These tasks are then scheduled on all the servers where the data exists. It is very possible for the tasks to be scheduled on a server where the required data isn’t available [As every data is replicated across 3 servers by default]. In this case, the server will ask for the necessary data to be transferred across the network interconnect to perform its task. As this is not very efficient, the JobTracker tries to avoid this and attempts to schedule the job on the servers where the data actually resides. Another set of program called “TASKTRACKER” is responsible for monitoring the status of every tasks running. If any tasks fails, the status of the failure is reported back to JobTracker, which will then reschedule the job. We can decide how many times a failed task will be attempted before the entire job is cancelled.
Source: Storage Networking Industry Association (SNIA)
Fig: MapReduce Basic Concepts
36 | P a g e
DECODING THE BIG DATA HYPE SHUFFLE & COMBINER are 02 features of Hadoop used in MapReduce. Shuffle takes the input from the map tasks and directs the output to reduce tasks. If we wish to perform some aggregation or other transformation on the output of map tasks before sending to reduce tasks, then we can use Combiner. The greater the number of reducer tasks, more will be the overhead but the overall performance is improved.
Source: Storage Networking Industry Association (SNIA)
Fig: MapReduce & HDFS Together
37 | P a g e
DECODING THE BIG DATA HYPE
Hadoop Common Components Hadoop Common Components are a set of libraries that support the various Hadoop subprojects. As mentioned before, Hadoop isnâ&#x20AC;&#x2122;t UNIX (POSIX) complaint. To interact with Hadoop, we need to use the /bin/hdfs dfs <args> file system shell command interface, where args represents the command argument.
Source: Google Images
Fig: HDFS Shell Commands
38 | P a g e
DECODING THE BIG DATA HYPE
Application Development in Hadoop For using Hadoop, we need to use Java for developing the MapReduce programs for interacting with HDFS. Also, the programmers need to develop and maintain MapReduce applications for business applications that require long and pipelined processing. To abstract some of the complexity of Hadoop programming model, several application development languages have emerged that run on top of Hadoop. The popular ones are Pig, Zookeeper, Hive, JAQL etc.
Pig & PigLatin Pig was developed by Yahoo! To allow people using Hadoop to focus more on analyzing data rather than writing the mapper and reducer programs. As the animal Pig eats anything, the Pig programming language is designed to handle any kind of data. Pig is made up of two components: PigLatin & Pig Runtime Environment where the PigLatin programs are executed. The relationship between PigLatin & Pig Runtime Environment is similar to Java Application and JVM.
Source: Google Images
Fig: Pig & PigLatin The commonly used commands in Pig are: (a) LOAD: Before writing programs via PigLatin to access data in HDFS, we need to specify the data in HDFS which will be used. For this purpose, we use LOAD ‘File’ Command (Where ‘File’ refers to either a HDFS file or directory). If a directory is specified, then all the files are loaded into the PigLatin program. 39 | P a g e
DECODING THE BIG DATA HYPE (b) TRANSFORM: The transformation logic is where all the data manipulation occurs. You can use FILTER to remove rows as required, JOIN to join two set of data files, GROUP to aggregate data, ORDER to order results, etc. For Example: To calculate the count of employees per manager belonging to the HealthCare ISU with the input file being located in HDFS @ Employee directory: L = LOAD ‘hdfs//node/Employee’; FL = FILTER L BY Vertical EQ ‘HC’; G = GROUP FL BY ManagerID; RT = FOREACH G GENERATE group, COUNT (EmployeeID); (c) DUMP & STORE: If DUMP & STORE aren’t specified, then the output of Pig program isn’t displayed. To display the output to the screen, use DUMP command. For redirecting the output to a file, use STORE command. Post writing the PigLatin program, the Pig Runtime Environment translate the program into a set of map & reduce tasks and runs them under the cover on your behalf.
HIVE Although Pig was a very powerful and simple language to understand and use, the downside was that it was something new to learn and master. HIVE was developed by Facebook with the intention of developing a runtime Hadoop support structure which allows anyone fluent in SQL to leverage the power of Hadoop. Their creation was called HQL (HIVE Structured Language). The HQL statements were broken down into MapReduce jobs by the HIVE service to be executed across the Hadoop cluster. For Example: Using the same scenario as above of calculating the count of employees per manager id, the following HIVE Code performs the task of creating a table, populating it, and then querying that table via HIVE: 40 | P a g e
DECODING THE BIG DATA HYPE CREATE TABLE Employee (EMPID BIGINT, MGRID BIGINT, VERTICAL STRING) COMMENT ‘The Above Is The Employee Table’ STORED AS SEQUENCEFILE; LOAD DATA INPATH ‘//hdfs://node/Employee’ INTO TABLE Employee; SELECT MGRID, COUNT (EMPID) FROM EMPLOYEE WHERE VERTICAL EQUAL ‘HC’ GROUP BY MGRID;
JAQL JAQL was developed by IBM and allows processing both structured and nontraditional data. JAQL was inspired by many programming language like LISP, SQL, XQuery, Pig etc.
Source: Google Images
Fig: Comparison of Pig, Hive & JAQL
41 | P a g e
DECODING THE BIG DATA HYPE
GETTING YOUR DATA INTO HADOOP One of the biggest challenges for Hadoop is that it is not UNIX (POSIX) complaint. To get your data into Hadoop (HDFS), we need to use the traditional Hadoop commands: (a) CopyFromLocal: Move file from local file system into HDFS. (b) CopyToLocal: Move file from HDFS into local file system. hdfs dfs 窶田opyFromLocal /user/dir/file hdfs://s1.n1.com/dir/hdfsfile hdfs dfs 窶田opyToLocal hdfs://s1.n1.com/dir/hdfsfile /user/dir/file These commands are executed through the HDFS Shell Program, which is similar to a Java Application. The shell uses the Java APIs for getting the date into and out of HDFS.
FLUME Flume is an Apache project for flowing data from source into your Hadoop environment. In Flume, there are 03 main entities: (a) SOURCE: A Source can be any data source. Flume has many predefined source adapters. Some adapters allows the flow of anything coming off a TCP port to enter the flow. A number of text source file adapters allows you the granular control to grab a specific file and feed it into whatever new data is written into the file. Look to Flume when you want data to flow from many sources. (b) SINK: A Sink is the target of a specific operation. There are 03 type of Sinks in Flume. One Sink is basically the final flow destination known as COLLECTOR TIER EVENT SINK. This is where you land a flow into the HDFS File System. Another Sink type is AGENT TIER EVENT SINK, which is used when you want the sink to be the input source of another operation. When using such Sinks, Flume will ensure communication or acknowledgement is sent for arriving data. The final sink type is BASIC SINK, which can be text file, a console display, simple HDFS path etc.
42 | P a g e
DECODING THE BIG DATA HYPE (c) DECORATOR: A Decorator is an operation on the stream that can transform the stream in some manner, be it compressing or adding or removing information. Very complex transformation or enterprise class transformation like IBM Information Server isnâ&#x20AC;&#x2122;t achieved by Flumeâ&#x20AC;&#x2122;s decorator task.
Hadoop is more than just a single project, but rather an ecosystem of projects at simplifying, managing, coordinating, and analyzing large sets of data. Such projects are listed in the following sections.
ZOOKEEPER ZooKeeper is an open source Apache project that provides a centralized infrastructure and services enabling synchronization across a cluster. Imagine a Hadoop cluster spanning 500 or more servers. There is a need for centralized management of the entire cluster in terms of name services, group services, synchronization services, configuration management, and more. Having ZooKeeper allows the cross-node synchronization and ensures the tasks across cluster are serialized or synchronized. A very Hadoop cluster can be supported by multiple ZooKeeper servers.
Fig: Apache Zookeeper
43 | P a g e
DECODING THE BIG DATA HYPE
OOZIE Sometimes, many jobs needs to be chained together to create a complex application. Oozie is an open source project that simplifies the workflow and coordination between jobs. It provides users with the ability to define actions and dependencies between actions. Oozie will then schedule the actions to execute when the required dependencies have been met. A workflow in Oozie is defined via DAG (Directed Acyclic Graph), where all the tasks and dependencies point are specified without any loop. The below figure shows an example of Oozie workflow, where the node represents the actions and control flow operations.
LUCENE Lucene is an extremely popular open source Apache project for text search. Lucene predates Hadoop and has been a top level Apache project since 2005. If you have searched on the Internet, itâ&#x20AC;&#x2122;s very likely that you have used Lucene and yet not known it. In a nutshell, if you wish to search a text in a large file, or a set of documents, then Lucene breaks the documents into text fields and builds an index on these fields. The index is the key component of Lucene, as it is the basis of its rapid search capabilities.
AVRO AVRO is an Apache project offering data serialization.
44 | P a g e