Decoding The Big Data Hype by Smarak Das

ABSTRACT In 2001, Meta Group (Now Gartner) published a report by Doug Laney, wherein the first analysis of Big Data challenges was documented. This document is an attempt to understand those challenges and the solutions offered. This document has been prepared using multiple sources and has been quoted likewise.

By: Smarak Das [EMP Id: 391485]

DECODING THE BIG DATA HYPE From A Laymanâ&#x20AC;&#x2122;s Perspective

Teradata Certified Database Administrator & Technical Specialist

DECODING THE BIG DATA HYPE

INDEX (a) Fundamentals Of Big Data

Page 2

- Characteristics Of Big Data

Page 3

- Warehouse Vs Hadoop

Page 7

- Use Cases Of Big Data

Page 11

- Big Data Demographics

Page 15

(b) All About Hadoop

Page 21

- HDFS

Page 23

- Assumptions & Goals

Page 25

- DataNodes & NameNodes

Page 26

- Data Replication & Integrity

Page 28

- Data Blocks Organization & Pipelining

Page 30

- File Permission Guide

Page 34

Page 35

(d) Hadoop Common Components

Page 38

- Pig & PigLatin

Page 39

- HIVE

Page 40

- JAQL

Page 41

- FLUME

Page 42

- ZooKeeper

Page 43

- Oozie

Page 44

- Lucene - Avro

Page 44 Page 44

1|Page

DECODING THE BIG DATA HYPE

Fundamentals Of BIG DATA You Are Part Of It Every Day Wikipedia quotes “Big Data is the collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.” The term “Big Data” is a misnomer since it implies that pre-existing data is somehow small (it isn’t) or the only challenge is its sheer size (size is one of them, but there are often more). In short, Big Data applies to information that can’t be processed or analyzed using traditional tools. The effect of e-commerce, social media, rise in merger/acquisition, increasing collaboration/partnership etc. is driving enterprises to higher levels of consciousness about how the data is being managed at its basic level. Today, every organization has access to a wealth of information, yet they don’t know how to get value out of it because it is sitting in its most raw format or in semistructured, unstructured format, and as a result, they don’t even know whether it’s worth keeping. Also, organization are getting overwhelmed by the volume of the data generated, variety of data being available, and velocity of data availability. Companies have the ability to store anything and they are generating data like never before, yet as this potential gold mine of data piles up, the percentage of data the business can process is going down. Today’s world is changing. Through instrumentation and sensors, we are able to track and sense more things, and if we can track and sense it, we tend to store it. Big Data is the game changer for the overall effectiveness of your data centers, because of its potential as a powerful tool in your information management repertoire. The practice and tools of Big Data and Data Science doesn’t stand alone in the data ecosystem. They rely on the usability of data, and a platform for future discovery and innovation. As Big Data grows over 2014, we will see more of Big Data acceptability, maturity as an industry, and adoption across industries.

2|Page

DECODING THE BIG DATA HYPE

Characteristics Of Big Data Three characteristics define Big Data: VOLUME, VARIETY, and VELOCITY.

Source: Google Images

Fig: 3V Of Big Data Other “V”s have been included in the Big Data characteristics, with “Variety” and “Variability” being two other characteristics. However, we shall concentrate on “Volume”, “Variety” and “Velocity”. Each of these characteristics introduces its own set of complexities concerning data processing. All these 03 features have created the need for a new class of capabilities to augment the way things are done today to provide a better line of sight and controls over our existing knowledge domains and the ability to act on them. Together, handling these 03 challenges effectively will decide the efficiency and successfulness of any Big Data initiative.

3|Page

DECODING THE BIG DATA HYPE VOLUME: The Volume of data is exploding. Year 2000 had 800,000 Petabytes of data and by 2020, we are expecting to reach 35 Zettabytes. Twitter generates 7 TB of data, while Facebook generates 10 TB every day. If the figures of Twitter and Facebook didn’t amaze you, below lists numbers much beyond the realm of any volume accumulated: (a) Large Hadron Collider: Generated 15PB of data. (b) YouTube: 72 hours of Video per hour. (c) Human Genomics: 7000 PB (d) Large Synoptic Survey Telescope: 30 TB of Images per day. (e) Annual Email Traffic (No SPAM): 300 PB + These numbers will be out of date by the time this document is prepared, and further outdated by the time you have read it. Today, we are storing everything: environmental data, financial data, medical data, surveillance data, click stream data, and the list goes on and on. The term “Big Data” means organizations are facing massive volumes of data, and this volume is increasing with each day. As the amount of data available to business is on the rise, the percent of data it can process, understand and analyze is on the decline, thus creating a Blind Zone. This Blind Zone creates an uncertainty concerning the value of all the captured and yet-unexplored data.

Source: Google Images

Fig: Discrepancy between Data Storage & Data Analysis

4|Page

DECODING THE BIG DATA HYPE VARIETY: Of all “Vs”, Variety holds the potential for most exploitation. While everybody doesn’t have the huge volume of data like Twitter, Facebook, eBay etc. even the medium to small scale industries have multiple data sources which can be integrated for organizational benefit. With the explosion of sensors, smart phones, social collaboration technologies, data in an organization is becoming complex, because it includes not only relational-database-suitable structured data, but also raw, unstructured, semi-structured data. Traditional systems are struggling to store and perform the required analytics to gain understanding from these varieties of data. Only 20 percent of today’s data is traditional. Rest 80 percent of world’s data is moving towards unstructured or semi-structured at most. Videos & Pictures aren’t easily stored in relational databases. Variety is harder to grasp and analyze than “bigger” (Volume) or “faster” (Variety). To capitalize on Big Data opportunity, enterprises must be able to analyze all types of data, both relational & non-relational: text, sensor data, audio, video, transactional, and more.

Figure: Share Of Structured & Un-Structured Data

Figure: Difference Between Variety of Data

Source: Google Images

Source: Relational Source

5|Page

DECODING THE BIG DATA HYPE VELOCITY: Velocity refers to the speed with which data is stored or retrieved. Earlier, the typical process was to fire a batch job against data and wait for results to arrive. It used to work because the incoming data rate is slower than the batch processing rate. Today, data in streaming into the server in real-time and have a very short shelf-life. As such, the need to analyze the data-in-motion, rather than the datain-rest is critical. Sometimes, the competitive advantage for an organization is decided by identifying a trend, problem, opportunity few minutes or few seconds before someone else. There is a difference between “How many people live in London” and “How many people are currently in London”. Dealing effectively with Big Data requires performing analytics against the volume and variety of data while it is still in motion, not just after it is at rest.

Source: Scale DB

Fig: Velocity & Big Data

6|Page

DECODING THE BIG DATA HYPE

Warehouse Vs. Hadoop (The Versus Thing) Gartner in its 2014 Data Warehouse Database Management Systems Magic Quadrant said “Entering 2014, the hype around replacing the data warehouse gives way to the more sensible strategy of augmenting it.” Data in a warehouse goes through multiple quality rigors: Cleaning, enrichment, matching, glossary, metadata, master data management, modeling, and other services before it’s ready for analysis. This is a very expensive and time consuming process. Having said that, Business realizes that the data in the Data Warehouse is required for reporting and BI purposes, which is essential for its functioning. Hence, the “high compute per byte” (high computation cost) is associated with the “high value per byte” warehouse’ data. The difference between traditional BI analytics and Big Data analytics is shown in the below figure:

Source: Storage Networking Industry Association (SNIA)

Fig: Big Data Is Different From Business Intelligence

7|Page

DECODING THE BIG DATA HYPE In contrast, Big Data repositories rarely undergoes the full quality rigors similar to Data Warehouse’ data. Hadoop data might seems to be of “low value per byte”, it also have “low compute by byte” factor. With the volume and velocity of today’s data, we cannot afford to cleanse and document every piece of data properly, because it’s not going to be economical. The data in Hadoop might sit for a while for analysis, and when its value is discovered, it might migrate its way to way to the data warehouse post the quality rigors associated with Data Warehouse’ data. 03 major considerations for Big Data technologies: (a) Big Data solutions are ideal for analyzing not only raw data, but structured, semi-structured, unstructured data as well. (b) Big Data solutions are ideal when all the data needs to be analyzed in comparison to a sample of data. (c) Big Data solutions are ideal for iterative and exploratory analysis when business measures of data is not predetermined. Hadoop is not meant for high performance interactive use, not it supports database features like schemas, indexes, optimizer, data structure, data models etc.

Source: Google Images

8|Page

DECODING THE BIG DATA HYPE

Source: Storage Networking Industry Association (SNIA)

Fig: Business Requirement Case Study Big Data solutions arenâ&#x20AC;&#x2122;t a replacement of traditional and existing warehouse solutions. Data bound for the analytic warehouse has to be cleaned, documented, and trusted before itâ&#x20AC;&#x2122;s neatly placed into the warehouse. A Big Data solution is going to give up some of the formalities and strictness of data. The next figure, as provided by Data Warehouse and Big Data Market Leader Teradata (Gartner 2014) explains the best approach by workload and data type concerning when to use what: Legend: (a) STABLE SCHEMA: Financial Analysis, OLAP, Enterprise-wide BI, Reporting, Active Intelligence etc. (b) EVOLVING SCHEMA: Interactive data discovery, web clickstream, social feeds, set-top box analysis, sensor logs, JSON etc. (c) FORMAT, NO SCHEMA: Image processing, audio/video storage and refining, storage and batch transformation.

9|Page

DECODING THE BIG DATA HYPE

Source: Teradata

Fig: Teradata Aster [When to Use What] Mapping As Per Requirement] Your information platform shouldnâ&#x20AC;&#x2122;t go into the future without these two important entities working together, because the outcomes of a cohesive analytic solutions deliver premium results.

Source: SAS Best Practices 2013

Fig: Big Data & Data Warehouse Together = Premium Results

10 | P a g e

DECODING THE BIG DATA HYPE

Use Cases Of Big Data The early companies to embrace Big Data were Google, LinkedIn, Facebook, eBay etc. These companies didnâ&#x20AC;&#x2122;t have to reconcile or integrate big data with the traditional sources of data and perform analytics on them as they were built around Big Data from the beginning. For these companies, Big Data could stand alone, Big Data analytics could be the only focus of analytics and Big Data technology architecture could be the only architecture. However, Large and well-established business should integrate their Big Data technologies with everything else going on with their company i.e. analytics on Big Data should co-exist with analytics of other types of data. Below, we list 05 instances of Big Data use by popular companies:

Source: International Institute for Analytics Study Sponsored by SAS May 2013

11 | P a g e

DECODING THE BIG DATA HYPE

Source: International Institute for Analytics Study Sponsored by SAS May 2013

12 | P a g e

DECODING THE BIG DATA HYPE

Source: International Institute for Analytics Study Sponsored by SAS May 2013

13 | P a g e

DECODING THE BIG DATA HYPE

Source: International Institute for Analytics Study Sponsored by SAS May 2013

14 | P a g e

DECODING THE BIG DATA HYPE

Big Data Demographics NVP [NewVantage Partners] conducted a survey in 2013 for Big Data statistics with the following participating companies {Below Figure}, with the survey participants including Chief Information Officers, Chief Analytics and Risk Officers, Chief Technology Officers, Chief Marketing Officers, Senior Line-of-Business Executives (EVP/SVP), Chief Architects, and Heads of Big Data and Analytics.