Technical overview by stnurrohmah

Technical

OVERVIEW

Technical Overview OVERVIEW In the era of big data, data is produced in very large quantities from various sources. Data is transforming business, social interactions, and the future of our society The amount of data is so large thatcome with it a great opportunity for various parties, ranging from businesses, organizations, and even government. Information that can be extracted. from the data is very useful to analyze and even predict manyaspects, such as customer sentiment, fraud, and many others. It is no longer enough for company to rely on limited data from its traditional data warehouse. They need to enrich the data information with new forms of multi-structured data such as web logs, social media, text, graphics, email, audio, machinegenerated data and much more. In the era of big data, data is produced in very large quantities fromvarious sources. Data is transforming business, social interactions, and the future of our society. The amount of data is so large thatcome with it a great opportunity for various parties, ranging from businesses, organizations, and even government. Information that can be extracted from the data is very useful to analyze and even predict manyaspects, such as customer sentiment, fraud, and many others. It is no longer enough for company to rely on limited data from its traditional data warehouse. They need to enrich the data information with new forms of multi-structured data such as web logs, social media, text, graphics, email, audio, machine-generated data and much more. Hadoop is a framework for running applications on large cluster built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work,each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes. Both MapReduce and the Hadoop Distributed File System are designed so that node failures are automatically handled by the framework. However, big data adoption often hindered by the complexity of its implementation and the programming skill required to design the MapReduce process. HGrid247 is a graphical user interface designer that will ease the pain of workflow design and implementation in Hadoop. HGrid helps organization in accelerating time to value by reducing complexity in big data implementation.

Easy HGrid247 ease of use and its intuitive drag-abd-drop interface enables ETL designer and programmer , simplifying, and reducing time and complexity of data integration.

No Coding Instead of writing Java programs , Mapreduce jobs and scripts, HGrid247 enpowers developers todesign and develop big data jobs using visual tools resulting in greater team productivity and reducing the need for speciallized skill.

Robust Many of robust and proven fungcionality are now available in HGrid247. all of which has been tested and implemented in production environment.

Simple Additional Functions can be easely and simply added as UDF in java,no need to learn any new language or script.

Powerfull Statistical and data timing operators based on weka and rapid miner library for churn prediction, fraud , cross selling, etc.

HGrid247 Benefit BENEFIT HGrid247 ease of use and its intuitive drag-and-drop interface enables ETL designer and programmer, simplifying and reducing time and complexity of data integration. Instead of writing Java programs, MapReduce jobs and scripts, HGrid247 empowers developers to design and develop big data jobs using visual tools resulting in greater team productivity and reducing the need for specialized skills. Many of robust and proven functionality are now available in HGrid247. All of which has been tested and implemented in a production environment. Additional functions can be easily and simply added as UDF in Java, no need to learn any new language or script. Statistical and data mining operators based on weka and rapid miner library for churn prediction, fraud, cross selling, etc.

Visual Workflow Designer

Operational Data Monitoring

Framework & Library Data Preparation Library

ETL Library

Predictive Modeling Library

Other HDFS Library

AUTO CODE GENERATOR

HADOOP CLUSTER HDFS

HDFS

HGrid247 provides a comprehensive set of ETL functions, data preparation and predictive modeling library. Developers can create a logical ETL process and flow easily by click and drag using Hgrid Workflow designer. HGrid247 will create the map/reduce code from the graphs and compile it into map reduce application. All compiled code will be run in Hadoop cluster and take the advantages of all Hadoop capabilities

Benchmark And Case Study BENCHMARK We have been implementing Hadoop in many of our customer and has been in production for several years. We also performed some POC and benchmarking to convince our clients of the competitive advantage of Hadoop implementation. HGrid247 has been very useful in enabling us to gain a high productivity in these projects and POC.

BENCHMARK We have conducted some POC to compare the performance between HADOOP and existing Production Process for telecommunication client. We used PC class as Hadoop Cluster with 1 manager and 9 data nodes. For each node we use AMD Phenom X4 quadcore 2.3 GHz, 8 GB ram and 1 TB hardisk.

BENCHMARK RESULT Mediation Scenario HGrid247

Existing System

18 Hours

TAX Processing Scenario HGrid247

3 Times Faster

} 933.33 % Speed Up

45 Minutes

Existing System

} 342.86 % Speed Up

5 Hours 15 Minutes

18 Hours

9 Times Faster

1. Mediation Scenario Scenario : Read binary data from landing server, then covert to ascii format and do some transformation and filtering. The output is ascii file ready to distribute to other system. Data source specification : Data type : IN call data records Input type : Binary files Input file size : ~1 TB Number of records : 1,072,944,903 Result : Processing time : 5 hours 15 minutes (compared to 18 hours in existing system)

2. Tax Processing Scenario Scenario : Read ascii file from landing server, then do some transformation, filtering and aggregating. The output is summary ascii file that ready to use by TAX report. Data source specification : Data type : IN call data records Input type : ascii files Input file size : 57.3 GB Number of records : 1,130,903,153 Result : Processing time : 45 minutes (compared to 7 hours in existing system)

DATA WAREHOUSING CASE STUDY

Data Source System

Enterprise Datawarehouse

Data Source System

Gateway Server

One of our client, a telecommunication service provider, needs to implement a data warehouse to obtain up-to-date profile on their subscribers. They need to process Internet Access data of their subscriber to provide detail and summary data to support reports and customer analytics. The complete Datawarehouse system is integrated with Teradata database. Hadoop Cluster Specification : Hadoop Distribution : Cloudera Hadoop (CDH 4.3.0) Operating System

: CentOS 6.4.x86_64

Number Of nodes

: 2 Name Nodes and 42 Data Nodes

Data Input

: Subscribers internet event (URLs hit) and usage records

Input type

: csv

Output Type

: Teradata tables

Segitiga Emas Bussiness Park Unit No 6 Jl. Dr. Satrio Kav-6 Jakarta 12940- Indonesia Tel. +62 21 579511 32 (Hunting) Fax. +62 21 579511 28