IJIRST –International Journal for Innovative Research in Science & Technology| Volume 3 | Issue 03 | August 2016 ISSN (online): 2349-6010
Analyzing Big Data using Hadoop Tanuja A Department of Computer Science & Engineering VTU Belgaum, India
Swetha Ramana D Department of Computer Science & Engineering VTU Belgaum, India
Abstract The major challenge of Big Data is extracting the useful information from the terabytes of raw data and analyzing the extracted information which is essential for the decision making. The above factor can be found in the proposed system which is simplified in three units. The first, Data Acquisition Unit (DAU) filters the collected data. The second, Data Processing Unit(DPU) process the data by collecting the useful information and the third, Analysis and Decision Unit(ADU), analyses the reduced data and supports decision making. Keywords: Big data, Map Reduce, ADU, DPU, DAU, Hadoop _______________________________________________________________________________________________________ I.
INTRODUCTION
The data is tremendously increasing day today leads to a Bigdata. The advancement in Big Data sensing and computer technology revolutionizes the way remote data collected, processed, analyzed, and managed in effective manner[8]. Big Data are normally generated by online transaction, video/audio, email, number of clicks, logs, posts, social network data, scientific data, remote access sensory data, mobile phones, and their applications [12], [13]. Storing and analysing the data gathered from remote in a traditional database is not possible as it cannot handle unstructured and streaming data. So it leads to a transition from traditional to Bigdata tool, which can analyse the data in well manner. Hadoop is a tool used by many organizations for managing and analysing the data. Hadoop uses parallel execution of data using large clusters of tiny machines or nodes which results in faster execution. And even data is distributed among the nodes so the node failure can be easily handled. MapReduce is a programming style, for Distributed processing on Hadoop. It contains the two functions; Map function will take the input as key/value pair and splits the data on several nodes for processing. Reduce function combines the results from Map function. The architecture and algorithm are implemented using Hadoop. This paper is organized as follows. In section II, we give review of authors. In section III and IV provides existing system and proposed system, section V presents architecture of system and in section VI, conclusion and future enhancement. II. LITERATURE SURVEY Literature survey is the most important step in software development process. Before improving the tools it is compulsory to decide the economy strength, time factor. Once the programmer‘s create the structure tools as programmer require a lot of external support, this type of support can be done by senior programmers, from websites or from books. Reference Paper [1]:
Paper Title: Transitioning from Relational Databases to Big Data Originators: Sangeeta Bansal, Dr. Ajay Rana Techniques & Procedures: Hadoop & Relational Database Management System. Description: This paper gives a clear differentiation between relational database management system and Hadoop which led me to choose Hadoop tool. Hadoop can perform on larger data with less cost has it runs on tiny nodes and makes execution faster with its parallel processing nature which is not found in traditional one. Hadoop can process text data along with multimedia also like audio, video, images etc but in relational database management system we can store and process only text data. Reference Paper [2]:
Paper Title: MapReduce Simplified Data Processing on Large Clusters Originators: Jeffrey Dean and Sanjay Ghemawat Techniques & Procedures: MapReduce. Description: In this paper, author are performed series of operation on larger datasets. Operation contains map and reduce function. Map function splits input data, which will be in the form of key/value pair into intermediate key / value pair and reduce function, combines the results of each intermediate key/value pairs. MapReduce functions will run on larger clusters containing small machines. These clusters run simultaneously, but this parallelization is hided to the user making the environment easy. Moreover optimization is done locally, where the intermediate data is read and written in host machine
All rights reserved by www.ijirst.org
13