Tr 00091

Page 1

IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017

Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017

Web Oriented FIM for large scale dataset using Hadoop Mrs. Supriya C PG Scholar Department of Computer Science and Engineering C.M.R.I.T, Bangalore, Karnataka, India

supriyakuppur@gmail.com

Abstract: In large scale datasets, mining frequent itemsets using existing parallel mining algorithm is to balance the load by distributing such enormous data between collections of computers. But we identify high performance issue in existing mining algorithms [1]. To handle this problem, we introduce a new approach called data partitioning using Map Reduce programming model.In our proposed system, we have introduced new technique called frequent itemset ultrametric tree rather than conservative FP-trees. An investigational outcome tells us that, eradicating redundant transaction results in improving the performance by reducing computing loads.

using traditional data processing techniques or software‟s. information

Major

challenges in big data are

safekeeping,

distribution,

searching,

revelation, querying, updating such data. Data analyzation is another big apprehension need to concentrate while dealing with big data. It involves data which is formed by different types of data and applications like social media data, online auctions. Data is differentiated into 3 major types‟ structured, unstructured and semi-structured data. It also defines 3 major V‟s Volume, Velocity, and Variety which gives us apparent notion on what is big data. Now a day‟s data is growing very fast, consider an

Keywords: Frequent Itemset, MapReduce, Data partitioning, parallel computing, load balance

example: many hospitals have trillions of data facets of ECG data. Twitter alone collects around 170million temporal data, every now and then, serves as much as 200million queries/day. Most important limitations

1 INTRODUCTION

with the existing systems are handling larger datasets; our databases can handle only structured data but not

Big data is an emerging technology in modern world. It is a greater amount of data, which is hard to process

IDL - International Digital Library

1|P a g e

varieties of data, fault tolerance, scalability. That‟s why big data consign an important role in these days. Copyright@IDL-2017


IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017

Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017 based, some ARM. Considering bulky datasets, it is not able to handle all with a single machine. So data need to be distributed

Compared

with

the

traditional

system,

modern

and processing it Parallely amongst clusters of nodes,

distributed systems tries to achieve high efficiency and

which is a foremost challenge. To handle this scenario

scalability when distributed data is been executed in a

we need to design a distributed storage system. In big

large scale clusters. Many algorithms have been defined

data, this can be conceded by a system called Hadoop

to process FIM, built in Hadoop which aims at

– stores and processing big data. It includes 2

balancing the load by equally distributed [4] among

important techniques called HDFS (storing big data) and MapReduce framework (processing big data). Big data process deals with 3 different techniques data

nodes. When such data is divided into different parts need to maintain the connection between the data thus it leads poor data locality and Parallely it increases data shuffling costs and network overhead. In order to

ingestion, data storage, and data analysis.

improve data locality in this we are introducing a If data is distributed it is tough to find the locality of

parallel FIM technique, where bulk of data is distributed

such files in view of bigger datasets. Better solution to

across Hadoop clusters.

this problem is to follow Master-Slave architecture, in

In this paper they have implemented FIM on Hadoop

which single machine acts as a „Master‟ and remaining machines are treated as „Slave‟. Master knows the location of file being stored on different Slave machines. So whenever a client sends a request,

[10] clusters using Map Reduce framework. This project aims is to boost the performance of parallel FIM on Hadoop clusters and this can be achieved with the help of Map and Reduce job.

Master machine processes it by finding out the requested file in any of the underlined slave machines.

3 METHODOLOGY

Hadoop follows same architecture. Traditional mining algorithms [2] are not enough to handle large data sets. Thus we have introduced a new

2OBJECTIVES

data partitioning technique. Parallel computing [7] is The main goal of the project is to eliminate the

one more method which we have introduced here to

redundant transactions on Hadoop nodes to improve the

compute the redundant transactions parallely. So that

performance and this can be achieved by reducing the

we can achieve better performance compared with the

computing and networking load. It mainly gives

traditional mining algorithms.

attention to grouping highly significant transactions into a data partitioning. In the area of big data processing, MR framework has been used to develop parallel data mining algorithms which includes FIM, FP-growth [3] IDL - International Digital Library

2|P a g e

Copyright@IDL-2017


IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017

Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017 Step1 Scans transaction DB: In this step first we will scan the transaction database to retrieve the frequent itemsets and call is as frequent 1-itemsets. And each set consist of key and value pair. Step 2 Organizing frequent 1-itemsetsFlist: Based on the frequent 1-itemsetsfrequency it sorts in a decreasing order fashion call it as Flist. Fig 3.1 System Architecture: High Level View

Step 3 FIU-Tree: It performs with 2 Map and Reduce

In proposed system, considering old parallel mining

phase.

and new mining algorithm using Hadoop technique



shows that how much processing time is acquired by

Mapper:From step2 we got Flist, here Mappers process Flist and finally will

each of system. In which Hadoop gives us better

produce output as a set of <key, value> pair.

modules to achieve this and illustration of whole system is depicted briefly in the Fig 3.1.



Reducer: Each reducer instance is assigned to process one or more group-dependent subdatasets one by one. For each sub-datasets, the reducer instance builds a local FP-tree. During the recursive process, it may output

4 IMPLEMENTATION

discovered patterns.

In this project, we are trying to show how to achieve better performance measure by comparing existing parallel mining algorithm with data partitioning

Step 4: Accumulating: the outcomes which are generated in Step.3are combined to produce final result.

system using some cluster algorithms. First we will load large datasets into HDFS [6], once it is uploaded

5 OUTCOMES

into the main web server where parallel FIM [5] application is running. Based on the minimum support, it partitions the data among 2 different servers and runs two map reduce jobs. Finally, result will be sent back to the main server which conducts another map and reduce job to mining further frequent itemsets.

Bringing together both new parallel mining algorithm and data partitioning yields to better performance by comparing with the traditional mining algorithms like Apriori , MLFPT [9] etc. which is showcased in below graph.

Thus here we are running 3 map and reduce job.

IDL - International Digital Library

3|P a g e

Copyright@IDL-2017


IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017

Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017 excavating to generate frequent itemset. This data partitioning

technique

not

only

improves

the

performance of a system but also balance the load. In future it can be validated with another emerging technology introduced by Apache Hadoop is Apache Spark [6]. It is a cluster computing technology [8], which is faster than Map Reduce. It uses python as a programming language, where Map Reduce uses Java. Python requires less number of codes to write. Thus it Fig 5.1 Effects of minimum support

improves processing speed.

ACKNOWLEDGEMENT I would also like to thank Mrs. Swathi, Assoc. Professor andHOD, Department of Computer Science and Engineering, CMRIT, Bangalore who shared her opinions and experiences through which I received the required information crucial for the project. Fig 5.2 Speed up performance

REFERENCES CONCLUSION AND FUTURE SCOPE Any area if we consider can realize huge level of records will be generated in a fraction of a second. Processing such info “Apache Hadoop� provides different

framework

like

MapReduce

etc.

In

Traditional parallel mining algorithms for frequent itemset mining it takes more time to process such data,

[1].Fast Parallel ARM without Candidacy generation. Osmar R. ZaYane, Mohammad El-Hajj , Paul Lu. Canada : IEEE, 2001. 7695-1 119-8. [2]. Cloud Data Mining based on Association Rule. CH.Sekhar, S ReshmaAnjum. 2091-2094, AndraPradesh : International journal of computer science and information technology, 2014, Vol. 5 (2). 09759646.

system performance and balancing the load was major challenges. This experiment introduces a new parallel mining algorithm called FIUT using Map Reduce programming paradigm; it divides the input data

[3]. An enhanced FP growth based on MapReduce for mining association rules. ARKAN A. G. ALHAMODI, SONGFENG LU, YAHYA E. A. ALSALHI. China : IJDKP, 2016, Vol. 6.

across multiple Hadoop nodes and start doing parallel IDL - International Digital Library

4|P a g e

Copyright@IDL-2017


IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017

Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017 [4]. Novel Data-Distribution Technique for Hadoop in Heterogeneous Cloud Environments. VrushaliUbarhande, AlinaMadalinaPopescu,HoracioGonz ́alez–V ́elez. Ireland : International Conference on complex intelligent and software sensitive systems, 2015, Vol. 15. 978-14799-8870-9. [5]. An Improved MapReduce Algorithm for Mining Closed Frequent Itemsets. YaronGonen, Ehud Gudes. Israel : International Conference on Software Science, Technology and Engineering, 2016. 978-1-5090-10189. [6]. Big Data Management Processing with Hadoop MapReduce and Spark Technology: A Comparison. AnkushVerma, AshikHussainMansuri ,Dr. Neelesh Jain. 16, Rajasthan : CDAN, 2016. [7] Deep Parallelization of Parallel FP-Growth Using Parent-Child MapReduce. AdetokunboMakanju, Zahra Farzanyar, Aijun An, Nick Cercone,ZaneZhenhua Hu, Yonggang Hu. Canada : IEEE, 2016. [8] A distributed frequent itemset mining algorithm using Spark for Big Data analytics. Feng Zhang, Yunlong Ma, Min Liu. New York : Springer, 2015. [9] Review:Association Rule for Distributed Data. BhagyashriWaghamare, Bharat Tidke. India : ISCSCN. 2249-5789. [10] H2Hadoop: Improving Hadoop Performance using the Metadata of Related Jobs. HamoudAlshammari, Jeongkyu Lee and Hassan Bajwa. TCC-2015-11-0399, s.l. : IEEE, 2015.

IDL - International Digital Library

5|P a g e

Copyright@IDL-2017


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.