IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017
Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
Web Oriented FIM for large scale dataset using Hadoop Mrs. Supriya C PG Scholar Department of Computer Science and Engineering C.M.R.I.T, Bangalore, Karnataka, India
supriyakuppur@gmail.com
Abstract: In large scale datasets, mining frequent itemsets using existing parallel mining algorithm is to balance the load by distributing such enormous data between collections of computers. But we identify high performance issue in existing mining algorithms [1]. To handle this problem, we introduce a new approach called data partitioning using Map Reduce programming model.In our proposed system, we have introduced new technique called frequent itemset ultrametric tree rather than conservative FP-trees. An investigational outcome tells us that, eradicating redundant transaction results in improving the performance by reducing computing loads.
using traditional data processing techniques or software‟s. information
Major
challenges in big data are
safekeeping,
distribution,
searching,
revelation, querying, updating such data. Data analyzation is another big apprehension need to concentrate while dealing with big data. It involves data which is formed by different types of data and applications like social media data, online auctions. Data is differentiated into 3 major types‟ structured, unstructured and semi-structured data. It also defines 3 major V‟s Volume, Velocity, and Variety which gives us apparent notion on what is big data. Now a day‟s data is growing very fast, consider an
Keywords: Frequent Itemset, MapReduce, Data partitioning, parallel computing, load balance
example: many hospitals have trillions of data facets of ECG data. Twitter alone collects around 170million temporal data, every now and then, serves as much as 200million queries/day. Most important limitations
1 INTRODUCTION
with the existing systems are handling larger datasets; our databases can handle only structured data but not
Big data is an emerging technology in modern world. It is a greater amount of data, which is hard to process
IDL - International Digital Library
1|P a g e
varieties of data, fault tolerance, scalability. That‟s why big data consign an important role in these days. Copyright@IDL-2017
IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017
Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017 based, some ARM. Considering bulky datasets, it is not able to handle all with a single machine. So data need to be distributed
Compared
with
the
traditional
system,
modern
and processing it Parallely amongst clusters of nodes,
distributed systems tries to achieve high efficiency and
which is a foremost challenge. To handle this scenario
scalability when distributed data is been executed in a
we need to design a distributed storage system. In big
large scale clusters. Many algorithms have been defined
data, this can be conceded by a system called Hadoop
to process FIM, built in Hadoop which aims at
– stores and processing big data. It includes 2
balancing the load by equally distributed [4] among
important techniques called HDFS (storing big data) and MapReduce framework (processing big data). Big data process deals with 3 different techniques data
nodes. When such data is divided into different parts need to maintain the connection between the data thus it leads poor data locality and Parallely it increases data shuffling costs and network overhead. In order to
ingestion, data storage, and data analysis.
improve data locality in this we are introducing a If data is distributed it is tough to find the locality of
parallel FIM technique, where bulk of data is distributed
such files in view of bigger datasets. Better solution to
across Hadoop clusters.
this problem is to follow Master-Slave architecture, in
In this paper they have implemented FIM on Hadoop
which single machine acts as a „Master‟ and remaining machines are treated as „Slave‟. Master knows the location of file being stored on different Slave machines. So whenever a client sends a request,
[10] clusters using Map Reduce framework. This project aims is to boost the performance of parallel FIM on Hadoop clusters and this can be achieved with the help of Map and Reduce job.
Master machine processes it by finding out the requested file in any of the underlined slave machines.
3 METHODOLOGY
Hadoop follows same architecture. Traditional mining algorithms [2] are not enough to handle large data sets. Thus we have introduced a new
2OBJECTIVES
data partitioning technique. Parallel computing [7] is The main goal of the project is to eliminate the
one more method which we have introduced here to
redundant transactions on Hadoop nodes to improve the
compute the redundant transactions parallely. So that
performance and this can be achieved by reducing the
we can achieve better performance compared with the
computing and networking load. It mainly gives
traditional mining algorithms.
attention to grouping highly significant transactions into a data partitioning. In the area of big data processing, MR framework has been used to develop parallel data mining algorithms which includes FIM, FP-growth [3] IDL - International Digital Library
2|P a g e
Copyright@IDL-2017
IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017
Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017 Step1 Scans transaction DB: In this step first we will scan the transaction database to retrieve the frequent itemsets and call is as frequent 1-itemsets. And each set consist of key and value pair. Step 2 Organizing frequent 1-itemsetsFlist: Based on the frequent 1-itemsetsfrequency it sorts in a decreasing order fashion call it as Flist. Fig 3.1 System Architecture: High Level View
Step 3 FIU-Tree: It performs with 2 Map and Reduce
In proposed system, considering old parallel mining
phase.
and new mining algorithm using Hadoop technique

shows that how much processing time is acquired by
Mapper:From step2 we got Flist, here Mappers process Flist and finally will
each of system. In which Hadoop gives us better
produce output as a set of <key, value> pair.
modules to achieve this and illustration of whole system is depicted briefly in the Fig 3.1.
ď&#x201A;ˇ
Reducer: Each reducer instance is assigned to process one or more group-dependent subdatasets one by one. For each sub-datasets, the reducer instance builds a local FP-tree. During the recursive process, it may output
4 IMPLEMENTATION
discovered patterns.
In this project, we are trying to show how to achieve better performance measure by comparing existing parallel mining algorithm with data partitioning
Step 4: Accumulating: the outcomes which are generated in Step.3are combined to produce final result.
system using some cluster algorithms. First we will load large datasets into HDFS [6], once it is uploaded
5 OUTCOMES
into the main web server where parallel FIM [5] application is running. Based on the minimum support, it partitions the data among 2 different servers and runs two map reduce jobs. Finally, result will be sent back to the main server which conducts another map and reduce job to mining further frequent itemsets.
Bringing together both new parallel mining algorithm and data partitioning yields to better performance by comparing with the traditional mining algorithms like Apriori , MLFPT [9] etc. which is showcased in below graph.
Thus here we are running 3 map and reduce job.
IDL - International Digital Library
3|P a g e
Copyright@IDL-2017
IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017
Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017 excavating to generate frequent itemset. This data partitioning
technique
not
only
improves
the
performance of a system but also balance the load. In future it can be validated with another emerging technology introduced by Apache Hadoop is Apache Spark [6]. It is a cluster computing technology [8], which is faster than Map Reduce. It uses python as a programming language, where Map Reduce uses Java. Python requires less number of codes to write. Thus it Fig 5.1 Effects of minimum support
improves processing speed.
ACKNOWLEDGEMENT I would also like to thank Mrs. Swathi, Assoc. Professor andHOD, Department of Computer Science and Engineering, CMRIT, Bangalore who shared her opinions and experiences through which I received the required information crucial for the project. Fig 5.2 Speed up performance
REFERENCES CONCLUSION AND FUTURE SCOPE Any area if we consider can realize huge level of records will be generated in a fraction of a second. Processing such info â&#x20AC;&#x153;Apache Hadoopâ&#x20AC;? provides different
framework
like
MapReduce
etc.
In
Traditional parallel mining algorithms for frequent itemset mining it takes more time to process such data,
[1].Fast Parallel ARM without Candidacy generation. Osmar R. ZaYane, Mohammad El-Hajj , Paul Lu. Canada : IEEE, 2001. 7695-1 119-8. [2]. Cloud Data Mining based on Association Rule. CH.Sekhar, S ReshmaAnjum. 2091-2094, AndraPradesh : International journal of computer science and information technology, 2014, Vol. 5 (2). 09759646.
system performance and balancing the load was major challenges. This experiment introduces a new parallel mining algorithm called FIUT using Map Reduce programming paradigm; it divides the input data
[3]. An enhanced FP growth based on MapReduce for mining association rules. ARKAN A. G. ALHAMODI, SONGFENG LU, YAHYA E. A. ALSALHI. China : IJDKP, 2016, Vol. 6.
across multiple Hadoop nodes and start doing parallel IDL - International Digital Library
4|P a g e
Copyright@IDL-2017
IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017
Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017 [4]. Novel Data-Distribution Technique for Hadoop in Heterogeneous Cloud Environments. VrushaliUbarhande, AlinaMadalinaPopescu,HoracioGonz ́alez–V ́elez. Ireland : International Conference on complex intelligent and software sensitive systems, 2015, Vol. 15. 978-14799-8870-9. [5]. An Improved MapReduce Algorithm for Mining Closed Frequent Itemsets. YaronGonen, Ehud Gudes. Israel : International Conference on Software Science, Technology and Engineering, 2016. 978-1-5090-10189. [6]. Big Data Management Processing with Hadoop MapReduce and Spark Technology: A Comparison. AnkushVerma, AshikHussainMansuri ,Dr. Neelesh Jain. 16, Rajasthan : CDAN, 2016. [7] Deep Parallelization of Parallel FP-Growth Using Parent-Child MapReduce. AdetokunboMakanju, Zahra Farzanyar, Aijun An, Nick Cercone,ZaneZhenhua Hu, Yonggang Hu. Canada : IEEE, 2016. [8] A distributed frequent itemset mining algorithm using Spark for Big Data analytics. Feng Zhang, Yunlong Ma, Min Liu. New York : Springer, 2015. [9] Review:Association Rule for Distributed Data. BhagyashriWaghamare, Bharat Tidke. India : ISCSCN. 2249-5789. [10] H2Hadoop: Improving Hadoop Performance using the Metadata of Related Jobs. HamoudAlshammari, Jeongkyu Lee and Hassan Bajwa. TCC-2015-11-0399, s.l. : IEEE, 2015.
IDL - International Digital Library
5|P a g e
Copyright@IDL-2017