Integrated Intelligent Research (IIR)
International Journal of Data Mining Techniques and Applications Volume: 07, Issue: 01, June 2018, Page No.100-110 SSN: 2278-2419
Application of Utility Mining using Frequent Itemset and Association Rules: A Survey T.Indhumathy1, T.Velmurugan2 1 Research Scholar, 2Associate Professor PG and Research Department of Computer Science, D. G. Vaishnav College, Arumbakkam, Chennai, India E-Mail: tindhu_m3@yahoo.com, velmurugan_dgvc@yahoo.co.in Abstract: Mining on data reveals patterns that provide useful information for analysis, decision making and forecasting in various domains. Association Rule Mining (ARM) identifies patterns on itemsets which are either frequent or have interesting relationship amongst them based on strong rules and conceptually form a basis for Frequent Itemset mining (FIM) problems. FIM extracts binary values from transaction databases to identify frequently bought items but provides insufficient information for identifying infrequent items that generate maximum profit. So a latter problem, High utility itemsets (HUI) mining was developed to focus on the itemsets that generate huge profit to the business. Even though HUI is related to Business Intelligence, its application extends to Web Server Logs, Biological Gene Databases, Network Traffic Measurements and many other fields. This paper presents a survey on the algorithms from different aspects and perspectives based on Utility mining, Frequent Itemset generation and Association Rule Mining. Keywords: Association Rule mining, Frequent Itemset Mining, Maximum Profit, High Utility Mining I. INTRODUCTION Originating from machine learning and artificial intelligence, Data mining is a powerful technology that extracts hidden effective information automatically from large data repositories, named as knowledge discovery in databases (KDD). Data Mining is a conglomeration of different disciplines such as statistics, database systems, information science, and business intelligence to name a few. Depending upon the type of data, various techniques and mathematical algorithms such as fuzzy set theory, neural networks, inductive logical programming, signal processing, image analysis, bioinformatics, web technology, computer graphics, business, economics and psychology may be applied. Upon the application of a specified techniques or retrospective tools on large volumes of data, otherwise called information processing (i.e., collection, extraction, warehousing, analysis), hidden meaningful patterns are revealed. These patterns allow businesses to make proactive knowledge driven decisions that were traditionally too time consuming to resolve. Data mining tasks are categorized as predictive which focus and infer on the current data and descriptive which work on the general properties of data.Utility mining, FIM and ARM are predictive, revealing useful interesting patterns and relationship among the data related not only to Business Intelligence but also to Medical Science, Security, Telecommunications, Census Analysis and Text Analysis. Different kinds of frequent patterns such as Itemsets, Subsequences or Substructures are extracted mainly from transactional or relational databases. To uncover interesting associations, correlations and recurring relationship amongst them, the output representation such as classification rules, association rules, expressive rules or any other strategy is adopted and different techniques to generate these patterns are developed. The problem of mining association rules was implemented as AIS algorithm by Rakesh Agrawal, Tomasz Imieliński and Arun Swami in 1993 [1]. The researchers use basket data (Market Basket Analysis also known as Affinity Analysis) referring to point-of-sale transaction and analyses customer buying habits by finding association between the different items. It eventually helps the retailers develop marketing strategies by gaining insight into frequently bought items. With a measure of support and confidence, the algorithm decomposes the problem into two major subtasks. 1) To find all the frequent itemsets that satisfy the minimum support threshold is the first task. 2) The second task is Rule Generation, where all the high-confidence rules known as strong rules from the frequent itemsets are extracted.
100
Integrated Intelligent Research (IIR)
International Journal of Data Mining Techniques and Applications Volume: 07, Issue: 01, June 2018, Page No.100-110 SSN: 2278-2419
An enhancement of AIS became Apriori, the basic algorithm which was proposed by R.Agrawal and R. Shrikant in 1994 [2] that uses the downward closure property. Numerous scanning of the database, generating huge number of candidates and monotonous calculation of support counting for candidates are some of the limitations of the Apriori and to overcome these, numerous other algorithms were proposed for mining association rules to generate frequent itemsets. But the frequency of an itemset alone is insufficient for business investigation because it only detects the number of transactions in the database that contain the itemset. Thereafter the theoretical analysis done by Hong Yao et al.,[3] assume that the utilities of itemsets may differ, describe the concept of utility mining and define transaction utility and external utility as two types of utilities for items. Further as the support measure reflects only the statistical correlation of items, Yao and Hamilton [4] put forth utility mining approach algorithms which also reflects the semantic significance of the items allowing the user to express the significance, interests or preferences of itemsets as significance values. This paper is structured as follows. Section 2 explores about the use of High Utility Mining. The effective use of Frequent Itemset mining research topics are given in section 3. Section 4 discusses the applications of Association Rule Mining in detail. Finally, section 5 concludes the research work. II.
APPLICATIONS OF HIGH UTILITY MINING
In Frequent itemset mining frequently purchased items indicated as binary values for the measure of support are scoured from the database providing insufficient information to take decisions. Constraints to itemsets and assigning different weights to items and have been proposed by researchers to subdue the limitations of Frequent Itemset mining. The concept of share measure was proposed that had an impact on the quantity sold which was based on the cost or profit of an itemset to overcome the shortcomings of support. But these proposals addressed the issues only to a certain extent. So utility mining approach focusing on both the objective and the semantic values of items was put forward to further research. To extract the semantic significance from the utility constraints (inconvertible constraints) which is not been handled by the previous theories and techniques, UMining and UMining_H were proposed by Hong Yao and Howard J. Hamilton [4]. UMining finds all itemsets with utility values higher than minimum utility. Based on a heuristic pruning strategy UMining_H typically finds most itemsets with utility values higher than minimum utility. Experiments reveal the algorithms to be effective in extracting itemsets of high utility. Simplified computation in calculating the utility and a reduction in the candidate set is presented by Ying Liu et al., implemented as Two-Phase algorithm[5]. In Phase I, transaction-level utility mining is implemented where the itemsets of high utilities are added to the candidate sets and to filter out the overestimated item sets only one database scan is performed in Phase II. Experiments reveal that all the itemsets which have high utilities are identified, scanning only a few times, utilizing less space with lower cost. Generating huge number of potentially high utility itemsets degrades the performance of the algorithms especially for long transactions. To overcome this issue Vincent S. Tseng et al.,[6] came up with UP-Growth and UP-Growth+. To generate high utility itemsets these algorithms prune the candidate itemsets with several effective strategies. Through the implementation of the data structure (UP-tree), two scans of database are enough to maintain and generate HUI information. The claims have been proved by the experiments conducted on real and synthetic data and outperform other algorithms substantially in terms of runtime, especially for databases with lots of long transactions. Unil Yun et al., propose an algorithm named MU-Growth containing two strategies[7] and a tree structure, named MIQ-Tree, which captures database information with a single-pass. Performance evaluation shows that MU-Growth decreases the number of candidates, outperforming the other current tree-based algorithms with overestimated methods in terms of runtime with a similar memory usage. C. F. Ahmed et al., implements an interactive and incremental tree structures to overcome the disadvantages of the previous HUP algorithms in terms of repeatedly changing minimum threshold [8]. Incremental HUP Lexicographic Tree (IHUPL-Tree) arranges the items according to lexicographic order without any restructuring. IHUP Transaction Frequency Tree (IHUP TF-Tree), the second tree structure arranges the items according to the transaction frequency which ultimately obtains a compact size. IHUP-Transaction-Weighted Utilization Tree
101
Integrated Intelligent Research (IIR)
International Journal of Data Mining Techniques and Applications Volume: 07, Issue: 01, June 2018, Page No.100-110 SSN: 2278-2419
(IHUPTWU-Tree) is the third tree that reduces the mining time based on the TWU value of items in descending order. Experiments on different datasets show that the tree structures are very efficient and scalable for incremental and interactive HUP mining. Sudip Bhattacharya et al., present a literature review on the algorithms and its current state of research in high utility itemset mining [9]. Discussions from different perspectives on HUM with appropriate references were presented. MortezaZihayat and Aijun An propose T-HUDS to find potential top-n number of high utility itemsets over sliding windows of a data streamusing a compressed tree structure [10] called HUDS-tree. It prunes the search space effectively by using the new utility estimation model. Several strategies were proposed for initializing and dynamically adjusting the minimum utility threshold ultimately extracting all the top-n high utility itemsets. With respect to execution time and memory storage the T-HUDS notably outperforms two stateof-the-art high utility itemset algorithms. Two efficient sliding window-based algorithms MHUI-BIT and MHUI-TID elaborated as Mining High-Utility Itemsets based on BITvector and Mining High-Utility Itemsets based on TIDlist respectively were proposed by Hua-Fu Li et al., for mining high-utility itemsets from data streams[11]. To improve the efficiency of extracting high-utility itemsets that has positive profits, a lexicographical tree-based summary data structure named LexTree-2HTU is developed. An analysis of the results show that the proposed algorithms outperform the existing approaches for discovering high-utility itemsets from data streams over sliding windows. To mine utility itemsets that has negative item profits, adapted approaches of MHUI-BIT and MHUITID were proposed and experiments prove that they are efficient over stream transaction-sensitive sliding windows. Yu-Chiang Li et al., propose IIDS (Isolated Items Discarding Strategy) [12], for better performance by reducing candidates which is applicable to any of the existing level-wise utility mining. Applying IIDS to two efficient share mining methods ShFSM and DCG, FUM and DCG+ were implemented respectively. The performance of FUM and DCG+ is verified on real and synthetic datasets which proves to be more efficient than that of ShFSM and DCG. Thus, IIDS is an useful and practical procedure for utility mining. As against the two-phase horizontal approach in most high utility mining algorithms, a single phase algorithm which uses vertical database approach [13] is proposed by S. RenuDeepti and B. Srivani. This approach avoids exhaustive search which is expensive and time consuming and to efficiently prune the search space, two strategies based on u-list structure and item pair co-existence map are used. Comparing the proposed algorithm with the two-phase algorithms UP-Growth, UP-Growth+ and IHUP on different databases, results prove that the proposed algorithm outperform with respect to memory usage and runtime Wei Song et al.,[14] propose an efficient algorithm to compute high utility itemsets, called IHUI-Mine (Index High Utility Itemsets Mine) which is an extension of subsume index. A new property is implemented to speed up the computation of transaction-weighted utilization. The scanning of the entire database is avoided by representing the database in bitmaps where the candidate`s real utility is verified from the recorded transactions. The analysis and tests conducted on synthetic and real datasets prove that the proposed algorithm outperforms existing state-of-the-art algorithms. The study in rare itemsets is crucial for many fields to identify abnormalities and thereby prevent losses and recover from them or avoid unnecessary situations. Genetical engineering, biological, environmental, marine and medical sciences are some of the fields which consider rare itemset mining of great importance. The approach towards rare itemsts mining by L.Szathmary et al.,[15] takes up significant substructures from the mining space and divides them into two steps, i.e., frequent itemset part traversal and rare itemset listing. They propose two algorithms for step one, a naŃ—ve and an optimized one, respectively, and another algorithm for step two. They also provide some empirical evidence about the performance gains due to the optimized traversal. Least association rules are used to detect abnormalities,infrequent events,unusual patterns and exceptional cases which are practically necessary to either avoid the differences or to improve the condition in certain applications like pollution,network intruders,machine critical faulty, abnormal learning problems, and many more. There are however very few techniques which are dedicated to find least association rules. A new measurement called Critical Relative Support (CRS) was proposed by Zailani Abdullah et al., to mine critical least association rules from educational
102
Integrated Intelligent Research (IIR)
International Journal of Data Mining Techniques and Applications Volume: 07, Issue: 01, June 2018, Page No.100-110 SSN: 2278-2419
context[16]. The dataset comprises of students’ examination result and to reveal only the significant rules this approach reduces up to 98% of uninterested association rules.HUI mining algorithms provide a huge support to enhance the practical usefulness when combined appropriately with medical science. One such enhancement was proposed by A. Saranya and D. K. Hanirex where UP growth algorithm is used with Genetic Algorithm[17]. Potentially High Utility Itemsets is extracted using UP growth algorithm and for optimization and generation of itemsets the genetic algorithm was used. On comparing with the existing algorithm, the proposed approach performs better in terms of memory utilization. Using pattern growth approach to mine high utility itemsets Alva Erwin et al., propose a new algorithm called CTU-Mine [18]. Most of the utility mining is based on the candidate-generation and-test approach suitable for sparse data sets with short patterns as stated by a recent research, so CTU-Mine was proposed to mine dense data sets or long patterns. The algorithm was tested on several dense data sets and proved to work efficiently when compared with the recent algorithms. Usually Utility mining on the dataset and its processing focus on centralized systems, so to implement the concepts in distributed system, K. Subramanian, P. Kandhasamy and S. Subramanian [19] uses Fast Utility Mining (FUM) algorithm in a sophisticated manner. The main idea is to extract high utility itemsets from the clients and detect the overall utility value at the server. Experiments reveal that the execution time is reduced to a great extent as compared to the time taken in a centralized system. To extract the desired number of itemsets and to overcome the issue of setting minimum utility appropriately, an algorithm named TKU [20] was proposed by Cheng Wei Wu et al.,. A new framework called top-k high utility itemset mining is implemented where k is the desired number of high utility itemsets to be mined. Further the challenges arising from the absence of anti-monotone property and the requirement of lossless results are solved. Novel strategies to prune the search space are also implemented. Experimental done on real and synthetic datasets show that TKU has excellent performance and scalability. III.
FREQUENT ITEMSET MINING
The basic algorithms that address the issue of FIM have undergone a lot of variations. Novel ideas and data structural techniques have been implemented to overcome the disadvantages of the basic algorithms. Few papers have been discussed below from different aspects of implementations of FIM. For attaining maximum performance wrt execution time and memory usage, Christian Borgelt [21] uses preďŹ x tree representation of the needed counters to implement Apriori and uses a doubly recursive scheme to count the transactions. To represent transactions lists, sparse bit matrices are used for the implementation of Eclat and drain off closed and maximal item sets. GÓ§sta Grahne and Jianfei Zhu present a novel FP-array technique that is efficient and incorporates various optimization techniques for FP-tree-based algorithms that reduces the need to traverse FP-trees, thus improving performance [22] working especially well for sparse data sets. Experimental results show that the methods are the fastest when compared with existing algorithms. When the data sets are sparse they consume memory but fast when the minimum support is low. They consume less memory than other methods when the data sets are dense. C. Lucchese et al., presents kDCI a scalable algorithm, an enhancement of DCI (Direct Count & Intersect)[23]. To mine effectively and for adapting to the features of the specific dataset and the computing platform, multiple heuristics and efficient data structures are used for both short and long patterns from a variety of datasets. The algorithm was conducted on a wide range of experiments on synthetic and real-world datasets. The results infer that kDCI performances are not over-fitted to a special case, and its high performance is maintained on datasets with different characteristics. Lars Schmidt and Thieme take up the algorithm Eclat to analyse its features and prove that they work better than the other best algorithms especially for dense datasets [24]. Taking Eclat as the base algorithm the authors analyze its effectiveness by combining the features from algorithms like Apriori, fpgrowth, lcm and the new algorithms present at that time. Experiments reveal that the performance of Eclat stands out when compared to other algorithms in competition. Zhihong Deng and Zhonghui Wang propose PPV [25] for extracting frequency patterns rapidly. Based on a data structure called Node-lists obtained from coding prefix-tree called PPC-tree, it is a vertical algorithm and the efficiency is achieved with three techniques.
103
Integrated Intelligent Research (IIR)
International Journal of Data Mining Techniques and Applications Volume: 07, Issue: 01, June 2018, Page No.100-110 SSN: 2278-2419
Comparing the algorithm with FP-growth, Eclat and dEclat - the last two being important vertical algorithms on a number of databases, the experimental results show that PPV is efficient than FP-growth, Eclat, and dEclat. Hung Long Nguyen present an enhanced algorithm named AWFIMiner [26] reflecting the variation of weights in items rather than fixed weights. Using adaptive weights on datasets experiments reveal the efficiency and the scalability of AWFIMiner to be high. In addition AWFIMiner is suitable for mining itemsets on data streams because it requires one single database scan only. Mohamed A. Gawwad et al.,introduced a new technique to work with big datasets for FIM called POBPA, that could work in distributed environment [27]. The itemsets are mined using Buddy Prima as a basic algorithm combined with the calculation using Greatest Common Divisor. It could be implemented using map-reduce technique or Spark to parallelize the FIM without the need to generate candidate itemsets and avoids any communication overhead between the participated nodes. For single node operation it works in a parallel fashion in the hardware. It was successfully applied on different size transactions DBs and compared with two well-known algorithms: FP-Growth and Parallel Apriori with different support levels. The experiments showed that the proposed algorithm is time efficient when compared to both algorithms especially with datasets having huge number of items. Manju presents a new approach based on graph and mines frequent itemsets without repeatedly scanning the database in less time and in a straight forward way[28]. Irrespective of support level and to find the largest most frequent itemset this proposed approach performs just one scan of the database in first phase and draws a graph in which edges are labeled with the respective transactions ids. In second phase, a table is constructed which has all distinct labels and the corresponding itemsets. The largest most frequent itemset is selected from this table according to the given selection criterion. Karam Gouda and Mohammed J. Zakiz, present GenMax, to mine the exact set of maximal patterns. It is a backtrack search based algorithm [29], using progressive focusing, an innovative technique to eliminate non-maximal itemsets, perform maximality checking, and diffset propagation to perform fast frequency computation. Comparison with previous work indicates that GenMax is a highly efficient method. QinghuaZou et al., proposed a technique called SmartMiner to gather and pass tail (of a node) information seeking to determine the next node during the mining process thus improving the performance of mining MFI (Maximal frequent itemsets) [30]. The algorithm uses an augmented dynamic reordering heuristic considering the tail information. Compared with Mafia and GenMax, SmartMiner generates a much smaller search tree, requires a smaller number of support counting, and does not require superset checking and yields an order of magnitude improvement in speed. Mohammed J. Zaki andChing-Jui Hsiao present CHARM, an efficient algorithm for mining all frequent closed itemsets [31] using a dual itemset-tidset tree making hybrid searches that efficiently skips many levels. A technique called diffsets is used to reduce the amount of storage and a fast hash-based approach to remove any “non-closedâ€? sets found during computations. Experiments on real and synthetic databases show that CHARM is better than the previous methods and is scalable linearly. Nicolas Pasquier et al., proposed a new algorithm A-Close, that uses closure technique to find frequent closed itemsets [32]. Instead of using the subset lattice it is constructed by reducing the search space to the closed itemset lattice. The set of all frequent closed itemsets is sufficient to figure out a reduced set of association rules, thus limiting the number of rules produced without loss of information. Experiments showed that the approach is very valuable for dense and correlated data that represent an important part of existing databases. Guimei Liu et al., proposes an adaptive pattern growth algorithm, called AFOPT, [33] where it differs in several dimensions, overcoming the disadvantages from the existing pattern growth algorithms. It demonstrated good performance on all tested datasets described and extended the algorithm to mine closed and maximal frequent itemsets. Vijayarani.S and Sathya. P present a research paper to discover frequent item sets in data streams [34] using Éclat and RARM elaborated as Rapid Association Rule mining algorithm. The performance of RARM algorithm when compared to Eclat is better that was proved through experiments on appropriate datesets. Sanjay Patel and K. Kotecha proposes prime number mechanism [35] which is used to form the graph such that ACO can be easily applied. To solve combinatorial optimization problems the authors suggest applying ACO (Ant
104
Integrated Intelligent Research (IIR)
International Journal of Data Mining Techniques and Applications Volume: 07, Issue: 01, June 2018, Page No.100-110 SSN: 2278-2419
Colony Optimization) being one of the Computational Intelligence techniques as one of the most common metaheuristic approach. To represent the problem for frequent pattern mining in Graph form is the basic requirement of ACO, which will be applicable in future for applying ACO. The present status of frequent pattern mining is provided by Jiawei Han et al., with a brief scenario and discuss a few research directions [36]. They state that there are still some challenging research issues that need to be solved although the scope of data analysis have broadened and a deep impact on data mining methodologies and applications is achieved. IV.
RESEARCHES ON ASSOCIATION RULE MINING
ARM helps to uncover relationships hidden in large data sets. Different association rules express different regularities that underlie the dataset and they generally predict different things. And many association rules can be derived from even a tiny dataset, interest is restricted to those that generate frequent itemsets. Given below are different variations of the rules analyzed through algorithms. H. Mannila et al., study the properties of association rule discovery in relations and give an improved version [37] of the basic algorithm as proposed in [1]. The enhanced algorithm obtains the information out of the previous passes, and irrelevant candidate rules are eliminated. Experiments on a university course enrollment database indicate that the method outperforms the previous one by a factor of 5. The authors take up two views of the problem to prove that sampling is effective to find the association rules by studying the theoretical aspects of the problem that these rules hold in a relation. Ashok Savasere et al., present an efficient algorithm called Partition that finds association rules [38] with two scans of database. This is significant because the I/O and CPU overhead are greatly reduced especially for large databases. Experiments reveal that the reduction of overhead for CPU is by a factor of four and the reduction of I/O is significantly great when compared to the previous algorithms. To adapt association rules in a dynamic database Shichao Zhang et al., present a paper where weights are assigned a higher value to the candidates when they are added dynamically. This is done to maintain the frequent itemsets as they increase. In addition the model upgrades to frequent itemsets from infrequent itemsets and vice versa. The entire database need not be scanned again and again when the itemsets are added dynamically due to this novel feature presents in this paper. This novel approach proves to be effective through experiments conducted on appropriate datasets. To reduce communication and improve the running time in a distributed environment Ebrahim Ansari Chelche et al., proposes an algorithm which uses hash filtering and Trie data structure [40]. These techniques were proposed previously for centralized setting and here they are adopted in the FDM algorithm as one of the wellknown distributed association rules mining algorithm. Experimental evaluations on different sort of distributed data show the effect of using these adopted techniques.Ling Zhou and Stephen Yauproposes two simple, practical and effective schemes called Hash-based Scheme and Matrix-based Scheme to mine association rules among rare items[41]. The algorithm can also be applied to frequent items with bounded length. C. H. Cai, in his thesis [42] figures that the analysts may want to mine the rules based onthe importance of the products, items or attributes. So he generalizes this idea to the case where the items are given weights to reflectthe importance to the users. As the downward closure property of the supportmeasure in the mining of association rules no longer exists, previous algorithmscannot be applied. The author makes use of a metric, called support bounds,in the mining of binary and quantitative (fuzzy) association rules and introduce the simple samplemethod and the data maintenance method, basedon the statistical approach, to mine the rules. Experiments show that by using the weights, we can prune the items without any interest at an earlier step, andhence saving time in mining of the association rules. GuoqiQian et al., propose an algorithm [43] with a speculative search that finds the most valuable association rules from datasets. Combined with the features of Apriori, the algorithm runs on the database randomly and works even if the database is small. Taking data samples from both real and synthetic dataset they prove that the most important association rules are extracted. While mining association rules there are possibilities of choosing values which lead to seemingly correct rules but are actually false. To avoid and control these false positives Guimei Liu et al., [44] present three approaches.
105
Integrated Intelligent Research (IIR)
International Journal of Data Mining Techniques and Applications Volume: 07, Issue: 01, June 2018, Page No.100-110 SSN: 2278-2419
The testing proves that false positives are controlled efficiently by these approaches. In addition to that, rules that are seemingly true are avoided to a great extent. Sujni Paul presents a paper on [45] an Optimized Distributed Association Rule mining algorithm for geographically distributed data which is used in parallel and distributed environment so that it reduces communication costs. The response time is calculated in this environment using XML data. Sotiris Kotsiantis and Dimitris Kanellopoulos provide the foundation of basic concepts related to mining association rules and provide a survey on the existing techniques [46] of association rule mining. This paper bring about the important theoretical issues present at that time and provides the researcher with an insight for further investigations on the unexplored topics. Table 1: Comparative Analysis Results
Paper Ref. No
Author Name
[6]
Vincent S. Tseng et al.,
[7] [10] [11] [12] [13] [14] [15] [18] [19] [20]
[22] [23] [24] [23] [26] [27] [29]
Unil Yun et al., MortezaZihayat and Aijun An Hua-Fu Li et al., Yu-Chiang Li et al., S. RenuDeepti and B. Srivani, Wei Song et al., L.Szathmary et al., Alva Erwin et al., Cheng Wei Wu et al., Kannimuthu Subramanian et al., GÓ§staGrahne and Jianfei Zhu C. Lucchese et al., Lars SchmidtThieme Zhi-Hong Deng Hung Long Nguyen Mohamed A. Gawwad et al., Karam Gouda and Mohammed J.
1.648
Minimum Support Utility % 0.1
Less than 500 500
1.837 832.801 75
0.01 -
60
-
0.2
30 590
1000 -
0.2 0.02
7.144
170.79
20
IHUI-Mine Apriori-Rare MRG-Exp CTU-Mine TKU
175 11.47 15.91 1000 Less than 300
150 -
2.0 10
-
0.1 0.01
FUM-D
46.77
-
1
FPgrowth* FPMax* FPClose* kDCI Eclat SOPH
500 800 500 9 10^2.0
450 450 450 -
2 2 2 0.01 0.01
PPV AWFIMiner
700 280
14.72
3 1
POBRA
24
-
30
GenMax
10
-
0.1
Proposed Method 1) UP-Growth 2) UP-Growth+, MU-Growth T-HUDS MHUI-BIT and MHUI-TID, Isolated Items Discarding Strategy U-Vertical Algorithm
106
Execution time Increases with increasing t
Memory usage
Integrated Intelligent Research (IIR)
[30] [31] [32] [33] [34] [35] [37] [38] [39] [40]
[41] [43]
Zakiz, QinghuaZou et al., Mohammed J. Zaki andChingJui Hsiao Nicolas Pasquier et al., Guimei Liu et al., S. Vijayarani and P. Sathya Sanjay Patel and K. Kotecha H. Mannila et al., Ashok Savasere et al., Shichao Zhang et al., Ebrahim Ansari Chelche et al.,
Ling Zhou and Stephen Yau Sujni Paul
International Journal of Data Mining Techniques and Applications Volume: 07, Issue: 01, June 2018, Page No.100-110 SSN: 2278-2419 SmartMiner
6
-
1
CHARM
8
0.39
0.1
A-Close
25
-
0.5
AFOPT Max AFOPT Close RARM
5 6 23
-
0.1 0.1 -
Prime_Based_Graph method Offline Candidate Determination (OCD) Partition algorithm
140
-
2
550
-
10
12
-
1
Weighted technique
1000
-
2
Distributed Hash Filtering Technique And Distributed Transcation Trimming MBS HBS Optimized Distributed Association Mining Algorithm
30
-
0.00121
-
2
-
0.2
658 35.98 75
Frequent itemset mining is an essential task that is used in many application domains including various data warehouses and other information repositories. Initially the discovery of frequent patterns had multiple issues so modifications, variations, adaptations and improvements in the algorithms, data structure etc have had a great impact in generating these patterns effectively. To avoid multiple issues in generating frequent patterns different approaches like vertical scanning were implemented and weights were added to find the real value of the itemsets. Algorithms to work for distributed databases and data streams were successfully implemented by some authors. Maximal frequent itemsets, closed frequent itemsets and generation of rare itemset were proposed to find the maximum possible frequent itemsets, to overcome the issues of memory wastages and to analyze the importance of infrequent items. The formulation of association rules from different perspectives taken for study by the authors proved to be of importance for the future development of FIM related not only to the respective application but also liable to other fields as well. Efficient way of finding association rules by applying various techniques like hashing, matrix, sampling, graphs, fuzzy logic etc were implemented. These helped to reduce the I/O and CPU overheads and the performance of the algorithms were improved a great deal in comparison to the older methods. Algorithms to obtain the utility of itemsets were upgraded using variations in tree structures which simplified the computation and reduced the number of scans. Ample proof has been given to support the claims of the methods. Strategies for pruning candidates were proven to be effective and outperformed the algorithms taken for comparison. Top-k high utility itemsets in datastreams were obtained and algorithms implemented to work over sliding windows performed effectively in terms of runtime and memory usage. Algorithms to work in a
107
Integrated Intelligent Research (IIR)
International Journal of Data Mining Techniques and Applications Volume: 07, Issue: 01, June 2018, Page No.100-110 SSN: 2278-2419
distributed environment, scanning for rare itemsets to be used in the field of genetics and medical science were also implemented successfully. V. CONCLUSION The concept of High utility mining developed from FIM that includes Market Basket Analysis, rare, closed, minimal and maximal mining. The algorithms Apriori, Eclat, DCI were enhanced with the modification in the data structures to be utilized in several applications. These modifications helped to achieve a satisfactory level of improvement in terms of memory space, execution time and scalability. Algorithms on High utility mining imbibe various qualities which were analyzed and experiments were conducted on synthetic and real datasets by the researchers. Generally algorithmic refinements were done to overcome the difficulties of anti-monotone properties and the application of different techniques were adapted to work for specific purposes thereby contributing to HUI. Among the algorithms utilized for the purpose of HUI, the algorithm MU-Growth is better in terms of memory usage and execution time while FPMax* and FPClose* works well when it comes to maximal and Closed itemset mining. References [1] R. Agrawal, T. Imielinski, and A. Swami, “Mining association rules between sets of items in large databases”, Proc.of ACM SIGMOD Int. Conf. Manag. data, vol. 22,pp. 207–216, 1993. [2] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules”, Organization, vol. 1215, pp. 487–499, 1994. [3] H. Yao, H. J. Hamilton, and C. J. Butz, “A Foundational Approach to Mining Itemset Utilities from Databases.”,Proc. 2004 Soc. Ind. Appl. Math. Int. Conf. Data Min., pp. 482–486, 2004. [4] H. Yao and H. J. Hamilton, “Mining itemset utilities from transaction databases”,Data Knowl. Eng., vol. 59, no. 3, pp. 603–626, 2006. [5] Y. Liu, J. Li, W.-K. Liao, A. Choudhary, and Y. Shi, “High Utility Itemsets Mining”,Int. J. Inf. Technol. Decis. Mak., vol. 9, no. 6, pp. 905–934, 2010. [6] V. S. Tseng, B. E. Shie, C. W. Wu, and P. S. Yu, “Efficient algorithms for mining high utility itemsets from transactional databases”,IEEE Trans. Knowl. Data Eng., vol. 25, no. 8, pp. 1772–1786, 2013. [7] U. Yun, H. Ryang, and K. H. Ryu, “High utility itemset mining with techniques for reducing overestimated utilities and pruning candidates”,Expert Syst. Appl., vol. 41, no. 8, pp. 3861–3878, 2014. [8] C. F. Ahmed, S. K. Tanbeer, B. S. Jeong, and Y. K. Lee, “Efficient tree structures for high utility pattern mining in incremental databases”,IEEE Trans. Knowl. Data Eng., vol. 21, no. 12, pp. 1708–1721, 2009. [9] S. Bhattacharya and D. Dubey, “High Utility Itemset Mining”,Ijetae.Com, vol. 2, no. 8, 2012. [10] M. Zihayat and A. An, “Mining top- k high utility patterns over data streams”, 2013. [11] H. F. Li, H. Y. Huang, and S. Y. Lee, “Fast and memory efficient mining of high-utility itemsets from data streams: With and without negative item profits”,Knowl. Inf. Syst., vol. 28, no. 3, pp. 495–522, 2011. [12] Y. C. Li, J. S. Yeh, and C. C. Chang, “Isolated items discarding strategy for discovering high utility itemsets”,Data Knowl. Eng., vol. 64, no. 1, pp. 198–217, 2008. [13] S. R. Deepti and B. Srivani, “Efficient Algorithm for Mining High Utility Itemsets from Large Datasets Using Vertical Approach”,IOSR J. Comput. Eng., vol. 18, no. 4, pp. 68–74, 2016. [14] W. Song, Z. Zhang, and J. Li, “A high utility itemset mining algorithm based on subsume index”,Knowl. Inf. Syst., vol. 49, no. 1, pp. 315–340, 2016. [15] L. Szathmary et al., “Towards Rare Itemset Mining”,19th IEEEInternational Conference on Tools with Artificial Intelligence., pp.305-312,2007. [16] Z. Abdullah, T. Herawan, N. Ahmad, and M. M. Deris, “Mining significant association rules from educational data using critical relative support approach”,Procedia - Soc. Behav. Sci., vol. 28, pp. 97–101, 2011. [17] A. Saranya and D. K. Hanirex, “Extraction of High Utility Itemsets using Utility Pattern with Genetic Algorithm from OLTP System”, International Journal on Recent and Innovation Trends in Computing and Communication., vol. 3, Issue. 3,pp. 1326–1331, 2015. [18] A. Erwin, R. P. Gopalan, and N. R. Achuthan, “CTU-Mine: An Efficient High Utility Itemset Mining Algorithm Using the Pattern Growth Approach”,7th IEEE Int. Conf. Comput. Inf. Technol. (CIT 2007), pp. 71–76, 2007.
108
Integrated Intelligent Research (IIR)
International Journal of Data Mining Techniques and Applications Volume: 07, Issue: 01, June 2018, Page No.100-110 SSN: 2278-2419
[19] K. Subramanian, P. Kandhasamy, and S. Subramanian, “a Novel Approach To Extract High Utility Itemsets From Distributed Databases”,Comput. Informatics, vol. 31, pp. 1597–1615, 2012. [20] C. W. Wu, B.-E. Shie, V. S. Tseng, and P. S. Yu, “Mining top-K high utility itemsets”,Proc. 18th ACM SIGKDD Int. Conf. Knowl. Discov. data Min. - KDD ’12, p. 78, 2012. [21] C. Borgelt, “Efficient Implementations of Apriori and Eclat”,FIMT 03 Proc. IEEE ICDM Work. Freq. itemset Min. implementations, 2003. [22] G. Grahne and J. Zhu, “Fast algorithms for frequent itemset mining using FP-trees”,IEEE Trans. Knowl. Data Eng., vol. 17, no. 10, pp. 1347–1362, 2005. [23] C. Lucchese, R. Perego, and F. Silvestri, “kDCI : a Multi-Strategy Algorithm for Mining Frequent Sets.”, Fimi, 2003. [24] L. Schmidt-Thieme, “Algorithmic Features of Eclat.”, in FIMI ’04, Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations, 2004. [25] Z. Deng and Z. Wang, “A new fast vertical method for mining frequent patterns”,Int. J. Comput. Intell. Syst., vol. 3, no. 6, pp. 733–744, 2010. [26] H. L. Nguyen, “An Efficient Algorithm for Mining Weighted Frequent Itemsets Using Adaptive Weights”,Int. J. Intell. Syst. Appl., vol. 7, no. 11, pp. 41–48, 2015. [27] M. A. Gawwad, M. F. Ahmed, and M. B. Fayek, “Frequent Itemset Mining for Big Data Using Greatest Common Divisor Technique”,Data Sci. J., vol. 16, no. 0, pp. 1–10, 2017. [28] Manju, “Graph Based Approach for Finding Frequent Itemsets to Discover Association Rules”,IJIET.,vol. 3, no. 2, pp. 269–274, 2013. [29] K. Gouda and M. J. Zaki, “Efficiently mining maximal frequent itemsets”,Proc. 2001 IEEE Int. Conf. Data Min., vol. 3, 2001. [30] Q. Zou, W. W. Chu, and B. Lu, “SmartMiner: A Depth First Algorithm Guided by Tail Information for Mining Maximal Frequent Itemsets.”,Proc. 2002 IEEE Int. Conf. Data Min., pp. 570–578, 2002. [31] M. J. Zaki and C.-J. Hsiao, “CHARM : An Efficient Algorithm for Closed Itemset Mining”, Data Min. Knowl. Discov., vol. 15, pp. 457–473, 2001. [32] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal, “Discovering Frequent Closed Itemsets for Association Rules”,Icdt, pp. 398–416, 1999. [33] G. Liu, H. Lu, J. Yu, W. Wang, and X. Xiao, “AFOPT: An Efficient Implementation of Pattern Growth Approach.”,Fimi, 2003. [34] S. Vijayarani and M. P. Sathya, “An Efficient Algorithm for Mining Frequent Items in Data Streams”, International Journal of Innovative Research in Computer and Communication Engineering., vol. 1, no. 3, pp. 742–747, 2013. [35] S. Patel and K. Kotecha, “A Prime Number based framework for Frequent Pattern Mining”, IJCSC.,vol.5, pp. 50-60, 2014. [36] J. Han, H. Cheng, D. Xin, and X. Yan, “Frequent pattern mining: Current status and future directions”,Data Min. Knowl. Discov., vol. 15, no. 1, pp. 55–86, 2007. [37] H. Mannila, H. Toivonen, and A. I. Verkamo, “Efficient Algorithms for Discovering Association Rules”,AAAI Work. Knowl. Discov. Databases KDD94, vol. 118, no. 1–4, pp. 181–192, 1994. [38] A. Savasere, E. Omiecinski, and S. Navathe, “An efficient algorithm for mining association rules in large databases”,Proc 1995 Int Conf Very Large Data Bases VLDB95, vol. 5, no. 1, pp. 432–444, 1995. [39] S. Zhang, C. Zhang, and X. Yan, “Post-mining: Maintenance of association rules by weighting”,Inf. Syst., vol. 28, no. 7, pp. 691–707, 2003. [40] E. A. Chelche, M. H. Sadreddini, and G. H. Dastghaybifard, “Using Candidate Hashing and Transaction Trimming in Distributed Frequent Itemset Mining”,World Applied Sciences Journal., vol. 9, no. 12, pp. 1353–1358, 2010. [41] L. Zhou and S. Yau, “Efficient association rule mining among both frequent and infrequent items”,Comput. Math. with Appl., vol. 54, no. 6, pp. 737–749, 2007. [42] C. H. Cai, A. W. C. Fu, C. H. Cheng, and W. W. Kwong, “Mining Association Rules with Weighted Items”,Int. Database Eng. Appl. Symp., pp. 68–77, 1998. [43] G. Qian, C. R. Rao, X. Sun, and Y. Wu, “Boosting association rule mining in large datasets via Gibbs sampling”,Proc. Natl. Acad. Sci., vol. 113, no. 18, pp.4958–4963, 2016.
109
Integrated Intelligent Research (IIR)
International Journal of Data Mining Techniques and Applications Volume: 07, Issue: 01, June 2018, Page No.100-110 SSN: 2278-2419
[44] L. Liu, G.; Zhang, H.; Wong, “Controlling False Positives in Association Rule Mining”, Proc. VLDB Endow., pp. 145–156, 2011. [45] S. Paul, “An Optimized Distributed Association Rule Mining Algorithm in Parallel and Distributed Data Mining with XML Data for Improved Response Time”,Int. J. Comput. Sci. Inf. Technol., vol. 2, no. 2, pp. 90–103, 2010. [46] S. Kotsiantis and D. Kanellopoulos, “Association Rules Mining: A Recent Overview”,GESTS International Transactions on Computer Science and Engineering., Vol. 32, No.1, pp. 71–82, 2006.
110