IJIRST –International Journal for Innovative Research in Science & Technology| Volume 1 | Issue 7 | December 2014 ISSN (online): 2349-6010
Mining Sequences – Approaches and Analysis Manika Verma Assistant Professor Department of Computer Science
Dr. Devarshi Mehta Associate Professor
Kadi Sarva Vishwavidyalaya, Gandhinagar, Gujarat, India
GLS Institute of Computer Technology, Ahmedabad,India
Dr. Vishal Dahiya Associate Professor Indus University, Ahmedabad,India
Krupa Mehta Assistant Professor GLS Institute of Computer Technology, Ahmedabad,India
Abstract Sequential Pattern Mining is to discover sequential patterns, with user-specified minimum support of pattern where support is number of sequences that contains pattern, from a database of sequences. Each sequence of database consists of list of transactions ordered by transaction time and each transaction is a set of items. Closed Sequential Pattern Mining has same capability as Sequential pattern mining, but in Closed Sequential Pattern Mining redundant patterns to be generated and stored are reduced which is much economical. This paper presents approaches and key-feature of algorithms ClaSP, CM-ClaSP, CloSpan, BIDE which are used for mining closed sequential patterns as well as approaches and key features of algorithms GSP, SPADE, PrefixSpan, SPAM, LAPIN which are used for mining sequential pattern. It shows that number of sequences generated in Closed Sequential Pattern Mining is much less than those generated by Sequential Pattern Mining which makes Closed Sequential Pattern Mining Economical. The algorithms are compared by attributes total time required to find frequent sequences, number of frequent sequences generated and maximum memory required. Keywords: Sequential Pattern Mining, Closed Sequential Pattern Mining. _______________________________________________________________________________________________________
I. INTRODUCTION Sequential pattern mining is applied in various areas like market and customer behavior analysis, web log analysis, pattern discovery in protein sequences and tandem repeats in DNA sequences, mining XML query access patterns for caching [3][2]. Various Mining methods have been studied like General Sequential Pattern Mining, Closed Sequential Pattern Mining, Constraint based sequential Pattern Mining[8][9][10][13].In recent year many studies have presented the views that for identifying frequent patterns, rather than mining all frequent patterns only closed patterns should be mined which lead to better efficiency [3].
Fig. 1: Approaches and Algorithm Used For Sequential Pattern Mining
All rights reserved by www.ijirst.org
229
Mining Sequences – Approaches and Analysis (IJIRST/ Volume 1 / Issue 7 / 047)
II. ALGORITHMS FOR SEQUENTIAL PATTERN MINING A. Horizontal Database Format Algorithms: Considering the dataset of customer, in Horizontal Database Formatting, Data is first sorted by CustomerID and then by Transaction Time and thus it results in transformed database. Then the mining is carried out using a breadth-first approach [1].
Algorithm
Table - I Summary of Apriori-Based Algorithm For Mining Sequential Patterns (Candidate Generation: Horizontal Database Format) Author Year
Key Features
Apriori (All, Some, Dynamic Some)
Agrawal and Srikant
1995
Mining Frequent sequences using Apriori Property[8]
Generalised Sequential Patterns (GSP)
Srikant and Agrawal
1996
Max/Min Gap Time Constraints, Sliding Window size, Find item across different levels of taxonomy (is-a hierarchy) [9]
Sequential Pattern mIning with Regular expressIon consTraints (SPIRIT)
Minos N. Garofalakis, Rajeev Rastogi,Kyuseok Shim
1999
Mining Sequences that satisfy User specified Regular Expressions constraint[10]
A framework for frequent sequence mining under generalized regular expression constraints: Regular Expression-Highly AdaptConstrained Local Extractor (RE-Hackle)
Albert-Lorincz and Boulicaut
2003
Explores only candidate space spanned over the regular expression, and prunes it at each level.[12]
A scalable algorithm Maximal Sequential Patterns using Sampling (MSPS)
Congnan Luo, Chung S.M
2004
sampling technique is used to identify long frequent sequences earlier, instead of enumerating all their subsequences.[11]
B. Vertical Database Format The data is organized in vertical format, where the rows of the database consist of time-stamp pairs associated with an event. Thus id-list for each event is easily generated and frequent sequences via simple temporal join are easily generated[1]. Yang et al [2002] recognized that vertical database format is advantageous when patterns are long and data is memory resident, and also generation and counting of candidates become easier[2]. Mining is carried out using depth-first traversal on Vertical Database Format.
Algorithm Name Sequential PAttern Discovery using Equivalence classes (SPADE) Sequential PAttern Mining (SPAM) Cache-based Constrained Sequence Miner (CCSM) LAst Position INduction Sequential PAttern Mining (LAPINSPAM) LAst Position INduction (LAPIN) RE-SPaM
Table - II Summary of Apriori-Based Algorithm For Mining Sequential Patterns (Candidate Generation: Vertical Database Format) Author Year Key Features Zaki
2001
Ayres et al.
2002
Orlando et al.
2004
Yang and Kitsuregawa
2005
Yang,Wang and Kitsuregawa Leticia I. G´omez et.al
2007 2008
All Sequences are Discovered in only 3 database scans[13] Vertical Bitmap Representation, incrementally outputs new frequent item-sets in an online fashion[14] k-way intersections, cache stores information for future use [17] Uses SPAM, last position and largely reduce search space[16] Uses last position to judge whether frequent k-length sequential pattern can be extended to k+1 length[15] Built on constraints (i.e., conditions over attributes of complex items) rather than over atomic items.
All rights reserved by www.ijirst.org
230
Mining Sequences – Approaches and Analysis (IJIRST/ Volume 1 / Issue 7 / 047)
C. Pattern Growth Algorithm for mining Sequential Pattern: It was recognized early that number of candidate sets generated using Apriori based algorithm was very large. That is, if there was a frequent sequence of 100 elements then 2 100 = 1030 candidates had to be generated to find such a sequence. The Frequent Pattern Growth Approach removes the need for candidate generation and prune steps that occur in the Apriori type algorithms. This is done by compressing the database representing the frequent sequences into a frequent pattern tree and then this tree is divided into a set of projected database, which are mined separately [Han et al.2000] [1]. In Comparison of Apriori-Type algorithm, pattern growth algorithms are generally more complex to develop, test and maintain but are faster when deals with large volume of data. Table - III Summary of Pattern Growth Algorithm For Mining Sequential Patterns Algorithm Name Author Year Key Features FREquEnt pattern-projected Sequential Projected sequence Han et al. 2000 PAtterN mining (FreeSpan) Database [18][1] PREFIX-projected Sequential PAtterN Projected prefix Pei et al. 2001 mining (PrefixSpan) Database, reduces candidate subsequence generation[19][1] Sequential pattern mining with LengthLength-decreasing Seno and decreasing suPport 2002 Support constraint,used database projection approach and pruning Karypis (SLPMiner ) methods to reduce search space [20]
D. Closed Frequent Sequential Patterns: Closed Sequential Pattern Mining [Yan et.al 2003] follows the candidate generation and test paradigm and stores the generated sets of candidates in a hash-indexed result-tree using which post-pruning is conducted to produce closed set of frequent sequences [1]. Closed Sub-Sequences are those sequences that contain no super-sequence with same support [6]. Algorithm Name Closed Sequential pattern mining BI-Directional Extension based frequent closed sequence mining CE-MINER ClaSP CM-ClaSP
Table - IV Summary of Closed Sequential Pattern Mining Algorithms Author Year Key Description Mine Frequent closed subsequences only i.e. those containing no superYan et al. 2003 sequence with same support [21] Wang and Without Candidate Maintenance, Pruning by BackScan and optimization by 2004 Han ScanSkip[22] Yi-Cheng 2011 Mining Closed Patterns from time-interval based data[3] Chan et.al Mining frequent closed sequential patterns in temporal transaction data Gomariz et al 2013 using search space pruning methods with vertical database[23] Fourier-Viger Fast Vertical Mining of Sequential Patterns Using Co-occurrence 2014 et.al Information[24]
E. Constraint Based Sequential Pattern Mining: Closed Sequential Pattern Mining mines frequent patters and Closed Sequential sub-sequences are those patterns that contain no super-sequences with same support. Use of constraint in closed sequential pattern mining reduces the search space and memory usage. Table - V Summary of Constraint Based Closed Sequential Pattern Mining Algorithms Algorithm Name Author Year Key Description Mining Closed Sequences with Constraint Shyamala S & Sathya Based on 2012 Pushing constraint in Mining Closed sequences[25] T BIDE Algorithm Closed sequential pattern mining based on intensity IC-BIDE Hiromasa TAKEI et.al 2013 constraint
III. ANALYSIS SPMF is an open-source data mining library written in Java, specialized in pattern mining. It is distributed under the GPL v3 license. Algorithms available in Sequential Pattern Mining library (SPMF) were applied on dataset. The result of algorithms statistic is displayed in Table VII and VIII and charts for the data are created in excel.in. The Table/Chart below shows the comparison among sequential pattern mining algorithms and Closed Sequential Pattern Mining Algorihm. Algorithms were executed on a text file containing sequence database named ContextPrefixSpan of size 136 bytes. The purpose of analysis is to represent that closed sequential pattern mining generates less redundant sequences as compared to Sequential pattern mining. This file was included as a test file in SPMF package itself and is shown below. Table - VI Sequence Database: Contextprefixspan ID Sequences
All rights reserved by www.ijirst.org
231
Mining Sequences – Approaches and Analysis (IJIRST/ Volume 1 / Issue 7 / 047)
S1 S2 S3 S4
(1),(1 2 3),(1 3),(4),(3 6) (1 4),(3),(2 3),(1 5) (5 6),(1,2),(4 6),(3),(2) (5),(7),(1 6),(3),(2),(3) Table-VII Comparison Among Sequential Pattern Mining Algorithm’s Attributes Considering Min Sup=0.4 PrefixSpan Attributes GSP SPADE SPAM (Max Pattern Length 4) LAPIN (Max Pattern Length 4) Total Time 16ms 16ms 16ms 16ms 16ms Frequent Sequences count 53 53 53 53 53 Max Memory(mb) 7.23007 6.2833 4.9123 6.1158 7.1062 Join Count 185 Table - VIII Comparison Among Closed Sequential Pattern Mining Algorithm’s Statistics Considering Min Sup=0.4 Attributes CLOSPAN BIDE CLASP CM_CLASP Total Time 62ms 31ms 31ms 15ms Frequent Sequences Count 17 17 17 17 Max Memory(mb) 4.3204 6.287 7.895 4.5262 Join Count 145 84
60
70
50 Total Time
40
60 50
30
Total Time
40
20 10
Frequent Sequences count
0
Max Memory(mb)
20
Frequent Sequences Count
10
Max Memory(mb)
30
0
(A) Sequential Pattern Mining Algorithms (B) Closed Sequential Pattern Mining Algorithms Fig. 2: Varying Performance, #Of Frequent Sequences And Memory Used For Dataset Contextprefixspan With Minimum Support=0.4
70 60
Total Time
50 40 30
Frequent Sequences count
20 10
Max Memory(mb)
0
(A) Sequential Pattern Mining And Closed Sequential Pattern Mining Algorithms Fig. 3: Varying Performance, #Of Frequent Sequences And Memory For Dataset Contextprefixspan Considering Min Support=0.4.
All rights reserved by www.ijirst.org
232
Mining Sequences – Approaches and Analysis (IJIRST/ Volume 1 / Issue 7 / 047)
IV.
CONCLUSION
This paper presents the approaches used, key-feature and analysis of various Sequential Pattern Mining Algorithms and Closed Sequential Pattern Mining Algorithm. Using SPMF Library, Sequential pattern Mining Algorithms were applied on ContextPrefixSpan File. It was observed that GSP consumed approximately 1 MB more than spade, 3 MB more than Prefix Span, 1 MB more than SPAM and 0.12387 more than LAPIN while the time taken by each algorithm was almost same i.e. 16ms and number of frequent sequences generated was also same i.e. 53. Thus for mining sequential patterns Prefix Span performs better in context of memory usage while taking same time and generating same number of frequent sequences as compared to GSP, SPAM and LAPIN. For Closed Sequential Pattern Mining Algorithm it was observed that execution time taken by CM-CLASP was approximately 45 ms less than CLOSPAN and approximately 30 ms less than BIDE and CLASP, the number of frequent sequences generated by all 4 algorithms were almost same i.e. 17, memory used by CLOSPAN, CM-CLASP is approximately 2 MB less than BIDE and approximately 3MB less than CLASP. Thus CM-CLASP performs better in context of execution time and memory usage while generating same number of sequences as compared to CLOSPAN, BIDE, CLASP. Sequential Pattern Mining algorithms GSP, PREFIXSPAN, SPAM and LAPIN generate frequent sequences three times more as compared to Closed Sequential Pattern Mining algorithms CLOSPAN,BIDE,CLASP,CM-CLASP . Thus the large number of redundant sequences generated in sequential pattern mining is tremendously reduced in Closed Sequential Pattern Mining.
REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27]
CARL H.MOONEY, JOHN F.RODDICK, “SEQUENTIAL PATTERN MINING – APPROACHES AND ALGORITHMS” (2013,FEBRUARY) ACM COMPUTING SURVEYS (CSUR) VOLUME 45 (ISSUE 2) KUO-YU HUANG , CHIA-HUI CHANG , JIUN-HUNG TUNG , CHENG-TAO HO, “COBRA: CLOSED SEQUENTIAL PATTERN MINING USING BI-PHASE REDUCTION APPROACH”, IN PROC. DAWAK,2006 PP.280-291 YI-CHENG CHEN, WEN-CHIH PENG AND SUH-YIN LEE, “CEMINER-AN EFFICIENT ALGORITHM FORM MINING CLOSED PATTERNS FROM TIME-INTERVAL BASED DATA”, IN PROC. ICDM, 2011 PP.121-130 HTTP://WWW.IBM.COM/DEVELOPERWORKS/LIBRARY/BA-DATA-MINING-TECHNIQUES/ HTTP://WWW.PHILIPPE-FOURNIER-VIGER.COM/SPMF/ KE WANG, YABO XU,JEFFREY XU YU, “SCALABLE SEQUENTIAL PATTERN MINING FOR BIOLOGICAL SEQUENCES”, IN PROC. CIKM, 2004 PP.178-187 VINCENT S. TSENG,ERIC HSUEH-CHAN,LU CHENG-HSIEN HUANG, “MINING TEMPORAL MOBILE SEQUENTIAL PATTERNS IN LOCATION-BASED SERVICES ENVIRONMENTS” IN PROC. ICPADS 2007 PP.1-8 RAKESH AGRAWAL, RAMAKRISHNAN SRIKANT “MINING SEQUENTIAL PATTERNS” IN PROC. ICDE, 1995 PP. 3-14 RAMAKRISHNAN SRIKANT, RAKESH AGRAWAL “MINING SEQUENTIAL PATTERNS: GENERALIZATIONS AND PERFORMANCE IMPROVEMENTS” IN PROC.EDBT,1996 PP.3-17 MINOS N.GAROFALAKIS, RAJEEV RASTOGI, KYUSEOK SHIM “SPIRIT: SEQUENTIAL PATTERN MINING WITH REGULAR EXPRESSION CONSTRAINTS” IN PROC.VLDB,1999 PP.223-234 CONGNAN LUO,CHUNG S.M, “A SCALABLE ALGORITHM MAXIMAL SEQUENTIAL PATTERNS USING SAMPLING (MSPS)” IN PROC.ICTAI, 2004 PP.156-165 HUNOR ALBERT-LORINCZ AND JEAN-FRANÇOIS BOULICAUT, “A FRAMEWORK FOR FREQUENT SEQUENCE MINING UNDER GENERALIZED REGULAR EXPRESSION CONSTRAINTS” IN PROC. KDID,2003 PP.2-16 MOHAMMED J. ZAKI, SPADE: AN EFFICIENT ALGORITHM FOR MINING FREQUENT SEQUENCES”,MACHINE LEARNING,VOL-42,NO.1-2,FEB-2001 JAY AYRES, JOHANNES GEHRKE,TOMI YIU, AND JASON FLANNICK, “SEQUENTIAL PATTERN MINING USING A BITMAP REPRESENTATION” IN PROC. ACM SIGKDD,2002 PP.429-435 ZHENGLU YANG, YITONG WANG, AND MASARU KITSUREGAWA, “LAPIN:EFFECTIVE SEQUENTIAL PATTERN MINING ALGORITHMS BY LAST POSITION INDUCTION FOR DENSE DATABASES” IN PROC DASFAA 2007 PP.1020-1023 ZHENGLU YANG, MASARU KITSUREGAWA, “LAPIN-SPAM:AN IMPROVED ALGORITHM FOR MINING SEQUENTIAL PATTERN” IN PROC. ICDEW, 2005 PP.1222 ORLANDO, S., PEREGO, R., AND SILVESTRI, “ A NEW ALGORITHM FOR GAP CONSTRAINED SEQUENCE MINING.” IN PROC. SAC, 2004 PP.540-547 JIAWEI HAN, JIAN PEI,BEHZAD MORTAZAVI “ FREESPAN:FREQEUNT PATTERN-PROJECTED SEQUENTIAL PATTERN MINING” IN PROC.ACM SIGKDD,2000 PP.355-359 JIAN PEI, JIAWEI HAN, BEHZAD MORTAZAVI-ASL, HELEN PINTO, QIMING CHEN, UMESHWAR DAYAL, MEI-CHUN HSU “PREFIXSPAN:MINING SEQUENTIAL PATTERNS EFFICIENTLY BY PREFIX-PROJECTED PATTERN GROWTH” IN PROC.ICDE,2001 PP.215 MASAKAZU SENO, GEORGE KARYPIS, “SLPMINER: AN ALGORITHM FOR FINDING FREQUENT SEQUENTIAL PATTERNS USING LENGTH DECREASING SUPPORT CONSTRAINT” IN PROC. ICDM, 2002 PP.418-425 XIFENG YAN, JIAWEI HAN, RAMIN AFSHAR, “CLOSPAN:MINING CLOSED SEQUENTIAL PATTERNS IN LARGE DATASETS” IN PROC.SDM,2003 JIANYONG WANG, JIAWEI HAN, “BIDE: EFFICIENT MINING OF FREQUENT CLOSED SEQUENCES” IN PROC. ICDE,2004 PP.79-90 ANTONIO GOMARIZ, MANUEL CAMPOS, ROQUE MARIN AND BART GOETHALS, “CLASP: AN EFFICIENT ALGORITHM FOR MINING FREQUENT CLOSED SEQUENCES” IN PROC. PAKDD 2013, PP.55-61 PHILIPPE FOURNIER-VIGER, ANTONIO GOMARIZ, MANUEL CAMPOS, RINCY THOMAS “FAST VERTICAL MINING OF SEQUENTIAL PATTERNS USING COOCCURRENCE INFORMATION” IN PROC. PAKDD, 2014 PP.40-52 SHYAMALA S,SATHYA T “MINING CLOSED SEQUENCES WITH CONSTRAINT BASED ON BIDE ALGORITHM” IN PROC. ICCI,2012 PP.1-5 GOMEZ.L.I, VAISMAN A.A, “RE-SPAM:USING REGULAR EXPRESSIONS FOR SEQUENTIAL PATTERN MINING IN TRAJECTORY DATABASES”, IN PROC. ICDMW, 2008 PP.395-398 TAKEI.H, YAMANA,H, “IC-BIDE: INTENSITY CONSTRAINT-BASED CLOSED SEQUENTIAL PATTERN MINING FOR CODING PATTERN EXTRACTION” IN PROC. AINA, 2013 PP.976-983
All rights reserved by www.ijirst.org
233