Paper id 2620143

Page 1

International Journal of Research in Advent Technology, Vol.2, No.6, June 2014 E-ISSN: 2321-9637

Minimizing the Repeated Database Scan Using an Efficient Frequent Pattern Mining Algorithm in Web Usage Mining Devinder Kaur1, Ravneet Kaur2 Student of Master Technology1, Assistant Professor2 Department of Computer Science and Engineering1, 2 Sri Guru Granth Sahib World University Fatehgarh Sahib, Punjab, India

Abstract— Data Mining, is the process of discovery of new patterns and knowledge from large dataset. Web mining is the application of data mining techniques to extract and mine useful knowledge and interesting patterns from World Wide Web .Web data including web documents, hyperlinks between documents, usage logs of web sites. The web usage data captures the identity and origin of the web user along their surfing behaviour at a website. The aim of discovering frequent patterns in Web log data is to obtain information about the navigational behaviour of the users. Mining frequent patterns from web log data can help to optimise the structure of a web site and improve the performance of web servers. In this paper First we investigate process used for maximal forward reference. And then a new approach is proposed to modify the process of find maximal forward reference through backward scan algorithm. After use proposed modified algorithm time and space complexity will reduce. The backward scan algorithm is proposed for frequent pattern mining in web usage mining. It scan the web log database from backward level. Index Terms—Data mining, World Wide Web, traversal pattern. 1. INTRODUCTION Data mining is the process of discovering interesting knowledge from large amount of data. Data mining refers to discover knowledge in huge amounts of data. Knowledge discovery in database is the non trivial process of identifying valid, potentially useful and ultimately understandable patterns in data [1]. The web mining is a combination of the two singular areas of in progress one is the data mining and second one is world wide web (WWW). It can be able to be mostly defined as the finding and discovering the useful information from WWW. Web mining is the make use of data mining discover and mine information from Web documents and services. Web Data Mining is the application of data mining techniques to find interesting and potentially useful knowledge from web data. Web mining has three types web content mining, web structure mining and web usage mining. It is normally expected that either the hyperlink structure of the web or the web log data or both have been used in the mining process [2].

Table 1 Relationship between Content, Structure and Usage mining Type Structure Forms Object Collection Usage

Accessing

Click

Behaviour

Logs

Structured

Pages

Text

Index

Pages

Content

Map

Hyperlinks

Map

Hyperlinks

2. LITERATURE REVIEW Hemant Kumar Singh and Brijendra Singh [3] introduced about the web content mining, web structure mining and web usages mining. The aim of this paper is to provide past, current evaluation and update in each of the three different types of web mining i.e. web content mining, web structure mining and web usages mining .It also represents the comparisons and summary of various methods of web data mining with applications. Jaideep Srivastava et al. [4]. Describe the Web usage mining phases, namely preprocessing, pattern discovery, and pattern analysis and its application. It provides a detailed taxonomy of the work in this area, including research as well as commercial offerings.and practice communities. Paweł Weichbroth et al. [5] describe problem of mining 38


International Journal of Research in Advent Technology, Vol.2, No.6, June 2014 E-ISSN: 2321-9637 access patterns from Web logs efficiently. A novel data structure, called Web access pattern tree, or WAP-tree in short, is developed for efficient mining of access patterns from pieces of logs. The Web access pattern tree stores highly compressed, critical information for access pattern mining and facilitates the development of novel algorithms for mining access patterns in large set of log pieces. Renata Ivancsy and Istvan Vajk [6] presented discovering frequent patterns in Web log data is to obtain information about the navigational behaviour of the users. The different patterns in Web log mining are page sets, page sequences and page graphs. This can be used for advertising purposes, for creating dynamic user profiles etc.. Yan Li et al. [7] the proposed path completion algorithm efficiently appends the lost information and improves the reliability of access data for further Web usage mining calculations. Ming-Syan [8] proposed a new data mining algorithm that involves mining path traversal patterns in a distributed information-providing environment where documents or objects are linked together to facilitate interactive access. Their solution procedure consists of two steps. First, derive an algorithm to convert the original sequence of log data into a set of maximal forward references. Second derive another algorithm to determine the frequent traversal patterns i.e. large reference sequences from the maximal forward references obtained. Jianhan Zhu [9] applied the Markov chains to model user navigational behaviour. They proposed a method for constructing a Markov model of a web site based on past visitor behaviour. Then the Markov model is used to make link predictions that assist new users to navigate the Web site. WANG Tong [10] offers an improved algorithm based on the original AprioriAll algorithm. The new algorithm adds the property of the User-id during the every step of producing the candidate set and every step of scanning the database by which to decide whether an item in the candidate set should be put into the large set which will be used to produce next candidate set. Hengshan Wang [11] introduced two prevalent data mining algorithms – Fpgrowth and PrefixSpan into WUM. Maximum Forward Path (MFP) is also used in the web usage mining model during sequential pattern mining along with PrefixSpan so as to reduce the interference of “false visit” caused by browser cache and raise the of mining frequent traversal paths. Sandeep Singh Rawat [12] proposed a custom-built apriori algorithm which is based on the old Apriori algorithm, to find the effective pattern analysis. Ankit R Kharwar et al. [13] describe implements the high level process of Web Usage Mining using basic Association Rules algorithm call Apriori Algorithm. It presents finding association Rule from server log which are

useful in many application like cache for web page, Marketing, Targeted and Advertising etc. Mr. Rahul Mishra and Ms. Abha Choubey [14] proposed the FP growth algorithm for obtaining frequent access patterns from the web log data and providing valuable information about the user’s interest. 3. WEB MINING CHALLENGES Today the World Wide Web is popular and interactive medium to distribute information. The web is huge, diverse, dynamic and unstructured nature of web data , web data research encountered lot of challenges for web mining. Information user could encounter following challenges when interacting with web. 1.Finding Relevant Information:- People either browse or use the search service when they want to find specific information on the web. Today’s search tools have problems like low precision which is due to irrelevance of many of the search results. This results in a difficulty in finding the relevant information. Another problem is low recall which is due to inability to index all the information available on the web. 2. Creating new knowledge out of the information available on the web:- This problem is basically sub problem of the above problem. Above problem is query triggered process (retrieval oriented) but this problem is data triggered process that presumes that already have collection of web data and extract potentially useful knowledge out of it. 3. Personalization of information:- When people interact with the web they differ in the contents and presentations they prefer. 4. Learning about Consumers or individual users:This problem is about what the customer do and want. Inside this problem there are sub problem such as customizing the information to the intended consumers or even to personalize it to individual user, problem related to web site design and management and marketing [15]. 4. BASIC PROBLEM OF FREQUENT ITEMSET Association Rules find all sets of items that have support greater than the minimum support and then using the large item sets to generate the desired rules that have confidence greater than the minimum confidence. An algorithm for finding ass rule named as AIS was proposed by R.S. Aggarwal in 1993. There are several algorithms for frequent itemset. All the variations make for apriroi algorithms have some advantages and disadvantages.

39


International Journal of Research in Advent Technology, Vol.2, No.6, June 2014 E-ISSN: 2321-9637 Algorithms Apriori

Storage Structure Array based

Apriori Tid

Array based

FP tree

Tree based

Custom built Apriori

Array based

Advantages

Disadvantages

Any subset of frequent item set is also frequent item set.

Multiple scans have to be done on database. Its time and memory complexity is very large. The time and space complexity is also very large

Number of entries may be smaller than the number of transactions in the database. Scans the database only twice. It has interactive rule mining Based on old apriori algorithm. It is efficient and effective pattern analysis.

It seems to be difficult in incremental. It require less memory and more execution time It requires more memory and more execution time.

5. PROPOSED APPROACH The proposed algorithm has based on frequent pattern mining using web log data. The basic objective of the algorithm to obtain the maximum traversing path of the users from web log database. It scan the web log database from the backward. This algorithm will makes the pattern mining process effective.

Algorithm : Backward Scan. Input : User traversal pattern. Output : Maximal forward reference. Step 1 : Input data set and min threshold value. Step 2 : Calculate the length of longest itemset in dataset. Step 3 : Repeat from longest itemset length. a. Generate candidate of k level. b. Calculate the count of candidate c. If count of any candidate is greater than min threshold value. Then print result. Else repeat step 3. Step 4 : Exit.

Illustrative Example The proposed algorithm consists of various steps which will explain with the user traversal pattern. The web has like tree structure with pages being represent as nodes which denoted by alphabets and hyperlinks represented by arrows.

Min threshold = 1

C

B

E

A

F

D

H

G

Fig 1 User Traversal Pattern TID Database 100

ABE

200

ADH

300

ACG

400

ACG

500

AD

600

ACG

Table 2 web log database The above database is generate from the user traversal pattern graph to obtain the maximal forward reference. Step1 : In the first step generate candidates from graph with the transaction id and number of nodes denoted by alphabet during access the transaction. Step 2 : In the next step generate the length of longest candidate level from the graph.

40


International Journal of Research in Advent Technology, Vol.2, No.6, June 2014 E-ISSN: 2321-9637 REFERENCES

length ABE ACF ACG ADH Table 3 Candidate length Step 3 : After generation of candidate level then count the number of occurrence of each candidate from the web log database. The candidate count satisfy the minimum threshold value. length

count

ABE

1

ACF

0

ACG

3

ADH

1

Table 4 length count The above shows count of each candidate. The maximum candidate count is 3 which has maximum traverse path of the user. Otherwise generate level 2 candidate from the graph. The proposed algorithm will improve the efficiency of traditional apriori algorithms. The backward scan algorithm is modified approach for frequent pattern mining in web usage mining. It will efficient from all previous algorithms. The time and space complexity will reduce used by this algorithm because it’s minimizing the repeated database scan for Frequent Pattern Mining in web usage mining. The maximal forward reference will easily obtain by use this algorithm. 6. CONCLUSION In this paper first data mining and web mining categories have been discussed. Frequent mining algorithms also has been analyzed along their advantages and disadvantages. An algorithm has been proposed for frequent pattern mining in web usage mining which is efficient than traditional Apriori algorithms. The first part of algorithm, i.e. backward scan. firstly scan the web log database and obtain the longest candidate level length. After that count the occurrence of each candidate. Each candidate count satisfy the minimum threshold value and then obtain the maximum forward reference from candidate count length. The new approach requires minimum repeated database scan for frequent pattern mining in web usage mining. It will reduce the time and space execution.

[1] Ping Tang Nig, “Introduction to Data Mining,” Addision Wesley publishers 2006. [2] Monika Yadav and Mr. Pradeep Mittal, “Web Mining: An Introduction”, in proceeding of International journal of Advanced Research in Computer Science and Software Engineering, vol.3, issue 3, March 2013. [3] Hemant Kumar Singh and Brijendra Singh, “ Web Data Mining Research: A Survey”, in proceeding of IEEE International Conference on Computational intelligence and Research (ICCIC), pp.1-10, December 2010. [4] Jaideep Srivastava, Robert Cooley, Mukund Deshpande, Ping Ning Tang, “Web Usage Mining: Discovery and Application of Usage Pattern from Web Data”, in proceeding of ACM SIGKDD, vol. 1, issue 2, pp.1 -12, January 2000. [5] Paweł Weichbroth, Mieczysław Owoc, Michał Pleszkun, "Web User Navigation Patterns Discovery from WWW Server Log Files", in proceeding of IEEE Federated Conference on Computer Science and Information Systems pp. 1171–1176, 2012. [6] Paweł Weichbroth Mieczysław Owoc Michał Pleszkun, "Web User Navigation Patterns Discovery from WWW Server Log Files" , in proceeding of IEEE Federated Conference on Computer Science and Information Systems pp. 1171–1176, 2012. [7] Yan Li, BoQin Feng and Qinjiao Mao, "Research on Path Completion Technique in Web Usage Mining", in proceeding of IEEE International Symposium on Computer Science and Computational Technology (ISCSCT), Vol. 1, pp. 554-559, December 2008. [8] Ming Syan Chen, Jong Soo Park, and Philip S. Yu, “Efficient Data Mining for Path Traversal Pattern”, in proceeding of IEEE transactions on knowledge and data engineering vol. 10, no. 2, pp. 209-221, April 1998. [9] Jiahan Zhu, Jung Hong, and John G. Hughes, “Using Markov Chain for Link Prediction in Adaptive Web Sites”, in proceeding of First International Conference of Computing in an Imperfect World, pp. 60-73, 2002. [10] Wong Tong and He Pi-lian, “Web Log Mining by an Improved AprioriAll Algorithm”, in proceeding of Second World Enformatika Conference, 2005. [11] Hengshan Wang , Cheng Yang and Hua Zeng , “Design and Implementation of a Web Usage Mining Model Based On Fpgrowth and Prefixspan”, in Communications of the IIMA, vol. 6 ,no. 2,2006. [12] Sandeep Singh Rawat and Lakshmi Rajamani, “ Web Discovering Potential User Browsing Behaviors Using Custom built Apriori Algorithm”, in proceeding of International Journal of Computer Science & Information Technology, Vol.2, No.4, August 2010. [13] Ankit R. Khawar, Viral Kapadia, Nilesh Parajapati, Premal Patel, “Implementation Apriori Algorithm on Web Serve Log”, in proceeding of National Conference on Recent Trends in Engineering & Technology, May 2011. [14] Mr. Rahul Mishra and Ms. Abha Choubey, “Discovery of Frequnet Pattrens from Web Log Data by Using FPGrowth Algorithm for Web Usage Mining”, in proceeding of Internation Journal of Advanced

41


International Journal of Research in Advent Technology, Vol.2, No.6, June 2014 E-ISSN: 2321-9637 Research in Computer Science and Software Engineering vol. 2, Issue 9, September 2012. [15] Raymond Kosala and Hendrik Blockeel, “Web Mining Research: A Survey”, in ACM SIGKDD vol. 2, pp. 1-15, June 2000. [16] D. Jayalatchumy and Dr. P. Thambidurai, “Web Mining Research Issues in Future Directions : A Survey”, in proceeding of ISOR journal of Computer Engineering Vol. 14, Issue 3, pp. 20-27, October 2013.

42


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.