Tutorial Paper Proc. of Int. Conf. on Advances in Information Technology and Mobile Communication 2013
Advance Clustering Technique Based on Markov Chain for Predicting Next User Movement Harish Kumar1, Dr. Anil Kumar Solanki2 1
PhD Scholar, Mewar University, 2Professor, BIT Jhansi Emial id : harishtaluja@gmail.com natural step, and it is now the focus of an increasing number of researchers.Web usage mining consists of three phases, preprocessing, pattern discovery, and pattern analysis. After the completion of these three phases the user can find the required usage patterns and use this information for the specific needs. The reliability of the previously developed methods for finding similar patterns is only up to 50%. Zidrina research introduced a mutual approach which takes users browsing history and text from the links text to analyse users’ behavior. Tanasa research proposed few approaches for extracting sequential patterns with low support from Web usage data. These approaches were also instantiated in concrete methods such as the “Cluster & Discover” and “Divide & Discover”. The aim all the previous research is to discover similar patterns in Web log data is to obtain information about the navigational behavior of the users. Web usage mining, from the data mining aspect, is the task of applying data mining techniques to discover usage patterns from Web data in order to understand and better serve the needs of users navigating on the Web. Web usage mining aim is to find out useful information from the educational weblogs. These useful data patterns are used to analyze behavior of user. The objective of this dissertation is to generate a similar patterns with the help of Markov chain and by using following algorithms like’s web logs data preparation methods, data mining algorithms for prediction and classification tasks, web text mining. The key target of the paper is to develop methods how to improve knowledge discovery steps mining using web log data that would reveal new prospect to the data analyst. To forecast next user movement effectively, this study generates a beam of light for webbased recommendation system to predict next user movement, named as WebAstro. According to the finding this WebAstro helps in web site reorganization. While performing web log analysis, it was discovered that insufficient interest has been paid to web log data cleaning process. By reducing the number of redundant records data mining process becomes much more effective and faster. Therefore a new original cleaning framework was introduced which leaves records that only corresponds to the real user clicks. This clean method named as Duster performs “Query based” cleaning. Clean data is use for designing Web Graph. This method help us to draw the web graphs that are modeled in the form of Markov Chain and generate a new friend function for calculating probability for user next page prediction and behavior analysis[8][9]. K mean clustering algorithm is used for predicting user be
Abstract - Aim: According to the survey India is one of the leading countries in the word for technical education and management education. Numbers of students are increasing day by day by the growth rate of 45% per annum. Advancement in technology puts special effect on education system. This helps in upgrading higher education. Some universities and colleges are using these technologies. Weblog is one of them. Main aim of this paper is to represent web logs using clustering technique for predicting next user movement and user behavior analysis. This paper moves around the web log clustering technique based on Markov chain results .In this paper we present an ideal approach to web clustering (clustering web site users) and predicting their behavior for next visit. Methodology: For generating effective result approx 14 engineering college web usage data is used and an advance clustering approach is presenting after optimizing the other clustering approach.Results: The user behavior is predicted with the help of the advance clustering approach based on the FPCM and k-mean. Proposed algorithm is used to mined and predict user’s preferred paths. To predict the user behavior existing approaches have been used. But the existing approaches are not enough because of its reaction towards noise. Thus with the help of ACM, noise is reduced, provides more accurate result for predicting the user behavior. Approach Implementation:The algorithm was implemented in MAT LAB, DTRG and in Java .The experiment result proves that this method is very effective in predicting user behavior. The experimental results have validated the method’s effectiveness in comparison with some previous studies. Keyword - Markov chain, Web logs, clustering, FPCM (Fuzzy Possiblistic C means algorithm),K-mean algorithm.
I. INTRODUCTION A recent study by Google has found that Indians just behind the Americans, when it comes to searching online about educational institutions and courses. According to the survey, the details of which were released by the online search giant, over 45% Indian students use the internet to research on education [10]. This spawn the massive data related to student’s interactions with the educational web sites. This massive data is in the form on web logs or server log files. The research area is focused on the web log analysis and methods how to process this web data. Finding hidden information from Web log data is called Web usage mining. Web Usage mining is the part of Data Mining technique. Data Mining and Knowledge Discovery is a research discipline involving the study of techniques to search for patterns in large collections of data. The application of data mining techniques to the web, called web data mining, was a © 2013 ACEEE DOI: 03.LSCS.2013.2. 563
66
Tutorial Paper Proc. of Int. Conf. on Advances in Information Technology and Mobile Communication 2013 havior its advance clustering algorithm Fuzzy C-means (FCM) is a well known soft clustering algorithm that allow for over lapping clusters [1]. The overlapping clusters can be useful in applications where restrictions imposed by crisp clustering that force assignment of every object to a unique cluster may not be practical. This paper emphasis on K-mean and FCM algorithms for clustering web navigation patterns to an educational site of NCR Colleges.
useful knowledge, user information and server access patterns allows Web based organizations to mining user access patterns and helps in future developments, maintenance planning and also to target more rigorous advertising campaigns aimed at groups of users. According to her as popularity of the web continues to increase, there is a growing need to develop tools and techniques that will help improve its overall usefulness. She proposed that k-means algorithm is used to reduce the computation intensity of the neural network, by reducing the input set of samples. This can be achieved by clustering the input dataset using the k-means algorithm, and then take only discriminate samples from the resulting clustering schema to perform the learning process. Chu et.al.[5] proposed a two way prediction model based on Markov models and Bayesian theorem. The prediction result can be used for personalization, building proper websites, promotion, getting marketing information, and forecasting market trends etc. Markov model is assumed to be a probability model by which users browsing behaviors can be predicted at category level. Bayesian theorem can also be applied to present and infer users browsing behaviors at webpage level. By the Markov Model, the system can effectively filter the possible category of the websites and Bayesian theorem will help to predict websites accuracy.R.Khanchana et. al. [6] proposed a modified prediction model of Lee based on Markov models and Bayesian theorem. She focuses on the preprocessing step and amends few changes in Prediction. Author uses hierarchical agglomerative clustering algorithm for browsing patters and obtain several various user clusters. The data of clusters can be projected as cluster view for replacing of the global. As a result, the author presents an altered Prediction Model. In the new model, the view selection will be utilized by which user’s browsing patterns is matched and utilized for forecasting and enhancing the accuracy confidently.
II. RELATED WORK G.Sudhamathy et. al. [1] proposed a optimization survey of for various web clustering algorithm. She provide a brief overview of Fuzzy clustering algorithm, Temporal Cluster Migration Matrices algorithm and PSO based clustering algorithm and she find that temporal clustering migration matrices approach is just to categorize the web users into different clusters and to study their cluster migration behavior over a period of time. Fuzzy clustering approach can be applied to study the aspect of E-commerce web sites starting from ranking the users based on their visit time and visit frequency.PSO optimization technique that is applied on the web session clustering concept is used for identifying more accurate clustering sessions. After analyzing she proposed that fuzzy clustering algorithm is simple, effective and practical to apply. J.Vellingiri et.al.,[2]proposed an approach for fuzzy possiblistic c means algorithm for clustering on web usage mining to predict the user behavior[2] . In recent times, CMeans is found to be superior as its embedded fuzzy logic. In noisy atmosphere, the memberships of FCM constantly do not correspond well to the degree of belonging of the data, and might be inexact. This paper uses a novel clustering algorithm called fuzzy-possibilistic C-Means (FPCM) algorithm, which integrates extended partition entropy and inter class resemblance which is computed from the fuzzy set point of view. The proposed approach uses FPCM to find out the user behavior since it needs only the ember ship matrix and possibilistic matrix, and is free from heavy distance computing. Tasawar et.al.,[3] proposed a connectivity based clustering approach for web usage mining (WUM), He proposed Agglomerative and Divisive approach for clustering. Swarm based web session clustering helps in many ways to manage the web resources effectively such as web personalization, schema modification, website modification and web server performance. In this paper, he proposes a web session clustering at second level of web usage mining (Preprocessing level). The framework approach will cover the data preprocessing steps to prepare the web log data and convert the categorical web log data into numerical data.A session vector is obtained from web data and swarm optimization could be applied to cluster the web log data. The hierarchical cluster based approach will enhance the existing web session techniques for more structured information about the user sessions Vinita et.al..[4] Proposed the possible use of the neural networks learning capabilities to classify the web traffic data mining set. The discovery of Š 2013 ACEEE DOI: 03.LSCS.2013.2. 563
III. METHODOLOGY A. Web Log File Web Mining: Web mining may be classified into three categories, namely weblog mining, web content mining, and web structure mining.
Fig. 1. Categorization of Web Data mining
Web content mining (WCM) is to find useful information 67
Tutorial Paper Proc. of Int. Conf. on Advances in Information Technology and Mobile Communication 2013 in the content of web pages [4] e.g. free Semi-structured datasuch as HTML code, pictures, and various unloaded files. Web structure mining (WSM) is use to generating a structural summary about the web site and web pages [7][11]. Web structure mining tries to discover the link structure of the hyperlinks at the inter document level. Web content mining mainly focuses on the structure of inner document, Web usage mining (WUM) is applied to the data generated by visits to a web site, especially those contained in web log files. I only highlighted and discussed research issues involved in web usage data mining. Web usage mining (WUM) or web log mining, users’ behavior or interests is revealed by applying data mining techniques on web. Web log files are of different types. 1. Access Log File. 2. Agent Log File 3. Referer Log File 4. Error Log File Access Log File: It records information about which files are being requested from web server. It is located in the directory www/logs/. Agent Log File: It records information about the web clients that make requests on your server. Referer Log File: It records information about the URL that the web browser had been viewing immediately before making the request on your server. This is particularly useful when you want to determine where requests on your web server come from and what websites are referring web traffic to your server. It is located in the www/logs/ directory and called Referer Log File. Error Log File: It records information about failed requests of your server. If someone tries to access a file on your server that doesn’t exist, your server automatically generates an error message. Each of these error messages is recorded in the referrer log. It is located in the www/logs/ directory and called Error Log File. Three main sources of web log file are 1. Client Log File, 2. Proxy Log File 3. Server Log File. A log file contains the following fieldThe client’s host name or its IP address, The client id (generally empty and represented by a \-”) The user login (if applicable), The date and time of the request, The operation type (GET, POST, HEAD, etc.), The requested resource name, The request status, The requested page size, The user agent (a string identifying the browser and the operating system used),and The referrer of the request which is the URL of the Web page containing the link that the user followed to get to the current page. User behavior can be best analyzed from client log file because log files collected from client logs are much reliable and © 2013 ACEEE DOI: 03.LSCS.2013.2. 563
accurate then server log file and proxy log file. An extended log file contains a sequence of lines containing ASCII characters terminated by either the sequence LF or CRLF. Log file generators should follow the line termination convention for the platform on which they are executed.Analyzers should accept either form. Each line may contain either a directive or an entry. Entries consist of a sequence of fields relating to a single HTTP transaction [8]. Fields are separated by whitespace; the use of tab characters for this purpose is encouraged. If a field is unused in a particular entry dash “-” marks the omitted field. Directives record information about the logging process itself. Lines beginning with the # character contain directives. The following directives are defined: Version: <integer>.<integer> The version of the extended log file format used [7][8]. This draft defines version 1.0. Fields: [<specifier>...] Specifies the fields recorded in the log. Software: string Identifies the software which generated the log. Start-Date: <date> <time> The date and time at which the log was started. End-Date :< date> <time> The date and time at which the log was finished. Date:<date> <time> The date and time at which the entry was added. Remark: <text> Comment information. Data recorded in this field should be ignored by analysis tools. Sample web log format is as in Figure 2. B. Markov’s Model The pages and hyperlinks of the World-Wide Web may be viewed as nodes and arcs in a directed graph. The relationship between sites and pages indicated by these hyperlinks gives rise to what is called a Web graph. When it is viewed as a purely mathematical object, each page forms a node in this graph and each hyperlink forms a directed edge from one node to another. These navigation marks are called navigation pattern that can be used to decide the next likely web page request based on significantly statistical correlations. If that sequence is occurring very frequently then this sequence indicated most likely traversal pattern. If this pattern occurs sequentially, Markov chains have been used to represent navigation pattern of the web site [8] [9]. Important properties of Markov Chain: 1. Markov Chain is successful in sequence matching generation. 2. Markov model is depending on previous state. 3. Markov Chain model is Generative. 4. Markov Chain is a discrete – time stochastic process. Markov chain model is assume to be a probability model and used to predict provide the probability of the next link chosen when viewing a Web page while taking into account the trail followed to reach that page. Our measure of the summarization ability of the model answers a question we 68
Tutorial Paper Proc. of Int. Conf. on Advances in Information Technology and Mobile Communication 2013
Fig. 2. Web logs TABLE I. USER N AVIGATION PATTERN
have often been asked about the adequacy of Markov models in representing user Web trails. We use three type of Markov model … 1. First Order Markov Model: Suppose we have state space say S= {S1, S2…, Sn) at the time t sate sequence is represented by St and transition probability is represented by Pi j. In first order Markov chain model state probability is depend on the previous state for example probability of state j depends on the previous state i.So transition probabilities are represented by following expressions. Pi,j = Probability of (St= j| St-1=i) (1) OR If we consider states at different instances of time t then this can be represented as S (t). If T represents the number of states in a sequence then ST = {S1, S3, S5, S1} (if T=4). This model uses the transition probability which is given by P (Sj (t + 1)|Si (t)) = Pij
AND
THEIR FREQUENCIES
Navigation Pat tern
Occurrence
SA B CD T
4
SE FG T
8
S BCEF T
4
SA CD T
4
SB CD T
6
S AC E T
14
SB CT
4
S DF G T
2
S D FT
10
S DT
12
SBC D FT
6
SE FT
2
(2) a probability which state j at a time t depends on previous state i at a time t-n. The n-order transition probability of Markov model also denotes by Pi ,j n= Pr{St= j | St-n= i} (6)
(3) (4) 2. Second Order Transition Probabilistic Model We let Pi, k j be the second-order transition probability, that is, the probability of the transition (A k, Aj) given that the previous transition that occurred was (Ai, Ak). The second-order probabilities are estimated as follows:
C. Bayesian Theorem Bayesian’ Theorem is a theorem of probability. It can be seen as a way of understanding how the probability that a theory is true is affected by a new piece of evidence. Bayesian networks (BNs), also known as belief networks, belong to the family of probabilistic graphical models (GMs) [5]. Graphical structures represent the knowledge about an uncertain domain. Graph node represents a random variable,while the edges between the nodes represent probabilistic dependencies among the corresponding random variables. These conditional dependencies in the graph are often estimated by using known statistical and computational methods. It has been used in a wide variety of context like Bayesian theorem is used to predict the most possible user’s
(5) We consider the same navigation patterns used in previous paper. With this model we found some problems like State C is not accurately showing his actual probability. The accuracy of changing probability from a state can be increased by separating the in paths 3. Nth Order Markov Model Nth order Markov model solve the above problems. Pi,j n is © 2013 ACEEE DOI: 03.LSCS.2013.2. 563
69
Tutorial Paper Proc. of Int. Conf. on Advances in Information Technology and Mobile Communication 2013
Fig. 3. Second Order Markov Model
next request. It is to be assumed that at sample space S, X and Y are the two events.
Bayesian’ Theorem to discover, we say that P(X|Y), the probability that T is true given that E is true, is the posterior probability of T. The idea is that P (X|Y) represents the probability assigned to T after taking into account the new piece of evidence, E. To calculate this we need, in addition to the prior probability P(X), two further conditional probabilities indicating how probable our piece of evidence is depending on whether our theory is or is not true. We can represent these as P (X|Y) and P (X|~Y), where ~X is the negation of X, i.e. the proposition that T is false. Following procedure is used for predicting user behavior and used for website organization. Experimental Methodology WebAstro procedure for cleaning and analysis is as follows Step 1: Read web log from web log Data base (Web server log
(7) The above equation no 7 indicates that X stands for a theory or hypothesis that we are interested in testing, and Y discover is the probability that X is true supposing that our new piece of evidence is true. This is a conditional probability, the probability that one proposition is true provided that another proposition is true. Using this idea of conditional probability to express what we want to use represents a new piece of evidence that seems to confirm or disconfirm the theory. In particular, P(X) represents our best estimate of the probability for next user page request. It is known as the prior probability of X. What we want to
Fig. 4. WebAstro Block Diagram
© 2013 ACEEE DOI: 03.LSCS.2013.2. 563
70
Tutorial Paper Proc. of Int. Conf. on Advances in Information Technology and Mobile Communication 2013 file) Step 2: Apply DUSTER algorithm for refining web logs Cleaning HTML, XML, CSS and other tags from web logs. Remove all jpeg, jpg, gif Delete words like and, an, is etc. Reduce sized log file is kept in separate folder by the name of WEBASTRO. Step3: Sort the clean and refined web logs on the basis of date and time of visits Step4: Prepare the separate table based on the following fields. 1. User IP Table(User Identification Table) 2. Pages Navigation Table(Transaction Identification Table) 3. Duration Table(session Identification table) Step5: Normalize the data table. Step6: Initialize IPADDRESS field to Zero (0) Check whether the IP address is in the IP Table or Not If yes then Increment IPADDRESS counter by one Else Insert the IPADDRESS in IP table. Step7: Initialize PAGEVISIT field to Zero (0) Check whether the PAGE address is in the PAGENAVIGATION or Not If yes then Increment PAGEVISIT counter by one Else Invalid page and repeat step no 7 STEP8: Prepare Transaction Matrix, Similarity Matrix and Relevance Matrix from Step No 4,5,6 and 7 until all data set are in matrix form. STEP 9: Apply K mean clustering algorithm for testing refined data set and generate the proper cluster. Let X=(X1, X2, X3… Xn) be the set of distinct n users visit P distinct pages in session Si. Specific user =Xi Where Xi K=no of web pages visited by Xi users in session Select another user Xj from the set where Xj And Si Xj Si If Xi and Xj belongs to the same session it means that they have common interest on the same web session then Session_count =Session_count+1(Increment session counter by 1) And generate the matrix named VISITij for number of time web page visited. VISITij=[ Matrix] { Page I visited by the web user J} Similarly generate the matrix for the following Page_count=page_count+1 (Increment the page counter by 1) Generate the matrix for ith page visited by jth user. Time_cont=Time_count+1(Increment the Time counter by 1) Generate the Matrix for time spend by a user on a web page. Assign the initial mean value for cluster K. Plot the cluster by the use of specified matrix on the basis of Session belongs, page visit and time spent on the page.
Set the threshold value for centroid ä and calculate the distance between different clusters. Step10: Apply Fuzzy c-mean clustering on testing refined data set and generate the proper cluster. Consider a unlabelled pattern X=(X1,X2, X3… Xn) Objective function is used to calculate WGSS. Min Jm(U,W)= N=NO of pattern in X C= No of clusters W=cluster center vector U=membership function matrix the element of U are µi,j µi,j=Degree of membership of Xi in the cluster j d2ij=|| Xi - Ci|| where i d” m<“ Where m is any real number greater than 1 Ci is the d-dimension center of the cluster. Step 11: Find the optimized solution and predict the user behavior on the basis of cluster results, density of cluster ,distance of cluster and compare with Markov predicting model and Bayesian Model(Two way model). D. EXPERIMENTAL RESULT For evaluating the proposed technique the database is
Fig. 5. User Visit per hour Graph
Fig. 6. Page view Graph
© 2013 ACEEE DOI: 03.LSCS.2013.2. 563
71
Tutorial Paper Proc. of Int. Conf. on Advances in Information Technology and Mobile Communication 2013 compared with Fuzzy clustering in comparison of K-means clustering. For future work we should try to explore the use of these techniques in automated software for predicting their next visit. This helps us in analyzing user behavior and understanding nature of user navigation. Proposed approach helps us in web site modification on the basis of user interest.
selected from 14 colleges of Northern India Universities and engineering colleges in the form of web logs. The program is implemented in MATLAB and in Java Only one weak database is taken here for experimental results. With this we also check the complexity of algorithm to show that the output of our approach is up to the mark and more efficient than the other approaches. It contains total 256789 results per web logs file approx 4503 visit per file. Before cleaning its size of single file is approx 1.288KB and after cleaning all fields it size reduce up to 498 kb. Proposed approach is developed in JAVA and clustering technique is employed in testing data set in MATLAB. After final optimization we feel that our approach is simpler and refine than the other approaches and this give more effective results to us for user behavior analysis.
REFERENCES [1] G.Sudhamathy,C.J.venkateswaran “Web log clustering approaches-a survey” IJCSE ISSN0975-3397 vol3No7 July 2011. [2] J. Vellingiri , S. Chenthur Pandian “Fuzzy Possibilistic CMeans Algorithm for Clustering on Web Usage Mining to Predict the User Behavior” European Journal of Scientific Research ISSN 1450-216X Vol.58 No.2 (2011), pp.222-230. [3] Hussain Tasawar, Asghar Sohail and Fong Simon, “A hierarchical cluster based preprocessing methodology for Web Usage Mining”, 6th International Conference on Advanced Information Management and Service (IMS), Pp. 472-477, 2010. [4] Vinita Shrivastava, Neetesh Gupta “Performance Improvement Of Web Usage Mining By Using Learning Based K-Mean Clustering” International Journal of Computer Science and its Applications ISSN 2250 – 3765. [5] Chu-Hui Lee, Yu-Hsiang Fu “Two level prediction model for user’s browsing behavior” Proceedings of the International MultiConference of Engineers and Computer Scientists 2008 Vol IIMECS 2008, 19-21 March, 2008, Hong Kong. [6] R.Khanchana and M. Punithavalli “Web Usage Mining for Predicting Users’ Browsing Behaviors by using FPCM Clustering” IACSIT International Journal of Engineering and Technology, Vol. 3, No. 5, October 2011. [7] Harish, Anil Kumar “Effective Cleaning of Educational Web Site Usage Patterns and Predicting their Next Visit” International Journal of Computer Applications (0975 – 8887) Volume 53– No.4, September 2012. [8] Harish, Anil Kumar “Analysis of Educational Web Pattern Using Adaptive Markov Chain For Next Page Access Prediction” International Journal of Computer Science and Information Security Publication July 2011, Volume 9 No. 7. [9] Bindu Madhuri, Dr. Anand Chandulal.J, Ramya. K, Phanidra.M “Analysis of Users’ Web Navigation Behavior using GRPA with Variable Length Markov Chains” IJDKP.2011.1201. [10] B.ramesh babu,R.jeyshankar “Websites of central university in India: A webometric Analysis” DESIDC journal of libarary and Information Technology,Vol30 no .4 july 2010. [11] Harish, Anil Kumar “Clustering algorithm employee in web usage mining: An overview” INDIACOMM-2011 ISSN 09737529 ISBN 978-93-80544-00-7
Fig. 7. Page visit Graph
AUTHOR PROFILE:
Fig. 8. Cluster Generation based on user identification
CONCLUSION
AND
Harish Kumar has completed his M.Tech (IT) from Guru Gobind Singh Indraprastha University, Delhi. He is currently pursuing his Ph.D from Mewar University, Chittorgarh.
FUTURE WORKS
Web is one the main source of the information. The results are based on the evaluation of 14 college’s web log files in busy and normal working days. After evaluation we find that fuzzy logic approach is more accurately define the cluster and provide more accurate results and prediction model based on the Markov chain and Bayesian theorem is more accurately © 2013 ACEEE DOI: 03.LSCS.2013.2. 563
Prof.(Dr.) Anil Kumar Solanki did his PhD in CSE from Bundelkhand University. He has published good number of papers in National and International journals. 72