Mrs. V. SUJATHA et al. / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES Vol No. 1, Issue No. 2, 112 - 117
AN APPROACH TO USER NAVIGATION PATTERN BASED ON ANT BASED CLUSTERING AND CLASSIFICATION USING DECISION TRESS Mrs. V. SUJATHA 1*
Dr. PUNITHAVALLI2
Computer Science Department CMS College of Science and Commerce, Coimbatore, India sujatha.padmakumar@rediffmail.com.
Computer Application Department 2
SNS Arts and Science College women’s Coimbatore, India mpunitha_srcw@yahoo.co.in
T
1*
mining techniques to automatically discover web
discovery of user access pattern from web servers.
documents and services, uncover general pattern on
ES
Abstract: Web Usage Mining (WUM) is the automatic Organizations collect large volumes of data in their
daily operations, generated automatically by web
servers and collected in server access logs. It can also provide information on how to restructure a website to
the web and to observe user behavior (viewing, book marking and browsing history).Web mining is the process of finding out what users are looking for on the internet .Some users might be looking at only
secondary data (web logs) derived from the users'
textual data, whereas some others might be interested
interaction with the web pages during certain period of
in multimedia data. Web usage mining is classified
Web sessions. At first Ant-based clustering algorithm is
into three and are web content mining, web structure
applied to pre-processed log files to extract frequent
mining, web usage mining.
A
service effectively. This paper presents how to mines the
patterns, then it is displayed in an interpretable format and secondly decision tree method is used to find and predict user’s navigation behavior. Two type of approaches are used were the offline phase is based on
IJ
Ant based clustering and the online phase is based on
Web
usage
mining
focuses
on
techniques that could predict user behavior while the user interacts with the web. As mentioned before the mined data in this category are the secondary data on
decision trees. The experimental results represent that
the web as the result of interaction. These data could
the approach can improve the quality of clustering for
range very widely but generally it is classified into
user navigation pattern in web usage mining systems.
usage data that resides in the web client, proxy server
These results can be use for predicting user’s next
and servers. The aim of understanding the navigation
request in the huge web sites.
preferences of the visitors is to enhance the quality of
Keywords -Web usage mining, web mining, web
log files, classification and navigation pattern
I. INTRODUCTION Web mining The term web mining is
electronic
commerce
services
ecommerce,
to
personalize the Web portals or to improve the Web structure and Web server performance.
The first
stage is preprocessing, next stage is pattern discovery and the last stage is pattern analysis.
coined by Etzioni in 1996, to signify the use of data
ISSN: 2230-7818
@ 2010 http://www.ijaest.iserp.org. All rights Reserved.
Page 112
Mrs. V. SUJATHA et al. / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES Vol No. 1, Issue No. 2, 112 - 117
Association Rules discover correlations among pages accessed together by a client.
Sequential
Patterns
extract
frequently
occurring inter-session patterns such that the presence of a set of items s followed by another item in time order.
Dependency Modeling determines if there are any significant dependencies among the variables in the Web.
T
C. Pattern Analysis Pattern Analysis is the final stage of
Fig 1: General Architecture for Web Usage Mining II. WEB USAGE MINING ARCHITECTURE Pre-processing "consists of converting the usage, content, and structure information contained in
the various available data sources into the data
abstractions necessary for pattern discovery". This step can break into at least four sub steps: Data
Cleaning, User Identification, Session Identification
A
and Formatting. Unneeded data will be deleted from
raw data in web log files in the data cleaning step. At least two log file formats exists: Common Log File format (CLF) and Extended Log File format ([16] for more details). Our university log file
IJ
consists of these fields: Date, Time, client IP address, Method, URI stem, Protocol status, Bytes sent, Protocol version, Host, User Agent and Referrer. B. Pattern Discovery
Statistical Analysis such as frequency analysis, mean, median, etc.
validation and interpretation of the mined pattern. Validation: to eliminate the irrelevant rules or patterns and to extract the interesting rules or patterns
ES
A.Preprocessing
WUM (Web Usage Mining), which involves the
from the output of the pattern discovery process. Interpretation: the output of mining algorithms
is mainly in mathematic form and not suitable for direct human interpretations. III. RELATED WORK
Identifying Web browsing strategies is a
crucial step in Website design and evaluation, and requires approaches that provide information on both the extent of any particular type of user behavior and the motivations for such behavior [9].Pattern discovery from web data is the key component of web mining and it converge algorithms and techniques from several research areas. Baraglia and Palmerini (2002) proposed a WUM system called SUGGEST that provide useful information to make easier the web user navigation and to optimize the web server performance. Liu and Keselj (2007)
Clustering of users help to discover groups
proposed the automatic classification of web user
of users with similar navigation patterns
navigation patterns and proposed a novel approach to
(provide personalized Web content).
classifying user navigation patterns and predicting
Classification is the technique to map a data
users’ future requests and Mobasher (2003) presents
item into one of several predefined classes.
a Web Personalizer system which provides dynamic
ISSN: 2230-7818
@ 2010 http://www.ijaest.iserp.org. All rights Reserved.
Page 113
Mrs. V. SUJATHA et al. / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES Vol No. 1, Issue No. 2, 112 - 117
recommendations, as a list of hypertext links, to users. Jespersen et al. (2002) [10] proposed a hybrid approach for analyzing the visitor click sequences. Jalali et al. (2008a [7] and 2008b [8]) proposed a system for discovering user navigation patterns using a graph partitioning model. An undirected graph based on connectivity between each pair of Web pages was considered and weights were assigning to edges of the graph. Dixit and Gadge (2010) [5] presented another user navigation pattern mining
T
system based on the graph partitioning. An
Figure 2 Offline & Online phase
undirected graph based on connectivity between Referrer and URI pages was presented along with a
A. Offline phase of the architecture
preprocessing method to process unprocessed web
This phase consists of two major
modules Data pretreatment and Navigation Patterns
of the undirected graph. Ant-based clustering due to
Mining. In this phase starting with the primary Web-
its flexibility and self-organization has been applied
Log Preprocessing (Data pretreatment) to extract user
in a variety of areas from problems arising in e-
navigation session from dataset and Clustering
commerce to circuit design, and text-mining to web-
algorithm to mining navigational patterns in offline
mining, etc (Jianbin et al., 2000. The various works
phase .
ES
log file and a formula for assigning weights to edges
proposed in this area with particular emphasize on
A
web usage mining, clustering and classification was
B. Online phase of the architecture During the online phase, when a new
provided in this section. In this present work,
request arrives at the server, the URL requested and
research work is one another attempt made to
the session to which the user belongs are identified,
propose a hybrid system that uses clustering and
the underlying knowledge base is updated, and a list
classification
of suggestion is appended to the requested page[6].
methods
to
discover
the
user’s
C. Prediction Engine.
web log file.
The main objective of prediction engine in this
IJ
navigation pattern and analyze them from the server’s IV METHODOLOGY
The refined web log files are given as an input to
part of architecture is to classify user navigation patterns and predicts users’ future requests.
the ant based clustering algorithm to find the user
D. Ant-based Clustering
behavior pattern, then with that classification method
In the case of ant-based clustering and sorting,
using decision trees are applied to predict the user’s
two related types of natural ant behaviors are
next request in the huge web sites. The hybrid system
modeled. When clustering, ants gather items to form
improves the quality of clustering for user navigation
heaps. And when sorting, ants discriminate between
pattern in web usage mining systems.
different kinds of items and spatially arrange them according to their properties. Lumer and Faieta in
ISSN: 2230-7818
@ 2010 http://www.ijaest.iserp.org. All rights Reserved.
Page 114
Mrs. V. SUJATHA et al. / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES Vol No. 1, Issue No. 2, 112 - 117
proposed ant-based data clustering algorithm (shown in Figure 3), which resembles the ant behavior
ES
T
described in [4].
Input: training samples, represented by discrete attributes; the set of candidate Attributes, attribute-list. Output: set of classes Method: 1. Create a node N; 2. If samples are all of the same class C, then Return N as a leaf node labeled with the class C; 3. If attribute list is empty then Return N as a leaf node labeled with the most common class in samples (majority voting) 4. Select test attribute, the attribute among attribute-list with the highest information gain ratio; 5. Label node N with test-attribute; 6. For each known value ai of test-attribute 7. Grow a branch from node N for the condition testattribute= ai; 8. Let si be the set of samples in samples for which test-attribute = ai; 9. If si is empty then 10. Attach a leaf labeled with the most common class in samples; 11. Else attach the node returned by generate decision- tree
Figure 4: Classification using decision trees V. EXPERIMENTAL EVALUATION In order to test the effectiveness of
the proposed system, server web log data file was
A
Figure 3: Ant based algorithm E. Decision Trees
obtained. The system was tested with several data collected from 90 days for easy discussion,
in
experiments projected here are from one day, that is,
classification and prediction. It is simple yet a
data collected on 29-12-2009. As mentioned in
powerful way of knowledge representation. The
section 3, the preprocessing is conducted in four
models produced by decision trees are represented in
steps, namely (i) Cleaning (ii) User Identification (iii)
the form of tree structure. A leaf node indicates the
Session Identification and (iv) formatting
trees
are
IJ
Decision
used
class of the examples. The instances are classified by sorting them down the tree from the root node to leaf node.
Figure 5: clusters group
ISSN: 2230-7818
@ 2010 http://www.ijaest.iserp.org. All rights Reserved.
Page 115
Mrs. V. SUJATHA et al. / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES Vol No. 1, Issue No. 2, 112 - 117
2
3 4 5 6
116.68.91.110 117.204.97.156
118.94.8.197 119.27.62.254 121.242.52.2
User Profile
Unique Pages
1 15 3 8 15 17 1 8 3 11 15 6 1 17 23 6
{1, 15, 3, 8, 17} {1, 6, 3, 11,
1286 17 2 149 11 23 1 8 13 1 17
{1, 2, 8, 6, 17} {1, 4, 9, 11, 23} {1, 8, 13, 17}
122.178.146.123
1 4 11 15 4 Figure 6 Extracted navigation patterns NP number 1 2 3 4
15, 17, 23}
{1, 4, 15}
11,
Navigational Pattern
(P1, P15 ,P3 ,P8 ,P17 ) (P1, P6 ,P3 ,P11 ,P15 ,P17 ,P23 ) (P1,,P2 P8, P6 ,P17 ) (P1, P4 ,P9 ,P11 ,P23 )
5 ( P1, P8 ,P13 ,P17 ) 6 ( P1, P4 ,P11 ,P15 ) Figure 7: Navigation pattern Generated by clustering algorithm
VI. CONCLUSION
In this paper, a new method to extract navigational patterns from web logs. The work focused on group of the frequently accessed patterns of interested users. It assists the web site designers to improve the performance of the web by giving preference to the patterns navigated by the regular interested users. After the clustering is completed, alignment processing has been applied to the extracted sequences in each cluster and extract the representative for each cluster. A Classification algorithm is used for online phase to predict the user future request. VII. REFERENCES
IJ
A
A. Output
Figure: 9 interested user & non interested user
ES
1
IP Address
T
S.No.
[1] Abraham. Natural Computation for Business Intelligence from Web Usage Mining, Proceeding of Seventh International Symposium on Symbolic and Numeric
Algorithms
for
Scientific
Computing
(SYNAC2005), pp. 3-11, 2005. [2] Baraglia, R. and Palmerini, P. (2002) SUGGEST: A web usage mining system, Proc. of IEEE Int’l Conf. on Information Technology: Coding and Computing, P.282.
Figure 8: Effect of cleaning step on raw web log file
[3] Clark, L., Ting, I.H., Kimble, C., Wright, P. and Kudenko, D. (2006) Combining ethnographic and
clickstream
data
to
identify
user
Web
browsingstrategies, Information Research, Vol. 11, No. 2.
ISSN: 2230-7818
@ 2010 http://www.ijaest.iserp.org. All rights Reserved.
Page 116
Mrs. V. SUJATHA et al. / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES Vol No. 1, Issue No. 2, 112 - 117
[4] Deneubourg, J.L., Goss, S., Franks, N.,
Data Warehousing and Knowledge Discovery, LNCS
Sendova–Franks, A., Detrain, C. and Chretien, L.
2454, Y. Kambayashi, W. Winiwarter, M. Arikawa
(1990) The Dynamics of Collective Sorting Robot–
(Eds.), Pp. 73-82.
Like Ants and Ant – Like Robots. From Animals to Animals, Proc. Of the 1st Int. Conf. on simulation of Adaptive Behaviour, Pp. 356–363. [5] Dixit, D. and Gadge, J. (2010) A New Approach for Clustering of Navigation Patterns of Online Users, International Journal of Engineering 1676. [6] Handl, J. and Meyer, B. (2002) Improved ant-based clustering and sorting in a document retrieval interface, Proceedings of the Seventh
ES
International Conference on Parallel Problem Solving
T
Science and Technology, Vol. 2, No.6, Pp. 1670-
from Nature, Vol. 2439 of LNCS, Springer-Verlag, Berlin, Germany, and Pp. 913–923.
[7] Jalali, M., Mustapha, M., Mamat, A. and Sulaiman,
M.N.B.
(2008a)
A
new
clustering
approach based on graph partitioning for navigation
patterns mining, 9th International Conference on
A
Pattern Recognition, Pp. 1- 4.
[8] Jalali, M., Mustapha, N., Mamat, A., Sulaiman, N.B. (2008b) Web user navigation pattern mining approach based on graph partitioning algorithm, Journal of Theoretical and Applied
IJ
Information Technology, Pp. 1125-1131
[9] Jalali, M., Mustapha, N., Sulaiman, N.B. and
Mamat, A. (2008c) A web usage mining approach based on LCS algorithm in online predicting recommendation
systems,
12th
International
Conference Information Visualization, IEEE Computer Society, Pp. 302307. [10] Jespersen S.E., Thorhauge J., and Bach T. (2002), A Hybrid Approach to Web Usage Mining,
ISSN: 2230-7818
@ 2010 http://www.ijaest.iserp.org. All rights Reserved.
Page 117