International Journal of Computer Trends and Technology (IJCTT) – volume 9 number 2– Mar 2014
A Survey of Preprocessing Method for Web Usage Mining Process Harmit kaur1 Hardeep singh2 1 2
(Department of CSE/Lovely Professional University,INDIA) (Department of ECE/Lovely Professional University,INDIA)
Preprocessing of log file,
behavior of user by applying data mining techniques.web usage mining deals with the information which is used to understanding the behavior of user who is interacting with a web site. The information is used to improve the structure website, improve performance, and provide fast and reliable access to users. Web usage mining is divided into three phases preprocessing of log data, pattern discovery. Pattern analysis. Preprocessing of log file is complex task but it improves the quality of other two steps pattern discovery and pattern analysis. This paper is organized as follows in section II literature review of web usage mining is explained. In section III and IV sources of log file and attributes of log file are described. In section V formats of log files are explained. In section VI web usage mining process is described. In section VII application of web usage mining are described. Conclusion is given in section VIII.
1. Introduction
2.Literature review
Data mining is a process of finding useful information from large database. Data mining is a process of knowledge discovery which uses different techniques for extracting the knowledge from database.web data mining is an application of data mining It is process of extracting the information from web.web data mining is categorized into three types that are web content mining, web structure mining and web usage mining[1].Web content mining is a type of web data mining which is extraction of contents from web sources there are different web sources from where user can get the informaion.web content mining is divided into two type’s text mining and multimedia mining.web structure mining is a process of discovering link structure of web. There are many tools available to retrieving information from web page but tools ignore valuable information containing in web links.web usage mining is a process to extract the
A paper [1] which described techniques of preprocessing that are used in data cleaning data filtering, path completion, user identification, session identification and web session clustering. They described the different sources of log files, log file formats, preprocessing techniques, algorithms applied and data support to data preprocessing phase. A survey is done by authors on preprocessing techniques used in preprocessing phase. A paper [2] in this paper web log data preprocessing is divided into steps that are log consolidation, data cleaning, user identification and transaction identification .log consolidation is the first step in preprocessing in which the logs from different servers are combined into one place for data cleaning. Next step is data cleaning which is divide into two parts first is page element cleaning in which files with extension.gif, jpeg, .jpg are removed and second
Abstract The amount of web applications are increasing in large amount and users of web applications are also increasing rapidly with high speed. By increasing number of users the size of log file also increases .The information which stores in log files cannot be directly used for analysis. Therefore preprocessing of log files is necessary to improve the quality of web usage mining process. Preprocessing of log data improves performance of other two steps pattern discovery and pattern analysis. Preprocessing involves data cleaning, user identification, session identification, path completion. In this paper the survey of different preprocessing techniques are done and identify better techniques to improve the performance.
Keywords – web usage mining, web server log,
ISSN: 2231-2803
http://www.ijcttjournal.org
Page62
International Journal of Computer Trends and Technology (IJCTT) – volume 9 number 2– Mar 2014
type is cleaning other information such as files with extension .css, xsl, .xsd, .dll. Determination and identification of user involves the ip address, agent file and referrer page. Transaction identification is to break large transaction into several smaller one or combine the small transaction into large one. Transaction identification is done using reference length and maximal forward length. A paper by [3] theint aye explain the web log cleaning process for mining the log data .He give overview of web usage mining process and explain two algorithms one for data cleaning and other for field extraction in web usage mining process he give overview of preprocessing phase and after preprocessing the data is converted into structured form and then apply algo for mining the information from it. In data cleaning algorithm image files, multimedia files with extensions .jpg,.css,.gif are removed from url link but there can be some irrelevant data cannot be removed using this algorithm. Second algorithm is field extraction in this algorithm a table is created and then data is converted into structured format and then stored into table. A paper by [4] who explain data preprocessing in log files he start the preprocessing from data fusion and cleaning in this he described data is combined from different sources and then irrelevant entries are removed. User and session identification is performed in session identification he gives two methods first is time oriented second is structure oriented in last path completion is performed using graphs. A paper [6] in this two algorithms for preprocessing are introduced the first algorithm for data cleaning and second for data reduction are proposed in data cleaning algorithm the records with extension .jpg, gif, .css are removed but records with irrelevant status code are not removed in this algorithm so the status code can be remover in improved algorithm in data reduction algorithm identified the sessions and removed the incomplete session entries are removed.
3.Sources of log files There are many sources of log files for preprocessing phase . Some sources are 3.1 Client log file Client log files are most authentic and accurate to depict user behavior. It store the information about
ISSN: 2231-2803
the client the clicks by the client. The information is in the form of one to many relationships. It store the information websites visited by particular users. 3.2 Proxy log file These files are more complex because these files are store the information is in form many to many relationship it means one user can visit many sites and many users can visit one site. Proxy server is a mechanism which exists between client browser and web server. These servers take requests from multiple clients to multiple web servers. 3.3 Server log file These server files store information in form many to one relationship. Many users can visit one website. The behavior of user can be captured using web server log files these files are accurate and reliable for web usage mining process. These files store information into different logs. Server log file has some common log files are access log, error log, referrer log, agent log. Access log file records all the clicks, hits access made by user. The behavior and interest of user are mined from access log file. The information about user behavior and user interests are captured in access log file and mined from this file. Error log file records all errors of website when user open a particular page by clicking a link and page does not found then error display a message “error 404 file not found. These log files improve the quality of website by optimize the web site links. Referrer log file records the information about the referrer. The referrer is the page from user jumps to the new page using any hyperlink between the pages. Agent log file contained the information about the browser and operating system. The information stored in agent files are helps in user identification and web site designer can also make website is more compatible with the most used browser and operating system.
4.Attributes of log file The log file store the different information using different attributes of log file which includes some important information [1] Client IP Client IP is the IP address of client machine from where users browse the website. Date and time
http://www.ijcttjournal.org
Page63
International Journal of Computer Trends and Technology (IJCTT) – volume 9 number 2– Mar 2014
Date is recorded when user made access the date is stored in the format YYYY-MM-DD.Time information is stored in the format as HH-MM-SS. Sever client status Server code return by server like 200,404. User agent User agent records the information about the browser type, version and operating system that is used by the client at the time of accessing the website. Referrer Referrer is the previous page from where client jumps to the new web page or website. It is the link of the current page or website to the previous page or website. Server client bytes Number of bytes sent by the server to client. Client server bytes Number of bytes received by client from server.
5. Formats of log files There are many log formats are available of log files to capture the behavior of user and activities of user on website. Log files text files which are stored using.txt extension. These files are stored in ASCII format.log files are used to monitor the behavior of user and takes feedback from client side. There are some common formats are available for storing the log records that are common log format, Microsoft IIS log format, extended log format[2]. 5.1 Common log format Common log format store the user’s activities in some fixed attributes. It store the information in attributes like IP address, time and date, duration,referenc log. HTTP status. Common log formats are standardized format [5].
6.
Web usage mining process
Web usage mining process involves three steps preprocessing of log file,pattern discovery and pattern analysis [7].The outcome of web usage mining process is used for improving the structure of web site and personalization. Preprocessing is important phase of web usage mining process because log files cannot directly used for analysis. Preprocessing phase takes 80% time of whole process. 6.1.Preprocessing of web log data Preprocessing is essential step of web usage mining process. A web log file is an input to the preprocessing phase. The quality of web usage mining process is not only depends on sources of log data. The quality of other two steps of web usage mining process pattern is recovery and pattern anaysis.Preprocessing phase involves data cleaning, user identification, session identification, path completion. 6.1.1 Data cleaning Data cleaning is first step in preprocessing phase. It is a process to remove the irrelevant records from log files. The records in log files with extension gif, jpeg, jpg are removed from files [6] . The information which is not require for the analysis all that entries are removed in data cleaning phase .The records having files like sound ,graphic information are removed from log files in this process.some othe files like css,xsl are also removed in data cleaning phase
5,2Microsoft IIS log format Microsoft IIS log format is customizable format. It has some additional attributes than common log format. It stores some extra information of user behavior. It records more data of user’s access [2]. 5.3 Extended log file format. Extended log format is customizable. It is more flexible and can be customized according to user requirements. Some other attributes like HTTP cookie, version and HTTP user agent are used to capture the user behavior.
ISSN: 2231-2803
Fig 1 web log preprocessing
http://www.ijcttjournal.org
Page64
International Journal of Computer Trends and Technology (IJCTT) – volume 9 number 2– Mar 2014
6.1.2 User identification User identification is important phase in preprocessing of log files .It is a process to identify the users who access the website. there are n number records in server log file user identification is a process to identify a user corresponding each record. In User identification users are identify based on ip address of client,registred users and other methods[3]. The common method to identify users is ip address if there is different ip address and then there is different user and if there is same ip address and different user agent means different operating system and browser type then there will be different user and if the operating system and browser type is same and referrer page is null the there will be new user if referrer page is not null there can be same user. This method is only applicable in static ip addresses. User identification is very difficult and complex task. Due to proxy server user identification cannot give accurate results. The result of web usage mining process is depends on the user identification. 6.1.3 Session identification A Session is a sequence of page view by a single user during visit. Session identification is a process of identification of user activity of each user in log file during period of time. Session identification is divided into two ways that are time oriented and structure oriented [5] .Time oriented is depends on time of request in the server log file. Time oriented session identification is divided into two ways first is the time gap of first request and last request is <=t and second is the time gap between the first request and next request is <=t. Structure oriented session identification is based on the referrer page. The session is identified is using the page accessed by user using links. 6.1.4 Path completion Path completion is the process to complete the access path of user using URL and referrer page access path complete structure of pages and links of pages that are accessed by a user. Graphs are used to represent the paths of user access. Graphs are used to represent the path completion process .Each node is used to represent a web page or a website and edges represent the links between the pages or websites.
ISSN: 2231-2803
6.2 Pattern discovery In pattern discovery the knowledge is discovered using classifying and clustering the user activities. Classification is a technique to classify the data items into predefined classes. The data items in a particular class having same properties [1]. 6.3 Pattern analysis Pattern analysis is final step in web usage mining process after pattern discovery the information is obtained .The analysis of that information is pattern analysis. The analysis of information can be done using the OLAP tools.
7. Applications of web usage mining Web application has many applications some important applications are Personalization Web site evaluation System improvement Personalization Personalization is an important application of web usage mining when user interacts with the website and website presents the information according to user’s requirements. Personalization is most widely used in research areas in web usage mining. Adaptive web sites change their organization and presentation according to the preferences of user accessing them. Web agent based systems are used for web personalization .Amazon.com uses similar technique for web personalization. Web site evaluation Web site evaluation determines needed modification in the contents of web site and link structure of website. The technique for web site evaluation is to model user navigation pattern and compare them to site designer’s expected patterns.
8. Conclusion In web usage mining process preprocessing of web log data is necessary step. Preprocessing improves the performance of other two steps pattern discovery and pattern analysis.the log files are store the records with .txt extension.server log file is source of log data for web usage mining process.there are many preprocessing techniques we can combine some old preprocessing techniques with new preprocessing
http://www.ijcttjournal.org
Page65
International Journal of Computer Trends and Technology (IJCTT) – volume 9 number 2– Mar 2014
techniques to improve the performance of web usage mining process.
9.
References
[1] Dafa-Alla, Mirghani. A. Eltahir and Anour F.A(2013), Extracting Knowledge from Web Server Logs Using Web Usage Mining, 2013 international conference on computing, electrical and electronic engineering (ICCEEE) [2] Tasawar hussain, dr.asghar and dr. masood(2010)” preprocessing techniques in web log mining” [3] ma shu yue,liu wen cai,wang shuo(2010) the study on the preprocessing in web log mining’ [4] Theint Theint Aye University of Computer Studies, Mandalay(2011), “Web Log Cleaning for Mining of Web Usage Patterns” [5] Marathe Dagadu Mitharam(2012), Preprocessing in Web Usage mining. [6] Navin Kumar Tyagi1, A.K. Solanki2& Sanjay Tya(2010), “an algorithmic approach for preprocessing in web usage mining” International Journal of Information Technology and Knowledge Management July-December 2010, Volume 2, No. 2, pp. 279-283. [7] Jia Li(2013), “Research of Analysis of User Behavior Based on Web Log”, 2013 International Conference on Computational and Information Sciences [8] Sitaramulu, K. Sudheer Reddy M. Kantha Reddy V.,(2013)” An effective Data Preprocessing method for Web”
ISSN: 2231-2803
http://www.ijcttjournal.org
Page66