Fatemeh Dashti et al. / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES Vol No. 1, Issue No. 1, 016 - 022
Optimizing the data search results in web using Genetic Algorithm Fatemeh Dashti
Solmaz Abdollahi Zad
Sama technical and vocational training school, Islamic Azad University, Tabriz Branch Tabriz, Iran fatemehdashti@yahoo.com
Islamic Azad University – Sardrud Branch Tabriz, Iran Sk_abdolahi@yahoo.com
Keywords- Search results, context, genetic algorithm, search engine
I.
INTRODUCTION
IJ
A
adays, by following the continuously developing of Nowadays, Internet, anyone can easily obtain information to the Internet. Meanwhile, though it is now straightforward to provide information from the Internet, the rapidly propagated World Wide Web makes the problem of information overload [1,2,8]. In order to help users efficiently find the needed information, different types of search engines, such as Yahoo, Google, and Infoseek, have been developed during the last few years. Search engines are useful tools to collect and index web pages. After receiving a user specified query, a search engine can use its internal strategy to narrow down the vast Internet information to a certain range, and can retrieve web pages that match the query for the user [1,4]. However, retrieving sufficient relevant information online is difficult for many people because they use too few keywords to search and search engines are not able to receive their users’ real meaning through their given keywords. As a result, users can not find the specific information they really need [2,4,6]. Accessing topical information through existing search engines requires the formulation of appropriate queries, which is highly challenging [5,7]. Then the appropriate selection of query is an optimizing problem and the purpose is to obtain the best query to get the information through the web automatically [7]. This paper tries to present a method based on genetic algorithm in the distributed form and while trying to solve the time problem existing in previous systems, it optimizes the result quality.
ISSN: 2230-7818
T
One of the main features of this algorithm is the dynamic termination criteria which prevents the additional calculations to obtain optimizing result and at the same time doesn't decrease the probability of reaching optimizing result. This causes a significant time decrease in the run time and as a result the user waits the short time to observe the results and one of the other features of the system is a parallel run process of the major part of the algorithm which occurs through independent web servers.
ES
Abstract— One of the most important problems of web data search is recombining words of query, because reformulation of query words plays an effective role in search engine results. Web users try to make queries using their experience. What matter here is to develop a method so that web users can get their desired results without having to repeat the reformulation of their query. In this paper, authors have presented a method using genetic algorithm in a distributed way according to users' favorites to optimize query sent to search engine and finally to optimize quality of result pages.
II.
RELATED WORKS
Genetic algorithms are applied extensively in IR. Gordon [3,9] proposed a genetic algorithm based approach or document indexing. In his formulation, a keyword represents a gene, a document’s list of keywords represents chromosomes, and a collection of relevant documents judged by a user represents the population. The population then evolves through generations and eventually finds a set of keywords which, in terms of the fitness function, best describes the documents. Petry et al. [10] applied genetic algorithms to a weighted IR system. In their design, a weighted Boolean query was modified to improve recall rate and precision rate. They found that the form of the fitness function had a significant effect on IR performance. Yang and Korfhage [12] used relevance feedback to develop adaptive retrieval methods based on genetic algorithms and the vector space model. They reported the effect of adopting genetic algorithms in large databases, the impact of genetic operators, and genetic algorithm’s parallel searching capability. Chen et al. [13,14] used the best first search algorithm and the genetic algorithm to develop a web spider system. They concluded that the genetic algorithm spider did not outperform the best first search spider, but they found both results to be comparable and complementary. Nick and Themis [15] have employed a genetic algorithm in an intelligent agent system that recommends web pages directly to users. In order to assist the intelligent agent in learning a user’s interests, the user is requested to provide some web page examples of interest in advance. Picarougne et al. [16] also developed a web spider system, called GeniMiner. They claimed that the genetic search can be valuable when (1) the user can wait for a longer time than in standard search engines; (2) queries are more complex or more precise than a list of
@ 2010 http://www.ijaest.iserp.org. All rights Reserved.
Page 16
Fatemeh Dashti et al. / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES Vol No. 1, Issue No. 1, 016 - 022
These researches found that the fitness function plays an important role in improving the precision rate. Thus, the design of the fitness function considers the documents that were retrieved [6]. Note that, in this paper, we compare our proposed method with Cechini method in both run time and results quality. III.
identification of optimal solutions but are usually successful in finding near optimal ones. - Multiple solutions. Each one of multiple sets of Web pages can represent a satisfactory result for a context based search. Therefore, we may be interested in finding many highquality queries rather than a single one. GAs can be naturally used for multimodal relevance optimization. - Exploration and exploitation. Finding good combinations of query terms requires exploring different direction of the thematic-context space. This exploration must be independent of the initial population of queries and it may require going beyond the initial set of terms by incorporating novel terms. Such a search process can be effectively performed by applying the genetic operators of crossover and mutation. In addition, the exploitation of the most promising combination of terms is naturally induced by the selection mechanism. PROPOSED METHOD
T
IV.
In this paper, an effort has been made to optimize query words formulation by genetic algorithm so that users meet the most related results of search engine. In the following sections the details of designed algorithm have been explained. A. System Architetcure Fig 1 shows the general outline of the system. In this system, there are some web servers whose main job is to create interaction with search engines through the web. Besides, they perform some parts of system calculation that can run in a parallel way.
ES
keywords. However, users usually expect quick responses from search engines. Vrajitoru [17] used a genetic algorithm to improve performance in IR systems. To avoid the classical crossover operator leads to fewer offspring than their parents, he used two cross points instead of one and treated the two input individuals differently. Many researches adopt GA to resolve relevance user feedback problem which is one of the applications of IR [18-20]. In the method given by Caramia[2], the evolutionary algorithm has been used. In every run of algorithm, the query words are recombined according to the user’s opinions and are sent to the search engine and the results are shown to the user, so that the user can select the related pages through observing the result context. The next generation of the queries is created using the user’s selected pages and this process continues until the user satisfaction of the search results. The major problem of this system is that the user has to have much interaction with the system and observe the results list and judge their content in each algorithm run and this is very time-consuming and undesirable for the user. Also the algorithm is deviated from its main evolutionary direction and is far from the optimizing result with any user’s minor mistakes. And finally, in 2007, Cecchini suggested a new automatic formulation system using the genetic algorithm [7]. This system has optimized the queries by generating a mass of queries during various generations without user interaction. But the major problem of the system is the long run time. Since the main part of the system is allocated to send queries and receive results via search engine and this is a time-consuming process.
There is also the central server that monitors web servers and performs main system functions.
CHOICE OF GENETIC ALGORITHM
A
There are a number of reasons why we choose GA in our method and why GAs are appropriate to deal with the problem of context-based Web search [7]:
IJ
- Context-based Web search as an optimization problem. Generating high-quality queries for context-based search on the Web can be regarded as an optimization problem. The search space of the problem is defined as the set of possible queries that can be presented to a search engine. The objective function to be optimized is based on the effectiveness of a query to retrieve relevant material when presented to a search engine. Depending on the system goals, a measure of query effectiveness can be defined using traditional IR notions such as precision and recall, or other customized performance evaluation metrics. - High-dimensional space. Query space is a highdimensional space, where each possible term accounts for a new dimension. This kind of problems cannot be effectively solved using analytical methods but are natural for GAs. - Suboptimal solution. Successful Web search requires the formulation of high-quality queries even if the formulated queries are not the optimal ones. GAs do not guarantee the
ISSN: 2230-7818
Figure 1. The general outline of designed system
At first, the central server generates initial population according to users' favorites and search context randomly. Every individual of this population presents a query. Central server divides generated queries among web servers equally.
@ 2010 http://www.ijaest.iserp.org. All rights Reserved.
Page 17
Fatemeh Dashti et al. / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES Vol No. 1, Issue No. 1, 016 - 022
Each web server sends its queries to search engine and gets the first l page address together with a paragraph in its document context that search engine has introduced it as a part with most similarity to query.
to sent query again. The equation (1) is used for this purpose[8]. t
Sim ( Q , D ) =
Then web servers split the words of every link paragraphs that introduced earlier by search engine and delete pronouns and prepositions among them and record their words and the number of times that words are repeated in the Profile Store. This number is used to calculate word weight. That is, the words with much occurrence because of their great importance, receive more weight. The reason why a weight is allocated to every word is that the words of query don't have the same role in showing the user's purpose and by giving weight to query words, its value in query words set is determined and also weight of words is used to score result pages.
× tf i × log(
N ) df i
(1)
Where: Q: query D: result web page N: the number of result pages
tf i : the number of occurrence of i-th query word df i
: the number of pages which include i-th word of
wi
: the weight of i-th query word
query
t: number of query words
After calculating this equation for each result page, the result list is ranked and first k page is selected from l page so that the next system operations can be performed on this k page.
ES
Note that, the words in Profile Store will be used in mutation operator to generate new queries and to expand search space.
i
i =1
T
It has to be explained that search engines select and show a paragraph for each result link that has most similarity to query among its page context. Now web servers should select k page with most similarity to query out of l page. To do this, web servers begin to rank links list according to their quality of pages again and select the first k page to continue next operations over them.
∑w
Similarly, web servers use page content, after removing HTML tags and additional words, to calculate the similarity amount between the concept of the desired user and page.
A
On the other hand, when web servers have full text of links, they extract links to other web pages in it and build pages link structure because authoritative web pages have various links from other web pages and more links to a web page means the site content has more authorities [11]. By using pages link structure, web servers can recognize the authoritative sites and allocate more scores to them.
C. pages link structure Authoritative web pages have various links from other web pages and then more links to a web page means the site content has more authorities [11]. Thus web servers discover web pages link structure. So they can recognize the authoritative sites and allocate more scores to them. For this purpose, web servers keep URLs and link information. Each URL is presented by a DocID <DocID,URL> and each link is shown by <DocID_src,DocID_des> as Fig. 2. The first one shows the source page ID and the second one shows destination page ID.
Another calculated factor is the number of web page repetition in results list of current query generation. Because it shows importance of web page [11]. Therefore, web servers use it as a factor in page scoring.
IJ
Then there are 3 factors in calculating score for each page: similarity to the concept of users, number of links to the result page and the number of web page repetition in results list of current query generation. After calculating a score for every page, average of all pages score is considered as query fitness and is submitted to central server. Central server produces next generation through genetic operations and this process continues until genetic algorithm termination. Finally, among produced generations, the best query with k page of its results list is submitted to user. B. Ranking Results Page List Web servers select the first l page after receiving results from search engine and rank them according to their similarity
ISSN: 2230-7818
Figure 2.
Pages link structure
To find the pages that link to a special link w1, at first DocID of w1 page is obtained from URL_TABLE and then all records whose DocID_des equals the selected DocID are fetched. The obtained records DocID field is DocID of pages which link to w1 page.
@ 2010 http://www.ijaest.iserp.org. All rights Reserved.
Page 18
Fatemeh Dashti et al. / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES Vol No. 1, Issue No. 1, 016 - 022
In order to specify the link page authority degree, at first, web servers allocate link authority degree 1 to pages that there wasn't any link to them.
L y∈θ = 1, y ∈ θ ← →(∃/x | x → y )
(2)
Then the authority link degree is updated via equation 3 recursively. This updating process continues until no authority link degree changes.
L j = α ∑ Li (i ∈ N | i → j )
(3)
For example in link structure of Fig 2 the authority link degree considering
α=
3 2
equals:
user's content and words of Profile Store and divides them between web servers equally. F. Fitness Function Fitness of each query depends on its result page quality. In order to calculate page quality, 3 independent expressions are used. The first expression (t1) is related to the sent query which is obtained from the equation (1). In this equation, the weight of each word which exist in Profile Store is used because the word value forming query is not the same in sent query to search engine and some words are more important than others. Then word weight is multiplied to word repetition in result page and is divided to total word weight which form query to calculate page similarity to sent query.
L1 = L7 = 1
t
∑w
L2 = L3 = L4 = 1.5 L5 = 2.25
qi
× wDi
i =1 t
∑w
T
Sim(Q, D ) =
L6 = 3.75
qi
(4)
i =1
Where:
Q: query
D: result web page
ES
D. Profile Store Web servers process the paragraph of each result page that search engine introduced as a part with most similarities to query among its context after that they selected first k page between l page.
t: the number of query words
At first, they split paragraph words and then delete pronouns and preposition and save the words in Profile Store and allocate the weight to words according to the number of word repetitions.
IJ
A
This Profile Store is used in mutation operation of genetic algorithm to expand search domain by adding new query structure.
ure 3. Profile Store Figure Fig Fi gure
E. Population and chromosomes representation method The formulated query to send the search engine makes search space. Search engines accept any combination of words up to 32 words [7]. As result the chromosomes length varies and is considered between 1 to 32 words and each chromosome indicates a query. At first, central server allocates a random weight to every user's favorites and saves them in the Profile Store. Then generates random queries of initial population according to the
ISSN: 2230-7818
wqi
: i-th word weight of query
wDi
: The repetition of i-th word of query in result
ppage pag agee
The second expression (t2) is link authority degree (explained (explained in 4.3) and the third one (t3) is the number of web page repetition in results list of current query generation. Because it shows importance of web page. Generally, the score of each web page is obtained through the following equation.
Score( D ) = α .t1 + β .t 2 + γ .t 3
(5)
α,β
and γ are expression coefficients and cause balance between expressions in page scoring. Finally fitness value for each query is calculated through the average of result page scores. k
∑ score( D ) i
Fitness − Function (Q ) =
i =1
k
(6)
K is the number of result pages and Di is the result web page. G. Cross Over operation In designed genetic algorithm, in order to produce next generation, some percent of individuals are transmitted to the
@ 2010 http://www.ijaest.iserp.org. All rights Reserved.
Page 19
Fatemeh Dashti et al. / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES Vol No. 1, Issue No. 1, 016 - 022
next generation through elitism process and other individuals are obtained through two points cross over operator. At first, the parents are selected among current generation through tournament method and are combined together. Since the chromosomes’ length varies, we try to equalize their length. In the way that chromosome’s length with small size is increased via putting X (don't care) in the end of it. Then select two points between their lengths randomly and break the chromosomes at those points and combine them together and finally equalize them up to standard form.
In proposed method, we consider genetic algorithm’s termination criteria dynamically based on improvement of fitness value in every generation [9]. We first give several definitions. If the maximal fitness value of chromosomes at the current generation is equal to that of the previous generation, we say that the current generation has made no improvement. If the consecutive NG without improvement is large, then there is a slim chance of making further improvement. For generation g we define the improvement ratio Ig as follows:
Ig =
f g − f g −1 f g −1
(7)
( g > 1)
f g and f g −1 are the average fitness value of all
where
T
omosomes at generations g and g-1. Let chromosomes
I g be the average
of all improvement ratio:
(8)
I g = avg ( I j ) , 1 ≤ j ≤ g
Generation g is said to make no significant improvement Generation if I g ≤ I g . That is, the improvement ratio at the current generation generation is not larger than the average improvement ratio for all generations up to now. We then define the allowable NG without significant improvement for generation g as follows:
ES
Figure 4. Cross Over operator
H. Mutation operator In this way, two types of mutation are used. In the first type, one of the query words is selected randomly and is deleted. This operation causes to decrease the query size and as a result causes variety among search result pages since the more keywords cause to decrease variety of query result domain.
IJ
A
In the second type of mutation, a word is selected from Profile Store randomly and is replaced with one of the query words and thus causes the new page result domain and expands search space and as a result causes result variety.
ure 5. Two types of mutation Figure Fig Fi gure
I.
Genetic algorithm’s termination criteria We assume that most users want quick responses, which may not be globally optimal or suboptimal, from search engines. As a result, if we produce many generations of chromosomes, we do extra and ineffective calculation for obtaining optimal result without any more significant improvement. Otherwise, if we consider a few generation, it is possible not to reach desired results. So deciding about number of generation is difficult [9].
ISSN: 2230-7818
Ig σg MG g = × Pop − size × σg Ig
(9)
I g > 0 , σ g > 0 , Pop size is the initial population size;
σg
is the standard deviation of all Ij, ( 1 ≤ j ≤ g );
average of
σj
σ g is the
( 1 ≤ j ≤ g ); As indicated in the above
definition, the allowable NG without significant improvement defi is dynamically determined on the base of improvement ratio at the current generation and improvement progress history. If either
Ig Ig
or
σg σg
is large, then there is a significant progress
in the fitness value at generation g. That implies the probability of having more progress is also high and the probability of having the optimal solution at this generation is low. Now we define the termination criterion as follows: the genetic algorithm is terminated whenever the consecutive NG without improvement is larger than its allowable NG without significant improvement at the current generation. We consider that is a sign that further computation could not yield much progress. V.
SIMULATION
The proposed method has been simulated using c# language and has been examined with various inputs several times.
@ 2010 http://www.ijaest.iserp.org. All rights Reserved.
Page 20
Fatemeh Dashti et al. / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES Vol No. 1, Issue No. 1, 016 - 022
In each run 4 web servers, 60 chromosomes in population, 0.7 cross over probability and 0.3 mutation probability have been considered.
A comparison between the results quality of proposed algorithm and the similar previous algorithm has been made using cosine similarity measure in fig 8.
Initial population is created by using the entered content and user’s favorites randomly and the query length is selected among 1 to 32 randomly. Fig 6 shows the performance of genetic algorithm in a diagram. For each generation the average fitness of all chromosomes has been calculated and finally the average of 10 independent runs has been calculated. The number of created generations varies and their average equals 23.
In order to do a correct comparison, two systems have been run 10 times with the same inputs and the mean of the results has been calculated.
The comparison between the results quality of posed algorithm and the similar previous algorithm proposed
Figure 6.
The search engines search the web using the keywords entered by the users and show the results. But most of the users are not able to find the suitable keywords for concept that they are searching.
ES
COMPARISON WITH THE PREVIOUS SIMILAR METHOD
IJ
A
The proposed method has some advantages to the previous similar algorithm (Cecchini algorithm). One of the advantages is the designing system in a distributed form which causes to decrease program run time and also user wait time. Fig 7 shows comparison between two systems were run 10 times with the same inputs to compare and the results have been shown by a diagram.
Figure 7.
VII. CONCLUSION AND FUTURE WORKS
The average fitness chromosomes in 20 generations
The results show designed genetic algorithm is able to increase chromosomes fitness during the new generation creation considerably. As result it offers pages with high quality and more similarity to the user's need. VI.
T
Figure 8.
The comparison between the proposed algorithm and the similar previous algorithm run time
Fitness function is the other features of the system. That is made of 3 expressions and is more effective than fitness function of the previous similar method. In the proposed method two types of mutation and Profile Store construction method and other polices have been considered cause to reach more effective results rather than previous method and also to determine generation number dynamically plays an important role in increasing the system efficiency.
ISSN: 2230-7818
The purpose is to obtain the best query automatically in order to get the required information on the web. As a result, this paper is trying to present a method based to genetic algorithm to optimize query and formulate an appropriate combination of keywords as a query to search the user's real goal. In order to decrease the user wait time, the system has been designed in the distributed form among various web servers. Also in order to prevent the additional calculation to obtain the optimized result, genetic algorithm’s termination is considered dynamically. As a future work we can refer to fitness function improvement. Because the fitness function plays an important part in the proposed method, other factors can be included in the calculation of this function. The proposed system can also be changed in away that the user can interact with the system during query evaluation and express his opinions in accordance to reach optimize result faster. Ultimately, along the improving results, the wait time to get the final result decreases. REFERENCES [1]
[2]
[3] [4]
[5]
Wei-Po Lee, Tsung-Che Tsai, “An interactive agent-based system for concept-based web search”, Expert Systems with Applications 24, 2003, 365–373. M. Caramia, G. Felici, A. Pezzoli, "Improving search results with data mining in a thematic search engine", Elsevier Computters and Operations Research 31, 2004, 2387-2404. M. Gordon, “Probabilistic and genetic algorithms for document retrieval, Commun.” ACM 31 (10),1998, 1208 1218. Fabio Crestani, Puay Leng Lee, "Searching the web by constrained spreading activation", Elsevier Information Processing And Managemet 36, 2000, 585-605. Filippo Menczer, "Complementing search engines with online web mining agents", Elsevier Decision Support Systems 35, 2003, 195-212.
@ 2010 http://www.ijaest.iserp.org. All rights Reserved.
Page 21
Fatemeh Dashti et al. / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES Vol No. 1, Issue No. 1, 016 - 022
[13] H. Chen, Y. Chung, M. Ramsey, C. Yang, “An intelligent personal spider (agent) for dynamic Internet/Intranet searching”, Decision Support Systems 23 (1),1998, 41 58. [14] H. Chen, Y. Chung, C. Yang, M. Ramsey, “A smart Itsy Bitsy Spider for the Web”, J. Am. Soc. Inf. Sci. Technol. 49 (7), 1998, 604 618. [15] Z.Z. Nick, P. Themis, “Web search using a genetic algorithm”, IEEE Internet Comput. 5 (2),2001, 18 26. [16] F. Picarougne, N. Monmarch!e, A. Oliver, G. Venturini, “Web mining with a genetic algorithm”, In Proceedings of the Eleventh International World Wide Web Conference, 2002. [17] D. Vrajitoru, “Genetic algorithms in information retrieval”, AIDRI97: Learning; From Natural Principles to Artificial Methods, Geneve, 1997. [18] J.T. Horng, C.C. Yeh, “Applying genetic algorithms to query optimization in document retrieval”, Inf. Proc. Manage. 36, 2000, 737 759. [19] C. Lopez-Pujalte, V.P. Guerrero-Bote, F.de. Moya-Aneg-o Moya-Aneg-on, “A test of genetic algorithms in relevance feedback”, Inf.Proc. Manage. 38 (6), 2002, 795 807. [20] C. Lopez-Pujalte, V.P. Guerrero-Bote, F.de. Moya-Aneg-o Moya-Aneg-on, “Orderbased fitness functions for genetic algorithms applied to relevance feedback”, J. Am. Soc. Inf. Sci.Technol. 54 (2), 2003, 152 160.
T
Lin-Chih Chen, Cheng-Jye Luh, Chichang Jou, "Generating page clippings from web search results using a dynamically terminated genetic algorithm", Elsevier Information Systems 30, 2005, 299-316. [7] Rocio L. Cecchini, Carlos M. Lorenzetti, Ana G. Maguitman, Nelida Beatriz Brignole, "Using genetic algorithms to evolve a population of topical queries", Elsevier Information Processing and Management 44, 2008, 1863-1878. [8] Weiguo Fan, Praveen Pathak, Linda Wallace, "Nonlinear ranking function representations in genetic programming-based ranking discovery for personalized search", Elsevier Decision Support Systems 42, 2006, 1338-1349. [9] M.D. Gordon, “User-based document clustering by redescribing subject descriptions with a genetic algorithm”, J. Am. Soc. Inf. Sci. Technol. 42 (5), 1991, 311 322. [10] F. Petry, B. Buckles, D. Prabhu, D. Kraft, “Fuzzy information retrieval using genetic algorithms and relevance feedback”, In Proceedings of the ASIS Annual Meeting, 1993, pp. 122 125. [11] Kyung-Joong Kim, Sung-Bae Cho, "Personalized mining of web documents using link structures and fuzzy concept networks", Elsevier Applied Soft Computing 7, 2007, 398-410. [12] J. Yang, R.R. Korfhage, “Effects of query term weights modification in document retrieval: a study based on a genetic algorithm”, In Proceedings of the Second Annual Symposium on Document Analysis and Information Retrieval, 1993, pp. 271 285.
IJ
A
ES
[6]
ISSN: 2230-7818
@ 2010 http://www.ijaest.iserp.org. All rights Reserved.
Page 22