How To Build A Search Engine

Page 1

==== ==== Learn How To Build And Profit From APPS! 30 Days For $1.00 Link Located At The Bottom Of The Page: http://tinyurl.com/apps-dev8 ==== ====

Everybody uses Google's search engine everyday. I believe that, many people must come with the idea of building a search engine by themselves, but very quickly give up just thinking about it is too technically difficult. Too much code need to be written, too many architecture problems need to be considered, and too hard relevance issues to be resolved. It seems to be a mission impossible. But, is it really the truth? The answer is NO. Actually in the open source community, a series of search engine building blocks have already been produced, and they work pretty much well. You can build one just like playing blocks game in childhood. Sounds interesting? Let me brief it a little more. First of all, you must have a server to host the engine. Both dedicated server and virtual private servers are OK, with RAM 512M at least, and DISK 1G at least. Both Windows and Linux systems are fine, although Linux is preferred. Crawling web pages is the first step to build a search engine. It is necessary to firstly fetch web pages to local disk, so that they can be further analyzed and understood by search engine. Basically, fetching web pages is started from a list of seed URLs, and is continued by incrementally finding new URLs in these seed URLs. More other new URLs may be found again in new URLs previously crawled. Just with such a repeated process, the crawler application can go to almost every page of whole internet. Generally it takes several weeks to complete a full crawling of whole internet. To store all crawled pages needs a huge disk and disk arrays which is not affordable for you, but you can set parameters to control the crawler application's behavior, limiting it to some domains or websites that you are interesting in, and also limiting it to only crawl URLs with under a max URL depth. Well, Nutch is such a crawler application, which is a Java based open source program. Search 'Nutch tutorial' in Google, you will find a bunch of related tutorial articles, from which you can get to know how to start Nutch, how to configure target domains, max crawling depth and so on. Indexing web pages is the second step to build a search engine. Generally indexing is implemented by building an inverted table which describes a mapping relationship between one word and all the documents containing it. Indexing is the critical step for engine to be able to find which documents contain the search query. Lucene is such an indexing application, which is also Java based. Search 'Lucene tutorial' in google, you will find a bunch of related articles, which demonstrate how to start Lucene to create an index for a directory containing all the web pages fetched by crawler application, say Nutch. The created index is also stored with the form of files under a pre-defined directory. The final step is to build a web container which can talk with the created index and make rank decision on search queries. We need an open source web container which can recognize Lucene


index. Tomcat is the best choice since it is also Java based, and Lucene group developed a.war file for Tomcat for special integration purpose. You only need to install Tomcat, and copy the.war file of Lucene to web app folder of Tomcat, then Tomcat can smoothly work on Lucene index and do awesome rank work now. So, do you still think it is difficult to build a search engine? Of course this is just a basic level one running on a single machine, but don't look down such small search engine. Though it only runs a single machine, it can serve >=50K unique visitors every day as long as the server is with powerful hardware. More important, you can configure it to accomplish great features that Google doesn't even have as long as you have smart ideas. With >= 50K unique visitors every day, you can already earn pretty much and be happy enough.

Recently I launched a free ebook search engine, which is fully built on Lucene, Nutch and Tomcat, and more importantly it only takes me of 1 day time. What an easy job!

Article Source: http://EzineArticles.com/?expert=Cheng_Gold

==== ==== Learn How To Build And Profit From APPS! 30 Days For $1.00 Link Located At The Bottom Of The Page: http://tinyurl.com/apps-dev8 ==== ====


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.