The Internet revolution has made a wealth of information resources available for direct and easy access on the user's desktop. However, finding appropriate information has become a significant problem for many users.
With the development of world wide web, searching in web pages has attracted great research interest. The big challenge to locate information in such a huge data source has resulted in the most popular application of the Internet, search engines.
Current search engines allow users to locate information of interest and online centralized catalogs (often called portals) such as Yahoo [1] provide more relevant and well-organized information.
Search engines operate on huge databases and carry out a keyword search. Recall is poor; no database covers the entire WWW. As a result, you are offered a pile of page references, which, for a major percentage, are not what you would expect to be an intelligent answer to your search request and the accuracy of the results is low [2].
Every day millions of new web pages are being downloaded and indexed by hundreds of search engine spiders or agents. All these spiders come up with meta data that are stored in their local databases and that are later searched by clients in order to find web pages of interest to them. These spiders are centralized and all the operations of parsing, link extracting, stemming, ranking and storing tasks are performed on the main servers of these search engines.
It is emerging that it is very difficult for the major search engines to provide a comprehensive and up-to-date search service of the Web. A page whose content is changed, added or deleted can be realized by traditional centralized search engines in weeks or months duration according to the reindex frequency of a search engine. When you think that Google has indexed more than three billion pages [3], it would be nearly impossible to realize this change in hours. Neither bandwidth, nor current processors can cope with such small latency.
Since centralized search modules are local, this results in bandwidth and processing overhead due to crawling of changed pages and the updating of their records in databases. In our thesis, we propose a distributed search architecture to resolve these problems. We propose local search modules that can be optionally embedded into Apache web server distributions and installed on web hosting server systems.
Our distributed architecture model DSA may overcome some problems and difficulties that exist in traditional search engines. Incremental indexing is used to overcome the update problem of traditional search engines. To achieve this task, the server modules are notified by client modules about the updates, deletions and the additions of web pages. The client modules are either triggered by the web site administrator or run on a scheduled basis, so the DSA search engine will always hold the latest and full index of a site.
By these distributed modules, high load traffic generated by traditional search engine agents is greatly lowered. The traffic includes only compressed meta data of the module communication instead of the whole web page. This also reduces the huge bandwidth needs for the search engine servers for indexing the world wide web.
Computing is also distributed. Instead of huge server systems that download, extract, parse, stem and rank web pages as in centralized search engines, the main requirement for DSA is the database storage system to keep the web page indexes and run search queries. Extensive computations such as extracting, parsing and scoring is done on the client side by the client modules on the hosting servers of web sites which provide scalability.
Offline web page or document searching is also applicable in DSA system. In today’s centralized search engines (like Google or AltaVista) offline searching is not available because of the fact that you cannot index the pages that you cannot access. In our model, this feature can be activated by configuring client modules to index offline paths that are only stored but not served by the web servers.
Another contribution we bring is our document analysis and keyword scoring technique. In DSA, client modules perform web page content parsing and score keywords according to their style of writing, position in the document and existence in title, description and citations. Also, word stemming in English and Turkish languages is performed on the server side, to list more relevant results to users of this search engine.
DSA model also overcomes index update latency problem against additions, deletions and updates in web sites which is a bottleneck in centralized search engines. By its configurable architecture, search engine server system does not need to revisit web sites from its center system to achieve this task. On the contrary, client modules run on web hosting systems watch out for changes in their local system.
DSA model also addresses security issues. Within the module to module protocol between client and server modules, there is a built-in security mechanism to prevent other programs or intruders to talk with the server or client modules.
Client modules of this system are available from the DSA home page after subscription where users can download necessary client module software and configuration files to install and run their client module on their web hosting servers to add their sites to the DSA search engine.
The outline of this thesis is as follows. In Chapter 2, the related work about our thesis subject is presented. In Chapter 3, architecture of DSA system is explained. Detailed explanations of distributed architecture, client and server modules, communication protocol used, software components, data structures, packages and installation of the system are presented extensively in this chapter. In Chapter 4, we compare our distributed architecture with traditional centralized structure and list advantages and disadvantages of these systems. In Chapter 5, we list the results and statistics produced by our proof of concept implementation. We also present data size measurements from our prototype system on several search result examples. In Chapter 6, conclusion of DSA model is discussed and future improvements that can be done for this system are presented.