3.  Benefits of Distributed Approach

This section describes centralized search engines and the benefits of distributed approach used in DSA system.

3.1.  Centralized Search Engines

In centralized search engines, crawlers or agents request web pages one by one or in a threaded fashion from the physical location of the search engine system. They usually use automated or human interacted web page address queues and from this queue, agents crawl, download the whole page, parse for other page links and frames, parse words inside the content, stem and score these words using a relevance scoring algorithm and store them in search engine database in their local system. All these processing are done on the search engine system side and this is why they are called centralized search engines.

Domains or pages to be visited are either added to the crawler queue by web site administrators or owners, like in AltaVista, or found automatically from citations in already indexed web sites like in Google [7]. Nowadays, most search engines request for fee to add your site to their indexes and to show your results in the higher ranks. On the other hand, Google just honors, other pages’ citations to your site while indexing and showing results.

Since centralized search modules are local, this brings bandwidth and processing overhead due to crawling pages in order to find updates, additions and deletions of web pages. Google states that they use hundreds of clustered machines and distribute the processing of crawling, indexing and searching jobs over these machines. But their most important bottleneck is the bandwidth and page index update latency problem.

We show a sample design of a centralized system in Figure 4.1 where each module requests and downloads web pages from the web servers hosting these sites to the physical location of search engine system.

 

Figure 4.1. Architecture of a centralized search engine

 

3.2.  Distributed System of DSA

Distributed modules of DSA, namely client and server modules work and communicate as shown in Figure 4.2. As seen from the figure, client modules operate on the hosting web servers of the web site and use processing power of these machines. They read pages from web servers locally, thus do not use any bandwidth. On the client system, they parse and analyze the content, extract and score keywords and compose information about the page that are packed into meta data. Afterwards, meta data is compressed, encrypted and uploaded to server modules through internet. Unpacking meta data, stemming and storing keywords and information of the page is left for server modules located on local system of DSA.

Client modules are run on scheduled intervals to check for changes of web page content and track for additions of new pages and deletions of indexed pages. Only when they find that a page should be indexed, they form and send meta data to server modules. This solves the huge bandwidth and processing bottleneck of centralized search engines and adds incremental indexing power to DSA. Volume of data is much smaller compared to centralized search systems.

 

Figure 4.2. Distributed system of DSA

 

Latency is another issue in search engines. A centralized search engine can realize a page whose content is changed, added or deleted in weeks or months duration according to the reindex frequency of a search engine. When you think that Google has indexed more than three billion pages [3], it would be nearly impossible to realize this change in hours. Neither bandwidth, nor current processors can cope with such small latency. But on the other hand, in DSA system, scheduled client modules, can immediately realize these alterations and send new index to server modules, where necessary modifications, additions and deletions are performed in local storage of our search engine system. Both small bandwidth usage and short latency goal are achieved with less processing cost on the server system compared to centralized search engine systems.

Advanced search is not supported by DSA search engine like traditional search engines because we do not think that advanced search feature is widely used by end users.

 

Table 4.1. Comparison table of centralized search engines and distributed DSA

No

Comparison Issue

Centralized

Distributed

1

Crawler queue

Human modified or automated

By subscription for client module installation

2

Bandwidth

Huge bandwidth needed to download all html content and reindex web sites

Less bandwidth needed as compressed meta data of only changed and added pages sent to server

3

Compression

Does not exist

Meta data sent to server is compressed

4

Clustering

Can be clustered

Can be clustered according to installation chosen

5

Incremental Indexing

Partial. Should download all pages to track changes

Supported

6

Scoring Keywords

Supported by engines working with modules support

Supported

7

Processing Power

Huge processing power needed.

Less processing power needed as computation is distributed

8

Scalability

Not all engines are scalable.

Scalable.

9

Platform Independent

Popular engines are not platform independent

Platform independent

10

Latency

High. Months needed to index added or changed pages and remove deleted pages

Very low. Client modules realize changes in the web site quickly. Changes are updated as soon as client module runs on its schedule

11

Load balancing

Supported by most engines

Supported with clustered server module installation

12

Content Tamper

Content of the page indexed cannot be changed by the client for the engine module

Web site owner can tamper content while client module is processing

13

Advanced Search

Supported by most engines

Not supported

14

Stemming

Not supported by most engines. Google supports

Turkish and English stemming supported

15

Stopwords

Supported

Supported

 

 

A drawback of this distributed system is that the content of web site can be tampered during client module indexes the site and then changed back to the original content after client module finishes its job. So, client site owner can alter the content and as a result index something that is not related. A solution for this problem can be randomly selecting a page from the site, downloading the content to server system and check if the index stored for this page is true. But this feature is not implemented in our DSA system. In Table 4.1, centralized search engines and distributed DSA search engine system is compared.

Home | introduction | architecture_ of_dsa | parser_ component | security_component | listing_component
centralized_search_engines | implementation |

Ask Siirleri | Guzel Sozler | Lazer Epilasyon | Gazeteler