The components of the distributed search engine DSA include the distributed architecture, software components, security and performance issues and installation of DSA.
There are three parts that make up the distributed architecture of DSA: Client modules, server modules and the communication protocol.
In the distributed search engine system DSA, client and server module software are implemented in Perl [21] programming language and use http protocol during communication. Server modules are executed by Apache Web Server [22] with the trigger of http requests by the client modules to this http server. Client modules send client data and wait for the response from the server module.
In Figure 3.1 client module sends an http request containing meta data packets to the http server of DSA system. The server forks a server module passing request parameters. Server module sends http response back to the calling module after processing the packets.
|
|
|
![]() |
Client modules are located and executed on the web servers or ISP’s hosting systems. They basically crawl the web site, download and parse the web pages through the site, score the keywords according to the parsing and scoring constraints and send the index data of each web document to the server module.
Clients have unique client ids that identify them to the server module. They have their own system configuration parameters and user (or web site administrator) configuration parameters.
These parameters are stored locally encrypted and are only readable by the client modules during the processing of web site documents. In Table 3.2 example settings of DSA prototype are listed. How the client modules use these parameters in Table 3.1 and Table 3.2 will be explained in Section 3.2.
These parameters include web page url to start with crawling, number of pages the client module should process, denied page extensions that will not be crawled. How the client modules use these parameters will be explained in Section 3.2.
Table 3.1. Client system configuration parameters
|
Parameter Name |
Description |
|
Client ID |
Unique DSA Client identifier |
|
Is Meta Zipped |
Flag for indicating compression option |
|
Server Module Url |
Address of server module to communicate |
|
Timeout |
Timeout value to wait for response from server module |
|
Max Visit Oldness |
Minimum required time for revisits in seconds |
|
Field Multiplier 1 |
Score multiplier for page description |
|
Field Multiplier 2 |
Score multiplier for meta keywords |
|
Field Multiplier 3 |
Score multiplier for bold words |
|
Field Multiplier 4 |
Score multiplier for italic words |
|
Field Multiplier 5 |
Score multiplier for heading words |
|
Field Multiplier 6 |
Score multiplier for words written in paragraph and newline beginnings |
|
Field Multiplier 7 |
Score multiplier for parent link’s citation text words |
|
Field Multiplier 8 |
Score multiplier for title and url text words |
|
Field Multiplier 9 |
Score multiplier for all content words |
|
Key Count 100 |
Percentage of the number of keywords to be sent to server when word count of page is up to 100 |
|
Key Count 500 |
Percentage of the number of keywords to be sent to server when word count of page is up to 500 |
|
Key Count 1000 |
Percentage of the number of keywords to be sent to server when word count of page is up to 1000 |
|
Key Count 5000 |
Percentage of the number of keywords to be sent to server when word count of page is up to 5000 |
|
Key Count Over |
Percentage of the number of keywords to be sent to server when word count of page is over 5000 |
In Table 3.4 example settings for the client user configuration file of the DSA prototype are listed. According to these values, the client module should index the pages in “boun.edu.tr” domain, starting from the address “http://www.boun.edu.tr/” up to 1000 links, not transform any url to lowercase, strip out text inside html option boxes and not index web pages with addresses containing “php” or “swf” extensions.
Table 3.2. Example client system configuration file
|
Parameter Name |
Value |
|
Client ID |
ABCDEFGHABCDEFGHABCDEFGHABCDEFGH |
|
Is Meta Zipped |
1 |
|
Server Module Url |
http://dsa-engine.com/tez/server_module.pl |
|
Timeout |
90 |
|
Max Visit Oldness |
86400 |
|
Field Multiplier 1 |
10 |
|
Field Multiplier 2 |
2 |
|
Field Multiplier 3 |
10 |
|
Field Multiplier 4 |
11 |
|
Field Multiplier 5 |
16 |
|
Field Multiplier 6 |
7 |
|
Field Multiplier 7 |
40 |
|
Field Multiplier 8 |
18 |
|
Field Multiplier 9 |
1 |
|
Key Count 100 |
1 |
|
Key Count 500 |
0.7 |
|
Key Count 1000 |
0.45 |
|
Key Count 5000 |
0.25 |
|
Key Count Over |
0.15 |
Table 3.3. Client user configuration parameters
|
Parameter Name |
Description |
|
Start Url |
Url to start crawling the client domain |
|
Domain Constraint |
Expression set for the crawler, to filter web page url domains for parsing. For example, when this expression is set as “boun.edu.tr”, the crawler downloads all pages with urls containing “boun.edu.tr” in url domain field |
|
Translate Lowercase |
Flag for transforming url text to lowercase. Crawler changes links to lowercase if this flag is set to 1 |
|
Max Link Count |
Maximum number of pages to be crawled |
|
Strip Select |
Flag for stripping off select fields in html text of a page. Used by parser to exclude html select tags. |
|
Denied Extensions |
Extensions filter for crawler to exclude, separated by space |
Table 3.4. Example client user configuration file
|
Parameter Name |
Value |
|
Start Url |
http://www.boun.edu.tr |
|
Domain Constraint |
boun.edu.tr |
|
Translate Lowercase |
0 |
|
Max Link Count |
1000 |
|
Strip Select |
1 |
|
Denied Extensions |
.php .swf |
Server modules operate on the server or servers of the distributed search engine. They do not download the web documents like traditional centralized search engines. On the contrary, they are triggered by client modules. Client modules first are authenticated and then the protocol of DSA begins.
![]() |
Figure 3.2. Cases of server module control responses
Server modules retrieve and store web page definitions, information and the index data from the client modules. After storing these data, they reply to the requesting client module in three ways as shown in Figure 3.2.
In “case 1”, client module sends meta data to server and server module after processing, responses to proceed with next page. In the “case 2”, server module, because of system load, sends stop response. In third case, because of an uncatched exception, false authentication or a database problem, server module reports client module about the error.
![]() |
Figure 3.3. Communication architecture of DSA
In Figure 3.3 the workflow of DSA and the related communication protocol between client and server modules are shown. There are three main phases of this communication which are explained in Subsections 3.1.3.1, 3.1.3.2 and 3.1.3.3 in more detail.
In first phase, authentication is established between two parties of DSA. Next, client module starts crawling the web site page by page, builds and uploads meta data to server module. The response of the server module forms the last phase of communication protocol. Second and third phases recur until the server module either tells client module to stop or the client module reaches maximum link count or the whole web site is traversed.
In Figure 3.4 the authentication phase is shown. Here, the client module sends its client id to server and requests for authentication. Server module first checks server load by monitoring the number of server processes. If this number is greater than a system defined threshold value, it responses the connected client module to stop. At this stage, client module waits for a timeout value and retries the authentication request.
Server module then checks client id and if access can be given, generates a random authentication id and sends it back to the requesting client module. This is called session authentication id and is valid for a timeout value.
Client module uses this authentication id in the following steps of the communication and generates the session key, which is also generated by the server module that will be used in encrypting the data sent from client to server and decrypting data received by server. The same key is used in both encryption and decryption.
![]() |
Figure 3.4. Communication of client and server during authentication
After this step, client process compresses and encrypts meta data package to form temporary meta data file. For compression and encryption third party Zlib [25] and CBC-Blowfish Perl modules [26] are used. Meta data file is posted by client module with client id and authentication id to server module by http file upload method. When client module requests to post the binary package file, DSA web server loads and executes the server module to response for meta data file.
![]() |
Figure 3.5. Building and uploading meta data file from client to server
Server module stems word bodies, recalculates word weights and stores the web page information and scored keyword index in DSA database. After this step, server system either tells client module to go on with the next page in its web site, or stop, or reports an error. According to this response, client module takes its action and logs the communication result.
![]() |
Figure 3.6. Response of server module after meta data processing
Software components of DSA platform are crawling, incremental indexing, parsing, scoring, compression, security, communication, stemming, storing, searching, listing and client configuration components. In Table 3.5 shows which components are used in the client modules, server modules and the search interface of DSA. In the following subsections functionalities of each component are explained in detail.
Table 3.5. Software components of DSA and their usage areas
|
Component |
Client Module |
Server Module |
Search Interface |
|
Crawler |
Yes |
|
|
|
Incremental Indexer |
Yes |
|
|
|
Parser |
Yes |
|
|
|
Scorer |
Yes |
|
|
|
Compression |
Yes |
Yes |
|
|
Security |
Yes |
Yes |
|
|
Communication |
Yes |
Yes |
|
|
Stemmer |
|
Yes |
|
|
Storer |
|
Yes |
|
|
Searching |
|
|
Yes |
|
Listing |
|
|
Yes |
|
Client Configuration |
|
Yes |
|
|
DSA Client Subscription |
|
|
Yes |
The tasks performed by crawler component which is used in client modules are traversing and downloading of the pages within the web site, extraction links from the page content, crawling of these links and the feeding of its crawler queue. Initially, crawler queue has only one url, which is the value of “start url” client user parameter.
Within the execution of this component, a new url from crawler queue is fetched and deleted which is put into visited queue as shown in Figure 3.7. Visited queue holds addresses of pages that have been traversed. Next, this new url is downloaded and links within the content are parsed. New links which do not exist in visited queue are put into the crawler queue.
Meanwhile, citation texts written on page links are added to parent citations data structure which is shown in Table 3.6. For each page, the page giving link to it and the text written on this link are hold in this data structure. After these steps, the content is passed to incremental indexing component which is described in Section 3.2.2.
Figure 3.7. Flow of crawler component
Table 3.6. Parent citation data structure
|
Field Name |
Description |
|
Url |
Url of the page queued |
|
Parent Link |
Url of currently crawled page |
|
Parent Citation Text |
Citation text written on this link |
Table 3.7 lists the url, parent link and parent citation text words computed by the client module of DSA prototype for the home page of our department web site. Here, the remark “[IMG]” stands for image link.
Table 3.7. Parent citations of “www.cmpe.boun.edu.tr”
|
URL |
Parent Link |
Parent Citation Text |
|
http://www.cmpe.boun.edu.tr/graduate/index.php |
http://www.cmpe.boun.edu.tr/ |
Graduate Program: |
|
http://www.cmpe.boun.edu.tr/graduate/ |
http://www.cmpe.boun.edu.tr/ |
[IMG] |
|
http://www.cmpe.boun.edu.tr/admission/graduate.php |
http://www.cmpe.boun.edu.tr/ |
Graduate Admission |
|
http://www.cmpe.boun.edu.tr/undergrad/information.php |
http://www.cmpe.boun.edu.tr/ |
Information |
|
http://www.cmpe.boun.edu.tr/graduate/desciptions.php |
http://www.cmpe.boun.edu.tr/ |
Course Descriptions |
|
http://www.cmpe.boun.edu.tr/undergrad/courselist.php |
http://www.cmpe.boun.edu.tr/ |
Course List |
|
http://www.cmpe.boun.edu.tr/graduate/ms_theses.php |
http://www.cmpe.boun.edu.tr/ |
MS Degrees Granted |
|
http://www.cmpe.boun.edu.tr/graduate/ms_info.php |
http://www.cmpe.boun.edu.tr/ |
MS Information |
|
http://www.cmpe.boun.edu.tr/people/support_personnel.php |
http://www.cmpe.boun.edu.tr/ |
Support Personnel |
|
http://www.cmpe.boun.edu.tr/oldweb/ |
http://www.cmpe.boun.edu.tr/ |
Old CmpE-web |
|
http://www.cmpe.boun.edu.tr/~akin/robsem.html |
http://www.cmpe.boun.edu.tr/ |
Robotics Seminars |
|
http://www.cmpe.boun.edu.tr/research/index.php |
http://www.cmpe.boun.edu.tr/ |
Research: |
|
http://www.cmpe.boun.edu.tr/people/alumni/index.php |
http://www.cmpe.boun.edu.tr/ |
Alumni |
|
http://www.cmpe.boun.edu.tr/undergrad/transferfaq.php |
http://www.cmpe.boun.edu.tr/ |
Transfer Applications FAQ |
|
http://www.cmpe.boun.edu.tr/graduate/phd_theses.php |
http://www.cmpe.boun.edu.tr/ |
PhD Degrees Granted |
|
http://www.cmpe.boun.edu.tr/research/ |
http://www.cmpe.boun.edu.tr/ |
[IMG] |
|
http://www.cmpe.boun.edu.tr/people/faculty/index.php |
http://www.cmpe.boun.edu.tr/ |
Faculty Members |
|
http://www.cmpe.boun.edu.tr/~otk/ |
http://www.cmpe.boun.edu.tr/ |
CmpE Undergraduate Student Representatives |
|
http://www.cmpe.boun.edu.tr/people/index.php |
http://www.cmpe.boun.edu.tr/ |
People: |
|
http://www.cmpe.boun.edu.tr/events/index.php |
http://www.cmpe.boun.edu.tr/ |
Events: |
|
http://www.cmpe.boun.edu.tr/graduate/phd_info.php |
http://www.cmpe.boun.edu.tr/ |
PhD Information |
|
http://www.cmpe.boun.edu.tr/undergrad/training/index.php |
http://www.cmpe.boun.edu.tr/ |
Summer Training |
|
http://www.cmpe.boun.edu.tr/sitemap.php |
http://www.cmpe.boun.edu.tr/ |
[IMG] |
|
http://www.cmpe.boun.edu.tr/graduate/courselist.php |
http://www.cmpe.boun.edu.tr/ |
Course List |
|
http://www.cmpe.boun.edu.tr/people/ |
http://www.cmpe.boun.edu.tr/ |
[IMG] |
|
http://www.cmpe.boun.edu.tr/admission/index.php |
http://www.cmpe.boun.edu.tr/ |
Admission: |
|
http://www.cmpe.boun.edu.tr/people/assistants/index.php |
http://www.cmpe.boun.edu.tr/ |
Research Assistants |
|
http://www.cmpe.boun.edu.tr/undergrad/curriculum.php |
http://www.cmpe.boun.edu.tr/ |
Curriculum |
|
http://www.cmpe.boun.edu.tr/admission/undergrad/ |
http://www.cmpe.boun.edu.tr/ |
Undergraduate Admission |
|
http://www.cmpe.boun.edu.tr/undergrad/descriptions.php |
http://www.cmpe.boun.edu.tr/ |
Course Descriptions |
|
http://www.cmpe.boun.edu.tr/tr/ |
http://www.cmpe.boun.edu.tr/ |
Türkçe |
Besides text links, crawler component recognizes pages with frames and follows JavaScript and web server redirects. Any time a new link is found, it checks if the domain of the url matches the domain constraint expression listed in Table 3.3. If a matching is found, this new link is added to crawler queue if it has not been visited before. For example, let the domain be “http://www.boun.edu.tr”, domain constraint be “boun.edu.tr” and url tested to be added to the crawler queue be “http://cmpe.boun.edu.tr/sitemap.htm”. This url can be added to crawler queue because domain section “cmpe.boun.edu.tr” matches domain constraint.
Other limitations for link addition to crawler queue are maximum page count and file extension filter parameters which are shown in Table 3.3. If the crawler component matches “denied extensions parameter” in the extension of a url, it does not feed this link to the crawler queue.
Additionally, if the number of pages crawled has reached “maximum link count parameter” or there are no other links in crawler queue, crawler stops traversing the web site and the client module process ends. Examples of these parameters from a client module of DSA prototype are listed in Table 3.4.
When the home page of “www.cmpe.boun.edu.tr” is crawled by this prototype, the crawler queue shown in Table 3.8 is built by the crawler component.
The job of this component is to track pages visited within the web site and decide whether a page should be parsed, scored and its index should be sent to server. It stores this data in an encrypted file on the client side and every time meta data file is uploaded to server, this file is updated. It uses the data structure shown in Table 3.9 which holds url of the page, when it was last indexed and whether the index was successfully uploaded to server module.
“Last index date” field holds the time when the page was last visited, parsed and index was created. “Successfully uploaded flag” is set when the index is successfully uploaded to the server in the last visit.
Table 3.8. Crawler queue when home page of “www.cmpe.boun.edu.tr” is crawled
|
Crawler Queue Page Addresses |
|
http://www.cmpe.boun.edu.tr/graduate/index.php |
|
http://www.cmpe.boun.edu.tr/graduate/ |
|
http://www.cmpe.boun.edu.tr/admission/graduate.php |
|
http://www.cmpe.boun.edu.tr/undergrad/information.php |
|
http://www.cmpe.boun.edu.tr/graduate/desciptions.php |
|
http://www.cmpe.boun.edu.tr/undergrad/courselist.php |
|
http://www.cmpe.boun.edu.tr/graduate/ms_theses.php |
|
http://www.cmpe.boun.edu.tr/graduate/ms_info.php |
|
http://www.cmpe.boun.edu.tr/people/support_personnel.php |
|
http://www.cmpe.boun.edu.tr/oldweb/ |
|
http://www.cmpe.boun.edu.tr/~akin/robsem.html |
|
http://www.cmpe.boun.edu.tr/people/alumni/index.php |
|
http://www.cmpe.boun.edu.tr/research/index.php |
|
http://www.cmpe.boun.edu.tr/undergrad/transferfaq.php |