2.  Architecture of DSA

The components of the distributed search engine DSA include the distributed architecture, software components, security and performance issues and installation of DSA.

2.1.  Distributed Architecture

There are three parts that make up the distributed architecture of DSA: Client modules, server modules and the communication protocol.

In the distributed search engine system DSA, client and server module software are implemented in Perl [21] programming language and use http protocol during communication. Server modules are executed by Apache Web Server [22] with the trigger of http requests by the client modules to this http server. Client modules send client data and wait for the response from the server module.

In Figure 3.1 client module sends an http request containing meta data packets to the http server of DSA system. The server forks a server module passing request parameters. Server module sends http response back to the calling module after processing the packets.

 

 

 

 

 


Figure 3.1. Data flow of DSA

2.1.1.  Client Modules

Client modules are located and executed on the web servers or ISP’s hosting systems. They basically crawl the web site, download and parse the web pages through the site, score the keywords according to the parsing and scoring constraints and send the index data of each web document to the server module.

Clients have unique client ids that identify them to the server module. They have their own system configuration parameters and user (or web site administrator) configuration parameters.

2.1.1.1.  System Configuration Parameters. System configuration parameters shown in Table 3.1 refer to distributed search engine platform parameters, i.e. client id; a 32 bytes long string identifying the unique client module, constants for parsers, indexers and crawlers; telling how to crawl the intended web site, parse the content and aggregate the index data for the web pages or documents in the web site, the url of the server module and timeout value for sending and requesting data.

These parameters are stored locally encrypted and are only readable by the client modules during the processing of web site documents. In Table 3.2 example settings of DSA prototype are listed. How the client modules use these parameters in Table 3.1 and Table 3.2 will be explained in Section 3.2.

2.1.1.2.  User Configuration Parameters. User configuration parameters listed in Table 3.3 give flexibility to the web site administrator, to control how the crawler should behave during client module processing.

These parameters include web page url to start with crawling, number of pages the client module should process, denied page extensions that will not be crawled. How the client modules use these parameters will be explained in Section 3.2.

 

Table 3.1. Client system configuration parameters

Parameter Name

Description

Client ID

Unique DSA Client identifier

Is Meta Zipped

Flag for indicating compression option

Server Module Url

Address of server module to communicate

Timeout

Timeout value to wait for response from server module

Max Visit Oldness

Minimum required time for revisits in seconds

Field Multiplier 1

Score multiplier for page description

Field Multiplier 2

Score multiplier for meta keywords

Field Multiplier 3

Score multiplier for bold words

Field Multiplier 4

Score multiplier for italic words

Field Multiplier 5

Score multiplier for heading words

Field Multiplier 6

Score multiplier for words written in paragraph and newline beginnings

Field Multiplier 7

Score multiplier for parent link’s citation text words

Field Multiplier 8

Score multiplier for title and url text words

Field Multiplier 9

Score multiplier for all content words

Key Count 100

Percentage of the number of keywords to be sent to server when word count of page is up to 100

Key Count 500

Percentage of the number of keywords to be sent to server when word count of page is up to 500

Key Count 1000

Percentage of the number of keywords to be sent to server when word count of page is up to 1000

Key Count 5000

Percentage of the number of keywords to be sent to server when word count of page is up to 5000

Key Count Over

Percentage of the number of keywords to be sent to server when word count of page is over 5000

 

 

In Table 3.4 example settings for the client user configuration file of the DSA prototype are listed. According to these values, the client module should index the pages in “boun.edu.tr” domain, starting from the address “http://www.boun.edu.tr/” up to 1000 links, not transform any url to lowercase, strip out text inside html option boxes and not index web pages with addresses containing “php” or “swf” extensions.

2.1.1.3.  Scheduling of Client Modules. Client modules are run on defined intervals and times by the operating system’s scheduling processes. It is under control of the client user to set the frequency of revisits to his web site.

Table 3.2. Example client system configuration file

Parameter Name

Value

Client ID

ABCDEFGHABCDEFGHABCDEFGHABCDEFGH

Is Meta Zipped

1

Server Module Url

http://dsa-engine.com/tez/server_module.pl

Timeout

90

Max Visit Oldness

86400

Field Multiplier 1

10

Field Multiplier 2

2

Field Multiplier 3

10

Field Multiplier 4

11

Field Multiplier 5

16

Field Multiplier 6

7

Field Multiplier 7

40

Field Multiplier 8

18

Field Multiplier 9

1

Key Count 100

1

Key Count 500

0.7

Key Count 1000

0.45

Key Count 5000

0.25

Key Count Over

0.15

 

Table 3.3. Client user configuration parameters

Parameter Name

Description

Start Url

Url to start crawling the client domain

Domain Constraint

Expression set for the crawler, to filter web page url domains for parsing. For example, when this expression is set as “boun.edu.tr”, the crawler downloads all pages with urls containing “boun.edu.tr” in url domain field

Translate Lowercase

Flag for transforming url text to lowercase. Crawler changes links to lowercase if this flag is set to 1

Max Link Count

Maximum number of pages to be crawled

Strip Select

Flag for stripping off select fields in html text of a page. Used by parser to exclude html select tags.

Denied Extensions

Extensions filter for crawler to exclude, separated by space

 

Table 3.4. Example client user configuration file

Parameter Name

Value

Start Url

http://www.boun.edu.tr

Domain Constraint

boun.edu.tr

Translate Lowercase

0

Max Link Count

1000

Strip Select

1

Denied Extensions

.php .swf

2.1.2.  Server Modules

Server modules operate on the server or servers of the distributed search engine. They do not download the web documents like traditional centralized search engines. On the contrary, they are triggered by client modules. Client modules first are authenticated and then the protocol of DSA begins.

 

 


Figure 3.2. Cases of server module control responses

Server modules retrieve and store web page definitions, information and the index data from the client modules. After storing these data, they reply to the requesting client module in three ways as shown in Figure 3.2.

In “case 1”, client module sends meta data to server and server module after processing, responses to proceed with next page. In the “case 2”, server module, because of system load, sends stop response. In third case, because of an uncatched exception, false authentication or a database problem, server module reports client module about the error.

 

Figure 3.3. Communication architecture of DSA

2.1.3.  The Communication Protocol of DSA

In Figure 3.3 the workflow of DSA and the related communication protocol between client and server modules are shown. There are three main phases of this communication which are explained in Subsections 3.1.3.1, 3.1.3.2 and 3.1.3.3 in more detail.

In first phase, authentication is established between two parties of DSA. Next, client module starts crawling the web site page by page, builds and uploads meta data to server module. The response of the server module forms the last phase of communication protocol. Second and third phases recur until the server module either tells client module to stop or the client module reaches maximum link count or the whole web site is traversed.

2.1.3.1.  Authentication. Communication protocol between client and server modules in DSA, rely on the client identification and authentication. Each client module crawls and parses its web site set by the client system configuration parameters which are also checked by server modules from the server database.

In Figure 3.4 the authentication phase is shown. Here, the client module sends its client id to server and requests for authentication. Server module first checks server load by monitoring the number of server processes. If this number is greater than a system defined threshold value, it responses the connected client module to stop. At this stage, client module waits for a timeout value and retries the authentication request.

Server module then checks client id and if access can be given, generates a random authentication id and sends it back to the requesting client module. This is called session authentication id and is valid for a timeout value.

Client module uses this authentication id in the following steps of the communication and generates the session key, which is also generated by the server module that will be used in encrypting the data sent from client to server and decrypting data received by server. The same key is used in both encryption and decryption.

 

 

 


Figure 3.4. Communication of client and server during authentication

 

2.1.3.2.  Uploading of Meta Data File. After authentication phase, client module crawls web pages in the web site, following the links with site domain, collects information about each page, parses the content of html document, scores the keywords from the content according to relevance computing constraints, forms the index of document and packages the index called word weight capsules (WWC) and the information of web page [7, 24] as shown in Figure 3.5. This package is called meta data of the web page.

After this step, client process compresses and encrypts meta data package to form temporary meta data file. For compression and encryption third party Zlib [25] and CBC-Blowfish Perl modules [26] are used. Meta data file is posted by client module with client id and authentication id to server module by http file upload method. When client module requests to post the binary package file, DSA web server loads and executes the server module to response for meta data file.

 

 


Figure 3.5. Building and uploading meta data file from client to server

 

2.1.3.3.  Response of Meta Data Processing.    In the third phase of the protocol, server module receives the request, checks client id and authentication id with its records. If this is a granted client, the process decrypts, decompresses and splits the uploaded meta data package into page information and word weight capsules as shown in Figure 3.6.

Server module stems word bodies, recalculates word weights and stores the web page information and scored keyword index in DSA database. After this step, server system either tells client module to go on with the next page in its web site, or stop, or reports an error. According to this response, client module takes its action and logs the communication result.

 

 


Figure 3.6. Response of server module after meta data processing

 

2.2.  Software Components

Software components of DSA platform are crawling, incremental indexing, parsing, scoring, compression, security, communication, stemming, storing, searching, listing and client configuration components. In Table 3.5 shows which components are used in the client modules, server modules and the search interface of DSA. In the following subsections functionalities of each component are explained in detail.

Table 3.5. Software components of DSA and their usage areas

Component

Client Module

Server Module

Search Interface

Crawler

Yes

 

 

Incremental Indexer

Yes

 

 

Parser

Yes

 

 

Scorer

Yes

 

 

Compression

Yes

Yes

 

Security

Yes

Yes

 

Communication

Yes

Yes

 

Stemmer

 

Yes

 

Storer

 

Yes

 

Searching

 

 

Yes

Listing

 

 

Yes

Client Configuration

 

Yes

 

DSA Client Subscription

 

 

Yes

 

 

2.2.1.  Crawler Component

The tasks performed by crawler component which is used in client modules are traversing and downloading of the pages within the web site, extraction links from the page content, crawling of these links and the feeding of its crawler queue. Initially, crawler queue has only one url, which is the value of “start url” client user parameter.

Within the execution of this component, a new url from crawler queue is fetched and deleted which is put into visited queue as shown in Figure 3.7. Visited queue holds addresses of pages that have been traversed. Next, this new url is downloaded and links within the content are parsed. New links which do not exist in visited queue are put into the crawler queue.

Meanwhile, citation texts written on page links are added to parent citations data structure which is shown in Table 3.6. For each page, the page giving link to it and the text written on this link are hold in this data structure. After these steps, the content is passed to incremental indexing component which is described in Section 3.2.2.

 

 


Figure 3.7. Flow of crawler component

 

Table 3.6. Parent citation data structure

Field Name

Description

Url

Url of the page queued

Parent Link

Url of currently crawled page

Parent Citation Text

Citation text written on this link

 

 

Table 3.7 lists the url, parent link and parent citation text words computed by the client module of DSA prototype for the home page of our department web site. Here, the remark “[IMG]” stands for image link.

Table 3.7. Parent citations of “www.cmpe.boun.edu.tr”

URL

Parent Link

Parent Citation Text

http://www.cmpe.boun.edu.tr/graduate/index.php

http://www.cmpe.boun.edu.tr/

Graduate Program:

http://www.cmpe.boun.edu.tr/graduate/

http://www.cmpe.boun.edu.tr/

[IMG]

http://www.cmpe.boun.edu.tr/admission/graduate.php

http://www.cmpe.boun.edu.tr/

Graduate Admission

http://www.cmpe.boun.edu.tr/undergrad/information.php

http://www.cmpe.boun.edu.tr/

Information

http://www.cmpe.boun.edu.tr/graduate/desciptions.php

http://www.cmpe.boun.edu.tr/

Course Descriptions

http://www.cmpe.boun.edu.tr/undergrad/courselist.php

http://www.cmpe.boun.edu.tr/

Course List

http://www.cmpe.boun.edu.tr/graduate/ms_theses.php

http://www.cmpe.boun.edu.tr/

MS Degrees Granted

http://www.cmpe.boun.edu.tr/graduate/ms_info.php

http://www.cmpe.boun.edu.tr/

MS Information

http://www.cmpe.boun.edu.tr/people/support_personnel.php

http://www.cmpe.boun.edu.tr/

Support Personnel

http://www.cmpe.boun.edu.tr/oldweb/

http://www.cmpe.boun.edu.tr/

Old CmpE-web

http://www.cmpe.boun.edu.tr/~akin/robsem.html

http://www.cmpe.boun.edu.tr/

 Robotics Seminars

http://www.cmpe.boun.edu.tr/research/index.php

http://www.cmpe.boun.edu.tr/

Research:

http://www.cmpe.boun.edu.tr/people/alumni/index.php

http://www.cmpe.boun.edu.tr/

Alumni

http://www.cmpe.boun.edu.tr/undergrad/transferfaq.php

http://www.cmpe.boun.edu.tr/

Transfer Applications FAQ

http://www.cmpe.boun.edu.tr/graduate/phd_theses.php

http://www.cmpe.boun.edu.tr/

PhD Degrees Granted

http://www.cmpe.boun.edu.tr/research/

http://www.cmpe.boun.edu.tr/

[IMG]

http://www.cmpe.boun.edu.tr/people/faculty/index.php

http://www.cmpe.boun.edu.tr/

Faculty Members

http://www.cmpe.boun.edu.tr/~otk/

http://www.cmpe.boun.edu.tr/

CmpE  Undergraduate Student Representatives

http://www.cmpe.boun.edu.tr/people/index.php

http://www.cmpe.boun.edu.tr/

People:

http://www.cmpe.boun.edu.tr/events/index.php

http://www.cmpe.boun.edu.tr/

Events:

http://www.cmpe.boun.edu.tr/graduate/phd_info.php

http://www.cmpe.boun.edu.tr/

PhD Information

http://www.cmpe.boun.edu.tr/undergrad/training/index.php

http://www.cmpe.boun.edu.tr/

Summer Training

http://www.cmpe.boun.edu.tr/sitemap.php

http://www.cmpe.boun.edu.tr/

[IMG]

http://www.cmpe.boun.edu.tr/graduate/courselist.php

http://www.cmpe.boun.edu.tr/

Course List

http://www.cmpe.boun.edu.tr/people/

http://www.cmpe.boun.edu.tr/

[IMG]

http://www.cmpe.boun.edu.tr/admission/index.php

http://www.cmpe.boun.edu.tr/

Admission:

http://www.cmpe.boun.edu.tr/people/assistants/index.php

http://www.cmpe.boun.edu.tr/

Research Assistants

http://www.cmpe.boun.edu.tr/undergrad/curriculum.php

http://www.cmpe.boun.edu.tr/

Curriculum

http://www.cmpe.boun.edu.tr/admission/undergrad/

http://www.cmpe.boun.edu.tr/

Undergraduate Admission

http://www.cmpe.boun.edu.tr/undergrad/descriptions.php

http://www.cmpe.boun.edu.tr/

Course Descriptions

http://www.cmpe.boun.edu.tr/tr/

http://www.cmpe.boun.edu.tr/

Türkçe

 

 

Besides text links, crawler component recognizes pages with frames and follows JavaScript and web server redirects. Any time a new link is found, it checks if the domain of the url matches the domain constraint expression listed in Table 3.3. If a matching is found, this new link is added to crawler queue if it has not been visited before. For example, let the domain be “http://www.boun.edu.tr”, domain constraint be “boun.edu.tr” and url tested to be added to the crawler queue be “http://cmpe.boun.edu.tr/sitemap.htm”. This url can be added to crawler queue because domain section “cmpe.boun.edu.tr” matches domain constraint.

Other limitations for link addition to crawler queue are maximum page count and file extension filter parameters which are shown in Table 3.3. If the crawler component matches “denied extensions parameter” in the extension of a url, it does not feed this link to the crawler queue.

Additionally, if the number of pages crawled has reached “maximum link count parameter” or there are no other links in crawler queue, crawler stops traversing the web site and the client module process ends. Examples of these parameters from a client module of DSA prototype are listed in Table 3.4.

When the home page of “www.cmpe.boun.edu.tr” is crawled by this prototype, the crawler queue shown in Table 3.8 is built by the crawler component.

2.2.2.  Incremental Indexer Component

The job of this component is to track pages visited within the web site and decide whether a page should be parsed, scored and its index should be sent to server. It stores this data in an encrypted file on the client side and every time meta data file is uploaded to server, this file is updated. It uses the data structure shown in Table 3.9 which holds url of the page, when it was last indexed and whether the index was successfully uploaded to server module.

“Last index date” field holds the time when the page was last visited, parsed and index was created. “Successfully uploaded flag” is set when the index is successfully uploaded to the server in the last visit.

 

Table 3.8. Crawler queue when home page of “www.cmpe.boun.edu.tr” is crawled

Crawler Queue Page Addresses

http://www.cmpe.boun.edu.tr/graduate/index.php

http://www.cmpe.boun.edu.tr/graduate/

http://www.cmpe.boun.edu.tr/admission/graduate.php

http://www.cmpe.boun.edu.tr/undergrad/information.php

http://www.cmpe.boun.edu.tr/graduate/desciptions.php

http://www.cmpe.boun.edu.tr/undergrad/courselist.php

http://www.cmpe.boun.edu.tr/graduate/ms_theses.php

http://www.cmpe.boun.edu.tr/graduate/ms_info.php

http://www.cmpe.boun.edu.tr/people/support_personnel.php

http://www.cmpe.boun.edu.tr/oldweb/

http://www.cmpe.boun.edu.tr/~akin/robsem.html

http://www.cmpe.boun.edu.tr/people/alumni/index.php

http://www.cmpe.boun.edu.tr/research/index.php

http://www.cmpe.boun.edu.tr/undergrad/transferfaq.php