This component works in both client and server modules to encrypt the compressed meta data sent from client module to server module on the client side and decrypt on the other side. Encrypted meta data is stored in a temporary file called encrypted meta data file which is deleted after being uploaded to DSA server modules. This component is also used while reading and writing client system configuration parameters and site log files from and to disk. It uses a third party Perl module called CBS [26] where Blowfish algorithm is used. Same key for encryption and decryption is formed from the authentication id generated in both client and server modules. Key generation algorithm is hard coded in the security component and it is assumed that the key generation function is not known by intruders. In this distributed search engine platform, not much effort was spent on security, because our main aim was not to implement the strongest security and key exchange methods but rather to prevent garbled and distorted messages to be sent to server modules by intruders and clients who try to increase the scores for their web page indexes.
In Figure 3.10, symmetric operations performed on meta data file in
both client and server modules are shown. At the client side, meta data is first
compressed and then encrypted to pack meta data file to upload. At the server
side, meta data file retrieved is first decrypted and then decompressed to
unpack meta data of the page.
Figure 3.10. Symmetric use of compression and security components
Communication component, which is used in both client and server modules transmits and retrieves authentication messages and encrypted meta data file from client modules to server modules and vice versa, by using http file upload method. This component uploads meta data file to the http address defined in “server module url” client system parameter (See Table 3.2) and at the server side, this component functions to retrieve this meta data. Communication component handles upload or download communication failures and retransmits data in case of errors. It also sends back an error or success message from server modules to client modules after uploads.
In server modules after downloading meta data file, this component reads and passes the content of this file to the security component. After file content is decrypted, it is decompressed by compression component and the server module parses meta data into page definitions and word weight capsules for further operations.
We use a heuristic algorithm for fast stemming of distinct words of a page. It is executed in server modules to find the body of the words according to the language used in html content, which are sent to server from clients in word weight capsules of web pages. In every step, to find the body of a word, it deletes the last letter (character) of the word and looks it up from the body dictionary database [27] (Turkish dictionary is downloaded from website of tdk.gov.tr).
If it cannot find any matching word body from database, up to three letters (here a body is assumed to be bigger than three letters), it assumes this word as a private name or a word in another language or does not exist in dictionary database and passes to the other word to lookup. If it finds a matching body, it adds this word into pages’ word index and gives it one third of score of the original one. Here, the percentage of the score, inherited from the original word to its body is experimental.
For example, suppose the search word is “books” which has a score of 30. It will first search “books” from dictionary and will not be able to find a matching, then it will drop the last letter ‘s’ and find a matching for “book”. Now, it assumes “book” as the body of “books” and add one thirds of 30 that is 10 to the body word “book” as its new score. Experiments on this heuristic fast stemming show that about 90 per cent of true word bodies can be found which is an acceptable success.
By this way, when a user searches “book” from DSA search engine, a page containing “books” but not the word “book” will also be listed in the results. This method relies on the fact that conjunction suffixes in Turkish and English does not change a word’s meaning, but emphasize the word body [28]. Giving additional scores to words’ bodies does change a page’s rank in search results for the body word, where words with conjunction suffixes added to this body word exist in this page. So, this component helps for more relevant and successful ranking in search result listing order.
In Figure 3.11, the stemming algorithm run on the Turkish word “çocukların” is shown. As the result, the word “çocuk” is found as the body. In Figure 3.18, stemming results from two pages of Msnbc and Bigglook web sites are shown.
![]() |
Figure 3.11. Stemming of Turkish word “çocukların”
Table 3.18. Two example results of stemming
|
An English Page |
A Turkish Page | ||
|
Word |
Stemmed Body |
Word |
Stemmed Body |
|
accounts |
account |
olduktan |
oldu |
|
lawyers |
lawyer |
yapımında |
yapım |
|
years |
year |
kokusu |
koku |
|
claimants |
claimant |
yetiştirilmektedir |
yetiştirilmek |
|
diagnosed |
diagnose |
buzdolabında |
buzdolabı |
|
crimes |
crime |
beklememiş |
bekleme |
|
believing |
belie |
çilekler |
çilek |
|
claimed |
claim |
arnavutköy |
arnavut |
|
psychiatrists |
psychiatrist |
sahiptir |
sahip |
|
specialists |
specialist |
sululuğa |
sulu |
|
commented |
comment |
manavlarda |
manav |
|
experts |
expert |
mevsimi |
mevsim |
|
russians |
russia |
restoranlar |
restoran |
|
served |
serve |
tatlılığa |
tatlılı |
|
sparked |
spark |
meyveler |
meyve |
|
windows |
window |
reçeli |
reçel |
|
developing |
develop |
almaz |
alma |
This component stores word weights related to a web page sent from client modules to server modules. After words are stemmed, storing component stores the keyword, its relevance score and url of the document into the search engine database. It also inserts or updates other meta data fields such as title, description, citation text, date of last visit, citation text, citation link and the content size of a page which are listed in search results for a query. Besides storing, deletion of removed pages and their related word indexes are also performed by this component when client module sends request to delete moved page url records.
Storing operation is performed by sql queries on database. Sybase Adaptive Server Enterprise 11.2 is used for storing and searching. For dictionary storage and lookup functions, MySql [29] is used which is known to be faster than most database systems on simple lookups. Necessary table keys and indexes are also used for faster search performance and lookups.
Database table fields, data types and corresponding definitions of DSA storage are shown in Table 3.19, Table 3.20, Table 3.21, Table 3.22 and Table 3.23.
Table 3.19. Structure of database table “domains”
|
Field Name |
Data type |
Definition |
|
Domainid |
Numeric(6,0) |
Id of distinct domains of DSA |
|
Domainname |
Varchar(255) |
Domain name of the site |
|
Last Visited |
Datetime |
Last visit time of domain |
|
Client ID |
Varchar(32) |
ID of client module that visited domain |
Table 3.20. Structure of database table “links”
|
Field Name |
Data type |
Definition |
|
Linkid |
Numeric(6,0) |
Id of distinct pages of DSA |
|
Domainid |
Numeric(6,0) |
Id of domain that this page resides in |
|
Linkname |
Varchar(255) |
Url of the page link |
|
Title |
Varchar(255) |
Title of page |
|
Description |
Varchar(255) |
Description of page |
|
Last Visited |
Datetime |
Last visited time of page |
|
Client ID |
Varchar(32) |
ID of client module that visited page |
|
Last Modified |
Datetime |
Last modification time of page |
|
Content Length |
Numeric(6,0) |
Size of page content in bytes |
|
Parent Link |
Varchar(255) |
Url of parent page’s link |
|
Parent Citation |
Varchar(255) |
Citation text of parent link |
Table 3.21. Structure of database table “keywords”
|
Field Name |
Data type |
Definition |
|
Keyword |
Char(25) |
Indexed word |
|
Score |
Numeric(5,0) |
Score of the indexed word |
|
Linkid |
Numeric(6,0) |
Id of the page where this indexed word exists in |
Table 3.22. Structure of database table “clients”
|
Field Name |
Data type |
Definition |
|
Client ID |
Varchar(32) |
ID of client |
|
Auth ID |
Varchar(32) |
Session authorization ID of the connected client |
|
Access |
Numeric(1,0) |
Access flag |
Table 3.23. Structure of database tables “English and Turkish dictionaries”
|
Field Name |
Data type |
Definition |
|
Word |
Varchar(32) |
Body word in dictionary |
Storing component, for each meta data uploaded, stores new site domains in “domains” table, web page definition in “links” table and scored keywords in “keywords” table. It checks client identifications and access rights from “clients” table. Stemming component uses English and Turkish dictionaries to find word bodies according to language used in web page.
This component functions during search process of queries in our distributed search engine. We used PHP [30] programming language for serving search pages. The graphical interface of the site is designed to meet the requirements for readability and functionality. It takes the search phrase as input from the user, splits it into words, prepares the related SQL query and executes it on the DSA database system and lists the results in html format. Only pages, which have all of the searched words in their index, are included in results. Then they are sorted by the summation of scores of each searched word from highest to lowest and assigned a corresponding rank.
Figure 3.12 explains the search algorithm. First, search query is split into distinct words and stopwords are eliminated. Remaining words are named as word1, word2 up to wordn. Afterwards, SQL search query is generated and this query functions in database as follows: For each searched word, pages (pagei, pagej to some pagen) and their word relevance scores (scorei1, scorei2, scorein to some scorenn) are found from “keywords” database table.
Then same pages, which have all searched keywords, are joined and their related word scores are added to find total score for searched keywords in these pages. Other pages are filtered out because at least one of the searched keywords does not exist in these pages. To find resulting rank of each page, pages are sorted according to the total scores from highest to lowest and the result is passed to the listing component to be displayed to searching user.
Figure 3.12. Searching and result ranking algorithm