After incremental indexer component decides to parse and build index of a web page, parser component analyzes content of the web page. Following constraints are processed on the html content:
· Any control characters like newline, carriage return and tab characters, comment, applet, script and style tags and alphanumeric characters are removed.
· “Select” html tags are removed if strip select client user configuration parameter is set (See Table 3.4).
· All letters are transformed to lower case.
· Language used in the web page is determined according to html meta content-language and meta-charset definition. If these settings in html content are defined as “TR” or “ISO-8859-1” or “Windows-1254” then the language used is assumed to be Turkish, in other cases, it is assumed to be English. Corresponding stopwords file and dictionary of stemming component are selected according to these criteria.
· Corresponding stopwords file is loaded to eliminate matching words from content.
Stopwords include most frequent words in English and Turkish language like “a”, “the”, “and”, “we” and do not have a meaning alone and should not be scored and included in the page index [7, 8]. Stopwords are also eliminated from the search query when a user searches from DSA search engine. These words are stored locally on client and server systems in text files and some examples of English and Turkish stopwords are listed in Table 3.11.
Table 3.11. Few examples of English and Turkish stopwords
|
English stopwords |
Turkish stopwords |
|
A |
Ben |
|
Any |
Ve |
|
All |
Ile |
|
The |
De |
|
This |
Gibi |
|
When |
Þey |
|
Who |
Çok |
|
We |
Biz |
|
They |
Onlar |
|
For |
Ýçin |
After these modifications on the html content are done, words are parsed and classified into different buckets shown in Figure 3.9 according to some criteria. These criteria include the following numbered constraints for extracted words of a page such as those written:
· in description,
· in html meta keywords,
· in between bold tags (<b> and </b>),
· between italic tags (<i> and </i>)
· between heading tags (<h> and </h>),
· at the beginning of paragraphs and new lines (after <p> or <br> tags),
· in the parent citation text (found from parent citation data structure),
· in title or url text,
· at the rest of page other than criteria mentioned above.
![]() |
Figure 3.9. Buckets filled with words according to numbered criteria
Within this process, title and description of the page are also extracted. Table 3.12 lists the contents of the buckets mentioned after a client module in DSA prototype parses the home page of “www.cmpe.boun.edu.tr” domain. For example, the words in bucket three are written in bold, on the other hand, second bucket is empty so there are no meta keywords written in the page. The words in bucket four are the ones written italic, bucket eight stores words written in page title and url address and bucket nine holds all the words written in content.
Table 3.12. Bucket contents of home page of “www.cmpe.boun.edu.tr”
|
Bucket |
Words in the Bucket |
|
Bucket 1 |
mail webmaster türkçe events news announcements people faculty members research assistants support personnel students alumni research artificial intelligence laboratory computer networks research perceptual intelligence labor |
|
Bucket 2 |
|
|
Bucket 3 |
mail to webmaster events people research undergraduate program graduate program admission robotics seminars cmpe undergraduate student representatives bogazici university department of computer engineering tr-34342 bebek istanbul turkey general information boðaziçi university istanbul turkey |
|
Bucket 4 |
mail webmaster news events month boðaziçi university istanbul turkey |
|
Bucket 5 |
türkçe news events month robotics seminars cmpe courses |
|
Bucket 6 |
mail webmaster events faculty members artificial intelligence laboratory information curriculum ms information graduate admission department bogazici university turkey department department fax ++90 general information |
|
Bucket 7 |
|
|
Bucket 8 |
bogazici university computer engineering department www cmpe boun edu tr |
|
Bucket 9 |
computer bebek month phone bogazici fax news educational istanbul north events building department tr eta located turkey general main block adjacent technology information announcements campus engineering library university secretary |
This component gives weights to words in each bucket filled by parser component and form meta data of the page indexed. Each bucket has a defined score and each word takes the sum of these scores according to their existence in these buckets. A word can be in one or more buckets according to parser constraints. In other words, a word can be written bold, can be in title and also can be a citation text of the page. Each constraint has a different score defined in client system configuration parameters named as field multipliers (see Table 3.2). Each field multiplier gives scores to the words in corresponding buckets. According to Table 3.2, a word in bucket one takes the score 10, a word in bucket two takes the score two and so on. In this way, word weight capsules (WWC) are computed. Word weight capsules are formed up of the word and its computed relevance score which are hold in the data structure shown in Table 3.13. Here, computed relevance score is the sum of the scores a word gained from each bucket it exists.
Table 3.13. Data structure of word weight capsule
|
Field Name |
Description |
|
Word |
Distinct word in a page |
|
Computed Relevance Score |
Sum of the scores a word gained from each bucket it exists |
As all client modules use the same algorithm and same field multiplier scores for words of pages indexed in DSA system, if a word w has the highest score in a page p among all other indexed pages, page p has the most relevance to word w. So, when w is searched in DSA search engine, p will be the most relevant page for the query and displayed first.
After scores are computed for words in buckets, word weight capsules are sorted according to computed relevance scores from highest to lowest and a portion from the top according to key count percentage parameters in client system parameters (see Table 3.2) is taken to form meta data to be sent to the server module. A key count percentage parameter is selected by number of words in the page. If a page has below 100, “key count 100” multiplier is selected, if a page has more than 100 hundred but below 500 hundred, value of “key count 500” multiplier is selected and this ratio of word weight capsules are sent to server. For example, if a page has 400 hundred scored distinct words, “key count 500” ratio parameter is selected which is 0,7. So, 70 per cent of the words, that is top 280 words will be sent in word weight capsules to the server module.
By this way, search performance will be kept fast and the words which reflect web page’s content with less relevance, will not be stored in search engine database. If these less frequent and less relevant words for this document were indexed and stored, they would not help this document to be in upper (top rank) results when searched and could only place this page in the lower (bottom rank) results. It is known that the users, who search queries from search engines, do not follow more than a few ten search results and do not even look at 40th or 50th results [4].
In Table 3.14, word weight capsules computed for the home page of “www.cmpe.boun.edu.tr” by a client module in DSA prototype, are listed.
Table 3.14. Word weight capsules of home page of “www.cmpe.boun.edu.tr”
|
Word |
Score |
Word |
Score |
|
department |
187 |
admission |
17 |
|
university |
182 |
members |
17 |
|
computer |
160 |
block |
17 |
|
engineering |
158 |
phone |
17 |
|
bogazici |
146 |
campus |
17 |
|
turkey |
55 |
library |
17 |
|
events |
51 |
building |
17 |
|
news |
49 |
located |
11 |
|
istanbul |
48 |
north |
10 |
|
information |
48 |
educational |
10 |
|
research |
40 |
technology |
10 |
|
webmaster |
38 |
labor |
10 |
|
|
38 |
student |
10 |
|
tr |
37 |
robotics |
10 |
|
boðaziçi |
28 |
alumni |
10 |
|
cmpe |
28 |
representatives |
10 |
|
month |
28 |
seminars |
10 |
|
laboratory |
27 |
students |
10 |
|
intelligence |
27 |
support |
10 |
|
bebek |
27 |
personnel |
10 |
|
announcements |
27 |
türkiye |
10 |
|
eta |
25 |
assistants |
10 |
|
general |
24 |
networks |
10 |
|
fax |
24 |
perceptual |
10 |
|
program |
20 |
adjacent |
10 |
|
people |
20 |
main |
10 |
|
undergraduate |
20 |
secretary |
8 |
|
www |
18 |
curriculum |
7 |
|
edu |
18 |
ms |
7 |
|
boun |
18 |
info@cmpe |
7 |
|
graduate |
17 |
|
|
|
faculty |
17 |
|
|
|
artificial |
17 |
|
|
After the scoring phase, scoring component forms the meta data to be sent to server. Meta data consists of word weight capsules and page definitions consisting of title, description, last modification, url of parent page, citation text of parent page, content length and address (url) of the page. The fields and format of meta data is shown in Table 3.15.
Each of these fields are separated by the pipe character ( | ) where pipe characters inside any of the fields are preceded with backslash character ( \| ) to prevent confusion. In meta data, word weight capsules are used in a structure that makes it easier to be parsed by server modules and decrease the message size. In this method, word weight capsules are split into words and word relevance weights. All words are concatenated by the tilde character ( ~ ) and the resulting string is named as word capsules. While concatenating words, their related scores are converted into two byte characters and concatenated, forming weight capsules. Both of these concatenations are done in the same order, so server modules will find the corresponding score of each word while parsing.
Computed word weights are converted into binary representations while concatenating. All weights are transformed into their two byte binary representation as ASCII characters. In other words, the scores are not sent in text format but in binary format. No other separator character in concatenation is needed because each weight is represented by exactly two characters (i.e. two bytes). Here, a score greater than 256 * 256 - 1, which is 65535, is fixed to 65535. For example, let score of a word be 65. ASCII representation of 65 is the character “A”. So, instead of sending “65”, the character “A” will be sent in meta data. But, this weight capsule should be exactly to bytes because the server module will split the concatenation of these weight capsules by two bytes, so null character (first character of ASCII) is added in front of character “A”.
For another example, let the score be 17446. First we divide it to 256 and take the integer part before decimal point, to find the leftmost ASCII character, which is 68. 68 in ASCII represent character “D”. 17446 – (68 * 256) is 38 and character “&” is represented by 38 in ASCII. So, result of our ASCII base conversion is “D&” that is equal to 17446. Instead of sending “17446” (which is five bytes), the character “D&” (which is two bytes) will be sent in meta data and three bytes are saved in communication. The server module, will do the inverse of this representation, that is it will transform “D&” to 17446 while processing weight capsules.
In Table 3.16, meta data constructed for the home page of “www.cmpe.boun.edu.tr” by a client module in DSA prototype, namely title, description, last modified date, parent url address, parent citation text, content length, address of the page, language used in content, word capsules and weight capsules are listed. As a remark, in this table, weight capsules are shown in integer format to avoid garbled representation of binary characters.
Table 3.15. Structure of meta data formed by parser component
|
Field Name |
Description |
|
Title |
Page title |
|
Description |
Description extracted from page |
|
Last Modified |
Last modification date of web page returned by web server of the site |
|
Parent Url |
Url of parent page citing this page, found from parent citation data structure |
|
Parent Citation Text |
Citation text of parent page, found from parent citation data structure |
|
Content Length |
Size of page in bytes |
|
Address of Page |
Url of page |
|
Language |
Language of html content (Turkish or English) |
|
Word Capsules |
Concatenation of distinct parsed and scored words of page separated by “~” character in computed word score order |
|
Weight Capsules |
Concatenation of two byte computed word scores in the same order with word capsules. |
Compression component is used by both client and server modules. It compresses meta data uploaded by client modules according to “is meta zipped” client system configuration parameter and decompresses uploaded data in server modules. Third party Zlib Perl module [25] is used for this task. By compressing meta data sent, some valuable bandwidth by this distributed search engine platform is saved.
In Table 3.17, statistics collected by a client module of DSA prototype indexing “cmpe.boun.edu.tr” domain are shown. In the second column size of whole html page, in third column size of meta data formed by the client module, in fourth column size of compressed meta data actually sent to server module, in fifth column the gain ratio in bandwidth if compression was not used and in the sixth column the gain ratio in bandwidth when compression is used are represented.
As we can see from Table 3.17, size of meta data is on the average about 17 times smaller than original content size. Moreover, actual size of uploaded data, which is the size of compressed meta data file is about 28 times smaller than the original html content size which is a valuable bandwidth gain.
Table 3.16. Meta data constructed from home page of “www.cmpe.boun.edu.tr”
|
Meta Data Field |
Content of Meta Data Field |
|
Title |
Bogazici University COMPUTER ENGINEERING department |
|
Description |
mail to webmaster TÜRKÇE Events: News and Announcements. People: Faculty Members, Research Assistants, Support Personnel, Students, Alumni Research: Artificial Intelligence Laboratory, Computer Networks Research Laboratory, Perceptual Intelligence Labor |
|
Last Modified |
|
|
Parent Url |
|
|
Parent Citation Text |
|
|
Content Length |
19653 |
|
Address of Page |
http://www.cmpe.boun.edu.tr/ |
|
Language |
TR |
|
Word Capsules |
department~university~computer~engineering~bogazici~turkey~ events~news~istanbul~information~ research~webmaster~mail~ tr~boðaziçi~cmpe~month~laboratory~intelligence~bebek~ announcements~eta~general~fax~program~people~undergraduate~ www~edu~boun~graduate~faculty~artificial~admission~ members~block~phone~campus~library~building~located~north~educational~ technology~labor~student~robotics~alumni~representatives~seminars~students~support~personnel~türkiye~assistants~ networks~ |
|
Weight Capsules Before Converted to Binary Format |
187~182~160~158~146~55~51~49~48~48~40~38~38~37~28~28~28~27~27~27~27~25~24~24~20~20~20~18~18~18~17~17~17~17~17~ 17~17~17~17~17~11~10~10~10~10~10~10~10~10~10~10~ 10~10~10~10~10~ |
|
Weight Capsules (Binary Formatted) |
Binary formatted data cannot be shown here |
Table 3.17. Sizes of HTML Content, Meta Data and Compressed Meta Data
|
Page Address |
Size of HTML (bytes) |
Size of Meta Data (bytes) |
Size of Compressed Meta Data File (bytes) |
(Size of HTML) / (Size of Meta Data) |
(Size of HTML) / (Size of Compressed Meta Data File) |
|
http://www.cmpe.boun.edu.tr/ |
19653 |
1029 |
656 |
19,10 |
29,96 |
|
http://www.cmpe.boun.edu.tr/graduate/index.php |
16456 |
806 |
504 |
20,42 |
32,65 |
|
http://www.cmpe.boun.edu.tr/graduate/ |
16447 |
795 |
504 |
20,69 |
32,63 |
|
http://www.cmpe.boun.edu.tr/admission/graduate.php |
16625 |
851 |
568 |
19,54 |
29,27 |
|
http://www.cmpe.boun.edu.tr/graduate/ms_info.php |
17333 |
1538 |
872 |
11,27 |
19,88 |
|
http://www.cmpe.boun.edu.tr/~akin/robsem.html |
6535 |
721 |
464 |
9,06 |
14,08 |
|
http://www.cmpe.boun.edu.tr/people/alumni/index.php |
16392 |
688 |
432 |
23,83 |
37,94 |
|
http://www.cmpe.boun.edu.tr/graduate/phd_theses.php |
15342 |
1184 |
720 |
12,96 |
21,31 |
|
http://www.cmpe.boun.edu.tr/research/ |
13779 |
527 |
352 |
26,15 |
39,14 |
|
http://www.cmpe.boun.edu.tr/people/faculty/index.php |
18244 |
1147 |
688 |
15,91 |
26,52 |
|
http://www.cmpe.boun.edu.tr/events/index.php |