2.2.3.  Parser Component

After incremental indexer component decides to parse and build index of a web page, parser component analyzes content of the web page. Following constraints are processed on the html content:

·  Any control characters like newline, carriage return and tab characters, comment, applet, script and style tags and alphanumeric characters are removed.

·  “Select” html tags are removed if strip select client user configuration parameter is set (See Table 3.4).

·  All letters are transformed to lower case.

·  Language used in the web page is determined according to html meta content-language and meta-charset definition. If these settings in html content are defined as “TR” or “ISO-8859-1” or “Windows-1254” then the language used is assumed to be Turkish, in other cases, it is assumed to be English. Corresponding stopwords file and dictionary of stemming component are selected according to these criteria.

·  Corresponding stopwords file is loaded to eliminate matching words from content.

Stopwords include most frequent words in English and Turkish language like “a”, “the”, “and”, “we” and do not have a meaning alone and should not be scored and included in the page index [7, 8]. Stopwords are also eliminated from the search query when a user searches from DSA search engine. These words are stored locally on client and server systems in text files and some examples of English and Turkish stopwords are listed in Table 3.11.

Table 3.11. Few examples of English and Turkish stopwords

English stopwords

Turkish stopwords

A

Ben

Any

Ve

All

Ile

The

De

This

Gibi

When

Þey

Who

Çok

We

Biz

They

Onlar

For

Ýçin

 

 

After these modifications on the html content are done, words are parsed and classified into different buckets shown in Figure 3.9 according to some criteria. These criteria include the following numbered constraints for extracted words of a page such as those written:

·  in description,

·  in html meta keywords,

·  in between bold tags (<b> and </b>),

·  between italic tags (<i> and </i>)

·  between heading tags (<h> and </h>),

·  at the beginning of paragraphs and new lines (after <p> or <br> tags),

·  in the parent citation text (found from parent citation data structure),

·  in title or url text,

·  at the rest of page other than criteria mentioned above.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


Figure 3.9. Buckets filled with words according to numbered criteria

 

Within this process, title and description of the page are also extracted. Table 3.12 lists the contents of the buckets mentioned after a client module in DSA prototype parses the home page of “www.cmpe.boun.edu.tr” domain. For example, the words in bucket three are written in bold, on the other hand, second bucket is empty so there are no meta keywords written in the page. The words in bucket four are the ones written italic, bucket eight stores words written in page title and url address and bucket nine holds all the words written in content.

 

Table 3.12. Bucket contents of home page of “www.cmpe.boun.edu.tr”

Bucket

Words in the Bucket

Bucket 1

 mail webmaster türkçe events news announcements people faculty members research assistants support personnel students alumni research artificial intelligence laboratory computer networks research perceptual intelligence labor

Bucket 2

 

Bucket 3

mail to webmaster events people research undergraduate program graduate program admission robotics seminars cmpe undergraduate student representatives bogazici university department of computer engineering tr-34342 bebek istanbul turkey general information boðaziçi university istanbul turkey

Bucket 4

mail webmaster news events month boðaziçi university istanbul turkey

Bucket 5

türkçe news events month robotics seminars cmpe courses

Bucket 6

mail webmaster events faculty members artificial intelligence laboratory information curriculum ms information graduate admission department bogazici university turkey department department fax ++90 general information

Bucket 7

 

Bucket 8

bogazici university computer engineering department www cmpe boun edu tr

Bucket 9

computer bebek month phone bogazici fax news educational istanbul north events building department tr eta located turkey general main block adjacent technology information announcements campus engineering library university secretary

 

2.2.4.  Scoring Component

This component gives weights to words in each bucket filled by parser component and form meta data of the page indexed. Each bucket has a defined score and each word takes the sum of these scores according to their existence in these buckets. A word can be in one or more buckets according to parser constraints. In other words, a word can be written bold, can be in title and also can be a citation text of the page. Each constraint has a different score defined in client system configuration parameters named as field multipliers (see Table 3.2). Each field multiplier gives scores to the words in corresponding buckets. According to Table 3.2, a word in bucket one takes the score 10, a word in bucket two takes the score two and so on. In this way, word weight capsules (WWC) are computed. Word weight capsules are formed up of the word and its computed relevance score which are hold in the data structure shown in Table 3.13. Here, computed relevance score is the sum of the scores a word gained from each bucket it exists.

Table 3.13. Data structure of word weight capsule

Field Name

Description

Word

Distinct word in a page

Computed Relevance Score

Sum of the scores a word gained from each bucket it exists

 

As all client modules use the same algorithm and same field multiplier scores for words of pages indexed in DSA system, if a word w has the highest score in a page p among all other indexed pages, page p has the most relevance to word w. So, when w is searched in DSA search engine, p will be the most relevant page for the query and displayed first.

After scores are computed for words in buckets, word weight capsules are sorted according to computed relevance scores from highest to lowest and a portion from the top according to key count percentage parameters in client system parameters (see Table 3.2) is taken to form meta data to be sent to the server module. A key count percentage parameter is selected by number of words in the page. If a page has below 100, “key count 100” multiplier is selected, if a page has more than 100 hundred but below 500 hundred, value of “key count 500” multiplier is selected and this ratio of word weight capsules are sent to server. For example, if a page has 400 hundred scored distinct words, “key count 500” ratio parameter is selected which is 0,7. So, 70 per cent of the words, that is top 280 words will be sent in word weight capsules to the server module.

By this way, search performance will be kept fast and the words which reflect web page’s content with less relevance, will not be stored in search engine database. If these less frequent and less relevant words for this document were indexed and stored, they would not help this document to be in upper (top rank) results when searched and could only place this page in the lower (bottom rank) results. It is known that the users, who search queries from search engines, do not follow more than a few ten search results and do not even look at 40th or 50th results [4].

In Table 3.14, word weight capsules computed for the home page of “www.cmpe.boun.edu.tr” by a client module in DSA prototype, are listed.

Table 3.14. Word weight capsules of home page of “www.cmpe.boun.edu.tr”

Word

Score

Word

Score

department

187

admission

17

university

182

members

17

computer

160

block

17

engineering

158

phone

17

bogazici

146

campus

17

turkey

55

library

17

events

51

building

17

news

49

located

11

istanbul

48

north

10

information

48

educational

10

research

40

technology

10

webmaster

38

labor

10

mail

38

student

10

tr

37

robotics

10

boðaziçi

28

alumni

10

cmpe

28

representatives

10

month

28

seminars

10

laboratory

27

students

10

intelligence

27

support

10

bebek

27

personnel

10

announcements

27

türkiye

10

eta

25

assistants

10

general

24

networks

10

fax

24

perceptual

10

program

20

adjacent

10

people

20

main

10

undergraduate

20

secretary

8

www

18

curriculum

7

edu

18

ms

7

boun

18

info@cmpe

7

graduate

17

 

 

faculty

17

 

 

artificial

17

 

 

 

After the scoring phase, scoring component forms the meta data to be sent to server. Meta data consists of word weight capsules and page definitions consisting of title, description, last modification, url of parent page, citation text of parent page, content length and address (url) of the page. The fields and format of meta data is shown in Table 3.15.

Each of these fields are separated by the pipe character ( | ) where pipe characters inside any of the fields are preceded with backslash character ( \| ) to prevent confusion. In meta data, word weight capsules are used in a structure that makes it easier to be parsed by server modules and decrease the message size. In this method, word weight capsules are split into words and word relevance weights. All words are concatenated by the tilde character ( ~ ) and the resulting string is named as word capsules. While concatenating words, their related scores are converted into two byte characters and concatenated, forming weight capsules. Both of these concatenations are done in the same order, so server modules will find the corresponding score of each word while parsing.

Computed word weights are converted into binary representations while concatenating. All weights are transformed into their two byte binary representation as ASCII characters. In other words, the scores are not sent in text format but in binary format. No other separator character in concatenation is needed because each weight is represented by exactly two characters (i.e. two bytes). Here, a score greater than 256 * 256 - 1, which is 65535, is fixed to 65535. For example, let score of a word be 65. ASCII representation of 65 is the character “A”. So, instead of sending “65”, the character “A” will be sent in meta data. But, this weight capsule should be exactly to bytes because the server module will split the concatenation of these weight capsules by two bytes, so null character (first character of ASCII) is added in front of character “A”.

For another example, let the score be 17446. First we divide it to 256 and take the integer part before decimal point, to find the leftmost ASCII character, which is 68. 68 in ASCII represent character “D”. 17446 – (68 * 256) is 38 and character “&” is represented by 38 in ASCII. So, result of our ASCII base conversion is “D&” that is equal to 17446. Instead of sending “17446” (which is five bytes), the character “D&” (which is two bytes) will be sent in meta data and three bytes are saved in communication. The server module, will do the inverse of this representation, that is it will transform “D&” to 17446 while processing weight capsules.

In Table 3.16, meta data constructed for the home page of “www.cmpe.boun.edu.tr” by a client module in DSA prototype, namely title, description, last modified date, parent url address, parent citation text, content length, address of the page, language used in content, word capsules and weight capsules are listed. As a remark, in this table, weight capsules are shown in integer format to avoid garbled representation of binary characters.

 

Table 3.15. Structure of meta data formed by parser component

Field Name

Description

Title

Page title

Description

Description extracted from page

Last Modified

Last modification date of web page returned by web server of the site

Parent Url

Url of parent page citing this page, found from parent citation data structure

Parent Citation Text

Citation text of parent page, found from parent citation data structure

Content Length

Size of page in bytes

Address of Page

Url of page

Language

Language of html content (Turkish or English)

Word Capsules

Concatenation of distinct parsed and scored words of page separated by “~” character in computed word score order

Weight Capsules

Concatenation of two byte computed word scores in the same order with word capsules.

 

2.2.5.  Compression Component

Compression component is used by both client and server modules. It compresses meta data uploaded by client modules according to “is meta zipped” client system configuration parameter and decompresses uploaded data in server modules. Third party Zlib Perl module [25] is used for this task. By compressing meta data sent, some valuable bandwidth by this distributed search engine platform is saved.

In Table 3.17, statistics collected by a client module of DSA prototype indexing “cmpe.boun.edu.tr” domain are shown. In the second column size of whole html page, in third column size of meta data formed by the client module, in fourth column size of compressed meta data actually sent to server module, in fifth column the gain ratio in bandwidth if compression was not used and in the sixth column the gain ratio in bandwidth when compression is used are represented.

As we can see from Table 3.17, size of meta data is on the average about 17 times smaller than original content size. Moreover, actual size of uploaded data, which is the size of compressed meta data file is about 28 times smaller than the original html content size which is a valuable bandwidth gain.

 

Table 3.16. Meta data constructed from home page of “www.cmpe.boun.edu.tr”

Meta Data Field

Content of Meta Data Field

Title

Bogazici University COMPUTER ENGINEERING department

Description

mail to webmaster TÜRKÇE Events: News and Announcements. People: Faculty Members, Research Assistants, Support Personnel, Students, Alumni Research: Artificial Intelligence Laboratory, Computer Networks Research Laboratory, Perceptual Intelligence Labor

Last Modified

 

Parent Url

 

Parent Citation Text

 

Content Length

19653

Address of Page

http://www.cmpe.boun.edu.tr/

Language

TR

Word Capsules

department~university~computer~engineering~bogazici~turkey~ events~news~istanbul~information~ research~webmaster~mail~ tr~boðaziçi~cmpe~month~laboratory~intelligence~bebek~

announcements~eta~general~fax~program~people~undergraduate~ www~edu~boun~graduate~faculty~artificial~admission~

members~block~phone~campus~library~building~located~north~educational~ technology~labor~student~robotics~alumni~representatives~seminars~students~support~personnel~türkiye~assistants~

networks~

Weight Capsules Before Converted to Binary Format

187~182~160~158~146~55~51~49~48~48~40~38~38~37~28~28~28~27~27~27~27~25~24~24~20~20~20~18~18~18~17~17~17~17~17~ 17~17~17~17~17~11~10~10~10~10~10~10~10~10~10~10~

10~10~10~10~10~

Weight Capsules (Binary Formatted)

Binary formatted data cannot be shown here

 

Table 3.17. Sizes of HTML Content, Meta Data and Compressed Meta Data

Page Address

Size of HTML (bytes)

Size of Meta Data (bytes)

Size of Compressed Meta Data File (bytes)

(Size of HTML) / (Size of Meta Data)

(Size of HTML) / (Size of Compressed Meta Data File)

http://www.cmpe.boun.edu.tr/

19653

1029

656

19,10

29,96

http://www.cmpe.boun.edu.tr/graduate/index.php

16456

806

504

20,42

32,65

http://www.cmpe.boun.edu.tr/graduate/

16447

795

504

20,69

32,63

http://www.cmpe.boun.edu.tr/admission/graduate.php

16625

851

568

19,54

29,27

http://www.cmpe.boun.edu.tr/graduate/ms_info.php

17333

1538

872

11,27

19,88

http://www.cmpe.boun.edu.tr/~akin/robsem.html

6535

721

464

9,06

14,08

http://www.cmpe.boun.edu.tr/people/alumni/index.php

16392

688

432

23,83

37,94

http://www.cmpe.boun.edu.tr/graduate/phd_theses.php

15342

1184

720

12,96

21,31

http://www.cmpe.boun.edu.tr/research/

13779

527

352

26,15

39,14

http://www.cmpe.boun.edu.tr/people/faculty/index.php

18244

1147

688

15,91

26,52

http://www.cmpe.boun.edu.tr/events/index.php