4.  Results From Proof of Concept Implementation

We have measured and calculated results of our DSA implementation on topics such as volume, bandwidth gain, web site change fraction and page index update latency and expression search results.

4.1.  Volume

When measuring volume, from 5 sites, about 3695 pages are indexed and for each site, content sizes retrieved by centralized search engines are compared with the size of meta data and compressed meta data file uploaded to the DSA server modules. The results are shown in Table 5.1. In second column total number of pages indexed, in third column total content size downloaded by centralized engines, in fourth column average page size, in fifth column total meta data size produced by DSA client modules, in sixth column total compressed and encrypted meta data file size and finally in seventh column average meta data file size retrieved by DSA server modules per each site are listed. From Table 5.1, we can see that total bandwidth needed for centralized search engines to download 1000 html pages in “www.cmpe.boun.edu.tr” domain is about ten megabytes but for DSA system, it is just half megabytes which is about 20 times smaller.

Table 5.1. Volume of data statistics for centralized search engines and DSA

Site Address

Number of Pages Retrieved

Total Content Size (Retrieved by centralized)

Average Size of Pages

Total Meta Data Size (DSA)

Total Meta Data File Size (DSA)

Average Meta Data File Size

www.cmpe.boun.edu.tr

1000

10893139

10893

886530

533952

534

www.gima.com.tr

30

955424

31847

47290

26428

875

www.msnbc.com

994

48572916

48866

1708594

965712

972

www.uluslararasiegitim.com

1000

30287525

30257

1097518

611512

611

www.ykb.com

671

13467172

20070

1004815

540064

805

 

 

Table 5.2 lists the averages of ratios of content sizes of pages over meta data sent from client modules to server modules for the same pages used Table 5.1. We see from our measurements that on the average, total size meta data files uploaded for these pages is 35,6 times smaller than total content size which results in 97,19 per cent bandwidth gain. Also, we find that we have achieved a compression ratio of 1,73 on the average where the gain of compression is 41,86 per cent.

Table 5.2. Averages of ratios of content sizes of pages over meta data and gain

Site Address

Average Ratio of  Page Size / Meta Data

Average Ratio of Page Size / Compressed Meta Data File

Average Compression Ratio

www.cmpe.boun.edu.tr

11,21

17,65

1,64

www.gima.com.tr

19,55

33,73

1,71

www.msnbc.com

29,59

50,39

1,73

www.uluslararasiegitim.com

29,89

52,00

1,76

www.ykb.com

13,38

24,23

1,79

General Average

20,72

35,60

1,73

Average Gain in Bandwidth

95,17 %

97,19 %

41,86 %

 

4.2.  Web Site Change Fraction and Latency

We can define web site change fraction as the ratio of number of pages added, deleted or updated over total number of pages in this web site. In our measurements, we focus on the web sites those have a potential of small change fraction such as news resources sites. The following results and computations are collected and found by modified client module software which finds newly added pages each day in the news resource web sites, following the url address patterns of new added news. For example, in “milliyet.com.tr” web site, sports news of a day are listed under the url pattern; “www.milliyet.com.tr/year/month/day/spor/index.html”. For all news categories, a similar rule is applied for finding new news pages. For other news sites, again similar url address pattern rules are applied and new news added each day are found for ten days and average of these observations are computed.

These specialized modules are run for Milliyet, Hurriyet, Sabah, Akþam, Radikal, Star and Ntvmsnbc news sites for ten days;. Total number of pages are found by estimation. We have found out the number of days these web sites have been published from archive sections and multiplied this by number of pages added each day. Of course, this technique cannot be generalized and is only applicable to news sites who publish nearly the same amount of news each day.

In Table 5.3, we represent the number of news each daily news resource publishes, the start date of its online publishing, number of days past since starting date and number of total news pages estimated in each news web site.

Table 5.3. Volume of listed news resources

News Resource

Number of News per day

Start Date of Online Publishing

Current Date

Days Past

Total Number of News Pages

Milliyet

116

26.11.1996

22.12.2002

2186

253066

Hurriyet

220

01.01.1997

22.12.2002

2151

472575

Sabah

53

01.01.1997

22.12.2002

2151

114254

Akþam

103

01.01.1998

22.12.2002

1791

184085

Radikal

125

01.05.1998

22.12.2002

1671

208346

Star

101

01.01.2002

22.12.2002

351

35422

Ntvmsnbc

88

01.01.2000

22.12.2002

1071

94272

 

 

We call latency as the number of pages that the search engine modules should traverse to find all the added, deleted and updated web pages. Apparently this number is directly proportional to search engine index update time delay. Only after this delay, you can find results from updated web site index.

In Table 5.4 web site change fraction and latency values of mentioned news resources are listed. We see that latency of a centralized search engine is the same as total number of pages in the web site. There is no other way to know which pages are added, deleted or updated without traversing and downloading the whole web site to search engine system in this case. But in our DSA system, client modules run on specific intervals in the client system and can traverse whole site easily and without using any bandwidth. So the latency of DSA is found equal to the number of added news pages each day. Client administrators can inhibit client modules to traverse specific url paths such as “/archive/” or “/arsiv/” to speed up the client module process. Additionally, they can set the starting url parameter of client user parameters as the address of “latest news page” which usually exists in big web sites and daily news resource sites. By changing these two parameters, they can administer client modules to cleverly find new and updated pages. In the case of small web sites, changes can be found easily without setting these parameters. From Table 5.4 we can see that web site change fraction is so small as 0,09 on the average for the mentioned web sites. So, to find new pages would be a very expensive job for centralized search engines possibly with a latency of traversing the whole site.

Table 5.4. Web site change fraction and latency values of mentioned news resources

News Resource

Number of News Added Each Day

Total Number of News Pages

Web Site Change Fraction (per cent)

Latency of Centralized Search Engines

Latency of DSA

Milliyet

116

253066

0,046

253066

116

Hurriyet

220

472575

0,047

472575

220

Sabah

53

114254

0,046

114254

53

Akþam

103

184085

0,056

184085

103

Radikal

125

208346

0,060

208346

125

Star

101

35422

0,285

35422

101

Ntvmsnbc

88

94272

0,093

94272

88

 

 

Another parameter which is the time delay of updating the search engine index differs in both centralized search engines and the DSA. In centralized systems, bandwidth, crawler queue and the number of web sites in the search engine index directly affect time delay. In our DSA system, scheduling interval of client modules and number of member client modules seem to affect this parameter.

4.3.  Search Results

In this section we show a search result example from our DSA search engine in Figure 5.1. The expression “artificial intelligence” is searched over 42 web sites, 6283 pages from these web sites with 691161 word indexes.

4.4.  Code Statistics

Components of DSA system are written in open source and high level Perl and PHP programming languages. Client and server modules are written in Perl where subscription and DSA search interface pages are written in PHP and client user configuration file generator code is written in Perl. Table 5.5 lists the code sizes, number of files and total file size of written code and configuration files.

Figure 5.1. Example search result from DSA

 

Table 5.5 Sizes of DSA programming codes

Component

Number of Module Files

Total Lines of Code or Text

Total File Size (KB)

Client Module

6

1032

28,7

Stopwords Files

2

686

4,59

Client Configuration Files

2

31

1,13

Server Module

5

579

23,9

Subscription Module

2

197

6,22

Search Interface

7

556

14,9

Total

24

3081

79,44

 

Home | introduction | architecture_ of_dsa | parser_ component | security_component | listing_component
centralized_search_engines | implementation |

Ask Siirleri | Guzel Sozler | Lazer Epilasyon | Gazeteler