We have measured and calculated results of our DSA implementation on topics such as volume, bandwidth gain, web site change fraction and page index update latency and expression search results.
Table 5.1. Volume of data statistics for centralized search engines and DSA
|
Site Address |
Number of Pages Retrieved |
Total Content Size (Retrieved by centralized) |
Average Size of Pages |
Total Meta Data Size (DSA) |
Total Meta Data File Size (DSA) |
Average Meta Data File Size |
|
www.cmpe.boun.edu.tr |
1000 |
10893139 |
10893 |
886530 |
533952 |
534 |
|
www.gima.com.tr |
30 |
955424 |
31847 |
47290 |
26428 |
875 |
|
www.msnbc.com |
994 |
48572916 |
48866 |
1708594 |
965712 |
972 |
|
www.uluslararasiegitim.com |
1000 |
30287525 |
30257 |
1097518 |
611512 |
611 |
|
www.ykb.com |
671 |
13467172 |
20070 |
1004815 |
540064 |
805 |
Table 5.2 lists the averages of ratios of content sizes of pages over meta data sent from client modules to server modules for the same pages used Table 5.1. We see from our measurements that on the average, total size meta data files uploaded for these pages is 35,6 times smaller than total content size which results in 97,19 per cent bandwidth gain. Also, we find that we have achieved a compression ratio of 1,73 on the average where the gain of compression is 41,86 per cent.
Table 5.2. Averages of ratios of content sizes of pages over meta data and gain
|
Site Address |
Average Ratio of Page Size / Meta Data |
Average Ratio of Page Size / Compressed Meta Data File |
Average Compression Ratio |
|
www.cmpe.boun.edu.tr |
11,21 |
17,65 |
1,64 |
|
www.gima.com.tr |
19,55 |
33,73 |
1,71 |
|
www.msnbc.com |
29,59 |
50,39 |
1,73 |
|
www.uluslararasiegitim.com |
29,89 |
52,00 |
1,76 |
|
www.ykb.com |
13,38 |
24,23 |
1,79 |
|
General Average |
20,72 |
35,60 |
1,73 |
|
Average Gain in Bandwidth |
95,17 % |
97,19 % |
41,86 % |
We can define web site change fraction as the ratio of number of pages added, deleted or updated over total number of pages in this web site. In our measurements, we focus on the web sites those have a potential of small change fraction such as news resources sites. The following results and computations are collected and found by modified client module software which finds newly added pages each day in the news resource web sites, following the url address patterns of new added news. For example, in “milliyet.com.tr” web site, sports news of a day are listed under the url pattern; “www.milliyet.com.tr/year/month/day/spor/index.html”. For all news categories, a similar rule is applied for finding new news pages. For other news sites, again similar url address pattern rules are applied and new news added each day are found for ten days and average of these observations are computed.
These specialized modules are run for Milliyet, Hurriyet, Sabah, Akþam, Radikal, Star and Ntvmsnbc news sites for ten days;. Total number of pages are found by estimation. We have found out the number of days these web sites have been published from archive sections and multiplied this by number of pages added each day. Of course, this technique cannot be generalized and is only applicable to news sites who publish nearly the same amount of news each day.
In Table 5.3, we represent the number of news each daily news resource publishes, the start date of its online publishing, number of days past since starting date and number of total news pages estimated in each news web site.
Table 5.3. Volume of listed news resources
|
News Resource |
Number of News per day |
Start Date of Online Publishing |
Current Date |
Days Past |
Total Number of News Pages |
|
Milliyet |
116 |
26.11.1996 |
22.12.2002 |
2186 |
253066 |
|
Hurriyet |
220 |
01.01.1997 |
22.12.2002 |
2151 |
472575 |
|
Sabah |
53 |
01.01.1997 |
22.12.2002 |
2151 |
114254 |
|
Akþam |
103 |
01.01.1998 |
22.12.2002 |
1791 |
184085 |
|
Radikal |
125 |
01.05.1998 |
22.12.2002 |
1671 |
208346 |
|
Star |
101 |
01.01.2002 |
22.12.2002 |
351 |
35422 |
|
Ntvmsnbc |
88 |
01.01.2000 |
22.12.2002 |
1071 |
94272 |
We call latency as the number of pages that the search engine modules should traverse to find all the added, deleted and updated web pages. Apparently this number is directly proportional to search engine index update time delay. Only after this delay, you can find results from updated web site index.
In Table 5.4 web site change fraction and latency values of mentioned news resources are listed. We see that latency of a centralized search engine is the same as total number of pages in the web site. There is no other way to know which pages are added, deleted or updated without traversing and downloading the whole web site to search engine system in this case. But in our DSA system, client modules run on specific intervals in the client system and can traverse whole site easily and without using any bandwidth. So the latency of DSA is found equal to the number of added news pages each day. Client administrators can inhibit client modules to traverse specific url paths such as “/archive/” or “/arsiv/” to speed up the client module process. Additionally, they can set the starting url parameter of client user parameters as the address of “latest news page” which usually exists in big web sites and daily news resource sites. By changing these two parameters, they can administer client modules to cleverly find new and updated pages. In the case of small web sites, changes can be found easily without setting these parameters. From Table 5.4 we can see that web site change fraction is so small as 0,09 on the average for the mentioned web sites. So, to find new pages would be a very expensive job for centralized search engines possibly with a latency of traversing the whole site.
Table 5.4. Web site change fraction and latency values of mentioned news resources
|
News Resource |
Number of News Added Each Day |
Total Number of News Pages |
Web Site Change Fraction (per cent) |
Latency of Centralized Search Engines |
Latency of DSA |
|
Milliyet |
116 |
253066 |
0,046 |
253066 |
116 |
|
Hurriyet |
220 |
472575 |
0,047 |
472575 |
220 |
|
Sabah |
53 |
114254 |
0,046 |
114254 |
53 |
|
Akþam |
103 |
184085 |
0,056 |
184085 |
103 |
|
Radikal |
125 |
208346 |
0,060 |
208346 |
125 |
|
Star |
101 |
35422 |
0,285 |
35422 |
101 |
|
Ntvmsnbc |
88 |
94272 |
0,093 |
94272 |
88 |
Another parameter which is the time delay of updating the search engine index differs in both centralized search engines and the DSA. In centralized systems, bandwidth, crawler queue and the number of web sites in the search engine index directly affect time delay. In our DSA system, scheduling interval of client modules and number of member client modules seem to affect this parameter.
In this section we show a search result example from our DSA search engine in Figure 5.1. The expression “artificial intelligence” is searched over 42 web sites, 6283 pages from these web sites with 691161 word indexes.
Components of DSA system are written in open source and high level Perl and PHP programming languages. Client and server modules are written in Perl where subscription and DSA search interface pages are written in PHP and client user configuration file generator code is written in Perl. Table 5.5 lists the code sizes, number of files and total file size of written code and configuration files.

Figure 5.1. Example search result from DSA
Table 5.5 Sizes of DSA programming codes
|
Component |
Number of Module Files |
Total Lines of Code or Text |
Total File Size (KB) |
|
Client Module |
6 |
1032 |
28,7 |
|
Stopwords Files |
2 |
686 |
4,59 |
|
Client Configuration Files |
2 |
31 |
1,13 |
|
Server Module |
5 |
579 |
23,9 |
|
Subscription Module |
2 |
197 |
6,22 |
|
Search Interface |
7 |
556 |
14,9 |
|
Total |
24 |
3081 |
79,44 |