[IP] more on What if Google and Yahoo got into a fight?
Begin forwarded message:
From: Eric Glover <eric@xxxxxxxxxxxxxx>
Date: August 15, 2005 9:46:54 PM EDT
To: "David Farber (by way of Bernard A. Galler)" <dave@xxxxxxxxxx>
Cc: i-p@xxxxxxxxx
Subject: Re: [IP] What if Google and Yahoo got into a fight?
There are several flaws with the NCSA Study (I looked at their PERL
code) - I do *not* work for Google or Yahoo.
There are a few other assumptions which were not explicit:
#1: That both Google and Yahoo use the same relevance function to
decide which results to include - or there is some way to post
process this to compare equally.
#2: That the Yahoo crawler has is biased in a way that is equally
probable for results which are returned for the keywords in the study
- or at least close.
Regarding #1: It is clear Google does not require keywords to be on a
particular web page for a result to be returned - it is necessary to
actually verify that both Yahoo and Google pages meet some identical
constraints. It is also possible that Yahoo has multiple partitions,
and only searches the later partition when there are no results in
the first - this might cause it to falsely appear that Yahoo has
lower coverage.
Regarding #2: It is entirely possible that there is a bias in the
Yahoo crawler, pretend that it crawled 15 Billion 'calendar pages' or
Spanish pages or some other different bias. In this case the NCSA
study fails for two reasons: First, what if the 'excluded
queries' (those with more than 1000 results) all come from Yahoo -
and the extra coverage was biased towards those queries? Second, what
if the extra content in Yahoo does not have keywords from the NCSA
study? Maybe Yahoo found 15 Billion Spanish pages, or image archives
with no keywords?
A few improvements to the study would involve:
#1: Post processing using a different relevance function: Such as
pages *must* contain each keyword somewhere in the HTML.
#2: Examining the remaining results actual intersection and use this
to test a probability model based on some predicted ratio of sizes.
Again this assumes that the crawlers are biased similarly.
I am not saying I believe Yahoo has added these pages, but I am
saying that this study does not deny the possibility Yahoo has in
fact indexed nearly 19 Billion pages.
-Eric
David Farber (by way of Bernard A. Galler) wrote:
Begin forwarded message:
From: Tim Finin <finin@xxxxxxxxxxx>
Date: August 15, 2005 5:29:41 PM EDT
To: Dave Farber <dave@xxxxxxxxxx>
Subject: What if Google and Yahoo got into a fight?
Yahoo and Google have been arguing about whose index is bigger.
The disagreement was touched off when Yahoo claimed [1] that its
index provided access to "over 20 billion items". Google
demurred [2]. Researchers at the University of Illinois NCSA
just announced the results [5] an experiment designed to settle
the question. Their scheme used a Perl program [3] to generate
over 10K random two word queries drawn from words in the the
ispell dictionary [4]. Comparing the number of results found by
each engine for these search queries identified a clear winner --
Google [5].
[1] http://www.ysearchblog.com/archives/000172.html
[2] http://battellemedia.com/archives/001790.php
[3] http://vburton.ncsa.uiuc.edu/compare.txt
[4] http://vburton.ncsa.uiuc.edu/wordlist.txt
[5] http://vburton.ncsa.uiuc.edu/indexsize.html
-- Tim Finin, Prof Computer Science & Electrical Engineering, Univ
of Maryland,
Baltimore County, 1000 Hilltop Circle, Baltimore MD 21250.
finin@xxxxxxxx
+1-410-455-3522 fax:-3969 http://umbc.edu/~finin/ http://
ebiquity.umbc.edu/
-------------------------------------
You are subscribed as galler@xxxxxxxxx
To manage your subscription, go to
http://v2.listbox.com/member/?listname=ip
Archives at: http://www.interesting-people.org/archives/interesting-
people/
-------------------------------------
You are subscribed as roessler@xxxxxxxxxxxxxxxxxx
To manage your subscription, go to
http://v2.listbox.com/member/?listname=ip
Archives at: http://www.interesting-people.org/archives/interesting-people/