[IP] more on What if Google and Yahoo got into a fight?

To: Ip Ip <ip@xxxxxxxxxxxxxx>
Subject: [IP] more on What if Google and Yahoo got into a fight?
From: David Farber <dave@xxxxxxxxxx>
Date: Tue, 16 Aug 2005 08:57:58 -0400
List-help: <http://v2.listbox.com/doc/help_sub?list_name=ip@v2.listbox.com>
List-id: <ip@xxxxxxxxxxxxxx>
List-software: listbox.com v2.0
List-subscribe: <mailto:subscribe-ip@v2.listbox.com>, <http://v2.listbox.com/subscribe/?listname=ip@v2.listbox.com>
List-unsubscribe: <mailto:unsubscribe-ip@v2.listbox.com>, <http://v2.listbox.com/member/unsubscribe/?listname=ip@v2.listbox.com>
References: <4301458E.5040309@xxxxxxxxxxxxxx>
Reply-to: dave@xxxxxxxxxx



Begin forwarded message:

From: Eric Glover <eric@xxxxxxxxxxxxxx>
Date: August 15, 2005 9:46:54 PM EDT
To: "David Farber (by way of Bernard A. Galler)" <dave@xxxxxxxxxx>
Cc: i-p@xxxxxxxxx
Subject: Re: [IP] What if Google and Yahoo got into a fight?

There are several flaws with the NCSA Study (I looked at their PERLcode) - I do *not* work for Google or Yahoo.


There are a few other assumptions which were not explicit:

#1: That both Google and Yahoo use the same relevance function todecide which results to include - or there is some way to postprocess this to compare equally.

#2: That the Yahoo crawler has is biased in a way that is equallyprobable for results which are returned for the keywords in the study- or at least close.

Regarding #1: It is clear Google does not require keywords to be on aparticular web page for a result to be returned - it is necessary toactually verify that both Yahoo and Google pages meet some identicalconstraints. It is also possible that Yahoo has multiple partitions,and only searches the later partition when there are no results inthe first - this might cause it to falsely appear that Yahoo haslower coverage.

Regarding #2: It is entirely possible that there is a bias in theYahoo crawler, pretend that it crawled 15 Billion 'calendar pages' orSpanish pages or some other different bias. In this case the NCSAstudy fails for two reasons: First, what if the 'excludedqueries' (those with more than 1000 results) all come from Yahoo -and the extra coverage was biased towards those queries? Second, whatif the extra content in Yahoo does not have keywords from the NCSAstudy? Maybe Yahoo found 15 Billion Spanish pages, or image archiveswith no keywords?


A few improvements to the study would involve:

#1: Post processing using a different relevance function: Such aspages *must* contain each keyword somewhere in the HTML.

#2: Examining the remaining results actual intersection and use thisto test a probability model based on some predicted ratio of sizes.Again this assumes that the crawlers are biased similarly.

I am not saying I believe Yahoo has added these pages, but I amsaying that this study does not deny the possibility Yahoo has infact indexed nearly 19 Billion pages.


-Eric

David Farber (by way of Bernard A. Galler) wrote:

Begin forwarded message:
From: Tim Finin <finin@xxxxxxxxxxx>
Date: August 15, 2005 5:29:41 PM EDT
To: Dave Farber <dave@xxxxxxxxxx>
Subject: What if Google and Yahoo got into a fight?
Yahoo and Google have been arguing about whose index is bigger.
The disagreement was touched off when Yahoo claimed [1] that its
index provided access to "over 20 billion items".  Google
demurred [2].  Researchers at the University of Illinois NCSA
just announced the results [5] an experiment designed to settle
the question.  Their scheme used a Perl program [3] to generate
over 10K random two word queries drawn from words in the the
ispell dictionary [4]. Comparing the number of results found by
each engine for these search queries identified a clear winner --
Google [5].
[1] http://www.ysearchblog.com/archives/000172.html
[2] http://battellemedia.com/archives/001790.php
[3] http://vburton.ncsa.uiuc.edu/compare.txt
[4] http://vburton.ncsa.uiuc.edu/wordlist.txt
[5] http://vburton.ncsa.uiuc.edu/indexsize.html

-- Tim Finin, Prof Computer Science & Electrical Engineering, Univof Maryland,Baltimore County, 1000 Hilltop Circle, Baltimore MD 21250.finin@xxxxxxxx+1-410-455-3522 fax:-3969 http://umbc.edu/~finin/ http://ebiquity.umbc.edu/

-------------------------------------
You are subscribed as galler@xxxxxxxxx
To manage your subscription, go to
 http://v2.listbox.com/member/?listname=ip

Archives at: http://www.interesting-people.org/archives/interesting-people/



-------------------------------------
You are subscribed as roessler@xxxxxxxxxxxxxxxxxx
To manage your subscription, go to
 http://v2.listbox.com/member/?listname=ip

Archives at: http://www.interesting-people.org/archives/interesting-people/

Prev by Date: [IP] Brilliant comment on modern "Security Theater"...
Next by Date: [IP] more on "Google Print" and Ethics
Previous by thread: [IP] Brilliant comment on modern "Security Theater"...
Next by thread: [IP] more on YES YES TSA may loosen carry-on, shoe-removal, other rul es
Index(es):
- Date
- Thread