[IP] What can't you find on Google? Vital statistics
Delivered-To: dfarber+@xxxxxxxxxxxxxxxxxx
Date: Sat, 01 May 2004 21:47:55 -0400
From: Claudio Gutiérrez <cgutierrez@xxxxxxxxxxxxxx>
Subject: What can't you find on Google? Vital statistics
To: dave@xxxxxxxxxx
X-Accept-Language: en-us, en
User-Agent: Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.6) Gecko/20040113
X-Spam-Status: No,
hits=2.3 required=7.5 tests=SUBJ_HAS_Q_MARK,MSG_ID_ADDED_BY_MTA_2
version=2.31
X-Spam-Level: **
X-Spam-Filtered-At: eList eXpress <http://www.elistx.com/>
*What can't you find on Google? Vital statistics*
*John Naughton
*http://observer.guardian.co.uk/business/story/0,6903,1202522,00.html
Here's a cheap trick to play on an audience - especially one drawn from the
business community. Ask them how many use Microsoft software. Virtually
every hand in the room will go up. How many use Apple Macs? One or two - at
most. How many use Linux? If the audience is drawn from corporate suits, no
hands will show. Now comes the punchline: who uses Google? A forest of
hands appears. 'Ah,' you say, 'that's very interesting, because it means
you're all Linux users.' Stunned looks all round.
The computing engine that powers Google is the largest cluster of Linux
servers in the history of the world. If you talk to computer-science folks,
you find that they regard this - rather than the number of web pages
indexed - as the most interesting thing about the company. Managing such a
vast server-farm is a formidable task. For example, how do you implement
security patches and operating-system upgrades (much more frequent in Linux
than in proprietary systems from Microsoft or Sun) on thousands of servers
without causing disruption to service? Google manages to achieve this with
sophisticated techniques for rippling changes through the cluster, yet
achieves 100 per cent uptime. This is serious stuff, and there are a lot of
IT managers out there who would give their eye-teeth to be able to do it
half as well.
Google is famous for being a confident, open company. Its clean,
uncluttered search page is supposed to be a metaphor for the organisation
behind it. But when you start asking questions about its technology, then
the water rapidly becomes murky. More than half the company's 1,000
employees are techies, and they are much in demand as seminar speakers in
university computer-science departments, where people are curious about
Google's technology. Wall Street - with its beady eye on the forthcoming
IPO - wants to know what Google does (and more importantly, what it plans
to do next). Computer scientists, in contrast, want to know how Google does it.
The two questions are different but increasingly, it seems, interlinked. At
any rate, the technical community has begun to realise that presentations
by Google techies have been run through some kind of corporate filter
before they make it into Powerpoint. The operation of the filter is erratic
(it's difficult for PR flacks effectively to censor geeks at the best of
times), but it seems that the overall aim is to understate every aspect of
Google's technology and technical performance by several orders of magnitude.
How do we know this? Mainly because of internal inconsistencies in the data
provided by Google employees. One university presentation, for example,
claimed that Google handled 150 million queries a day, and 1,000 per second
at peak times. This prompted Simpson Garfinkel of MIT's Technology Review
to do some simple calculations. If the system is handling a peak load of
1,000 queries per second, he reasoned, that translates to a peak rate of
86.4 million queries per day - or perhaps 40 million queries per day if you
assume that the system spends only half its time at peak capacity. 'No
matter how you crank the math', he concluded, 'Google's statistics are not
self-consistent'.
Or take the number of servers that Google operates. The only figure the
company will admit to is '10,000+'. They also claim to have '4+ petabytes'
of disk storage, and have let slip that each server is fitted with two 80
gigabyte hard drives. Now a petabyte is 10 to the power of 15 bytes, so if
Google had only 10,000 servers, that would come to 400 Gb per server. So
again the numbers don't add up. I could go on, but you will get the point.
But what it all comes down to is this: Google has far more computing power
at its disposal than it is letting on. In fact, there have been rumours in
the business for months that the Google cluster actually has 100,000
servers - which if true means that the company's technical competence
beggars belief.
Now the interesting question raised by all this is: why the reticence? Most
companies lose no opportunity to brag about their technology. (Think of all
those Oracle ads.) Is this an example of Google behaving ultra-responsibly
- being careful not to hype its prospects prior to an IPO? Or is it a sign
of a deeper commercial strategy? The latter is what Garfinkel suspects.
'After all,' 'he says, 'if Google publicised how many pages it has indexed
and how many computers it has in its data centres around the world, search
competitors such as Yahoo!, Teoma, and Mooter would know how much capital
they had to raise in order to have a hope of displacing the king at the top
of the hill.' If truth is the first casualty of war, openness is the first
casualty of going public.
-------------------------------------
You are subscribed as roessler@xxxxxxxxxxxxxxxxxx
To manage your subscription, go to
http://v2.listbox.com/member/?listname=ip
Archives at: http://www.interesting-people.org/archives/interesting-people/