<<< Date Index >>>     <<< Thread Index >>>

[IP] What can't you find on Google? Vital statistics




Delivered-To: dfarber+@xxxxxxxxxxxxxxxxxx
Date: Sat, 01 May 2004 21:47:55 -0400
From: Claudio Gutiérrez <cgutierrez@xxxxxxxxxxxxxx>
Subject: What can't you find on Google? Vital statistics
To: dave@xxxxxxxxxx
X-Accept-Language: en-us, en
User-Agent: Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.6) Gecko/20040113
X-Spam-Status: No,
hits=2.3 required=7.5 tests=SUBJ_HAS_Q_MARK,MSG_ID_ADDED_BY_MTA_2 version=2.31
X-Spam-Level: **
X-Spam-Filtered-At: eList eXpress <http://www.elistx.com/>

*What can't you find on Google? Vital statistics*

*John Naughton
*http://observer.guardian.co.uk/business/story/0,6903,1202522,00.html

Here's a cheap trick to play on an audience - especially one drawn from the business community. Ask them how many use Microsoft software. Virtually every hand in the room will go up. How many use Apple Macs? One or two - at most. How many use Linux? If the audience is drawn from corporate suits, no hands will show. Now comes the punchline: who uses Google? A forest of hands appears. 'Ah,' you say, 'that's very interesting, because it means you're all Linux users.' Stunned looks all round.

The computing engine that powers Google is the largest cluster of Linux servers in the history of the world. If you talk to computer-science folks, you find that they regard this - rather than the number of web pages indexed - as the most interesting thing about the company. Managing such a vast server-farm is a formidable task. For example, how do you implement security patches and operating-system upgrades (much more frequent in Linux than in proprietary systems from Microsoft or Sun) on thousands of servers without causing disruption to service? Google manages to achieve this with sophisticated techniques for rippling changes through the cluster, yet achieves 100 per cent uptime. This is serious stuff, and there are a lot of IT managers out there who would give their eye-teeth to be able to do it half as well.

Google is famous for being a confident, open company. Its clean, uncluttered search page is supposed to be a metaphor for the organisation behind it. But when you start asking questions about its technology, then the water rapidly becomes murky. More than half the company's 1,000 employees are techies, and they are much in demand as seminar speakers in university computer-science departments, where people are curious about Google's technology. Wall Street - with its beady eye on the forthcoming IPO - wants to know what Google does (and more importantly, what it plans to do next). Computer scientists, in contrast, want to know how Google does it.

The two questions are different but increasingly, it seems, interlinked. At any rate, the technical community has begun to realise that presentations by Google techies have been run through some kind of corporate filter before they make it into Powerpoint. The operation of the filter is erratic (it's difficult for PR flacks effectively to censor geeks at the best of times), but it seems that the overall aim is to understate every aspect of Google's technology and technical performance by several orders of magnitude.

How do we know this? Mainly because of internal inconsistencies in the data provided by Google employees. One university presentation, for example, claimed that Google handled 150 million queries a day, and 1,000 per second at peak times. This prompted Simpson Garfinkel of MIT's Technology Review to do some simple calculations. If the system is handling a peak load of 1,000 queries per second, he reasoned, that translates to a peak rate of 86.4 million queries per day - or perhaps 40 million queries per day if you assume that the system spends only half its time at peak capacity. 'No matter how you crank the math', he concluded, 'Google's statistics are not self-consistent'.

Or take the number of servers that Google operates. The only figure the company will admit to is '10,000+'. They also claim to have '4+ petabytes' of disk storage, and have let slip that each server is fitted with two 80 gigabyte hard drives. Now a petabyte is 10 to the power of 15 bytes, so if Google had only 10,000 servers, that would come to 400 Gb per server. So again the numbers don't add up. I could go on, but you will get the point. But what it all comes down to is this: Google has far more computing power at its disposal than it is letting on. In fact, there have been rumours in the business for months that the Google cluster actually has 100,000 servers - which if true means that the company's technical competence beggars belief.

Now the interesting question raised by all this is: why the reticence? Most companies lose no opportunity to brag about their technology. (Think of all those Oracle ads.) Is this an example of Google behaving ultra-responsibly - being careful not to hype its prospects prior to an IPO? Or is it a sign of a deeper commercial strategy? The latter is what Garfinkel suspects. 'After all,' 'he says, 'if Google publicised how many pages it has indexed and how many computers it has in its data centres around the world, search competitors such as Yahoo!, Teoma, and Mooter would know how much capital they had to raise in order to have a hope of displacing the king at the top of the hill.' If truth is the first casualty of war, openness is the first casualty of going public.

-------------------------------------
You are subscribed as roessler@xxxxxxxxxxxxxxxxxx
To manage your subscription, go to
 http://v2.listbox.com/member/?listname=ip

Archives at: http://www.interesting-people.org/archives/interesting-people/