[IP] AOL Releases Search Logs from 500,000 Users
Begin forwarded message:
From: Seth Finkelstein <sethf@xxxxxxxxx>
Date: August 7, 2006 1:05:50 AM EDT
To: Dave Farber <dave@xxxxxxxxxx>, ip@xxxxxxxxxxxxxx
Subject: AOL Releases Search Logs from 500,000 Users
AOL Releases Search Logs from 500,000 Users
[1]Adam D'Angelo - 8/5/2006
AOL just released the logs of all searches done by 500,000 of their
users over the course of three months earlier this year. That means
that if you happened to be randomly chosen as one of these users,
everything you searched for from March to May (2006) is now public
information on the internet.
This was not a leak - it was intentional. In their desperation to gain
recognition from the research community, AOL decided they would
compromise their integrity to provide a data set that might become
often-cited in research papers: "Please reference the following
publication when using this collection..." is the message before the
download.
This is a blatant violation of users' privacy. The data is
"anonymized", which to AOL means that each screenname was replaced
with a unique number. "It is still a research question how much
information needs to be anonymized to protect users," [9]says Abdur
from AOL. Here are some examples of what you can find in the data:
User 491577 searches for "florida cna pca lakeland tampa", "emt school
training florida", "low calorie meals", "infant seat", and "fisher
price roller blades". Among user 39509's hundreds of searches are:
"ford 352", "oklahoma disciplined pastors", "oklahoma disciplined
doctors", "home loans", and some other personally identifying and
illegal stuff I'm going to leave out of here. Among user 545605's
searches are "shore hills park mays landing nj", "frank william
sindoni md", "ceramic ashtrays", "transfer money to china", and
"capital gains on sale of house". Compared to some of the data, these
examples are on the safe side. I'm leaving out the worst of it -
searches for names of specific people, addresses, telephone numbers,
illegal drugs, and more. There is no question that law enforcement,
employers, or friends could figure out who some of these people are.
I hope others can find more examples in the data, which is up for
[10]download over here. The data set is very large when uncompressed
which makes it pretty hard to work with, but someone should set up a
web interface so people can browse it (or even 10% of it) without
having to download the 400mb file. If you make a mirror or better
interface to the data, or find other examples, let me know and I'll
put a link up here.
This is the same data that the DOJ wanted from Google back in March.
[11]This ruling allowed Google to keep all query logs secret. Now any
government can just go download the data from AOL.
It's unclear if this is the type of data AOL released to the
government [12]back when Google refused to comply. If nothing else,
this should be a good example of why search history needs strong
privacy protection.
Thanks to Greg Linden for pointing this out [13]here.
Update 2: The md5 of the file AOL posted (and now removed) is
31cd27ce12c3a3f2df62a38050ce4c0a. I'm posting it so you can make sure
you have a valid copy, but so far none of the copies I've seen are
fake.
Update: Seems like AOL took it down. There are some mirrors of the
data in the comments of the digg story, linked below. I estimate about
1000 people have the file, so it's definitely going to be circulated
around. The [2]main AOL research page is still up, with some other
data collections. The [3]google cache of the download page is still
up, but you can't get the data. Here's discussion at other sites:
* [4]siliconbeat
* [5]techcrunch
* [6]digg
* [7]reddit
* [8]zoli's blog
References
1. http://www.ugcs.caltech.edu/~dangelo/
2. http://research.aol.com/pmwiki/pmwiki.php?n=Main.Home
3. http://72.14.207.104/search?q=cache:2Qvd2z9VbuIJ:research.aol.com/
pmwiki/pmwiki.php%3Fn%3DResearch.500kUserQueriesSampledOver3Months
+&hl=en&gl=us&ct=clnk&cd=1
4. http://www.siliconbeat.com/entries/2006/08/06/
aol_research_exposes_data_weve_got_a_little_sick_feeling.html
5. http://www.techcrunch.com/2006/08/06/aol-proudly-releases-massive-
amounts-of-user-search-data/
6. http://digg.com/tech_news/AOL_Releases_Search_Logs_from_500_000_Users
7. http://reddit.com/info/cfvt/comments
8. http://www.zoliblog.com/blog/_archives/2006/8/6/2204969.html
9. http://research.aol.com/pmwiki/pmwiki.php?n=Research.
500kUserQueriesSampledOver3Months
10. http://research.aol.com/pmwiki/pmwiki.php?n=Research.
500kUserQueriesSampledOver3Months
11. http://googleblog.blogspot.com/2006/03/judge-tells-doj-no-on-
search-queries.html
12. http://www.boingboing.net/2006/01/20/aol_we_did_not_compl.html
13. http://glinden.blogspot.com/2006/08/chance-to-play-with-big-
data.html
--
Seth Finkelstein Consulting Programmer http://sethf.com
Infothought blog - http://sethf.com/infothought/blog/
Interview: http://sethf.com/essays/major/greplaw-interview.php
-------------------------------------
You are subscribed as roessler@xxxxxxxxxxxxxxxxxx
To manage your subscription, go to
http://v2.listbox.com/member/?listname=ip
Archives at: http://www.interesting-people.org/archives/interesting-people/