<<< Date Index >>>     <<< Thread Index >>>

[IP] Google and Data Retention - Policies and Possibilities





-------- Original Message --------
Subject:        Google and Data Retention - Policies and Possibilities
Date:   Tue, 31 Jan 2006 09:08:22 -0800
From:   Lauren Weinstein <lauren@xxxxxxxxxx>
To:     dave@xxxxxxxxxx
CC:     lauren@xxxxxxxxxx



Dave,

That Google can track user searches is hardly an "alert the media"
revelation.  This status was effectively obvious since we know that
Google responds affirmatively to various law enforcement-related
data-retrieval orders (and quite possibly to others that we don't
know about, such as national security letters), that would be largely
useless without such data -- and Google has never claimed to operate
anonymously in this respect.

A more interesting question in terms of data retention is *how long*
various aspects of the data are retained.  That is, does this fine
grain of data "expire" over time, or is retrospective data mining of
the detailed data possible back into the indefinite past?
This issue is rapidly moving into the spotlight, as Congress appears
poised to discuss laws that would *mandate* data retention rules
for search engines and perhaps other Internet services -- and we all
know that when Congress gets involved in technical matters, the results
are often -- shall we say -- less "optimal" than if industry had
addressed these concerns themselves voluntarily.

Obviously there are certain enhanced Google services (mostly related
to logged-in users in the search and Gmail spaces, including but not
limited to users availing themselves of Google's search history
features) that require long-term detailed data to function.
But viewed from the outside, there are steps that Google could take
to minimize privacy-related risks while not significantly
interfering with the value of that data for ongoing R&D and
innovation.  This is only a thumbnail conceptual description of
course, based on external observations alone.

 1) Minimize the length of time that full log records are maintained
    for users not using enhanced services.  For instance, full
    records might be maintained for 30 days (an arbitrary figure for
    this example).  These would be available to law enforcement
    queries and the like for ongoing investigations.  However, after
    the expiration period, records would be anonymized (stripped of
    IP, cookie, and other connection-related data identifying the
    user).  Logged search query strings (though they also can
    contain personal information, as we know) would not be affected
    at this stage and would continue to be available for R&D and
    other purposes, but now with a significantly lower outside
    abuse potential.

 2) After some longer period of time (a year? -- again, an arbitrary
    period for the sake of this example) the remaining portion of
    the records for non-enhanced service users would be purged
    (deleted).  I of course cannot address the non-trivial issues of
    system and related data backups in this regard, since I have no
    idea how Google has structured backup activities across their
    enterprise, but this aspect in particular might make for an
    interesting technical challenge.

 3) Users of Google's enhanced search-history-based services, etc.
    represent another interesting problem, since detailed data must
    be maintained for these users in some form for the services to
    function.  However, it seems likely that the outside abuse
    potential of this detailed data could be greatly reduced
    through various cryptographic techniques, while still permitting
    the required functionalities.  It should be noted that
    cryptographic methods may also be applicable in various ways to
    alternative solutions for the issues described in sections (1)
and (2) above.
Since I am not privy to Google's internal topology, the above ideas can
quite reasonably be categorized as speculative.  However, the point
is that there do exist a range of technological approaches to dealing
with this data that could be harnessed to strike a reasonable balance
between data usefulness and privacy-related concerns -- permitting
R&D and innovation to proceed while minimizing the inherent abuse
potential in sensitive data of this sort.

--Lauren--
Lauren Weinstein
lauren@xxxxxxxxxx or lauren@xxxxxxxx
Tel: +1 (818) 225-2800
http://www.pfir.org/lauren
Co-Founder, PFIR
  - People For Internet Responsibility - http://www.pfir.org
Co-Founder, IOIC
  - International Open Internet Coalition - http://www.ioic.net
Moderator, PRIVACY Forum - http://www.vortex.com
Member, ACM Committee on Computers and Public Policy
Lauren's Blog: http://lauren.vortex.com
DayThink: http://daythink.vortex.com


- - - - - - -




Begin forwarded message:

From: Adam Fields <ip20398470293845@xxxxxxxxxx>
Date: January 30, 2006 10:05:48 PM EST
To: dave@xxxxxxxxxx
Subject: More detailed queries of what Google stores

I asked two very specific questions in a conversation with John
Battelle, and he's received unequivocal answers from Google:

1) "Given a list of search terms, can Google produce a list of people
    who searched for that term, identified by IP address and/or Google
    cookie value?"

2) "Given an IP address or Google cookie value, can Google produce a
    list of the terms searched by the user of that IP address or cookie
    value?"

The answer to both of them is "yes".

http://battellemedia.com/archives/002283.php

--
                                - Adam


-------------------------------------
You are subscribed as roessler@xxxxxxxxxxxxxxxxxx
To manage your subscription, go to
 http://v2.listbox.com/member/?listname=ip

Archives at: http://www.interesting-people.org/archives/interesting-people/