[IP] Essay on Data Mining
Begin forwarded message:
From: Bruce Schneier <schneier@xxxxxxxxxxxxxxx>
Date: March 9, 2006 3:09:05 PM EST
To: EPIC_IDOF@xxxxxxxxxxxxxxxx
Subject: [EPIC_IDOF] My Essay on Data Mining
Why Data Mining Won't Stop Terror
Commentary by Bruce Schneier
02:00 AM Mar, 09, 2006 EST
http://www.wired.com/news/columns/0,70357-0.html?tw=wn_index_3
In the post-9/11 world, there's much focus on connecting the dots.
Many believe data mining is the crystal ball that will enable us to
uncover future terrorist plots. But even in the most wildly
optimistic projections, data mining isn't tenable for that purpose.
We're not trading privacy for security; we're giving up privacy and
getting no security in return.
Most people first learned about data mining in November 2002, when
news broke about a massive government data mining program called
Total Information Awareness. The basic idea was as audacious as it
was repellent: suck up as much data as possible about everyone, sift
through it with massive computers, and investigate patterns that
might indicate terrorist plots.
Americans across the political spectrum denounced the program, and in
September 2003, Congress eliminated its funding and closed its offices.
But TIA didn't die. According to The National Journal, it just
changed its name and moved inside the Defense Department.
This shouldn't be a surprise. In May 2004, the General Accounting
Office published a report (.pdf) listing 122 different federal
government data-mining programs that used people's personal
information. This list didn't include classified programs, like the
NSA's eavesdropping effort or state-run programs like MATRIX.
The promise of data mining is compelling, and convinces many. But
it's wrong. We're not going to find terrorist plots through systems
like this, and we're going to waste valuable resources chasing down
false alarms. To understand why, we have to look at the economics of
the system.
Security is always a trade-off, and for a system to be worthwhile,
the advantages have to be greater than the disadvantages. A national
security data-mining program is going to find some percentage of real
attacks and some percentage of false alarms. If the benefits of
finding and stopping those attacks outweigh the cost -- in money,
liberties, etc. -- then the system is a good one. If not, you'd be
better off spending that capital elsewhere.
Data mining works best when you're searching for a well-defined
profile, a reasonable number of attacks per year and a low cost of
false alarms. Credit-card fraud is one of data mining's success
stories: all credit-card companies mine their transaction databases
for data for spending patterns that indicate a stolen card.
Many credit-card thieves share a pattern -- purchase expensive luxury
goods, purchase things that can be easily fenced, etc. -- and data
mining systems can minimize the losses in many cases by shutting down
the card. In addition, the cost of false alarms is only a phone call
to the cardholder asking him to verify a couple of purchases. The
cardholders don't even resent these phone calls -- as long as they're
infrequent -- so the cost is just a few minutes of operator time.
Terrorist plots are different. There is no well-defined profile and
attacks are very rare. Taken together, these facts mean that data-
mining systems won't uncover any terrorist plots until they are very
accurate, and that even very accurate systems will be so flooded with
false alarms that they will be useless.
All data-mining systems fail in two different ways: false positives
and false negatives. A false positive is when the system identifies a
terrorist plot that really isn't one. A false negative is when the
system misses an actual terrorist plot. Depending on how you "tune"
your detection algorithms, you can err on one side or the other: you
can increase the number of false positives to ensure you are less
likely to miss an actual terrorist plot, or you can reduce the number
of false positives at the expense of missing terrorist plots.
To reduce both those numbers, you need a well-defined profile. And
that's a problem when it comes to terrorism. In hindsight, it was
really easy to connect the 9/11 dots and point to the warning signs,
but it's much harder before the fact. Certainly, many terrorist plots
share common warning signs, but each is unique, as well. The better
you can define what you're looking for, the better your results will
be. Data mining for terrorist plots will be sloppy, and it'll be hard
to find anything useful.
Data mining is like searching for a needle in a haystack. There are
900 million credit cards in circulation in the United States.
According to the FTC September 2003 Identity Theft Survey Report,
about 1 percent (10 million) cards are stolen and fraudulently used
each year.
When it comes to terrorism, however, trillions of connections exist
between people and events -- things that the data-mining system will
have to "look at" -- and very few plots. This rarity makes even
accurate identification systems useless.
Let's look at some numbers. We'll be optimistic -- we'll assume the
system has a one in 100 false-positive rate (99 percent accurate),
and a one in 1,000 false-negative rate (99.9 percent accurate).
Assume 1 trillion possible indicators to sift through: that's about
10 events -- e-mails, phone calls, purchases, web destinations,
whatever -- per person in the United States per day. Also assume that
10 of them are actually terrorists plotting.
This unrealistically accurate system will generate 1 billion false
alarms for every real terrorist plot it uncovers. Every day of every
year, the police will have to investigate 27 million potential plots
in order to find the one real terrorist plot per month. Raise that
false-positive accuracy to an absurd 99.9999 percent and you're still
chasing 2,750 false alarms per day -- but that will inevitably raise
your false negatives, and you're going to miss some of those 10 real
plots.
This isn't anything new. In statistics, it's called the "base rate
fallacy," and it applies in other domains as well. For example, even
highly accurate medical tests are useless as diagnostic tools if the
incidence of the disease is rare in the general population. Terrorist
attacks are also rare, any "test" is going to result in an endless
stream of false alarms.
This is exactly the sort of thing we saw with the NSA's eavesdropping
program: the New York Times reported that the computers spat out
thousands of tips per month. Every one of them turned out to be a
false alarm.
And the cost was enormous -- not just for the FBI agents running
around chasing dead-end leads instead of doing things that might
actually make us safer, but also the cost in civil liberties. The
fundamental freedoms that make our country the envy of the world are
valuable, and not something that we should throw away lightly.
Data mining can work. It helps Visa keep the costs of fraud down,
just as it helps Amazon alert me to books I might want to buy and
Google show me advertising I'm more likely to be interested in. But
these are all instances where the cost of false positives is low (a
phone call from a Visa operator or an uninteresting ad) in systems
that have value even if there is a high number of false negatives.
Finding terrorism plots is not a problem that lends itself to data
mining. It's a needle-in-a-haystack problem, and throwing more hay on
the pile doesn't make that problem any easier. We'd be far better off
putting people in charge of investigating potential plots and letting
them direct the computers, instead of putting the computers in charge
and letting them decide who should be investigated.
_______________________________________________
EPIC_IDOF mailing list
EPIC_IDOF@xxxxxxxxxxxxxxxx
https://mailman.epic.org/cgi-bin/mailman/listinfo/epic_idof
-------------------------------------
You are subscribed as roessler@xxxxxxxxxxxxxxxxxx
To manage your subscription, go to
http://v2.listbox.com/member/?listname=ip
Archives at: http://www.interesting-people.org/archives/interesting-people/