[IP] Essay on Data Mining

To: ip@xxxxxxxxxxxxxx
Subject: [IP] Essay on Data Mining
From: David Farber <dave@xxxxxxxxxx>
Date: Sun, 12 Mar 2006 18:56:19 -0500
List-help: <http://v2.listbox.com/doc/help_sub?list_name=ip@v2.listbox.com>
List-id: <ip@xxxxxxxxxxxxxx>
List-software: listbox.com v2.0
List-subscribe: <mailto:subscribe-ip@v2.listbox.com>, <http://v2.listbox.com/subscribe/?listname=ip@v2.listbox.com>
List-unsubscribe: <mailto:unsubscribe-ip@v2.listbox.com>, <http://v2.listbox.com/member/unsubscribe/?listname=ip@v2.listbox.com>
References: <6.2.5.6.2.20060309140900.02b82988@xxxxxxxxxxxxxxx>
Reply-to: dave@xxxxxxxxxx



Begin forwarded message:

From: Bruce Schneier <schneier@xxxxxxxxxxxxxxx>
Date: March 9, 2006 3:09:05 PM EST
To: EPIC_IDOF@xxxxxxxxxxxxxxxx
Subject: [EPIC_IDOF] My Essay on Data Mining

Why Data Mining Won't Stop Terror

Commentary by Bruce Schneier
02:00 AM Mar, 09, 2006 EST
http://www.wired.com/news/columns/0,70357-0.html?tw=wn_index_3

In the post-9/11 world, there's much focus on connecting the dots.Many believe data mining is the crystal ball that will enable us touncover future terrorist plots. But even in the most wildlyoptimistic projections, data mining isn't tenable for that purpose.We're not trading privacy for security; we're giving up privacy andgetting no security in return.

Most people first learned about data mining in November 2002, whennews broke about a massive government data mining program calledTotal Information Awareness. The basic idea was as audacious as itwas repellent: suck up as much data as possible about everyone, siftthrough it with massive computers, and investigate patterns thatmight indicate terrorist plots.

Americans across the political spectrum denounced the program, and inSeptember 2003, Congress eliminated its funding and closed its offices.

But TIA didn't die. According to The National Journal, it justchanged its name and moved inside the Defense Department.

This shouldn't be a surprise. In May 2004, the General AccountingOffice published a report (.pdf) listing 122 different federalgovernment data-mining programs that used people's personalinformation. This list didn't include classified programs, like theNSA's eavesdropping effort or state-run programs like MATRIX.

The promise of data mining is compelling, and convinces many. Butit's wrong. We're not going to find terrorist plots through systemslike this, and we're going to waste valuable resources chasing downfalse alarms. To understand why, we have to look at the economics ofthe system.

Security is always a trade-off, and for a system to be worthwhile,the advantages have to be greater than the disadvantages. A nationalsecurity data-mining program is going to find some percentage of realattacks and some percentage of false alarms. If the benefits offinding and stopping those attacks outweigh the cost -- in money,liberties, etc. -- then the system is a good one. If not, you'd bebetter off spending that capital elsewhere.

Data mining works best when you're searching for a well-definedprofile, a reasonable number of attacks per year and a low cost offalse alarms. Credit-card fraud is one of data mining's successstories: all credit-card companies mine their transaction databasesfor data for spending patterns that indicate a stolen card.

Many credit-card thieves share a pattern -- purchase expensive luxurygoods, purchase things that can be easily fenced, etc. -- and datamining systems can minimize the losses in many cases by shutting downthe card. In addition, the cost of false alarms is only a phone callto the cardholder asking him to verify a couple of purchases. Thecardholders don't even resent these phone calls -- as long as they'reinfrequent -- so the cost is just a few minutes of operator time.

Terrorist plots are different. There is no well-defined profile andattacks are very rare. Taken together, these facts mean that data-mining systems won't uncover any terrorist plots until they are veryaccurate, and that even very accurate systems will be so flooded withfalse alarms that they will be useless.

All data-mining systems fail in two different ways: false positivesand false negatives. A false positive is when the system identifies aterrorist plot that really isn't one. A false negative is when thesystem misses an actual terrorist plot. Depending on how you "tune"your detection algorithms, you can err on one side or the other: youcan increase the number of false positives to ensure you are lesslikely to miss an actual terrorist plot, or you can reduce the numberof false positives at the expense of missing terrorist plots.

To reduce both those numbers, you need a well-defined profile. Andthat's a problem when it comes to terrorism. In hindsight, it wasreally easy to connect the 9/11 dots and point to the warning signs,but it's much harder before the fact. Certainly, many terrorist plotsshare common warning signs, but each is unique, as well. The betteryou can define what you're looking for, the better your results willbe. Data mining for terrorist plots will be sloppy, and it'll be hardto find anything useful.

Data mining is like searching for a needle in a haystack. There are900 million credit cards in circulation in the United States.According to the FTC September 2003 Identity Theft Survey Report,about 1 percent (10 million) cards are stolen and fraudulently usedeach year.

When it comes to terrorism, however, trillions of connections existbetween people and events -- things that the data-mining system willhave to "look at" -- and very few plots. This rarity makes evenaccurate identification systems useless.

Let's look at some numbers. We'll be optimistic -- we'll assume thesystem has a one in 100 false-positive rate (99 percent accurate),and a one in 1,000 false-negative rate (99.9 percent accurate).Assume 1 trillion possible indicators to sift through: that's about10 events -- e-mails, phone calls, purchases, web destinations,whatever -- per person in the United States per day. Also assume that10 of them are actually terrorists plotting.

This unrealistically accurate system will generate 1 billion falsealarms for every real terrorist plot it uncovers. Every day of everyyear, the police will have to investigate 27 million potential plotsin order to find the one real terrorist plot per month. Raise thatfalse-positive accuracy to an absurd 99.9999 percent and you're stillchasing 2,750 false alarms per day -- but that will inevitably raiseyour false negatives, and you're going to miss some of those 10 realplots.

This isn't anything new. In statistics, it's called the "base ratefallacy," and it applies in other domains as well. For example, evenhighly accurate medical tests are useless as diagnostic tools if theincidence of the disease is rare in the general population. Terroristattacks are also rare, any "test" is going to result in an endlessstream of false alarms.

This is exactly the sort of thing we saw with the NSA's eavesdroppingprogram: the New York Times reported that the computers spat outthousands of tips per month. Every one of them turned out to be afalse alarm.

And the cost was enormous -- not just for the FBI agents runningaround chasing dead-end leads instead of doing things that mightactually make us safer, but also the cost in civil liberties. Thefundamental freedoms that make our country the envy of the world arevaluable, and not something that we should throw away lightly.

Data mining can work. It helps Visa keep the costs of fraud down,just as it helps Amazon alert me to books I might want to buy andGoogle show me advertising I'm more likely to be interested in. Butthese are all instances where the cost of false positives is low (aphone call from a Visa operator or an uninteresting ad) in systemsthat have value even if there is a high number of false negatives.

Finding terrorism plots is not a problem that lends itself to datamining. It's a needle-in-a-haystack problem, and throwing more hay onthe pile doesn't make that problem any easier. We'd be far better offputting people in charge of investigating potential plots and lettingthem direct the computers, instead of putting the computers in chargeand letting them decide who should be investigated.


_______________________________________________
EPIC_IDOF mailing list
EPIC_IDOF@xxxxxxxxxxxxxxxx
https://mailman.epic.org/cgi-bin/mailman/listinfo/epic_idof


-------------------------------------
You are subscribed as roessler@xxxxxxxxxxxxxxxxxx
To manage your subscription, go to
 http://v2.listbox.com/member/?listname=ip

Archives at: http://www.interesting-people.org/archives/interesting-people/

Prev by Date: [IP] Leahy: The assault on the public's right to know
Next by Date: [IP] Audio recording from - ancient Pompeii !!!
Previous by thread: [IP] Leahy: The assault on the public's right to know
Next by thread: [IP] Audio recording from - ancient Pompeii !!!
Index(es):
- Date
- Thread