Re: Which spam filter do you use?
* Eyolf Østrem <eyolf@xxxxxxxxxxx> [2007-10-05 13:22]:
> On 05.10.2007 (02:08), Kyle Wheeler wrote:
> > The real criticism I'd level against SpamAssassin as compared to
> > bogofilter is that SpamAssassin's bayesian classifier is relatively
> > simple.
Yes. Also, SpamAssassin is terribly slow.
> > To my knowledge, it doesn't tokenize word-pairs and phrases,
> > but just single words; thus, something that uses more advanced
> > bayesian techniques (I presume bogofilter fits this description?) may
> > well beat it at that particular game---which is where updating the
> > rules can help as a compensating factor. It's not like a virus-scanner
> > where an out-of-date database is worthless.
>
> So this does in fact mean that without that extra time tending the
> rules, SA may actually let more spam through?
Yes, even if you do take that time :-) From my experience, which
includes few benchmarks I did, Bogofilter's accuracy is way better than
SpamAssassin's, even if enabling SpamAssassin's bayesian classifier,
Razor, and a few other non-default modules.
Bogofilter's accuracy highly depends on good training, though. It's
critical to not only train Bogofilter with misclassified messages but
also with messages it's unsure about. Fine-tuning the configuration
might also increase Bogofilter's accuracy[*].
IMO, SpamAssassin is only useful if you don't want or cannot train your
spam filter for some reason (e.g., if you're an ISP, though in this case
SpamAssassin's bad performance can be a real drawback).
> It's not that I'm dissatisfied with my current situation. My #1
> concern is actually with false negatives; I've since long given up
> browsing through the CapturedSpam folder to check for them before I
> delete everything.
So you mean false positives :-) In any case, any serious spam filter
allows for adjusting the spam/ham thresholds, so you can always buy
redrucing the number of false positives to almost zero at the cost of
increasing the number of false negatives. SpamAssassin's default
configuration does just that (which makes sense, of course).
> it seemed that SA had absolutely 0 of that, whereas bogo might have
> one or two (out of I don't remember how many thousand).
Depends on the configuration.
Holger
[*] You could try bogotune(1) and/or increasing multi-token-count,
though this increases the database size and decreases Bogofilter's
performance. It _should_ increase Bogofilter's accuracy, though for
me it had far less effect than described in the following posting
(mainly because I get better results with multi-token-count=1 than
described in the posting):
http://www.bogofilter.org/pipermail/bogofilter-dev/2006-August/003357.html