<<< Date Index >>>     <<< Thread Index >>>

Re: [PATCH] generic spam detection



This indeed looks very appealing for CVS inclusion.  I'd be curious
to hear about any experience people have with the patch.

Suggested tweaking:

- Add documentation.
- Add both lexical and numerical sorting by spam tag.

(This would make your patch a useful generalization of Rogier
Wolff's proposed spam-score sorting patch.)

Regards,
-- 
Thomas Roessler · Personal soap box at <http://log.does-not-exist.org/>.






On 2004-02-10 01:13:45 -0600, David Champion wrote:
> From: David Champion <dgc@xxxxxxxxxxxx>
> To: mutt-dev@xxxxxxxx
> Date: Tue, 10 Feb 2004 01:13:45 -0600
> Subject: [PATCH] generic spam detection
> Mail-Followup-To: mutt-dev@xxxxxxxx
> X-Spam-Level: 
> 
> (This is long and I'm having trouble writing a concise explanation, so
> please bear with me or ignore.)
> 
> >From time to time someone posts to mutt-users or mutt-dev with a request
> that mutt should automatically detect warning headers from SpamAssassin,
> Bogofilter, or some other spam filter. A few patches have been posted
> to this effect, but not absorbed into the main code.
> 
> That's great! I'm of the opinion that mutt should not cater to any
> particular spam-detection product: these functions are separate, and
> mutt doesn't need distinct support for each one. However, mutt's
> strengths are in its flexibility, and spam can be handled in an open,
> unbiased way.
> 
> (Several people have suggested procmail rules for copying spam headers
> to the X-Label: header, and using that to indicate spam. That works, but
> I like using X-Label: to carry other information, and I think there's a
> fair argument that one shoudn't need to coerce various spam headers into
> one canonical header just for the mailer to pick it up.)
> 
> Around the time that everybody and his mother invented their own
> Bayesian analyzer, I had an idea of how to approach this. Each filter
> has its own header or set of headers to indicate results, and sometimes
> multiple filters are stacked together, so there's no single pattern that
> indicates spamminess.
> 
> This patch implements two new commands, "spam" and "nospam", and a
> variable, $spam_separator. These govern the "spam tag" for a message
> header. The "spam" command takes this form:
> 
>       spam 'regex' 'tag'
> 
> The "nospam" command takes only one regex:
> 
>       nospam 'regex'
> 
> When a message header is read, each header line is compared against the
> list of regular expressions from your "spam" commands. You can use as
> many as you like. If a header matches a spam regex, the message's "spam
> tag" is set to 'tag'. Parenthesized substitutions from the regex are
> performed: %1 is the first subexpression, %2 the second, etc. However,
> if a header matches both a spam regex *and* a nospam regex, it is
> ignored, and the spam tag is not set. Think of "nospam" as the exception
> list for things that match "spam", but are not spammy.
> 
> In $index_format, %H expands to a message's spam tag. %?H? notation
> works, too. And the new ~H pattern will match on spam tags, so you can
> perform limits and hooks against spam results.
> 
> The $spam_separator variable controls how multiple matches are treated.
> If it is unset, then the spam tag is always overwritten -- it will hold
> whatever the last spam regex to match indicated. If it is set, it is a
> join string: with each successive match, $spam_separator is appended to
> the existing spam tag, and the new 'tag' is appended to that.
> 
> Currently I'm using these settings:
>     spam "X-DCC-.*-Metrics:.*(....)=many" "DCC/%1"
>     spam "X-Spam-Status: Yes" "SA"
>     set spam_separator=","
> 
> If a message scores in both DCC and SpamAssassin, it will get a spam
> tag of (for example) "DCC/Fuz1,SA". (The "Fuz1" comes from the "(....)"
> in the DCC regex.) My $index_format includes "%?H?*%H* ?", so it will
> insert "*DCC/Fuz1,SA*" in the subject area for this message. And I can
> search for messages that SpamAssassin marked with "~H SA".
> 
> The regex matching occurs as message headers are parsed, so if you're
> using spam commands, there's some startup overhead as a folder is opened
> or as mail arrives. But this isn't incurred once the mailbox is open,
> and it shouldn't be mutable anyway.
> 
> This is all sort of experimental for me, but I thought others might want
> to experiment, too. If this looks appealing for CVS, but needs tweaking,
> please let me know.
> 
> -- 
>  -D.    dgc@xxxxxxxxxxxx   **   Enterprise Network Servers and Such
>                            **   University of Chicago
>  We are the robots.        **   North America's southernmost seasonal glacier