<<< Date Index >>>     <<< Thread Index >>>

[PATCH] generic spam detection



(This is long and I'm having trouble writing a concise explanation, so
please bear with me or ignore.)

>From time to time someone posts to mutt-users or mutt-dev with a request
that mutt should automatically detect warning headers from SpamAssassin,
Bogofilter, or some other spam filter. A few patches have been posted
to this effect, but not absorbed into the main code.

That's great! I'm of the opinion that mutt should not cater to any
particular spam-detection product: these functions are separate, and
mutt doesn't need distinct support for each one. However, mutt's
strengths are in its flexibility, and spam can be handled in an open,
unbiased way.

(Several people have suggested procmail rules for copying spam headers
to the X-Label: header, and using that to indicate spam. That works, but
I like using X-Label: to carry other information, and I think there's a
fair argument that one shoudn't need to coerce various spam headers into
one canonical header just for the mailer to pick it up.)

Around the time that everybody and his mother invented their own
Bayesian analyzer, I had an idea of how to approach this. Each filter
has its own header or set of headers to indicate results, and sometimes
multiple filters are stacked together, so there's no single pattern that
indicates spamminess.

This patch implements two new commands, "spam" and "nospam", and a
variable, $spam_separator. These govern the "spam tag" for a message
header. The "spam" command takes this form:

        spam 'regex' 'tag'

The "nospam" command takes only one regex:

        nospam 'regex'

When a message header is read, each header line is compared against the
list of regular expressions from your "spam" commands. You can use as
many as you like. If a header matches a spam regex, the message's "spam
tag" is set to 'tag'. Parenthesized substitutions from the regex are
performed: %1 is the first subexpression, %2 the second, etc. However,
if a header matches both a spam regex *and* a nospam regex, it is
ignored, and the spam tag is not set. Think of "nospam" as the exception
list for things that match "spam", but are not spammy.

In $index_format, %H expands to a message's spam tag. %?H? notation
works, too. And the new ~H pattern will match on spam tags, so you can
perform limits and hooks against spam results.

The $spam_separator variable controls how multiple matches are treated.
If it is unset, then the spam tag is always overwritten -- it will hold
whatever the last spam regex to match indicated. If it is set, it is a
join string: with each successive match, $spam_separator is appended to
the existing spam tag, and the new 'tag' is appended to that.

Currently I'm using these settings:
    spam "X-DCC-.*-Metrics:.*(....)=many" "DCC/%1"
    spam "X-Spam-Status: Yes" "SA"
    set spam_separator=","

If a message scores in both DCC and SpamAssassin, it will get a spam
tag of (for example) "DCC/Fuz1,SA". (The "Fuz1" comes from the "(....)"
in the DCC regex.) My $index_format includes "%?H?*%H* ?", so it will
insert "*DCC/Fuz1,SA*" in the subject area for this message. And I can
search for messages that SpamAssassin marked with "~H SA".

The regex matching occurs as message headers are parsed, so if you're
using spam commands, there's some startup overhead as a folder is opened
or as mail arrives. But this isn't incurred once the mailbox is open,
and it shouldn't be mutable anyway.

This is all sort of experimental for me, but I thought others might want
to experiment, too. If this looks appealing for CVS, but needs tweaking,
please let me know.

-- 
 -D.    dgc@xxxxxxxxxxxx   **   Enterprise Network Servers and Such
                           **   University of Chicago
 We are the robots.        **   North America's southernmost seasonal glacier
--- mutt-1.5.6/PATCHES~ never
+++ mutt-1.5.6/PATCHES  Mon Feb  9 21:07:37 CST 2004
@@ -1,0 +1 @@
+patch-1.5.6.dgc.hormel.1
diff -Pur mutt-1.5.6-dist/globals.h mutt-1.5.6/globals.h
--- mutt-1.5.6-dist/globals.h   Sun Feb  1 11:15:17 2004
+++ mutt-1.5.6/globals.h        Sat Feb  7 21:39:00 2004
@@ -102,6 +102,7 @@
 WHERE char *Signature;
 WHERE char *SimpleSearch;
 WHERE char *Spoolfile;
+WHERE char *SpamSep;
 #if defined(USE_SSL) || defined(USE_NSS)
 WHERE char *SslCertFile INITVAL (NULL);
 WHERE char *SslEntropyFile INITVAL (NULL);
@@ -125,6 +126,8 @@
 WHERE RX_LIST *Alternates INITVAL(0);
 WHERE RX_LIST *MailLists INITVAL(0);
 WHERE RX_LIST *SubscribedLists INITVAL(0);
+WHERE SPAM_LIST *SpamList INITVAL(0);
+WHERE RX_LIST *NoSpamList INITVAL(0);
 
 /* bit vector for boolean variables */
 #ifdef MAIN_C
diff -Pur mutt-1.5.6-dist/hdrline.c mutt-1.5.6/hdrline.c
--- mutt-1.5.6-dist/hdrline.c   Sun Feb  1 11:15:17 2004
+++ mutt-1.5.6/hdrline.c        Sat Feb  7 22:24:26 2004
@@ -433,6 +433,18 @@
         optional = 0;
       break;
 
+    case 'H':
+      /* (Hormel) spam score */
+      if (optional)
+       optional = hdr->env->spam ? 1 : 0;
+
+       if (hdr->env->spam)
+         mutt_format_s (dest, destlen, prefix, NONULL (hdr->env->spam->data));
+       else
+         mutt_format_s (dest, destlen, prefix, "");
+
+      break;
+
     case 'i':
       mutt_format_s (dest, destlen, prefix, hdr->env->message_id ? 
hdr->env->message_id : "<no.id>");
       break;
diff -Pur mutt-1.5.6-dist/init.c mutt-1.5.6/init.c
--- mutt-1.5.6-dist/init.c      Sun Feb  1 12:21:00 2004
+++ mutt-1.5.6/init.c   Mon Feb  9 00:29:52 2004
@@ -366,6 +366,73 @@
 }
 
 
+static int add_to_spam_list (SPAM_LIST **list, const char *pat, const char 
*templ, BUFFER *err)
+{
+  SPAM_LIST *t, *last = NULL;
+  REGEXP *rx;
+  int n;
+  const char *p;
+
+  if (!pat || !*pat || !templ)
+    return 0;
+
+  if (!(rx = mutt_compile_regexp (pat, REG_ICASE)))
+  {
+    snprintf (err->data, err->dsize, _("Bad regexp: %s"), pat);
+    return -1;
+  }
+
+  /* check to make sure the item is not already on this list */
+  for (last = *list; last; last = last->next)
+  {
+    if (ascii_strcasecmp (rx->pattern, last->rx->pattern) == 0)
+    {
+      /* already on the list, so just ignore it */
+      last = NULL;
+      break;
+    }
+    if (!last->next)
+      break;
+  }
+
+  if (!*list || last)
+  {
+    t = mutt_new_spam_list();
+    t->rx = rx;
+    t->template = strdup(templ);
+
+    /* find highest match number in template string */
+    t->nmatch = 0;
+    for (p = templ; *p;)
+    {
+      if (*p == '%')
+      {
+       n = atoi(++p);
+       if (n > t->nmatch)
+         t->nmatch = n;
+       while (*p && isdigit(*p))
+         ++p;
+      }
+      else
+       ++p;
+    }
+    t->nmatch++;               /* match 0 is always the whole expr */
+
+    if (last)
+    {
+      last->next = t;
+      last = last->next;
+    }
+    else
+      *list = last = t;
+  }
+  else /* duplicate */
+    mutt_free_regexp (&rx);
+
+  return 0;
+}
+
+
 static void remove_from_list (LIST **l, const char *str)
 {
   LIST *p, *last = NULL;
@@ -500,6 +567,35 @@
     remove_from_rx_list ((RX_LIST **) data, buf->data);
   }
   while (MoreArgs (s));
+  
+  return 0;
+}
+
+static int parse_spam_list (BUFFER *buf, BUFFER *s, unsigned long data, BUFFER 
*err)
+{
+  BUFFER templ;
+
+  memset(&templ, 0, sizeof(templ));
+
+  if (!MoreArgs(s))
+  {
+    strfcpy(err->data, _("spam: no matching pattern"), err->dsize);
+    return -1;
+  }
+  mutt_extract_token (buf, s, 0);
+
+  if (MoreArgs(s))
+  {
+    mutt_extract_token (&templ, s, 0);
+  }
+  else
+  {
+    templ.data = strdup("");
+    templ.dsize = 0;
+  }
+
+  if (add_to_spam_list ((SPAM_LIST **) data, buf->data, templ.data, err) != 0)
+      return -1;
   
   return 0;
 }
diff -Pur mutt-1.5.6-dist/init.h mutt-1.5.6/init.h
--- mutt-1.5.6-dist/init.h      Sun Feb  1 11:15:17 2004
+++ mutt-1.5.6/init.h   Sat Feb  7 21:38:32 2004
@@ -2379,6 +2379,15 @@
   ** the message whether or not this is the case, as long as the
   ** non-``$$reply_regexp'' parts of both messages are identical.
   */
+  { "spam_separator",   DT_STR, R_NONE, UL &SpamSep, UL 0 },
+  /*
+  ** .pp
+  ** ``$spam_separator'' controls what happens when multiple spam headers
+  ** are matched: if unset, each successive header will overwrite any
+  ** previous matches value for the spam label. If set, each successive
+  ** match will append to the previous, using ``$spam_separator'' as a
+  ** separator.
+  */
   { "spoolfile",       DT_PATH, R_NONE, UL &Spoolfile, 0 },
   /*
   ** .pp
@@ -2728,6 +2737,7 @@
 
 static int parse_list (BUFFER *, BUFFER *, unsigned long, BUFFER *);
 static int parse_rx_list (BUFFER *, BUFFER *, unsigned long, BUFFER *);
+static int parse_spam_list (BUFFER *, BUFFER *, unsigned long, BUFFER *);
 static int parse_unlist (BUFFER *, BUFFER *, unsigned long, BUFFER *);
 static int parse_rx_unlist (BUFFER *, BUFFER *, unsigned long, BUFFER *);
 
@@ -2793,6 +2803,8 @@
   { "send-hook",       mutt_parse_hook,        M_SENDHOOK },
   { "set",             parse_set,              0 },
   { "source",          parse_source,           0 },
+  { "spam",            parse_spam_list,        UL &SpamList },
+  { "nospam",          parse_rx_list,          UL &NoSpamList },
   { "subscribe",       parse_subscribe,        0 },
   { "toggle",          parse_set,              M_SET_INV },
   { "unalias",         parse_unalias,          0 },
diff -Pur mutt-1.5.6-dist/mutt.h mutt-1.5.6/mutt.h
--- mutt-1.5.6-dist/mutt.h      Sun Feb  1 11:15:17 2004
+++ mutt-1.5.6/mutt.h   Sat Feb  7 21:34:20 2004
@@ -220,6 +220,7 @@
   M_ID,
   M_BODY,
   M_HEADER,
+  M_HORMEL,
   M_WHOLE_MSG,
   M_SENDER,
   M_MESSAGE,
@@ -405,6 +406,7 @@
   OPTSIGDASHES,
   OPTSIGONTOP,
   OPTSORTRE,
+  OPTSPAMSEP,
   OPTSTATUSONTOP,
   OPTSTRICTTHREADS,
   OPTSUSPEND,
@@ -512,10 +514,20 @@
   struct rx_list_t *next;
 } RX_LIST;
 
+typedef struct spam_list_t
+{
+  REGEXP *rx;
+  int     nmatch;
+  char   *template;
+  struct spam_list_t *next;
+} SPAM_LIST;
+
 #define mutt_new_list() safe_calloc (1, sizeof (LIST))
 #define mutt_new_rx_list() safe_calloc (1, sizeof (RX_LIST))
+#define mutt_new_spam_list() safe_calloc (1, sizeof (SPAM_LIST))
 void mutt_free_list (LIST **);
 void mutt_free_rx_list (RX_LIST **);
+void mutt_free_spam_list (SPAM_LIST **);
 int mutt_matches_ignore (const char *, LIST *);
 
 /* add an element to a list */
@@ -550,6 +562,7 @@
   char *supersedes;
   char *date;
   char *x_label;
+  BUFFER *spam;
   LIST *references;            /* message references (in reverse order) */
   LIST *in_reply_to;           /* in-reply-to header content */
   LIST *userhdrs;              /* user defined headers */
diff -Pur mutt-1.5.6-dist/muttlib.c mutt-1.5.6/muttlib.c
--- mutt-1.5.6-dist/muttlib.c   Sun Feb  1 11:15:17 2004
+++ mutt-1.5.6/muttlib.c        Mon Feb  9 00:46:46 2004
@@ -1283,6 +1283,60 @@
     sleep (s);
 }
 
+/*
+ * Creates and initializes a BUFFER*. If passed an existing BUFFER*,
+ * just initializes. Frees anything already in the buffer.
+ *
+ * Disregards the 'destroy' flag, which seems reserved for caller.
+ * This is bad, but there's no apparent protocol for it.
+ */
+BUFFER * mutt_buffer_init(BUFFER *b)
+{
+  if (!b)
+  {
+    b = malloc(sizeof(BUFFER));
+    if (!b)
+      return NULL;
+  }
+  else
+  {
+    safe_free(b->data);
+  }
+  memset(b, 0, sizeof(BUFFER));
+  return b;
+}
+
+/*
+ * Creates and initializes a BUFFER*. If passed an existing BUFFER*,
+ * just initializes. Frees anything already in the buffer. Copies in
+ * the seed string.
+ *
+ * Disregards the 'destroy' flag, which seems reserved for caller.
+ * This is bad, but there's no apparent protocol for it.
+ */
+BUFFER * mutt_buffer_from(BUFFER *b, char *seed)
+{
+  int n;
+
+  if (!seed)
+    return NULL;
+
+  b = mutt_buffer_init(b);
+  b->data = strdup(seed);
+  b->dsize = strlen(seed);
+  b->dptr = (char *)((int)b->data + b->dsize);
+  return b;
+}
+
+void mutt_buffer_free(BUFFER **b)
+{
+  if (!b)
+    return;
+  if ((*b)->data)
+    safe_free(&((*b)->data));
+  safe_free(b);
+}
+
 void mutt_buffer_addstr (BUFFER* buf, const char* s)
 {
   mutt_buffer_add (buf, s, mutt_strlen (s));
@@ -1379,6 +1433,21 @@
   }
 }
 
+void mutt_free_spam_list (SPAM_LIST **list)
+{
+  SPAM_LIST *p;
+  
+  if (!list) return;
+  while (*list)
+  {
+    p = *list;
+    *list = (*list)->next;
+    mutt_free_regexp (&p->rx);
+    safe_free(&p->template);
+    FREE (&p);
+  }
+}
+
 int mutt_match_rx_list (const char *s, RX_LIST *l)
 {
   if (!s)  return 0;
@@ -1388,6 +1457,57 @@
     if (regexec (l->rx->rx, s, (size_t) 0, (regmatch_t *) 0, (int) 0) == 0)
     {
       dprint (5, (debugfile, "mutt_match_rx_list: %s matches %s\n", s, 
l->rx->pattern));
+      return 1;
+    }
+  }
+
+  return 0;
+}
+
+int mutt_match_spam_list (const char *s, SPAM_LIST *l, char *text, int x)
+{
+  static regmatch_t *pmatch = NULL;
+  static int nmatch = 0;
+  int i, n, tlen;
+  char *p;
+
+  if (!s)  return 0;
+
+  tlen = 0;
+
+  for (; l; l = l->next)
+  {
+    /* If this pattern needs more matches, expand pmatch. */
+    if (l->nmatch > nmatch)
+    {
+      safe_realloc ((void**) &pmatch, l->nmatch * sizeof(regmatch_t));
+      nmatch = l->nmatch;
+    }
+
+    /* Does this pattern match? */
+    if (regexec (l->rx->rx, s, (size_t) l->nmatch, (regmatch_t *) pmatch, 
(int) 0) == 0)
+    {
+      dprint (5, (debugfile, "mutt_match_spam_list: %s matches %s\n", s, 
l->rx->pattern));
+      dprint (5, (debugfile, "mutt_match_spam_list: %d subs\n", 
l->rx->rx->re_nsub));
+
+      /* Copy template into text, with substitutions. */
+      for (p = l->template; *p;)
+      {
+       if (*p == '%')
+       {
+         n = atoi(++p);                        /* find pmatch index */
+         while (isdigit(*p))
+           ++p;                                /* skip subst token */
+         for (i = pmatch[n].rm_so; (i < pmatch[n].rm_eo) && (tlen < x); i++)
+           text[tlen++] = s[i];
+       }
+       else
+       {
+         text[tlen++] = *p++;
+       }
+      }
+      text[tlen] = '\0';
+      dprint (5, (debugfile, "mutt_match_spam_list: \"%s\"\n", text));
       return 1;
     }
   }
diff -Pur mutt-1.5.6-dist/parse.c mutt-1.5.6/parse.c
--- mutt-1.5.6-dist/parse.c     Wed Nov  5 03:41:33 2003
+++ mutt-1.5.6/parse.c  Mon Feb  9 00:45:51 2004
@@ -1267,6 +1267,7 @@
   long loc;
   int matched;
   size_t linelen = LONG_STRING;
+  char buf[LONG_STRING+1];
 
   if (hdr)
   {
@@ -1308,6 +1309,49 @@
 
       fseek (f, loc, 0);
       break; /* end of header */
+    }
+
+    *buf = '\0';
+
+    if (mutt_match_spam_list(line, SpamList, buf, sizeof(buf)))
+    {
+      if (!mutt_match_rx_list(line, NoSpamList))
+      {
+
+       /* if spam tag already exists, figure out how to amend it */
+       if (e->spam && *buf)
+       {
+         /* If SpamSep defined, append with separator */
+         if (SpamSep)
+         {
+           mutt_buffer_addstr(e->spam, SpamSep);
+           mutt_buffer_addstr(e->spam, buf);
+         }
+
+         /* else overwrite */
+         else
+         {
+           e->spam->dptr = e->spam->data;
+           *e->spam->dptr = '\0';
+           mutt_buffer_addstr(e->spam, buf);
+         }
+       }
+
+       /* spam tag is new, and match expr is non-empty; copy */
+       else if (!e->spam && *buf)
+       {
+         e->spam = mutt_buffer_from(NULL, buf);
+       }
+
+       /* match expr is empty; plug in null string if no existing tag */
+       else if (!e->spam)
+       {
+         e->spam = mutt_buffer_from(NULL, "");
+       }
+
+       if (e->spam && e->spam->data)
+          dprint(5, (debugfile, "p822: spam = %s\n", e->spam->data));
+      }
     }
 
     *p = 0;
diff -Pur mutt-1.5.6-dist/pattern.c mutt-1.5.6/pattern.c
--- mutt-1.5.6-dist/pattern.c   Wed Nov  5 03:41:33 2003
+++ mutt-1.5.6/pattern.c        Sat Feb  7 22:38:57 2004
@@ -58,6 +58,7 @@
   { 'g', M_CRYPT_SIGN,                 0,              NULL },
   { 'G', M_CRYPT_ENCRYPT,      0,              NULL },
   { 'h', M_HEADER,             M_FULL_MSG,     eat_regexp },
+  { 'H', M_HORMEL,             0,              eat_regexp },
   { 'i', M_ID,                 0,              eat_regexp },
   { 'k', M_PGP_KEY,            0,              NULL },
   { 'L', M_ADDRESS,            0,              eat_regexp },
@@ -1045,6 +1046,8 @@
      return (pat->not ^ ((h->security & APPLICATION_PGP) && (h->security & 
PGPKEY)));
     case M_XLABEL:
       return (pat->not ^ (h->env->x_label && regexec (pat->rx, 
h->env->x_label, 0, NULL, 0) == 0));
+    case M_HORMEL:
+      return (pat->not ^ (h->env->spam && h->env->spam->data && regexec 
(pat->rx, h->env->spam->data, 0, NULL, 0) == 0));
     case M_DUPLICATED:
       return (pat->not ^ (h->thread && h->thread->duplicate_thread));
   }
diff -Pur mutt-1.5.6-dist/protos.h mutt-1.5.6/protos.h
--- mutt-1.5.6-dist/protos.h    Sun Feb  1 11:15:17 2004
+++ mutt-1.5.6/protos.h Sun Feb  8 18:42:55 2004
@@ -32,6 +32,9 @@
        HEADER *, format_flag);
 
 int mutt_extract_token (BUFFER *, BUFFER *, int);
+BUFFER * mutt_buffer_init (BUFFER *);
+BUFFER * mutt_buffer_from (BUFFER *, char *);
+void mutt_buffer_free(BUFFER **);
 void mutt_buffer_add (BUFFER*, const char*, size_t);
 void mutt_buffer_addstr (BUFFER*, const char*);
 void mutt_buffer_addch (BUFFER*, char);
@@ -291,6 +294,7 @@
 int mutt_is_valid_mailbox (const char *);
 int mutt_lookup_mime_type (BODY *, const char *);
 int mutt_match_rx_list (const char *, RX_LIST *);
+int mutt_match_spam_list (const char *, SPAM_LIST *, char *, int);
 int mutt_messages_in_thread (CONTEXT *, HEADER *, int);
 int mutt_multi_choice (char *prompt, char *letters);
 int mutt_needs_mailcap (BODY *);