Re: [PATCH] generic spam detection

To: mutt-dev@xxxxxxxx
Subject: Re: [PATCH] generic spam detection
From: David Champion <dgc@xxxxxxxxxxxx>
Date: Mon, 12 Jul 2004 02:03:23 -0500
In-reply-to: <20040412204852.GT5807@xxxxxxxxxxxxxxxxxxxxxxxxxxxx>
List-unsubscribe: <mailto:mutt-dev-request@mutt.org?body=unsubscribe>
Mail-followup-to: mutt-dev@xxxxxxxx
References: <20040210071345.GA23743@xxxxxxxxxxxxxxxxx> <20040412204852.GT5807@xxxxxxxxxxxxxxxxxxxxxxxxxxxx>
Sender: owner-mutt-dev@xxxxxxxx
User-agent: Mutt/1.5.6i

[I posted a patch implementing generalized support for one or more
external spam filters.]

* On 2004.04.12, in <20040412204852.GT5807@xxxxxxxxxxxxxxxxxxxxxxxxxxxx>,
*       "Thomas Roessler" <roessler@xxxxxxxxxxxxxxxxxx> wrote:
> This indeed looks very appealing for CVS inclusion.  I'd be curious
> to hear about any experience people have with the patch.
> 
> Suggested tweaking:
> 
> - Add documentation.
> - Add both lexical and numerical sorting by spam tag.
> 
> (This would make your patch a useful generalization of Rogier
> Wolff's proposed spam-score sorting patch.)

I've added several paragraphs of documentation to manual.sgml
(<sect1>Spam detection<label id="spam">). Muttrc.man has a reference to
the full manual. I've never gotten sgmltools to work on my system, so if
someone could check these I'd appreciate it.

I added sorting in a similar vein to that posted by Rogier Wolff.
Thomas, I provided only one sort function. It gives priority to numeric
sorting if the sort keys begin with numbers, but it falls back on
lexical sort at the point where any leading numbers terminate. (If
spam tags don't begin with numbers, there's no difference from lexical
sorting; it's like "sort -n" this way.) If that doesn't provide what
you had in mind, I'd be glad to revisit it, but I think it's a good
compromise for a hypothetical case where a user has two spam filters,
one of which provides numeric spam tags and one of which which does not.

The remainder of the patch is untouched. See manual.sgml for a reminder.
:)

There is one unrelated change included here, a typo fix ("unified
ressource locator") in manual.sgml.head. There's also one related but
unnecessary change included. In sort.h, SORT_* macros were aligned such
that the low four bits of a byte were reserved for sort methods, and the
high four bits reserved for sort flags (last and reverse). SORT_SPAM
becomes sort method 14, leaving room only for one more sort method in
the future. However, the globals Sort and SortAux are shorts. To make
room for future expansion, I realigned these flags to bits 8 and 9. If
that's undesirable, the fix is indeed as simple as it appears to be. :)

Feedback: I've still not heard from anyone using this patch. If you are,
please let us know how it's working out. For my part, although I don't
use its features heavily, it hasn't caused any trouble since I began
using it in April.

-- 
 -D.    dgc@xxxxxxxxxxxx                                  NSIT::ENSS
        No money,  no book.  No book,  no study.  No study, no pass.
        No pass, no graduate. No graduate, no job. No job, no money.
             T h e   U n i v e r s i t y   o f   C h i c a g o

--- mutt-1.5.6/PATCHES~ never
+++ mutt-1.5.6/PATCHES  Mon Feb  9 21:07:37 CST 2004
@@ -1,0 +1 @@
+patch-1.5.6.dgc.hormel.2
diff -ur mutt-1.5.6-base/commands.c mutt-1.5.6-hormel.2/commands.c
--- mutt-1.5.6-base/commands.c  Sun Feb  1 11:10:57 2004
+++ mutt-1.5.6-hormel.2/commands.c      Sun Jul 11 22:19:55 2004
@@ -501,9 +501,9 @@
   int method = Sort; /* save the current method in case of abort */
 
   switch (mutt_multi_choice (reverse ?
-                            _("Rev-Sort 
(d)ate/(f)rm/(r)ecv/(s)ubj/t(o)/(t)hread/(u)nsort/si(z)e/s(c)ore?: ") :
-                            _("Sort 
(d)ate/(f)rm/(r)ecv/(s)ubj/t(o)/(t)hread/(u)nsort/si(z)e/s(c)ore?: "),
-                            _("dfrsotuzc")))
+                            _("Rev-Sort 
(d)ate/(f)rm/(r)ecv/(s)ubj/t(o)/(t)hread/(u)nsort/si(z)e/s(c)ore/s(p)am?: ") :
+                            _("Sort 
(d)ate/(f)rm/(r)ecv/(s)ubj/t(o)/(t)hread/(u)nsort/si(z)e/s(c)ore/s(p)am?: "),
+                            _("dfrsotuzcp")))
   {
   case -1: /* abort - don't resort */
     return -1;
@@ -542,6 +542,10 @@
   
   case 9: /* s(c)ore */ 
     Sort = SORT_SCORE;
+    break;
+
+  case 10: /* s(p)am */
+    Sort = SORT_SPAM;
     break;
   }
   if (reverse)
diff -ur mutt-1.5.6-base/doc/manual.sgml.head 
mutt-1.5.6-hormel.2/doc/manual.sgml.head
--- mutt-1.5.6-base/doc/manual.sgml.head        Sun Feb  1 11:49:53 2004
+++ mutt-1.5.6-hormel.2/doc/manual.sgml.head    Mon Jul 12 01:08:41 2004
@@ -1492,6 +1492,97 @@
 removed.  The pattern ``*'' is a special token which means to clear the list
 of all score entries.
 
+<sect1>Spam detection<label id="spam">
+<p>
+Usage: <tt/spam/ <em/pattern/ <em/format/
+Usage: <tt/nospam/ <em/pattern/
+
+Mutt has generalized support for external spam-scoring filters.
+By defining your spam patterns with the <tt/spam/ and <tt/nospam/
+commands, you can <em/limit/, <em/search/, and <em/sort/ your
+mail based on its spam attributes, as determined by the external
+filter. You also can display the spam attributes in your index
+display using the <tt/%H/ selector in the <ref id="index_format"
+name="&dollar;index&lowbar;format"> variable. (Tip: try <tt/%?H?[%H] ?/
+to display spam tags only when they are defined for a given message.)
+
+Your first step is to define your external filter's spam patterns using
+the <tt/spam/ command. <em/pattern/ should be a regular expression
+that matches a header in a mail message. If any message in the mailbox
+matches this regular expression, it will receive a ``spam tag'' or
+``spam attribute'' (unless it also matches a <tt/nospam/ pattern -- see
+below.) The appearance of this attribute is entirely up to you, and is
+governed by the <em/format/ parameter. <em/format/ can be any static
+text, but it also can include back-references from the <em/pattern/
+expression. (A regular expression ``back-reference'' refers to a
+sub-expression contained within parentheses.) <tt/%1/ is replaced with
+the first back-reference in the regex, <tt/%2/ with the second, etc.
+
+If you're using multiple spam filters, a message can have more than
+one spam-related header. You can define <tt/spam/ patterns for each
+filter you use. If a message matches two or more of these patterns, and
+the &dollar;spam&lowbar;separator variable is set to a string, then the
+message's spam tag will consist of all the <em/format/ strings joined
+together, with the value of &dollar;spam&lowbar;separator separating
+them.
+
+For example, suppose I use DCC, SpamAssassin, and PureMessage. I might
+define these spam settings:
+<tscreen><verb>
+spam "X-DCC-.*-Metrics:.*(....)=many"         "90+/DCC-%1"
+spam "X-Spam-Status: Yes"                     "90+/SA"
+spam "X-PerlMX-Spam: .*Probability=([0-9]+)%" "%1/PM"
+set spam_separator=", "
+</verb></tscreen>
+
+If I then received a message that DCC registered with ``many'' hits
+under the ``Fuz2'' checksum, and that PureMessage registered with a
+97% probability of being spam, that message's spam tag would read
+<tt>90+/DCC-Fuz2, 97/PM</tt>. (The four characters before ``=many'' in a
+DCC report indicate the checksum used -- in this case, ``Fuz2''.)
+
+If the &dollar;spam&lowbar;separator variable is unset, then each
+spam pattern match supercedes the previous one. Instead of getting
+joined <em/format/ strings, you'll get only the last one to match.
+
+The spam tag is what will be displayed in the index when you use
+<tt/%H/ in the <tt/&dollar;index&lowbar;format/ variable. It's also the
+string that the <tt/~H/ pattern-matching expression matches against for
+<em/search/ and <em/limit/ functions. And it's what sorting by spam
+attribute will use as a sort key.
+
+That's a pretty complicated example, and most people's actual
+environments will have only one spam filter. The simpler your
+configuration, the more effective mutt can be, especially when it comes
+to sorting.
+
+Generally, when you sort by spam tag, mutt will sort <em/lexically/ --
+that is, by ordering strings alphnumerically. However, if a spam tag
+begins with a number, mutt will sort numerically first, and lexically
+only when two numbers are equal in value. (This is like UNIX's
+<tt/sort -n/.) A message with no spam attributes at all -- that is, one
+that didn't match <em/any/ of your <tt/spam/ patterns -- is sorted at
+lowest priority. Numbers are sorted next, beginning with 0 and ranging
+upward. Finally, non-numeric strings are sorted, with ``a'' taking lower
+priority than ``z''. Clearly, in general, sorting by spam tags is most
+effective when you can coerce your filter to give you a raw number. But
+in case you can't, mutt can still do something useful.
+
+Finally, the <tt/nospam/ command can be used to write exceptions to
+<tt/spam/ patterns. If a header pattern matches something in a <tt/spam/
+command, but you nonetheless do not want it to receive a spam tag,
+you can list a more precise pattern under a <tt/nospam/ command.
+
+You can have as many <tt/spam/ or <tt/nospam/ commands as you like.
+You can even do your own primitive spam detection within mutt -- for
+example, if you consider all mail from MAILER-DAEMON to be spam, you can
+use a <tt/spam/ command like this:
+
+<tscreen><verb>
+spam "^From: .*MAILER-DAEMON"       "999"
+</verb></tscreen>
+
+
 <sect1>Setting variables<label id="set">
 <p>
 Usage: <tt/set/ &lsqb;no|inv&rsqb;<em/variable/&lsqb;=<em/value/&rsqb; &lsqb; 
<em/variable/ ... &rsqb;<newline>
@@ -1759,6 +1850,7 @@
 ~f USER         messages originating from USER
 ~g              cryptographically signed messages
 ~G              cryptographically encrypted messages
+~H EXPR         messages with a spam attribute matching EXPR
 ~h EXPR         messages which contain EXPR in the message header
 ~k             message contains PGP key material
 ~i ID           message which match ID in the ``Message-ID'' field
@@ -2390,7 +2482,7 @@
 
 <sect1>Start a WWW Browser on URLs (EXTERNAL)<label id="urlview">
 <p>
-If a message contains URLs (<em/unified ressource locator/ = address in the
+If a message contains URLs (<em/unified resource locator/ = address in the
 WWW space like <em>http://www.mutt.org/</em>), it is efficient to get
 a menu with all the URLs and start a WWW browser on one of them.  This
 functionality is provided by the external urlview program which can be
@@ -3053,6 +3145,10 @@
 <tt><ref id="set" name="unset"></tt> <em/variable/ &lsqb;<em/variable/ ... 
&rsqb;
 <item>
 <tt><ref id="source" name="source"></tt> <em/filename/
+<item>
+<tt><ref id="spam" name="spam"></tt> <em/pattern/ <em/format/
+<item>
+<tt><ref id="spam" name="nospam"></tt> <em/pattern/
 <item>
 <tt><ref id="lists" name="subscribe"></tt> <em/address/ &lsqb; <em/address/ 
... &rsqb; 
 <item>
diff -ur mutt-1.5.6-base/doc/muttrc.man.head 
mutt-1.5.6-hormel.2/doc/muttrc.man.head
--- mutt-1.5.6-base/doc/muttrc.man.head Sun Feb  1 11:15:18 2004
+++ mutt-1.5.6-hormel.2/doc/muttrc.man.head     Mon Jul 12 01:15:49 2004
@@ -336,6 +336,15 @@
 \fBsource\fP \fIfilename\fP
 The given file will be evaluated as a configuration file.
 .TP
+.nf
+\fBspam\fP \fIpattern\fP \fIformat\fP
+\fBnospam\fP \fIpattern\fP
+.fi
+These commands define spam-detection patterns from external spam
+filters, so that mutt can sort, limit, and search on
+``spam tags'' or ``spam attributes'', or display them
+in the index. See the Mutt manual for details.
+.TP
 \fBunhook\fP [\fB * \fP | \fIhook-type\fP ]
 This command will remove all hooks of a given type, or all hooks
 when \(lq\fB*\fP\(rq is used as an argument.  \fIhook-type\fP
@@ -384,6 +393,7 @@
 ~f \fIEXPR\fP  messages originating from \fIEXPR\fP
 ~g     PGP signed messages
 ~G     PGP encrypted messages
+~H \fIEXPR\fP  messages with spam tags matching \fIEXPR\fP
 ~h \fIEXPR\fP  messages which contain \fIEXPR\fP in the message header
 ~k     message contains PGP key material
 ~i \fIEXPR\fP  message which match \fIEXPR\fP in the \(lqMessage-ID\(rq field
diff -ur mutt-1.5.6-base/globals.h mutt-1.5.6-hormel.2/globals.h
--- mutt-1.5.6-base/globals.h   Sun Feb  1 11:15:17 2004
+++ mutt-1.5.6-hormel.2/globals.h       Sun Jul 11 02:16:57 2004
@@ -102,6 +102,7 @@
 WHERE char *Signature;
 WHERE char *SimpleSearch;
 WHERE char *Spoolfile;
+WHERE char *SpamSep;
 #if defined(USE_SSL) || defined(USE_NSS)
 WHERE char *SslCertFile INITVAL (NULL);
 WHERE char *SslEntropyFile INITVAL (NULL);
@@ -125,6 +126,8 @@
 WHERE RX_LIST *Alternates INITVAL(0);
 WHERE RX_LIST *MailLists INITVAL(0);
 WHERE RX_LIST *SubscribedLists INITVAL(0);
+WHERE SPAM_LIST *SpamList INITVAL(0);
+WHERE RX_LIST *NoSpamList INITVAL(0);
 
 /* bit vector for boolean variables */
 #ifdef MAIN_C
diff -ur mutt-1.5.6-base/hdrline.c mutt-1.5.6-hormel.2/hdrline.c
--- mutt-1.5.6-base/hdrline.c   Sun Feb  1 11:15:17 2004
+++ mutt-1.5.6-hormel.2/hdrline.c       Sun Jul 11 02:16:57 2004
@@ -433,6 +433,18 @@
         optional = 0;
       break;
 
+    case 'H':
+      /* (Hormel) spam score */
+      if (optional)
+       optional = hdr->env->spam ? 1 : 0;
+
+       if (hdr->env->spam)
+         mutt_format_s (dest, destlen, prefix, NONULL (hdr->env->spam->data));
+       else
+         mutt_format_s (dest, destlen, prefix, "");
+
+      break;
+
     case 'i':
       mutt_format_s (dest, destlen, prefix, hdr->env->message_id ? 
hdr->env->message_id : "<no.id>");
       break;
diff -ur mutt-1.5.6-base/init.c mutt-1.5.6-hormel.2/init.c
--- mutt-1.5.6-base/init.c      Sun Feb  1 12:21:00 2004
+++ mutt-1.5.6-hormel.2/init.c  Sun Jul 11 02:16:57 2004
@@ -366,6 +366,73 @@
 }
 
 
+static int add_to_spam_list (SPAM_LIST **list, const char *pat, const char 
*templ, BUFFER *err)
+{
+  SPAM_LIST *t, *last = NULL;
+  REGEXP *rx;
+  int n;
+  const char *p;
+
+  if (!pat || !*pat || !templ)
+    return 0;
+
+  if (!(rx = mutt_compile_regexp (pat, REG_ICASE)))
+  {
+    snprintf (err->data, err->dsize, _("Bad regexp: %s"), pat);
+    return -1;
+  }
+
+  /* check to make sure the item is not already on this list */
+  for (last = *list; last; last = last->next)
+  {
+    if (ascii_strcasecmp (rx->pattern, last->rx->pattern) == 0)
+    {
+      /* already on the list, so just ignore it */
+      last = NULL;
+      break;
+    }
+    if (!last->next)
+      break;
+  }
+
+  if (!*list || last)
+  {
+    t = mutt_new_spam_list();
+    t->rx = rx;
+    t->template = strdup(templ);
+
+    /* find highest match number in template string */
+    t->nmatch = 0;
+    for (p = templ; *p;)
+    {
+      if (*p == '%')
+      {
+       n = atoi(++p);
+       if (n > t->nmatch)
+         t->nmatch = n;
+       while (*p && isdigit(*p))
+         ++p;
+      }
+      else
+       ++p;
+    }
+    t->nmatch++;               /* match 0 is always the whole expr */
+
+    if (last)
+    {
+      last->next = t;
+      last = last->next;
+    }
+    else
+      *list = last = t;
+  }
+  else /* duplicate */
+    mutt_free_regexp (&rx);
+
+  return 0;
+}
+
+
 static void remove_from_list (LIST **l, const char *str)
 {
   LIST *p, *last = NULL;
@@ -500,6 +567,35 @@
     remove_from_rx_list ((RX_LIST **) data, buf->data);
   }
   while (MoreArgs (s));
+  
+  return 0;
+}
+
+static int parse_spam_list (BUFFER *buf, BUFFER *s, unsigned long data, BUFFER 
*err)
+{
+  BUFFER templ;
+
+  memset(&templ, 0, sizeof(templ));
+
+  if (!MoreArgs(s))
+  {
+    strfcpy(err->data, _("spam: no matching pattern"), err->dsize);
+    return -1;
+  }
+  mutt_extract_token (buf, s, 0);
+
+  if (MoreArgs(s))
+  {
+    mutt_extract_token (&templ, s, 0);
+  }
+  else
+  {
+    templ.data = strdup("");
+    templ.dsize = 0;
+  }
+
+  if (add_to_spam_list ((SPAM_LIST **) data, buf->data, templ.data, err) != 0)
+      return -1;
   
   return 0;
 }
diff -ur mutt-1.5.6-base/init.h mutt-1.5.6-hormel.2/init.h
--- mutt-1.5.6-base/init.h      Sun Feb  1 11:15:17 2004
+++ mutt-1.5.6-hormel.2/init.h  Mon Jul 12 00:12:03 2004
@@ -901,6 +901,7 @@
   ** .dt %E .dd number of messages in current thread
   ** .dt %f .dd entire From: line (address + real name)
   ** .dt %F .dd author name, or recipient name if the message is from you
+  ** .dt %H .dd spam attribute(s) of this message
   ** .dt %i .dd message-id of the current message
   ** .dt %l .dd number of lines in the message (does not work with maildir,
   **            mh, and possibly IMAP folders)
@@ -2314,6 +2315,7 @@
   ** .  mailbox-order (unsorted)
   ** .  score
   ** .  size
+  ** .  spam
   ** .  subject
   ** .  threads
   ** .  to
@@ -2379,6 +2381,15 @@
   ** the message whether or not this is the case, as long as the
   ** non-``$$reply_regexp'' parts of both messages are identical.
   */
+  { "spam_separator",   DT_STR, R_NONE, UL &SpamSep, UL 0 },
+  /*
+  ** .pp
+  ** ``$spam_separator'' controls what happens when multiple spam headers
+  ** are matched: if unset, each successive header will overwrite any
+  ** previous matches value for the spam label. If set, each successive
+  ** match will append to the previous, using ``$spam_separator'' as a
+  ** separator.
+  */
   { "spoolfile",       DT_PATH, R_NONE, UL &Spoolfile, 0 },
   /*
   ** .pp
@@ -2678,6 +2689,7 @@
   { "threads",         SORT_THREADS },
   { "to",              SORT_TO },
   { "score",           SORT_SCORE },
+  { "spam",            SORT_SPAM },
   { NULL,              0 }
 };
 
@@ -2696,6 +2708,7 @@
                                         */
   { "to",              SORT_TO },
   { "score",           SORT_SCORE },
+  { "spam",            SORT_SPAM },
   { NULL,              0 }
 };
   
@@ -2728,6 +2741,7 @@
 
 static int parse_list (BUFFER *, BUFFER *, unsigned long, BUFFER *);
 static int parse_rx_list (BUFFER *, BUFFER *, unsigned long, BUFFER *);
+static int parse_spam_list (BUFFER *, BUFFER *, unsigned long, BUFFER *);
 static int parse_unlist (BUFFER *, BUFFER *, unsigned long, BUFFER *);
 static int parse_rx_unlist (BUFFER *, BUFFER *, unsigned long, BUFFER *);
 
@@ -2793,6 +2807,8 @@
   { "send-hook",       mutt_parse_hook,        M_SENDHOOK },
   { "set",             parse_set,              0 },
   { "source",          parse_source,           0 },
+  { "spam",            parse_spam_list,        UL &SpamList },
+  { "nospam",          parse_rx_list,          UL &NoSpamList },
   { "subscribe",       parse_subscribe,        0 },
   { "toggle",          parse_set,              M_SET_INV },
   { "unalias",         parse_unalias,          0 },
diff -ur mutt-1.5.6-base/mutt.h mutt-1.5.6-hormel.2/mutt.h
--- mutt-1.5.6-base/mutt.h      Sun Feb  1 11:15:17 2004
+++ mutt-1.5.6-hormel.2/mutt.h  Sun Jul 11 02:16:57 2004
@@ -220,6 +220,7 @@
   M_ID,
   M_BODY,
   M_HEADER,
+  M_HORMEL,
   M_WHOLE_MSG,
   M_SENDER,
   M_MESSAGE,
@@ -405,6 +406,7 @@
   OPTSIGDASHES,
   OPTSIGONTOP,
   OPTSORTRE,
+  OPTSPAMSEP,
   OPTSTATUSONTOP,
   OPTSTRICTTHREADS,
   OPTSUSPEND,
@@ -512,10 +514,20 @@
   struct rx_list_t *next;
 } RX_LIST;
 
+typedef struct spam_list_t
+{
+  REGEXP *rx;
+  int     nmatch;
+  char   *template;
+  struct spam_list_t *next;
+} SPAM_LIST;
+
 #define mutt_new_list() safe_calloc (1, sizeof (LIST))
 #define mutt_new_rx_list() safe_calloc (1, sizeof (RX_LIST))
+#define mutt_new_spam_list() safe_calloc (1, sizeof (SPAM_LIST))
 void mutt_free_list (LIST **);
 void mutt_free_rx_list (RX_LIST **);
+void mutt_free_spam_list (SPAM_LIST **);
 int mutt_matches_ignore (const char *, LIST *);
 
 /* add an element to a list */
@@ -550,6 +562,7 @@
   char *supersedes;
   char *date;
   char *x_label;
+  BUFFER *spam;
   LIST *references;            /* message references (in reverse order) */
   LIST *in_reply_to;           /* in-reply-to header content */
   LIST *userhdrs;              /* user defined headers */
diff -ur mutt-1.5.6-base/muttlib.c mutt-1.5.6-hormel.2/muttlib.c
--- mutt-1.5.6-base/muttlib.c   Sun Feb  1 11:15:17 2004
+++ mutt-1.5.6-hormel.2/muttlib.c       Sun Jul 11 02:16:57 2004
@@ -1283,6 +1283,60 @@
     sleep (s);
 }
 
+/*
+ * Creates and initializes a BUFFER*. If passed an existing BUFFER*,
+ * just initializes. Frees anything already in the buffer.
+ *
+ * Disregards the 'destroy' flag, which seems reserved for caller.
+ * This is bad, but there's no apparent protocol for it.
+ */
+BUFFER * mutt_buffer_init(BUFFER *b)
+{
+  if (!b)
+  {
+    b = malloc(sizeof(BUFFER));
+    if (!b)
+      return NULL;
+  }
+  else
+  {
+    safe_free(b->data);
+  }
+  memset(b, 0, sizeof(BUFFER));
+  return b;
+}
+
+/*
+ * Creates and initializes a BUFFER*. If passed an existing BUFFER*,
+ * just initializes. Frees anything already in the buffer. Copies in
+ * the seed string.
+ *
+ * Disregards the 'destroy' flag, which seems reserved for caller.
+ * This is bad, but there's no apparent protocol for it.
+ */
+BUFFER * mutt_buffer_from(BUFFER *b, char *seed)
+{
+  int n;
+
+  if (!seed)
+    return NULL;
+
+  b = mutt_buffer_init(b);
+  b->data = strdup(seed);
+  b->dsize = strlen(seed);
+  b->dptr = (char *)((int)b->data + b->dsize);
+  return b;
+}
+
+void mutt_buffer_free(BUFFER **b)
+{
+  if (!b)
+    return;
+  if ((*b)->data)
+    safe_free(&((*b)->data));
+  safe_free(b);
+}
+
 void mutt_buffer_addstr (BUFFER* buf, const char* s)
 {
   mutt_buffer_add (buf, s, mutt_strlen (s));
@@ -1379,6 +1433,21 @@
   }
 }
 
+void mutt_free_spam_list (SPAM_LIST **list)
+{
+  SPAM_LIST *p;
+  
+  if (!list) return;
+  while (*list)
+  {
+    p = *list;
+    *list = (*list)->next;
+    mutt_free_regexp (&p->rx);
+    safe_free(&p->template);
+    FREE (&p);
+  }
+}
+
 int mutt_match_rx_list (const char *s, RX_LIST *l)
 {
   if (!s)  return 0;
@@ -1388,6 +1457,57 @@
     if (regexec (l->rx->rx, s, (size_t) 0, (regmatch_t *) 0, (int) 0) == 0)
     {
       dprint (5, (debugfile, "mutt_match_rx_list: %s matches %s\n", s, 
l->rx->pattern));
+      return 1;
+    }
+  }
+
+  return 0;
+}
+
+int mutt_match_spam_list (const char *s, SPAM_LIST *l, char *text, int x)
+{
+  static regmatch_t *pmatch = NULL;
+  static int nmatch = 0;
+  int i, n, tlen;
+  char *p;
+
+  if (!s)  return 0;
+
+  tlen = 0;
+
+  for (; l; l = l->next)
+  {
+    /* If this pattern needs more matches, expand pmatch. */
+    if (l->nmatch > nmatch)
+    {
+      safe_realloc ((void**) &pmatch, l->nmatch * sizeof(regmatch_t));
+      nmatch = l->nmatch;
+    }
+
+    /* Does this pattern match? */
+    if (regexec (l->rx->rx, s, (size_t) l->nmatch, (regmatch_t *) pmatch, 
(int) 0) == 0)
+    {
+      dprint (5, (debugfile, "mutt_match_spam_list: %s matches %s\n", s, 
l->rx->pattern));
+      dprint (5, (debugfile, "mutt_match_spam_list: %d subs\n", 
l->rx->rx->re_nsub));
+
+      /* Copy template into text, with substitutions. */
+      for (p = l->template; *p;)
+      {
+       if (*p == '%')
+       {
+         n = atoi(++p);                        /* find pmatch index */
+         while (isdigit(*p))
+           ++p;                                /* skip subst token */
+         for (i = pmatch[n].rm_so; (i < pmatch[n].rm_eo) && (tlen < x); i++)
+           text[tlen++] = s[i];
+       }
+       else
+       {
+         text[tlen++] = *p++;
+       }
+      }
+      text[tlen] = '\0';
+      dprint (5, (debugfile, "mutt_match_spam_list: \"%s\"\n", text));
       return 1;
     }
   }
diff -ur mutt-1.5.6-base/parse.c mutt-1.5.6-hormel.2/parse.c
--- mutt-1.5.6-base/parse.c     Wed Nov  5 03:41:33 2003
+++ mutt-1.5.6-hormel.2/parse.c Sun Jul 11 02:16:57 2004
@@ -1267,6 +1267,7 @@
   long loc;
   int matched;
   size_t linelen = LONG_STRING;
+  char buf[LONG_STRING+1];
 
   if (hdr)
   {
@@ -1308,6 +1309,49 @@
 
       fseek (f, loc, 0);
       break; /* end of header */
+    }
+
+    *buf = '\0';
+
+    if (mutt_match_spam_list(line, SpamList, buf, sizeof(buf)))
+    {
+      if (!mutt_match_rx_list(line, NoSpamList))
+      {
+
+       /* if spam tag already exists, figure out how to amend it */
+       if (e->spam && *buf)
+       {
+         /* If SpamSep defined, append with separator */
+         if (SpamSep)
+         {
+           mutt_buffer_addstr(e->spam, SpamSep);
+           mutt_buffer_addstr(e->spam, buf);
+         }
+
+         /* else overwrite */
+         else
+         {
+           e->spam->dptr = e->spam->data;
+           *e->spam->dptr = '\0';
+           mutt_buffer_addstr(e->spam, buf);
+         }
+       }
+
+       /* spam tag is new, and match expr is non-empty; copy */
+       else if (!e->spam && *buf)
+       {
+         e->spam = mutt_buffer_from(NULL, buf);
+       }
+
+       /* match expr is empty; plug in null string if no existing tag */
+       else if (!e->spam)
+       {
+         e->spam = mutt_buffer_from(NULL, "");
+       }
+
+       if (e->spam && e->spam->data)
+          dprint(5, (debugfile, "p822: spam = %s\n", e->spam->data));
+      }
     }
 
     *p = 0;
diff -ur mutt-1.5.6-base/pattern.c mutt-1.5.6-hormel.2/pattern.c
--- mutt-1.5.6-base/pattern.c   Wed Nov  5 03:41:33 2003
+++ mutt-1.5.6-hormel.2/pattern.c       Sun Jul 11 02:16:57 2004
@@ -58,6 +58,7 @@
   { 'g', M_CRYPT_SIGN,                 0,              NULL },
   { 'G', M_CRYPT_ENCRYPT,      0,              NULL },
   { 'h', M_HEADER,             M_FULL_MSG,     eat_regexp },
+  { 'H', M_HORMEL,             0,              eat_regexp },
   { 'i', M_ID,                 0,              eat_regexp },
   { 'k', M_PGP_KEY,            0,              NULL },
   { 'L', M_ADDRESS,            0,              eat_regexp },
@@ -1045,6 +1046,8 @@
      return (pat->not ^ ((h->security & APPLICATION_PGP) && (h->security & 
PGPKEY)));
     case M_XLABEL:
       return (pat->not ^ (h->env->x_label && regexec (pat->rx, 
h->env->x_label, 0, NULL, 0) == 0));
+    case M_HORMEL:
+      return (pat->not ^ (h->env->spam && h->env->spam->data && regexec 
(pat->rx, h->env->spam->data, 0, NULL, 0) == 0));
     case M_DUPLICATED:
       return (pat->not ^ (h->thread && h->thread->duplicate_thread));
   }
diff -ur mutt-1.5.6-base/protos.h mutt-1.5.6-hormel.2/protos.h
--- mutt-1.5.6-base/protos.h    Sun Feb  1 11:15:17 2004
+++ mutt-1.5.6-hormel.2/protos.h        Sun Jul 11 02:16:57 2004
@@ -32,6 +32,9 @@
        HEADER *, format_flag);
 
 int mutt_extract_token (BUFFER *, BUFFER *, int);
+BUFFER * mutt_buffer_init (BUFFER *);
+BUFFER * mutt_buffer_from (BUFFER *, char *);
+void mutt_buffer_free(BUFFER **);
 void mutt_buffer_add (BUFFER*, const char*, size_t);
 void mutt_buffer_addstr (BUFFER*, const char*);
 void mutt_buffer_addch (BUFFER*, char);
@@ -291,6 +294,7 @@
 int mutt_is_valid_mailbox (const char *);
 int mutt_lookup_mime_type (BODY *, const char *);
 int mutt_match_rx_list (const char *, RX_LIST *);
+int mutt_match_spam_list (const char *, SPAM_LIST *, char *, int);
 int mutt_messages_in_thread (CONTEXT *, HEADER *, int);
 int mutt_multi_choice (char *prompt, char *letters);
 int mutt_needs_mailcap (BODY *);
diff -ur mutt-1.5.6-base/sort.c mutt-1.5.6-hormel.2/sort.c
--- mutt-1.5.6-base/sort.c      Sun Feb  1 11:10:58 2004
+++ mutt-1.5.6-hormel.2/sort.c  Sun Jul 11 23:53:59 2004
@@ -149,6 +149,57 @@
   return (SORTCODE ((*ha)->index - (*hb)->index));
 }
 
+int compare_spam (const void *a, const void *b)
+{
+  HEADER **ppa = (HEADER **) a;
+  HEADER **ppb = (HEADER **) b;
+  char   *aptr, *bptr;
+  int     ahas, bhas;
+  int     result = 0;
+
+  /* Firstly, require spam attributes for both msgs */
+  /* to compare. Determine which msgs have one.     */
+  ahas = (*ppa)->env && (*ppa)->env->spam;
+  bhas = (*ppb)->env && (*ppb)->env->spam;
+
+  /* If one msg has spam attr but other does not, sort the one with first. */
+  if (ahas && !bhas)
+    return (SORTCODE(1));
+  if (!ahas && bhas)
+    return (SORTCODE(-1));
+
+  /* Else, if neither has a spam attr, presume equality. Fall back on aux. */
+  if (!ahas && !bhas)
+  {
+    AUXSORT(result, a, b);
+    return (SORTCODE(result));
+  }
+
+
+  /* Both have spam attrs. */
+
+  /* preliminary numeric examination */
+  result = (strtoul((*ppa)->env->spam->data, &aptr, 10) -
+            strtoul((*ppb)->env->spam->data, &bptr, 10));
+
+  /* If either aptr or bptr is equal to data, there is no numeric    */
+  /* value for that spam attribute. In this case, compare lexically. */
+  if ((aptr == (*ppa)->env->spam->data) || (bptr == (*ppb)->env->spam->data))
+    return (SORTCODE(strcmp(aptr, bptr)));
+
+  /* Otherwise, we have numeric value for both attrs. If these values */
+  /* are equal, then we first fall back upon string comparison, then  */
+  /* upon auxiliary sort.                                             */
+  if (result == 0)
+  {
+    result = strcmp(aptr, bptr);
+    if (result == 0)
+      AUXSORT(result, a, b);
+  }
+
+  return (SORTCODE(result));
+}
+
 sort_t *mutt_get_sort_func (int method)
 {
   switch (method & SORT_MASK)
@@ -169,6 +220,8 @@
       return (compare_to);
     case SORT_SCORE:
       return (compare_score);
+    case SORT_SPAM:
+      return (compare_spam);
     default:
       return (NULL);
   }
diff -ur mutt-1.5.6-base/sort.h mutt-1.5.6-hormel.2/sort.h
--- mutt-1.5.6-base/sort.h      Mon Jan  6 04:25:35 2003
+++ mutt-1.5.6-hormel.2/sort.h  Sun Jul 11 21:58:17 2004
@@ -29,9 +29,12 @@
 #define SORT_ADDRESS   11
 #define SORT_KEYID     12
 #define SORT_TRUST     13
-#define SORT_MASK      0xf
-#define SORT_REVERSE   (1<<4)
-#define SORT_LAST      (1<<5)
+#define SORT_SPAM      14
+/* dgc: Sort & SortAux are shorts, so I'm bumping these bitflags up from
+ * bits 4 & 5 to bits 8 & 9 to make room for more sort keys in the future. */
+#define SORT_MASK      0xff
+#define SORT_REVERSE   (1<<8)
+#define SORT_LAST      (1<<9)
 
 typedef int sort_t (const void *, const void *);
 sort_t *mutt_get_sort_func (int);

Follow-Ups:
- Re: [PATCH] generic spam detection
  - From: TAKAHASHI Tamotsu
- Re: [PATCH] generic spam detection
  - From: Thomas Roessler
- Re: [PATCH] generic spam detection
  - From: David Champion

References:
- [PATCH] generic spam detection
  - From: David Champion
- Re: [PATCH] generic spam detection
  - From: Thomas Roessler

Prev by Date: [2004-07-12] CVS repository changes
Next by Date: Re: [PATCH] generic spam detection
Previous by thread: Re: [PATCH] generic spam detection
Next by thread: Re: [PATCH] generic spam detection
Index(es):
- Date
- Thread