Re: [PATCH] generic spam detection

To: mutt-dev@xxxxxxxx
Subject: Re: [PATCH] generic spam detection
From: David Champion <dgc@xxxxxxxxxxxx>
Date: Thu, 15 Jul 2004 01:46:35 -0500
In-reply-to: <20040715012321.GW24127@xxxxxxxxxxxxxxxxx>
List-unsubscribe: <mailto:mutt-dev-request@mutt.org?body=unsubscribe>
Mail-followup-to: mutt-dev@xxxxxxxx
References: <20040712142005.GA44582@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> <20040712180051.GH27149@xxxxxxxxxxxxxxxxx> <20040714053853.GA55151@xxxxxxxxxxxxxxxxxxxxxxxxxx> <20040210071345.GA23743@xxxxxxxxxxxxxxxxx> <20040412204852.GT5807@xxxxxxxxxxxxxxxxxxxxxxxxxxxx> <20040712070323.GQ2543@xxxxxxxxxxxxxxxxx> <20040712142005.GA44582@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> <20040712180051.GH27149@xxxxxxxxxxxxxxxxx> <20040712184628.GF14973@xxxxxxxxxxxxxxxxxxxxxxxxxxxx> <20040715012321.GW24127@xxxxxxxxxxxxxxxxx>
Sender: owner-mutt-dev@xxxxxxxx
User-agent: Mutt/1.5.6i

v3 is attached.

* On 2004.07.14, in <20040715012321.GW24127@xxxxxxxxxxxxxxxxx>,
*       "David Champion" <dgc@xxxxxxxxxxxx> wrote:
> 
> I think that "" should not be the default, but I'm split evenly on
> whether it should be unset or "," (or the like).

I changed the default $spam_separator to "," for this version of the
patch -- just for variety, in case anyone is trying this and feels that
it makes a better default.

> It's surprising to me that anyone would want to folder-hook these --
> my original thought was that spam patterns would remain the same for
> all folders, and it seems strange that one folder might use different
> external spam engines than another. But perhaps this is for a non-spam
> application of the functionality. More below.

Another note on this topic: notice that HEADER->env->spam is only
updated during folder reads. This means that folder-hook is the *only*
circumstance in which it makes any sense to change/add/remove spam or
nospam lists, or $spam_separator, during runtime. I overlooked this
before, in fact: for $spam_separator, R_NONE is a suitable flag because
changing $spam_separator won't update any spam tags until the mailbox is
reread anyway, and that already implies a full redraw.

This isn't strictly necessary, it's just an optimization against
assembling lists into strings on the fly, as the index is rendered. But
it seems like a wise one.

> > It would be consistent with behavior of other list management
> > functions in mutt if spam would go through the nospam list and
> > remove a possibly identical regular expression (in addition to
> > "spamming" something), and if "nospam" would go through the spam
> > list and remove anything matching from that one.
> 
> Agreed. This is doable, and I'm willing to extend the patch to cover
> this. It shouldn't be hard.

This is done. Suppose the following four rules:
  spam aaa aaa  # if "aaa" matches, add "aaa" into the spam tag
  spam bbb bbb  # if "bbb" matches, add "bbb" into the spam tag
nospam aaabbb   # a special-case exception to "spam aaa"
nospam bbbaaa   # a special-case exception to "spam bbb"

Then:
  spam bbbaaa   # removes the second "nospam" above
nospam aaa      # removes the first "spam" above

Additionally, adding a "spam" rule whose pattern already exists will now
update it to use the new template. (Formerly this was a silent no-op.)

So:
  spam bbb ccc  # changes "spam" rule #2.

The result is effectively:

spam bbb ccc    # if "bbb" matches, add "ccc" into the spam tag
nospam aaabbb   # an exception to "spam bbb"

Also, "nospam *" removes all spam and nospam entries. (No point leaving
nospam rules when removing all spam rules.) This will be useful in
default folder-hooks, for resetting both spam and nospam lists to empty.

> > That said, I'm wondering if the spam/nospam stuff shouldn't reuse
> > the current hook framework; we'd then have "unhook" as a
> > coarse-grained mechanism to clean up the situation.
> > 
> > spam-hook, ham-hook?
> 
> The hook framework would need to be extended to handle backreferences,
> I think, but this seems like it would work. It has the advantage you
> mention, but I think it still requires a new datatype (e.g. spam_list_t)
> to map the hook pattern to a spam tag template, and most of the
> functionality in parse_spam_list() would be duplicated anyway inside
> mutt_parse_hook(). It seems like the primary advantage of changing
> it would be in the extent to which it makes the user experience
> more consistent. But it's not clear to me whether it would be more
> connsistent, particularly. I wonder what others think of this.

I don't mean to remove this option from discussion, I just felt that the
other approach was something I could address today.

-- 
 -D.    dgc@xxxxxxxxxxxx                                  NSIT::ENSS

--- mutt-1.5.6/PATCHES~ never
+++ mutt-1.5.6/PATCHES  Thu Jul 15 01:46:16 CDT 2004
@@ -1,0 +1 @@
+patch-1.5.6.dgc.hormel.3
diff -ur mutt-1.5.6-base/commands.c mutt-1.5.6-hormel.3/commands.c
--- mutt-1.5.6-base/commands.c  Sun Feb  1 11:10:57 2004
+++ mutt-1.5.6-hormel.3/commands.c      Sun Jul 11 22:19:55 2004
@@ -501,9 +501,9 @@
   int method = Sort; /* save the current method in case of abort */
 
   switch (mutt_multi_choice (reverse ?
-                            _("Rev-Sort 
(d)ate/(f)rm/(r)ecv/(s)ubj/t(o)/(t)hread/(u)nsort/si(z)e/s(c)ore?: ") :
-                            _("Sort 
(d)ate/(f)rm/(r)ecv/(s)ubj/t(o)/(t)hread/(u)nsort/si(z)e/s(c)ore?: "),
-                            _("dfrsotuzc")))
+                            _("Rev-Sort 
(d)ate/(f)rm/(r)ecv/(s)ubj/t(o)/(t)hread/(u)nsort/si(z)e/s(c)ore/s(p)am?: ") :
+                            _("Sort 
(d)ate/(f)rm/(r)ecv/(s)ubj/t(o)/(t)hread/(u)nsort/si(z)e/s(c)ore/s(p)am?: "),
+                            _("dfrsotuzcp")))
   {
   case -1: /* abort - don't resort */
     return -1;
@@ -542,6 +542,10 @@
   
   case 9: /* s(c)ore */ 
     Sort = SORT_SCORE;
+    break;
+
+  case 10: /* s(p)am */
+    Sort = SORT_SPAM;
     break;
   }
   if (reverse)
diff -ur mutt-1.5.6-base/doc/manual.sgml.head 
mutt-1.5.6-hormel.3/doc/manual.sgml.head
--- mutt-1.5.6-base/doc/manual.sgml.head        Sun Feb  1 11:49:53 2004
+++ mutt-1.5.6-hormel.3/doc/manual.sgml.head    Thu Jul 15 01:08:27 2004
@@ -1492,6 +1492,106 @@
 removed.  The pattern ``*'' is a special token which means to clear the list
 of all score entries.
 
+<sect1>Spam detection<label id="spam">
+<p>
+Usage: <tt/spam/ <em/pattern/ <em/format/
+Usage: <tt/nospam/ <em/pattern/
+
+Mutt has generalized support for external spam-scoring filters.
+By defining your spam patterns with the <tt/spam/ and <tt/nospam/
+commands, you can <em/limit/, <em/search/, and <em/sort/ your
+mail based on its spam attributes, as determined by the external
+filter. You also can display the spam attributes in your index
+display using the <tt/%H/ selector in the <ref id="index_format"
+name="&dollar;index&lowbar;format"> variable. (Tip: try <tt/%?H?[%H] ?/
+to display spam tags only when they are defined for a given message.)
+
+Your first step is to define your external filter's spam patterns using
+the <tt/spam/ command. <em/pattern/ should be a regular expression
+that matches a header in a mail message. If any message in the mailbox
+matches this regular expression, it will receive a ``spam tag'' or
+``spam attribute'' (unless it also matches a <tt/nospam/ pattern -- see
+below.) The appearance of this attribute is entirely up to you, and is
+governed by the <em/format/ parameter. <em/format/ can be any static
+text, but it also can include back-references from the <em/pattern/
+expression. (A regular expression ``back-reference'' refers to a
+sub-expression contained within parentheses.) <tt/%1/ is replaced with
+the first back-reference in the regex, <tt/%2/ with the second, etc.
+
+If you're using multiple spam filters, a message can have more than
+one spam-related header. You can define <tt/spam/ patterns for each
+filter you use. If a message matches two or more of these patterns, and
+the &dollar;spam&lowbar;separator variable is set to a string, then the
+message's spam tag will consist of all the <em/format/ strings joined
+together, with the value of &dollar;spam&lowbar;separator separating
+them.
+
+For example, suppose I use DCC, SpamAssassin, and PureMessage. I might
+define these spam settings:
+<tscreen><verb>
+spam "X-DCC-.*-Metrics:.*(....)=many"         "90+/DCC-%1"
+spam "X-Spam-Status: Yes"                     "90+/SA"
+spam "X-PerlMX-Spam: .*Probability=([0-9]+)%" "%1/PM"
+set spam_separator=", "
+</verb></tscreen>
+
+If I then received a message that DCC registered with ``many'' hits
+under the ``Fuz2'' checksum, and that PureMessage registered with a
+97% probability of being spam, that message's spam tag would read
+<tt>90+/DCC-Fuz2, 97/PM</tt>. (The four characters before ``=many'' in a
+DCC report indicate the checksum used -- in this case, ``Fuz2''.)
+
+If the &dollar;spam&lowbar;separator variable is unset, then each
+spam pattern match supercedes the previous one. Instead of getting
+joined <em/format/ strings, you'll get only the last one to match.
+
+The spam tag is what will be displayed in the index when you use
+<tt/%H/ in the <tt/&dollar;index&lowbar;format/ variable. It's also the
+string that the <tt/~H/ pattern-matching expression matches against for
+<em/search/ and <em/limit/ functions. And it's what sorting by spam
+attribute will use as a sort key.
+
+That's a pretty complicated example, and most people's actual
+environments will have only one spam filter. The simpler your
+configuration, the more effective mutt can be, especially when it comes
+to sorting.
+
+Generally, when you sort by spam tag, mutt will sort <em/lexically/ --
+that is, by ordering strings alphnumerically. However, if a spam tag
+begins with a number, mutt will sort numerically first, and lexically
+only when two numbers are equal in value. (This is like UNIX's
+<tt/sort -n/.) A message with no spam attributes at all -- that is, one
+that didn't match <em/any/ of your <tt/spam/ patterns -- is sorted at
+lowest priority. Numbers are sorted next, beginning with 0 and ranging
+upward. Finally, non-numeric strings are sorted, with ``a'' taking lower
+priority than ``z''. Clearly, in general, sorting by spam tags is most
+effective when you can coerce your filter to give you a raw number. But
+in case you can't, mutt can still do something useful.
+
+The <tt/nospam/ command can be used to write exceptions to <tt/spam/
+patterns. If a header pattern matches something in a <tt/spam/ command,
+but you nonetheless do not want it to receive a spam tag, you can list a
+more precise pattern under a <tt/nospam/ command.
+
+If the <em/pattern/ given to <tt/nospam/ is exactly the same as the
+<em/pattern/ on an existing <tt/spam/ list entry, the effect will be to
+remove the entry from the spam list, instead of adding an exception.
+Likewise, if the <em/pattern/ for a <tt/spam/ command matches an entry
+on the <tt/nospam/ list, that <tt/nospam/ entry will be removed. If the
+<em/pattern/ for <tt/nospam/ is ``*'', <em/all entries on both lists/
+will be removed. This might be the default action if you use <tt/spam/
+and <tt/nospam/ in conjunction with a <tt/folder-hook/.
+
+You can have as many <tt/spam/ or <tt/nospam/ commands as you like.
+You can even do your own primitive spam detection within mutt -- for
+example, if you consider all mail from <tt/MAILER-DAEMON/ to be spam,
+you can use a <tt/spam/ command like this:
+
+<tscreen><verb>
+spam "^From: .*MAILER-DAEMON"       "999"
+</verb></tscreen>
+
+
 <sect1>Setting variables<label id="set">
 <p>
 Usage: <tt/set/ &lsqb;no|inv&rsqb;<em/variable/&lsqb;=<em/value/&rsqb; &lsqb; 
<em/variable/ ... &rsqb;<newline>
@@ -1759,6 +1859,7 @@
 ~f USER         messages originating from USER
 ~g              cryptographically signed messages
 ~G              cryptographically encrypted messages
+~H EXPR         messages with a spam attribute matching EXPR
 ~h EXPR         messages which contain EXPR in the message header
 ~k             message contains PGP key material
 ~i ID           message which match ID in the ``Message-ID'' field
@@ -2390,7 +2491,7 @@
 
 <sect1>Start a WWW Browser on URLs (EXTERNAL)<label id="urlview">
 <p>
-If a message contains URLs (<em/unified ressource locator/ = address in the
+If a message contains URLs (<em/unified resource locator/ = address in the
 WWW space like <em>http://www.mutt.org/</em>), it is efficient to get
 a menu with all the URLs and start a WWW browser on one of them.  This
 functionality is provided by the external urlview program which can be
@@ -3053,6 +3154,10 @@
 <tt><ref id="set" name="unset"></tt> <em/variable/ &lsqb;<em/variable/ ... 
&rsqb;
 <item>
 <tt><ref id="source" name="source"></tt> <em/filename/
+<item>
+<tt><ref id="spam" name="spam"></tt> <em/pattern/ <em/format/
+<item>
+<tt><ref id="spam" name="nospam"></tt> <em/pattern/
 <item>
 <tt><ref id="lists" name="subscribe"></tt> <em/address/ &lsqb; <em/address/ 
... &rsqb; 
 <item>
diff -ur mutt-1.5.6-base/doc/muttrc.man.head 
mutt-1.5.6-hormel.3/doc/muttrc.man.head
--- mutt-1.5.6-base/doc/muttrc.man.head Sun Feb  1 11:15:18 2004
+++ mutt-1.5.6-hormel.3/doc/muttrc.man.head     Mon Jul 12 01:15:49 2004
@@ -336,6 +336,15 @@
 \fBsource\fP \fIfilename\fP
 The given file will be evaluated as a configuration file.
 .TP
+.nf
+\fBspam\fP \fIpattern\fP \fIformat\fP
+\fBnospam\fP \fIpattern\fP
+.fi
+These commands define spam-detection patterns from external spam
+filters, so that mutt can sort, limit, and search on
+``spam tags'' or ``spam attributes'', or display them
+in the index. See the Mutt manual for details.
+.TP
 \fBunhook\fP [\fB * \fP | \fIhook-type\fP ]
 This command will remove all hooks of a given type, or all hooks
 when \(lq\fB*\fP\(rq is used as an argument.  \fIhook-type\fP
@@ -384,6 +393,7 @@
 ~f \fIEXPR\fP  messages originating from \fIEXPR\fP
 ~g     PGP signed messages
 ~G     PGP encrypted messages
+~H \fIEXPR\fP  messages with spam tags matching \fIEXPR\fP
 ~h \fIEXPR\fP  messages which contain \fIEXPR\fP in the message header
 ~k     message contains PGP key material
 ~i \fIEXPR\fP  message which match \fIEXPR\fP in the \(lqMessage-ID\(rq field
diff -ur mutt-1.5.6-base/globals.h mutt-1.5.6-hormel.3/globals.h
--- mutt-1.5.6-base/globals.h   Sun Feb  1 11:15:17 2004
+++ mutt-1.5.6-hormel.3/globals.h       Sun Jul 11 02:16:57 2004
@@ -102,6 +102,7 @@
 WHERE char *Signature;
 WHERE char *SimpleSearch;
 WHERE char *Spoolfile;
+WHERE char *SpamSep;
 #if defined(USE_SSL) || defined(USE_NSS)
 WHERE char *SslCertFile INITVAL (NULL);
 WHERE char *SslEntropyFile INITVAL (NULL);
@@ -125,6 +126,8 @@
 WHERE RX_LIST *Alternates INITVAL(0);
 WHERE RX_LIST *MailLists INITVAL(0);
 WHERE RX_LIST *SubscribedLists INITVAL(0);
+WHERE SPAM_LIST *SpamList INITVAL(0);
+WHERE RX_LIST *NoSpamList INITVAL(0);
 
 /* bit vector for boolean variables */
 #ifdef MAIN_C
diff -ur mutt-1.5.6-base/hdrline.c mutt-1.5.6-hormel.3/hdrline.c
--- mutt-1.5.6-base/hdrline.c   Sun Feb  1 11:15:17 2004
+++ mutt-1.5.6-hormel.3/hdrline.c       Sun Jul 11 02:16:57 2004
@@ -433,6 +433,18 @@
         optional = 0;
       break;
 
+    case 'H':
+      /* (Hormel) spam score */
+      if (optional)
+       optional = hdr->env->spam ? 1 : 0;
+
+       if (hdr->env->spam)
+         mutt_format_s (dest, destlen, prefix, NONULL (hdr->env->spam->data));
+       else
+         mutt_format_s (dest, destlen, prefix, "");
+
+      break;
+
     case 'i':
       mutt_format_s (dest, destlen, prefix, hdr->env->message_id ? 
hdr->env->message_id : "<no.id>");
       break;
diff -ur mutt-1.5.6-base/init.c mutt-1.5.6-hormel.3/init.c
--- mutt-1.5.6-base/init.c      Sun Feb  1 12:21:00 2004
+++ mutt-1.5.6-hormel.3/init.c  Thu Jul 15 01:11:04 2004
@@ -366,6 +365,112 @@
 }
 
 
+static int add_to_spam_list (SPAM_LIST **list, const char *pat, const char 
*templ, BUFFER *err)
+{
+  SPAM_LIST *t = NULL, *last = NULL;
+  REGEXP *rx;
+  int n;
+  const char *p;
+
+  if (!pat || !*pat || !templ)
+    return 0;
+
+  if (!(rx = mutt_compile_regexp (pat, REG_ICASE)))
+  {
+    snprintf (err->data, err->dsize, _("Bad regexp: %s"), pat);
+    return -1;
+  }
+
+  /* check to make sure the item is not already on this list */
+  for (last = *list; last; last = last->next)
+  {
+    if (ascii_strcasecmp (rx->pattern, last->rx->pattern) == 0)
+    {
+      /* Already on the list. Formerly we just skipped this case, but
+       * now we're supporting removals, which means we're supporting
+       * re-adds conceptually. So we probably want this to imply a
+       * removal, then do an add. We can achieve the removal by freeing
+       * the template, and leaving t pointed at the current item.
+       */
+      t = last;
+      safe_free(&t->template);
+      break;
+    }
+    if (!last->next)
+      break;
+  }
+
+  /* If t is set, it's pointing into an extant SPAM_LIST* that we want to
+   * update. Otherwise we want to make a new one to link at the list's end.
+   */
+  if (!t)
+  {
+    t = mutt_new_spam_list();
+    t->rx = rx;
+    if (last)
+      last->next = t;
+    else
+      *list = t;
+  }
+
+  /* Now t is the SPAM_LIST* that we want to modify. It is prepared. */
+  t->template = strdup(templ);
+
+  /* find highest match number in template string */
+  t->nmatch = 0;
+  for (p = templ; *p;)
+  {
+    if (*p == '%')
+    {
+       n = atoi(++p);
+       if (n > t->nmatch)
+         t->nmatch = n;
+       while (*p && isdigit((int)*p))
+         ++p;
+    }
+    else
+       ++p;
+  }
+  t->nmatch++;         /* match 0 is always the whole expr */
+
+  return 0;
+}
+
+static int remove_from_spam_list (SPAM_LIST **list, const char *pat)
+{
+  SPAM_LIST *spam, *prev;
+  int nremoved = 0;
+
+  /* Being first is a special case. */
+  spam = *list;
+  if (spam->rx && !mutt_strcmp(spam->rx->pattern, pat))
+  {
+    *list = spam->next;
+    mutt_free_regexp(&spam->rx);
+    safe_free(&spam->template);
+    safe_free(&spam);
+    return 1;
+  }
+
+  prev = spam;
+  for (spam = prev->next; spam;)
+  {
+    if (!mutt_strcmp(spam->rx->pattern, pat))
+    {
+      prev->next = spam->next;
+      mutt_free_regexp(&spam->rx);
+      safe_free(&spam->template);
+      safe_free(&spam);
+      spam = prev->next;
+      ++nremoved;
+    }
+    else
+      spam = spam->next;
+  }
+
+  return nremoved;
+}
+
 static void remove_from_list (LIST **l, const char *str)
 {
   LIST *p, *last = NULL;
@@ -502,6 +607,76 @@
   while (MoreArgs (s));
   
   return 0;
+}
+
+static int parse_spam_list (BUFFER *buf, BUFFER *s, unsigned long data, BUFFER 
*err)
+{
+  BUFFER templ;
+
+  memset(&templ, 0, sizeof(templ));
+
+  /* Insist on at least one parameter */
+  if (!MoreArgs(s))
+  {
+    if (data == M_SPAM)
+      strfcpy(err->data, _("spam: no matching pattern"), err->dsize);
+    else
+      strfcpy(err->data, _("nospam: no matching pattern"), err->dsize);
+    return -1;
+  }
+
+  /* Extract the first token, a regexp */
+  mutt_extract_token (buf, s, 0);
+
+  /* data should be either M_SPAM or M_NOSPAM. M_SPAM is for spam commands. */
+  if (data == M_SPAM)
+  {
+    /* If there's a second parameter, it's a template for the spam tag. */
+    if (MoreArgs(s))
+    {
+      mutt_extract_token (&templ, s, 0);
+
+      /* Add to the spam list. */
+      if (add_to_spam_list (&SpamList, buf->data, templ.data, err) != 0)
+          return -1;
+    }
+
+    /* If not, try to remove from the nospam list. */
+    else
+    {
+      remove_from_rx_list(&NoSpamList, buf->data);
+    }
+
+    return 0;
+  }
+
+  /* M_NOSPAM is for nospam commands. */
+  else if (data == M_NOSPAM)
+  {
+    /* nospam only ever has one parameter. */
+
+    /* "*" is a special case. */
+    if (!mutt_strcmp(buf->data, "*"))
+    {
+      mutt_free_spam_list (&SpamList);
+      mutt_free_rx_list (&NoSpamList);
+      return 0;
+    }
+
+    /* If it's on the spam list, just remove it. */
+    if (remove_from_spam_list(&SpamList, buf->data) != 0)
+      return 0;
+
+    /* Otherwise, add it to the nospam list. */
+    if (add_to_rx_list (&NoSpamList, buf->data, REG_ICASE, err) != 0)
+      return -1;
+
+    return 0;
+  }
+
+  /* This should not happen. */
+  strfcpy(err->data, "This is no good at all.", err->dsize);
+  return -1;
 }
 
 static int parse_unlist (BUFFER *buf, BUFFER *s, unsigned long data, BUFFER 
*err)
diff -ur mutt-1.5.6-base/init.h mutt-1.5.6-hormel.3/init.h
--- mutt-1.5.6-base/init.h      Sun Feb  1 11:15:17 2004
+++ mutt-1.5.6-hormel.3/init.h  Wed Jul 14 20:32:46 2004
@@ -901,6 +901,7 @@
   ** .dt %E .dd number of messages in current thread
   ** .dt %f .dd entire From: line (address + real name)
   ** .dt %F .dd author name, or recipient name if the message is from you
+  ** .dt %H .dd spam attribute(s) of this message
   ** .dt %i .dd message-id of the current message
   ** .dt %l .dd number of lines in the message (does not work with maildir,
   **            mh, and possibly IMAP folders)
@@ -2314,6 +2315,7 @@
   ** .  mailbox-order (unsorted)
   ** .  score
   ** .  size
+  ** .  spam
   ** .  subject
   ** .  threads
   ** .  to
@@ -2379,6 +2381,15 @@
   ** the message whether or not this is the case, as long as the
   ** non-``$$reply_regexp'' parts of both messages are identical.
   */
+  { "spam_separator",   DT_STR, R_NONE, UL &SpamSep, UL "," },
+  /*
+  ** .pp
+  ** ``$spam_separator'' controls what happens when multiple spam headers
+  ** are matched: if unset, each successive header will overwrite any
+  ** previous matches value for the spam label. If set, each successive
+  ** match will append to the previous, using ``$spam_separator'' as a
+  ** separator.
+  */
   { "spoolfile",       DT_PATH, R_NONE, UL &Spoolfile, 0 },
   /*
   ** .pp
@@ -2678,6 +2689,7 @@
   { "threads",         SORT_THREADS },
   { "to",              SORT_TO },
   { "score",           SORT_SCORE },
+  { "spam",            SORT_SPAM },
   { NULL,              0 }
 };
 
@@ -2696,6 +2708,7 @@
                                         */
   { "to",              SORT_TO },
   { "score",           SORT_SCORE },
+  { "spam",            SORT_SPAM },
   { NULL,              0 }
 };
   
@@ -2728,6 +2741,7 @@
 
 static int parse_list (BUFFER *, BUFFER *, unsigned long, BUFFER *);
 static int parse_rx_list (BUFFER *, BUFFER *, unsigned long, BUFFER *);
+static int parse_spam_list (BUFFER *, BUFFER *, unsigned long, BUFFER *);
 static int parse_unlist (BUFFER *, BUFFER *, unsigned long, BUFFER *);
 static int parse_rx_unlist (BUFFER *, BUFFER *, unsigned long, BUFFER *);
 
@@ -2793,6 +2807,8 @@
   { "send-hook",       mutt_parse_hook,        M_SENDHOOK },
   { "set",             parse_set,              0 },
   { "source",          parse_source,           0 },
+  { "spam",            parse_spam_list,        M_SPAM },
+  { "nospam",          parse_spam_list,        M_NOSPAM },
   { "subscribe",       parse_subscribe,        0 },
   { "toggle",          parse_set,              M_SET_INV },
   { "unalias",         parse_unalias,          0 },
diff -ur mutt-1.5.6-base/mutt.h mutt-1.5.6-hormel.3/mutt.h
--- mutt-1.5.6-base/mutt.h      Sun Feb  1 11:15:17 2004
+++ mutt-1.5.6-hormel.3/mutt.h  Wed Jul 14 20:59:15 2004
@@ -220,6 +220,7 @@
   M_ID,
   M_BODY,
   M_HEADER,
+  M_HORMEL,
   M_WHOLE_MSG,
   M_SENDER,
   M_MESSAGE,
@@ -312,6 +313,9 @@
 #define M_SEL_MULTI    (1<<1)
 #define M_SEL_FOLDER   (1<<2)
 
+/* flags for parse_spam_list */
+#define M_SPAM          1
+#define M_NOSPAM        2
 
 /* boolean vars */
 enum
@@ -405,6 +409,7 @@
   OPTSIGDASHES,
   OPTSIGONTOP,
   OPTSORTRE,
+  OPTSPAMSEP,
   OPTSTATUSONTOP,
   OPTSTRICTTHREADS,
   OPTSUSPEND,
@@ -512,10 +517,20 @@
   struct rx_list_t *next;
 } RX_LIST;
 
+typedef struct spam_list_t
+{
+  REGEXP *rx;
+  int     nmatch;
+  char   *template;
+  struct spam_list_t *next;
+} SPAM_LIST;
+
 #define mutt_new_list() safe_calloc (1, sizeof (LIST))
 #define mutt_new_rx_list() safe_calloc (1, sizeof (RX_LIST))
+#define mutt_new_spam_list() safe_calloc (1, sizeof (SPAM_LIST))
 void mutt_free_list (LIST **);
 void mutt_free_rx_list (RX_LIST **);
+void mutt_free_spam_list (SPAM_LIST **);
 int mutt_matches_ignore (const char *, LIST *);
 
 /* add an element to a list */
@@ -550,6 +565,7 @@
   char *supersedes;
   char *date;
   char *x_label;
+  BUFFER *spam;
   LIST *references;            /* message references (in reverse order) */
   LIST *in_reply_to;           /* in-reply-to header content */
   LIST *userhdrs;              /* user defined headers */
diff -ur mutt-1.5.6-base/muttlib.c mutt-1.5.6-hormel.3/muttlib.c
--- mutt-1.5.6-base/muttlib.c   Sun Feb  1 11:15:17 2004
+++ mutt-1.5.6-hormel.3/muttlib.c       Sun Jul 11 02:16:57 2004
@@ -1283,6 +1283,60 @@
     sleep (s);
 }
 
+/*
+ * Creates and initializes a BUFFER*. If passed an existing BUFFER*,
+ * just initializes. Frees anything already in the buffer.
+ *
+ * Disregards the 'destroy' flag, which seems reserved for caller.
+ * This is bad, but there's no apparent protocol for it.
+ */
+BUFFER * mutt_buffer_init(BUFFER *b)
+{
+  if (!b)
+  {
+    b = malloc(sizeof(BUFFER));
+    if (!b)
+      return NULL;
+  }
+  else
+  {
+    safe_free(b->data);
+  }
+  memset(b, 0, sizeof(BUFFER));
+  return b;
+}
+
+/*
+ * Creates and initializes a BUFFER*. If passed an existing BUFFER*,
+ * just initializes. Frees anything already in the buffer. Copies in
+ * the seed string.
+ *
+ * Disregards the 'destroy' flag, which seems reserved for caller.
+ * This is bad, but there's no apparent protocol for it.
+ */
+BUFFER * mutt_buffer_from(BUFFER *b, char *seed)
+{
+  int n;
+
+  if (!seed)
+    return NULL;
+
+  b = mutt_buffer_init(b);
+  b->data = strdup(seed);
+  b->dsize = strlen(seed);
+  b->dptr = (char *)((int)b->data + b->dsize);
+  return b;
+}
+
+void mutt_buffer_free(BUFFER **b)
+{
+  if (!b)
+    return;
+  if ((*b)->data)
+    safe_free(&((*b)->data));
+  safe_free(b);
+}
+
 void mutt_buffer_addstr (BUFFER* buf, const char* s)
 {
   mutt_buffer_add (buf, s, mutt_strlen (s));
@@ -1379,6 +1433,21 @@
   }
 }
 
+void mutt_free_spam_list (SPAM_LIST **list)
+{
+  SPAM_LIST *p;
+  
+  if (!list) return;
+  while (*list)
+  {
+    p = *list;
+    *list = (*list)->next;
+    mutt_free_regexp (&p->rx);
+    safe_free(&p->template);
+    FREE (&p);
+  }
+}
+
 int mutt_match_rx_list (const char *s, RX_LIST *l)
 {
   if (!s)  return 0;
@@ -1388,6 +1457,57 @@
     if (regexec (l->rx->rx, s, (size_t) 0, (regmatch_t *) 0, (int) 0) == 0)
     {
       dprint (5, (debugfile, "mutt_match_rx_list: %s matches %s\n", s, 
l->rx->pattern));
+      return 1;
+    }
+  }
+
+  return 0;
+}
+
+int mutt_match_spam_list (const char *s, SPAM_LIST *l, char *text, int x)
+{
+  static regmatch_t *pmatch = NULL;
+  static int nmatch = 0;
+  int i, n, tlen;
+  char *p;
+
+  if (!s)  return 0;
+
+  tlen = 0;
+
+  for (; l; l = l->next)
+  {
+    /* If this pattern needs more matches, expand pmatch. */
+    if (l->nmatch > nmatch)
+    {
+      safe_realloc ((void**) &pmatch, l->nmatch * sizeof(regmatch_t));
+      nmatch = l->nmatch;
+    }
+
+    /* Does this pattern match? */
+    if (regexec (l->rx->rx, s, (size_t) l->nmatch, (regmatch_t *) pmatch, 
(int) 0) == 0)
+    {
+      dprint (5, (debugfile, "mutt_match_spam_list: %s matches %s\n", s, 
l->rx->pattern));
+      dprint (5, (debugfile, "mutt_match_spam_list: %d subs\n", 
l->rx->rx->re_nsub));
+
+      /* Copy template into text, with substitutions. */
+      for (p = l->template; *p;)
+      {
+       if (*p == '%')
+       {
+         n = atoi(++p);                        /* find pmatch index */
+         while (isdigit(*p))
+           ++p;                                /* skip subst token */
+         for (i = pmatch[n].rm_so; (i < pmatch[n].rm_eo) && (tlen < x); i++)
+           text[tlen++] = s[i];
+       }
+       else
+       {
+         text[tlen++] = *p++;
+       }
+      }
+      text[tlen] = '\0';
+      dprint (5, (debugfile, "mutt_match_spam_list: \"%s\"\n", text));
       return 1;
     }
   }
diff -ur mutt-1.5.6-base/parse.c mutt-1.5.6-hormel.3/parse.c
--- mutt-1.5.6-base/parse.c     Wed Nov  5 03:41:33 2003
+++ mutt-1.5.6-hormel.3/parse.c Sun Jul 11 02:16:57 2004
@@ -1267,6 +1267,7 @@
   long loc;
   int matched;
   size_t linelen = LONG_STRING;
+  char buf[LONG_STRING+1];
 
   if (hdr)
   {
@@ -1308,6 +1309,49 @@
 
       fseek (f, loc, 0);
       break; /* end of header */
+    }
+
+    *buf = '\0';
+
+    if (mutt_match_spam_list(line, SpamList, buf, sizeof(buf)))
+    {
+      if (!mutt_match_rx_list(line, NoSpamList))
+      {
+
+       /* if spam tag already exists, figure out how to amend it */
+       if (e->spam && *buf)
+       {
+         /* If SpamSep defined, append with separator */
+         if (SpamSep)
+         {
+           mutt_buffer_addstr(e->spam, SpamSep);
+           mutt_buffer_addstr(e->spam, buf);
+         }
+
+         /* else overwrite */
+         else
+         {
+           e->spam->dptr = e->spam->data;
+           *e->spam->dptr = '\0';
+           mutt_buffer_addstr(e->spam, buf);
+         }
+       }
+
+       /* spam tag is new, and match expr is non-empty; copy */
+       else if (!e->spam && *buf)
+       {
+         e->spam = mutt_buffer_from(NULL, buf);
+       }
+
+       /* match expr is empty; plug in null string if no existing tag */
+       else if (!e->spam)
+       {
+         e->spam = mutt_buffer_from(NULL, "");
+       }
+
+       if (e->spam && e->spam->data)
+          dprint(5, (debugfile, "p822: spam = %s\n", e->spam->data));
+      }
     }
 
     *p = 0;
diff -ur mutt-1.5.6-base/pattern.c mutt-1.5.6-hormel.3/pattern.c
--- mutt-1.5.6-base/pattern.c   Wed Nov  5 03:41:33 2003
+++ mutt-1.5.6-hormel.3/pattern.c       Sun Jul 11 02:16:57 2004
@@ -58,6 +58,7 @@
   { 'g', M_CRYPT_SIGN,                 0,              NULL },
   { 'G', M_CRYPT_ENCRYPT,      0,              NULL },
   { 'h', M_HEADER,             M_FULL_MSG,     eat_regexp },
+  { 'H', M_HORMEL,             0,              eat_regexp },
   { 'i', M_ID,                 0,              eat_regexp },
   { 'k', M_PGP_KEY,            0,              NULL },
   { 'L', M_ADDRESS,            0,              eat_regexp },
@@ -1045,6 +1046,8 @@
      return (pat->not ^ ((h->security & APPLICATION_PGP) && (h->security & 
PGPKEY)));
     case M_XLABEL:
       return (pat->not ^ (h->env->x_label && regexec (pat->rx, 
h->env->x_label, 0, NULL, 0) == 0));
+    case M_HORMEL:
+      return (pat->not ^ (h->env->spam && h->env->spam->data && regexec 
(pat->rx, h->env->spam->data, 0, NULL, 0) == 0));
     case M_DUPLICATED:
       return (pat->not ^ (h->thread && h->thread->duplicate_thread));
   }
diff -ur mutt-1.5.6-base/protos.h mutt-1.5.6-hormel.3/protos.h
--- mutt-1.5.6-base/protos.h    Sun Feb  1 11:15:17 2004
+++ mutt-1.5.6-hormel.3/protos.h        Sun Jul 11 02:16:57 2004
@@ -32,6 +32,9 @@
        HEADER *, format_flag);
 
 int mutt_extract_token (BUFFER *, BUFFER *, int);
+BUFFER * mutt_buffer_init (BUFFER *);
+BUFFER * mutt_buffer_from (BUFFER *, char *);
+void mutt_buffer_free(BUFFER **);
 void mutt_buffer_add (BUFFER*, const char*, size_t);
 void mutt_buffer_addstr (BUFFER*, const char*);
 void mutt_buffer_addch (BUFFER*, char);
@@ -291,6 +294,7 @@
 int mutt_is_valid_mailbox (const char *);
 int mutt_lookup_mime_type (BODY *, const char *);
 int mutt_match_rx_list (const char *, RX_LIST *);
+int mutt_match_spam_list (const char *, SPAM_LIST *, char *, int);
 int mutt_messages_in_thread (CONTEXT *, HEADER *, int);
 int mutt_multi_choice (char *prompt, char *letters);
 int mutt_needs_mailcap (BODY *);
diff -ur mutt-1.5.6-base/sort.c mutt-1.5.6-hormel.3/sort.c
--- mutt-1.5.6-base/sort.c      Sun Feb  1 11:10:58 2004
+++ mutt-1.5.6-hormel.3/sort.c  Sun Jul 11 23:53:59 2004
@@ -149,6 +149,57 @@
   return (SORTCODE ((*ha)->index - (*hb)->index));
 }
 
+int compare_spam (const void *a, const void *b)
+{
+  HEADER **ppa = (HEADER **) a;
+  HEADER **ppb = (HEADER **) b;
+  char   *aptr, *bptr;
+  int     ahas, bhas;
+  int     result = 0;
+
+  /* Firstly, require spam attributes for both msgs */
+  /* to compare. Determine which msgs have one.     */
+  ahas = (*ppa)->env && (*ppa)->env->spam;
+  bhas = (*ppb)->env && (*ppb)->env->spam;
+
+  /* If one msg has spam attr but other does not, sort the one with first. */
+  if (ahas && !bhas)
+    return (SORTCODE(1));
+  if (!ahas && bhas)
+    return (SORTCODE(-1));
+
+  /* Else, if neither has a spam attr, presume equality. Fall back on aux. */
+  if (!ahas && !bhas)
+  {
+    AUXSORT(result, a, b);
+    return (SORTCODE(result));
+  }
+
+
+  /* Both have spam attrs. */
+
+  /* preliminary numeric examination */
+  result = (strtoul((*ppa)->env->spam->data, &aptr, 10) -
+            strtoul((*ppb)->env->spam->data, &bptr, 10));
+
+  /* If either aptr or bptr is equal to data, there is no numeric    */
+  /* value for that spam attribute. In this case, compare lexically. */
+  if ((aptr == (*ppa)->env->spam->data) || (bptr == (*ppb)->env->spam->data))
+    return (SORTCODE(strcmp(aptr, bptr)));
+
+  /* Otherwise, we have numeric value for both attrs. If these values */
+  /* are equal, then we first fall back upon string comparison, then  */
+  /* upon auxiliary sort.                                             */
+  if (result == 0)
+  {
+    result = strcmp(aptr, bptr);
+    if (result == 0)
+      AUXSORT(result, a, b);
+  }
+
+  return (SORTCODE(result));
+}
+
 sort_t *mutt_get_sort_func (int method)
 {
   switch (method & SORT_MASK)
@@ -169,6 +220,8 @@
       return (compare_to);
     case SORT_SCORE:
       return (compare_score);
+    case SORT_SPAM:
+      return (compare_spam);
     default:
       return (NULL);
   }
diff -ur mutt-1.5.6-base/sort.h mutt-1.5.6-hormel.3/sort.h
--- mutt-1.5.6-base/sort.h      Mon Jan  6 04:25:35 2003
+++ mutt-1.5.6-hormel.3/sort.h  Sun Jul 11 21:58:17 2004
@@ -29,9 +29,12 @@
 #define SORT_ADDRESS   11
 #define SORT_KEYID     12
 #define SORT_TRUST     13
-#define SORT_MASK      0xf
-#define SORT_REVERSE   (1<<4)
-#define SORT_LAST      (1<<5)
+#define SORT_SPAM      14
+/* dgc: Sort & SortAux are shorts, so I'm bumping these bitflags up from
+ * bits 4 & 5 to bits 8 & 9 to make room for more sort keys in the future. */
+#define SORT_MASK      0xff
+#define SORT_REVERSE   (1<<8)
+#define SORT_LAST      (1<<9)
 
 typedef int sort_t (const void *, const void *);
 sort_t *mutt_get_sort_func (int);

References:
- Re: [PATCH] generic spam detection
  - From: TAKAHASHI Tamotsu
- Re: [PATCH] generic spam detection
  - From: David Champion
- Re: [PATCH] generic spam detection
  - From: TAKAHASHI Tamotsu
- [PATCH] generic spam detection
  - From: David Champion
- Re: [PATCH] generic spam detection
  - From: Thomas Roessler
- Re: [PATCH] generic spam detection
  - From: David Champion
- Re: [PATCH] generic spam detection
  - From: Thomas Roessler
- Re: [PATCH] generic spam detection
  - From: David Champion

Prev by Date: [2004-07-15] CVS repository changes
Next by Date: Re: [PATCH] generic spam detection
Previous by thread: Re: [PATCH] generic spam detection
Next by thread: Re: [PATCH] generic spam detection
Index(es):
- Date
- Thread