<<< Date Index >>>     <<< Thread Index >>>

full-text-indexing email



I'm once again full-text-indexing all my email, and it's working
great. This has only limited relation to mutt; I use mutt, but most
any MUA could be plugged in place of it. In the rest of this message
I'll describe what I'm doing and how.

Last time I did full-text indexing, I used glimpse; but
that codebase has gone down a road I didn't want to follow
(restricted-use commercial code).

For a long time I did without. Recently I took another look over the
available full-text indexers; Freshmeat gave me: Lupy, glimpse,
harvest, holmes, namazu, swish, swish++, and yase. The first one I
looked at closely was swish++, and I stopped there. Perhaps some of
the others would have worked better; I didn't check.

Given swish++, the whole job was almost perfectly trivial. I find
that on my platform, the memory scaling of index is somewhat
different from the author's; on Red Hat 8 I found -W100000 climbed
to 76MB before finishing my email archives, where the author was
seeing 64MB per 250Kwords. Aside from that everything is slick.

My email I archive in Maildirs; that's critical to this strategy. I
built the initial index, didn't take all that long, and I
incrementally re-index periodically and that's _really_ quick; I
re-index with:

  #!/bin/sh -e

  cd $HOME/archive/Mail
  find */??? -type f -newer swish++.index | index -W100000 -I -
  mv swish++.index.new swish++.index

With a current index, I can do keyword searches for email with the
attached perl script; invoked with keywords (actually, search takes
boolean relations of keywords) it will build a tmp maildir populated
with links to the matching messages, and invoke mutt on it. Very,
very fast.

-Bennett
#!/usr/bin/perl -w
use strict;
use IO::File;
use File::Basename;

my $nothing = <<'EoF';
    Lucy Locket lost her pocket; Kitty Fisher found it.
    Nothing in it, nothing in it, but the binding 'round it.
EoF

my $tmpbox = $ENV{HOME} . '/.mailsearch' . $$;
END { exec "rm", "-rf", $tmpbox; }
mkdir $tmpbox, 0700 or die;
mkdir "$tmpbox/$_" or die for qw(tmp new cur);

my $cur = "$tmpbox/cur";

chdir $ENV{HOME} . '/archive/Mail' or die;

my $gotsome = 0;
my $fi = IO::File->new("search @ARGV|") or die;
while (defined($_ = $fi->getline)) {
    next if /^#/;
    my $fn = (split)[1];
    link $fn, "$cur/@{[basename($fn)]}" or die;
    $gotsome = 1;
}
die $nothing unless $gotsome;
system "mutt", "-f", $tmpbox;

Attachment: pgpDYXHDknjKL.pgp
Description: PGP signature