[PATCH] Re-introduce inode sorting
Hi,
sorry that this is going to be longer, the subject is quite tricky.
The inode sorting patch was sent to mutt-dev in:
http://does-not-exist.org/mail-archives/mutt-dev/msg00205.html
saying that it can speed it uncached maildir parsing from 200 to 15
seconds making it more than 13 times faster, some concerns were raised
and the code disabled by default.
For #2981 I've been doing some tests on my gentoo box and can confirm
this factor roughly (10 minutes for >300k messages compared to far over
2 hours). This almost renders mutt unusable for larger folders that
aren't cached yet (e.g. an internal structure change requires a full
cache rebuild or a user just giving mutt a try).
Googling for some messages on the kernel mailing list turns out that not
doing inode sorting on dir_index-enabled ext3 filesystems effectively
makes apps access inodes in nearly random order, so that explains the
slowdown. I think mutt needs inode sorting.
However, turning it on in general, per OS and even per filesystem is not
desirable. The attached patch is a first try at "on demand" inode
sorting.
Using readdir() the code builds up a linked list with files to parse. In
the second pass upon the first time that we either need to stat() or
really parse a message if not's hcache'd yet, inode sorting now is done.
Note that only the portion of the linked list is sorted following and
including the entry that we miss in the cache. Some more simple test
series of mine show that by doing insertion sort on lists with 6 or
fewer elements makes sorting another 7% to 3% faster.
Thus, the attached patch also addresses the main concern in the above
thread: For a fully cached mailbox with maildir_header_cache_verify=no,
this patch won't do sorting since no stat() or open() is involved. Also,
for my simple tests it does help to sort only part of the list instead
completely since it reduces the number of elements to check (down from a
total >30k to just <20 entries for some folders). Since the code is
moved into the second-pass maildir parsing, it's not only triggered when
initially opening a folder but also when re-scanning the currently open
one which helps also if cur/ is massively changed outside of the current
mutt.
I've been trying hard to come up with numbers on how fast sorting is.
According to my tests, a rule of thumb is: it does roughly 500k messages
per second and GHz clock speed. On my MacPro, it does 300k in 0.3
seconds at 2 GHz, on my 800 MHz laptop it takes 0.65 for the same
folder; this "formula" even roughly holds for smaller folders on these
machines. I consider the sorting overhead relatively low when looking at
how much time we save by mostly avoiding seeking.
I'm running this code for quite a few days and it doesn't cause harm for
me on HFS+/OSX and ext3/Linux; on the contrary, it speeds things up
(though far more on ext3 than on HFS+).
I've also put some timing and debug code in, so it would be kind if
people could test this patch and see the time sorting takes in the debug
file for further analysis (especially checking whether partial list
sorting happens/helps).
For my personal preference, I've turned the compile-time option into a
run-time option $maildir_sort_inode which we may even consider enabling
by default if it turns out that this optimization patch really does
optimize things. IMHO users shouldn't have to deal with filesystem
details only because they happen to have huge folders (though that
requires thinking about the storage format and tools involved anyway,
but that is debatable).
As it's not final yet, some words in the manual are still missing.
Rocco
comparing with ../pdmef/feature/inode-sort
searching for changes
diff --git a/configure.ac b/configure.ac
--- a/configure.ac
+++ b/configure.ac
@@ -757,16 +757,6 @@ if test x$ac_cv_dirent_d_ino = xyes ; th
[Define to 1 if your system has the dirent::d_ino member])
fi
AC_MSG_RESULT($ac_cv_dirent_d_ino)
-
-dnl This may look cumbersome -- please keep it that way, so we can
-dnl quickly change the default to "yes" again.
-mutt_cv_inodesort=no
-AC_ARG_ENABLE(inodesort, AC_HELP_STRING([--enable-inodesort], [Read files in
maildir folders sorted by inode]),
- [if test x$enableval = xyes && test x$ac_cv_dirent_d_ino = xyes ; then
mutt_cv_inodesort=yes; fi])
-
-if test $mutt_cv_inodesort = yes; then
- AC_DEFINE(USE_INODESORT, 1, [ Define to sort files in a maildir by
inode number. ])
-fi
mutt_cv_warnings=yes
AC_ARG_ENABLE(warnings, AC_HELP_STRING([--disable-warnings], [Turn off
compiler warnings (not recommended)]),
diff --git a/doc/makedoc-defs.h b/doc/makedoc-defs.h
--- a/doc/makedoc-defs.h
+++ b/doc/makedoc-defs.h
@@ -54,3 +54,6 @@
# ifndef USE_SASL
# define USE_SASL
# endif
+# ifndef HAVE_DIRENT_D_INO
+# define HAVE_DIRENT_D_INO
+# endif
diff --git a/init.h b/init.h
--- a/init.h
+++ b/init.h
@@ -1136,6 +1136,34 @@ struct option_t MuttVars[] = {
** files when the header cache is in use. This incurs one stat(2) per
** message every time the folder is opened.
*/
+#if HAVE_DIRENT_D_INO
+ { "maildir_sort_inode", DT_BOOL, R_NONE, OPTSORTINODE, 0 },
+ /*
+ ** .pp
+ ** For Maildir folders, this variable controls whether to sort filenames
first
+ ** by their inode number on demand before trying to open() or stat() them.
Whether this
+ ** variable should be \fIset\fP or \fIunset\fP highly depends on what
filesystem
+ ** the folder resides.
+ ** .pp
+ ** Under some circumstances, setting this variable may result in great
performance
+ ** improvements when reading large maildir folders when either header
caching is
+ ** disabled or many messages are not yet cached. In situations where
(nearly) all
+ ** messages are cached, sorting a large filename list may take more time
than the
+ ** time savings from using a sorted list are (causing lower performance).
+ ** .pp
+ ** In general, if ``$$maildir_header_cache_verify'' is \fIset\fP, users may
want to
+ ** enable this setting since Mutt will stat() the messages anyway. When
+ ** ``$$maildir_header_cache_verify'' is \fIunset\fP, it depends on the
filesystem
+ ** and the cached/uncached message ratio whether this setting improves
performance
+ ** or not.
+ ** .pp
+ ** Users of Mutt on an ext3 filesystem using \fIdir_index\fP likely want to
\fIset\fP
+ ** this value.
+ ** .pp
+ ** When this option is \fIset\fP, Mutt sorts entries on demand, i.e. only
once before
+ ** it actually attempts to stat() or open() a message.
+ */
+#endif /* HAVE_DIRENT_D_INO */
#if defined(HAVE_GDBM) || defined(HAVE_DB4)
{ "header_cache_pagesize", DT_STR, R_NONE, UL &HeaderCachePageSize, UL
"16384" },
/*
diff --git a/main.c b/main.c
--- a/main.c
+++ b/main.c
@@ -246,12 +246,6 @@ static void show_version (void)
"+USE_FLOCK "
#else
"-USE_FLOCK "
-#endif
-
-#ifdef USE_INODESORT
- "+USE_INODESORT "
-#else
- "-USE_INODESORT "
#endif
);
puts (
diff --git a/mh.c b/mh.c
--- a/mh.c
+++ b/mh.c
@@ -56,14 +56,16 @@
#include <sys/time.h>
#endif
+#define INS_SORT_THRESHOLD 6
+
struct maildir
{
HEADER *h;
char *canon_fname;
unsigned header_parsed:1;
-#ifdef USE_INODESORT
+#ifdef HAVE_DIRENT_D_INO
ino_t inode;
-#endif /* USE_INODESORT */
+#endif /* HAVE_DIRENT_D_INO */
struct maildir *next;
};
@@ -677,8 +679,10 @@ static HEADER *maildir_parse_message (in
static int maildir_parse_entry (CONTEXT * ctx, struct maildir ***last,
const char *subdir, const char *fname,
- int *count, int is_old, progress_t *progress,
- ino_t inode
+ int *count, int is_old, progress_t *progress
+#if HAVE_DIRENT_D_INO
+ , ino_t inode
+#endif
#if USE_HCACHE
, header_cache_t *hc
#endif
@@ -741,9 +745,9 @@ static int maildir_parse_entry (CONTEXT
entry = safe_calloc (sizeof (struct maildir), 1);
entry->h = h;
entry->header_parsed = (ctx->magic == M_MH);
-#ifdef USE_INODESORT
+#ifdef HAVE_DIRENT_D_INO
entry->inode = inode;
-#endif /* USE_INODESORT */
+#endif /* HAVE_DIRENT_D_INO */
**last = entry;
*last = &entry->next;
@@ -811,11 +815,9 @@ static int maildir_parse_dir (CONTEXT *
(debugfile, "%s:%d: parsing %s\n", __FILE__, __LINE__,
de->d_name));
maildir_parse_entry (ctx, last, subdir, de->d_name, count, is_old,
- progress,
+ progress
#if HAVE_DIRENT_D_INO
- de->d_ino
-#else
- 0
+ , de->d_ino
#endif
#if USE_HCACHE
, hc
@@ -889,7 +891,7 @@ static size_t maildir_hcache_keylen (con
}
#endif
-#ifdef USE_INODESORT
+#ifdef HAVE_DIRENT_D_INO
/*
* Merge two maildir lists according to the inode numbers.
*/
@@ -949,47 +951,107 @@ static struct maildir* maildir_merge_in
return head;
}
+static struct maildir* maildir_ins_sort (struct maildir* list)
+{
+ struct maildir *tmp, *last, *ret = NULL, *back;
+
+ ret = list;
+ list = list->next;
+ ret->next = NULL;
+
+ while (list)
+ {
+ last = NULL;
+ back = list->next;
+ for (tmp = ret; tmp && tmp->inode <= list->inode; tmp = tmp->next)
+ last = tmp;
+
+ list->next = tmp;
+ if (last)
+ last->next = list;
+ else
+ ret = list;
+
+ list = back;
+ }
+
+ return ret;
+}
+
/*
* Sort maildir list according to inode.
*/
-static struct maildir* maildir_sort_inode(struct maildir* list)
+static struct maildir* _maildir_sort_inode (struct maildir* list, size_t len)
{
struct maildir* left = list;
struct maildir* right = list;
+ size_t c = 0;
if (!list || !list->next)
{
return list;
}
+ if (len != (size_t)(-1) && len <= INS_SORT_THRESHOLD)
+ return maildir_ins_sort (list);
+
list = list->next;
while (list && list->next)
{
right = right->next;
list = list->next->next;
+ c++;
}
list = right;
right = right->next;
list->next = 0;
- left = maildir_sort_inode(left);
- right = maildir_sort_inode(right);
- return maildir_merge_inode(left, right);
+ left = _maildir_sort_inode (left, c);
+ right = _maildir_sort_inode (right, c);
+ return maildir_merge_inode (left, right);
}
-#endif /* USE_INODESORT */
+#ifdef DEBUG
+static struct maildir* maildir_sort_inode (struct maildir* list)
+{
+ struct timeval tv1 = { 0, 0}, tv2 = { 0, 0 };
+ int a, b, c;
+ struct maildir *p;
+
+ gettimeofday (&tv1, NULL);
+ list = _maildir_sort_inode (list, (size_t)(-1));
+ gettimeofday (&tv2, NULL);
+
+ a = tv2.tv_sec - tv1.tv_sec;
+ b = tv2.tv_usec - tv1.tv_usec;
+ if (b < 0)
+ a--, b *= -1;
+ for (p = list, c = 0; p; p = p->next, c++)
+ ;
+ dprint (4, (debugfile, "maildir: inode sorting took %.6f secs for %d
entries\n", a + b/1e6, c));
+
+ return list;
+}
+#else
+#define maildir_sort_inode(L) _maildir_sort_inode(L,(size_t)(-1))
+#endif
+
+#endif /* HAVE_DIRENT_D_INO */
/*
* This function does the second parsing pass for a maildir-style
* folder.
*/
-void maildir_delayed_parsing (CONTEXT * ctx, struct maildir *md,
+void maildir_delayed_parsing (CONTEXT * ctx, struct maildir **md,
progress_t *progress)
{
- struct maildir *p;
+ struct maildir *p, *last = NULL;
char fn[_POSIX_PATH_MAX];
int count;
+#if HAVE_DIRENT_D_INO
+ int sort = 0;
+#endif
#if USE_HCACHE
header_cache_t *hc = NULL;
@@ -1001,10 +1063,13 @@ void maildir_delayed_parsing (CONTEXT *
hc = mutt_hcache_open (HeaderCache, ctx->path, NULL);
#endif
- for (p = md, count = 0; p; p = p->next, count++)
+ for (p = *md, count = 0; p; p = p->next, count++)
{
if (! (p && p->h && !p->header_parsed))
+ {
+ last = p;
continue;
+ }
if (!ctx->quiet && progress)
mutt_progress_update (progress, count, -1);
@@ -1018,7 +1083,22 @@ void maildir_delayed_parsing (CONTEXT *
#if USE_HCACHE
if (option(OPTHCACHEVERIFY))
+ {
+#if HAVE_DIRENT_D_INO
+ if (!sort && option (OPTSORTINODE))
+ {
+ dprint (4, (debugfile, "maildir: need to sort by inode\n"));
+ p = maildir_sort_inode (p);
+ if (!last)
+ *md = p;
+ else
+ last->next = p;
+ sort = 1;
+ snprintf (fn, sizeof (fn), "%s/%s", ctx->path, p->h->path);
+ }
+#endif
ret = stat(fn, &lastchanged);
+ }
else {
lastchanged.st_mtime = 0;
ret = 0;
@@ -1029,6 +1109,20 @@ void maildir_delayed_parsing (CONTEXT *
p->h = mutt_hcache_restore ((unsigned char *)data, &p->h);
maildir_parse_flags (p->h, fn);
} else
+ {
+#endif
+#if HAVE_DIRENT_D_INO
+ if (!sort && option (OPTSORTINODE))
+ {
+ dprint (4, (debugfile, "maildir: need to sort by inode\n"));
+ p = maildir_sort_inode (p);
+ if (!last)
+ *md = p;
+ else
+ last->next = p;
+ sort = 1;
+ snprintf (fn, sizeof (fn), "%s/%s", ctx->path, p->h->path);
+ }
#endif
if (maildir_parse_message (ctx->magic, fn, p->h->old, p->h))
{
@@ -1039,8 +1133,10 @@ void maildir_delayed_parsing (CONTEXT *
} else
mutt_free_header (&p->h);
#if USE_HCACHE
+ }
FREE (&data);
#endif
+ last = p;
}
#if USE_HCACHE
mutt_hcache_close (hc);
@@ -1099,16 +1195,12 @@ int mh_read_dir (CONTEXT * ctx, const ch
mh_update_maildir (md, &mhs);
mhs_free_sequences (&mhs);
}
-#ifdef USE_INODESORT
-
- md = maildir_sort_inode(md);
-#endif /* USE_INODESORT */
if (ctx->magic == M_MAILDIR)
{
snprintf (msgbuf, sizeof (msgbuf), _("Reading %s..."), ctx->path);
mutt_progress_init (&progress, msgbuf, M_PROGRESS_MSG, ReadInc, count);
- maildir_delayed_parsing (ctx, md, &progress);
+ maildir_delayed_parsing (ctx, &md, &progress);
}
maildir_move_to_context (ctx, &md);
@@ -1898,7 +1990,7 @@ int maildir_check_mailbox (CONTEXT * ctx
maildir_update_tables (ctx, index_hint);
/* do any delayed parsing we need to do. */
- maildir_delayed_parsing (ctx, md, NULL);
+ maildir_delayed_parsing (ctx, &md, NULL);
/* Incorporate new messages */
have_new = maildir_move_to_context (ctx, &md);
diff --git a/mutt.h b/mutt.h
--- a/mutt.h
+++ b/mutt.h
@@ -431,6 +431,7 @@ enum
OPTSCORE,
OPTSIGDASHES,
OPTSIGONTOP,
+ OPTSORTINODE,
OPTSORTRE,
OPTSPAMSEP,
OPTSTATUSONTOP,