<<< Date Index >>>     <<< Thread Index >>>

Re: The future of mailboxes?



Le sextidi 26 thermidor, an CCXIV, Dave a écrit :
> >I thought about doing a postgres backend for mutt myself, but my first
> >coding experience in mutt showed me that perhaps an experienced mutt
> >programmer would make more headway. :)

I believe SQLite would be a better choice than PostgreSQL or MySQL:
PostgreSQL- or MySQL-based solutions would require installing a server (most
people do not have that already), setting up a database and database
account. SQLite is just a library embedding the SQL engine, and the database
itself is a single file.

I would really like to hear your preliminary thoughts about that project,
especially about the database structure.

As for me, I think that for the sake of compatibility, it would be better to
keep the mail itself in old-fashioned storage (or at least be able to do
it), and use the database only for the caching of metadata.


By the way, are there technical notes somewhere about what API it is
necessary to implement when writing a new storage format for mutt?



Le quintidi 25 thermidor, an CCXIV, Kyle Wheeler a écrit :
> Uh, yeah, that’s unfounded suspicion on your part, I think. It makes 
> perfect sense that someone would want to know what the most recently 
> arrived messages are, ESPECIALLY in situations like your INBOX. I 
> mean, if you have more than, oh, say, 20 messages in your INBOX, 
> sorting by sender or subject or what-have-you seems pretty pointless.  
> Just cognitively, that makes no sense. All that’s gonna happen is 
> you’re going to miss messages because they weren’t in a convenient 
> spot for looking at them.

The question is not delivery order here, but read/unread status. This is
something that maildir specifically does well, I must admit.

But anyway, that is only relevant for the live, recent mail, while I am
mainly speaking of archives.

> The actual choice of which mail to read is generally based exclusively 
> on meta-data (unless you’re doing full-content searches for text 
> strings, hardly a common-case). And THAT, I agree, is perfect data for 
> caching in a little database (i.e. what mutt does when hcache is 
> enabled).

It is widely admitted that keeping the same information twice in two
different formats is a bad thing, because it leads, sooner or later, to
copies that are out of sync, and therefore needs synchronisation tools and
so on.

Thus, if part of your mail is in a database, then having all your mail in
the database is a sensible and straightforward choice.

It may be a little bit more efficient to do differently, and in that
particular case we both agree it is, if for different reasons, but it is not
at all stupid. And storing the mail itself in files has drawbacks too,
mainly the added complexity of ensuring that the database and the filesystem
data are in sync.

> Not true; quite often you are limited by the serialization of reads 
> with computation.

Normally, operating systems have read-ahead mechanisms, and they work very
well for sequential reading, like what is done with mbox-style files. If you
try to read it from two points at the same time, you will just require a lot
of seeks from the disk and kill its performances (unless you have RAID).

On the other hand, there is no such thing as read-ahead for directories with
a lot of files. In that case, having several threads will help indeed. But
it will always be worst than simple sequential reading of a single file.

The only point where several files may behave faster if there is a lot of
huge attachments, and the task is to read only the header. And that is only
true if the mbox file has no reliable content-length field.

You can do benchmarks. I did mine, and I found reading a big mbox file was
ten times faster than reading the same in maildir, but the circumstances
where not optimal and the test is very crude. If you want to try it, just
remember you must flush the disk cache before each round, or you will only
be testing your memory bandwidth.

> Mmap()’d data is not equivalent to sendfile(); mmap still does a 
> context switch into the kernel for all pages that aren’t currently in 
> memory, it’s just very efficient at it because it does it in the data 
> size with the least overhead. So you have to go into the kernel once 
> to get the data mapped in, and then again when you do an fwrite(). The 
> primary advantage is portability, not speed.

With read-ahead, page faults are very rare during a sequential read.
Furthermore, if the kernel itself is doing the reading, he has absolutely
all needed information. In fact, if you read the documentation for, let us
say, the Linux sendfile function, "in_fd, must correspond to a file which
supports mmap()-like operations", which strongly suggests that sendfile is
nothing more mmap+write bundled together.

And if you start counting the overhead of just one system call, you would
probably better reconsider a scheme where one open+stat is needed for each
single mail.

> In theory, yes, but in practice, no, they do not have exactly the same 
> consequences on databases and filesystems: databases tend to avoid 
> writing to disk whenever possible, while filesystems are frequently 
> designed for reliability when (say) the power goes out.

Do you have reasons to think that database designer are stupid or do not
care about reliability?

> I know: that was my point. Often people point to “hey, database 
> replication!” to support the claim that a database is much more 
> reliable than a filesystem.

The point is that it is not less reliable.

> If the implementation details are all exactly identical, then why wish 
> for storing mail in a database rather than a filesystem? Your logic 
> defeats itself. Yes, some databases have similar data structures, but 
> that hardly makes them indistinguishable.

When you install a server, you adapt the filesystems to the task, you mount
filesystems tuned differently on different mount points, to achieve the best
performance.

Database do exactly the same, but in a much more flexible way
(hot-resizeable partitions and filesystems are more recant than databases,
and database have a wider range of possible data structures).

For tables which are just linear storage of blobs of data, they adopt a very
filesystem-like data structure, and achieve the same performance as
filesystems. For tables with more cross-references, they adopt a more
complex data structure; but in that case, filesystems just could not have
done the job.

> I suppose we think of different things when we say “spool”. When I say 
> spool I think of /var/spool/mail/: the places where our *new* mail 
> gets delivered.

I was referring to the whole union of all the mail of a particular user,
disregarding its age or storage place.


Regards,

-- 
  Nicolas George

Attachment: signature.asc
Description: Digital signature