<<< Date Index >>>     <<< Thread Index >>>

Re: The future of mailboxes?



I thought you wanted to keep these responses off-list. If not, fine.

On Sunday, August 13 at 05:39 PM, quoth Nicolas George:
By the way, are there technical notes somewhere about what API it is necessary to implement when writing a new storage format for mutt?

My understanding is that mutt is, unfortunately, not modularized like that. The integration with things like IMAP support and mbox/maildir support is fairly complex and not abstracted out.

It is widely admitted that keeping the same information twice in two different formats is a bad thing, because it leads, sooner or later, to copies that are out of sync, and therefore needs synchronisation tools and so on.

For a "bad thing" it also happens to be extremely popular. Virtually every modern computer has, for example, not one but two levels of cache between CPU and main memory. Some have three, and I believe the Pentium 4 hides a semi-secret fourth level. There are, indeed, rather complex cache coherency algorithms, and it's something that hardware designers fret about a lot, but it's something worth doing in most cases.

What helps is the designation that one is a *cache* and the other is *authoritative*. This is different from what they taught you in database design class, where data duplication was bad because everything is equally authoritative.

Additionally, if you're presuming that one can only ever access the cache with a client that supports it (e.g. the mail client you wish to design), you can easily ensure that the cache is *never* out of sync with the main mail storage.

The only point where several files may behave faster if there is a lot of huge attachments, and the task is to read only the header. And that is only true if the mbox file has no reliable content-length field.

The mbox format itself does not require a content-length field (it is merely a convenience), and thus for a general-purpose mail reader, you must assume that you NEVER have a reliable content-length field: it is an optimization, nothing more.

Furthermore, if the kernel itself is doing the reading, he has absolutely all needed information. In fact, if you read the documentation for, let us say, the Linux sendfile function, "in_fd, must correspond to a file which supports mmap()-like operations", which strongly suggests that sendfile is nothing more mmap+write bundled together.

The implementation is irrelevant, what matters is that the kernel does it all: I say "kernel, send this file" and the kernel goes and does it and tells me when it's done. The kernel could be doing a tight read+write loop for all I care, but the real reason it wants file descriptors that support mmap()-like operations is that it wants to be able to have any and all of the data necessary for sending at once, so that it can optimize its sending strategy, and also so that it can seek backward in the file to handle retransmissions. The kernel, knowing where all of the pieces of that file are, can then simply set up a DMA transfer from the disk to the NIC.


In theory, yes, but in practice, no, they do not have exactly the same consequences on databases and filesystems: databases tend to avoid writing to disk whenever possible, while filesystems are frequently designed for reliability when (say) the power goes out.

Do you have reasons to think that database designer are stupid or do not care about reliability?

I've had far too many databases corrupt themselves into oblivion because the power went out. I've had filesystems get upset because the power went out, but nothing I couldn't fix with a good fsck; worst case: I lost a couple files.

If database designers are not stupid and care dearly about reliability, tell me, why is MyISAM still the default database format for MySQL? It supports none of the reliability features of InnoDB, BerkeleyDB, and Gemini, nor does it support transactions. Is that supposed to make me supremely confident that they have data integrity in the face of unexpected failure as their number one priority? No, of course not: that tells me that they have SPEED as their number one priority.

For tables which are just linear storage of blobs of data, they adopt a very filesystem-like data structure, and achieve the same performance as filesystems.

MySL stores BLOBs in tables of 2000-byte rows. Pretending for a moment that this is essentially equivalent to a block on disk, where did that number come from? It's not a convenient power of two, it has no relation to the 4k, 8k, or 16k block sizes of most filesystems, nor the 4k page size of most memory systems. Maybe if you somehow convinced MySQL to format up its own partition, you could argue that it could make the low-level block size 2000 bytes, but who are we kidding here? Assuming that the BLOB (let's say, a 24k mail message like the one you just sent) got inserted at the same time as some other BLOB did, those rows may be interleaved (or MySQL may choose to interleave them for some other reason). Reading linearly through them will be slower from MySQL's database than from a filesystem, because you've got to map in twice as many pages from disk---and, of course, it has to go through the filesystem anyway, because the database is just a file on disk. You've got twice as much overhead, in the *best* case, because the database has to figure out where all the parts of the blob are within your file, and the OS has to figure out where all the parts of the file are too. Given the 2000-byte block size (which doesn't vary depending on the size of the database, like block sizes in filesystems do), I'm not exactly supremely confident that MySQL is going to structure it's BLOB storage tables to optimize the necessary jumping around in the file to minimize the jumping around in the filesystem it's implemented on top of.

For tables with more cross-references, they adopt a more complex data structure; but in that case, filesystems just could not have done the job.

Well, good for them. But we're talking about storing mail, which filesystems do very well.

~Kyle
--
Life is too important to be taken seriously.
                                                       -- Oscar Wilde

Attachment: pgpqdFsFAByO9.pgp
Description: PGP signature