<<< Date Index >>>     <<< Thread Index >>>

Re: e-mail encoding/formatting (was Re: Split-screen mode in mutt?)



Thus spake Kyle Wheeler on Tue, May 09, 2006 at 06:05:37PM -0400 or 
thereabouts: <kyle-mutt@xxxxxxxxxxxxxx> [2006-05-09 18:49]:
> On Monday, May  8 at 10:53 PM, quoth cga2000:
> >> vim, less, od, etc. do not decode quoted-printable encoding. They 
> >> edit/view files just as they are.
> >> 
> > not the quoted-printable ("electronic mail") encoding.. but vim & 
> > less at least would necessarily be able to handle UTF-8 encoding. 
> > Hence my using "od" to display the hex contents of the file/message. 
> > I thought this was the only way I could visualize it without any 
> > rendering software tampering with it.
> >
> > Or am I missing something?
> 
> Well, the thing is that whether your mail contains actual UTF-8 bytes 
> or whether your mail contains only 7-bit ascii bytes determines what 
> vim sees.
> 
> Put another way: the actual email that gets sent, in it's raw form, is 
> in "quoted-printable" encoding. That means that it actually contains 
> the ascii strings that look like "=20" and "=E2=80=99". That is not 
> UTF-8 encoding, that's just a string of ascii characters. If you edit 
> the raw mail file, that's what vim sees.
> 
> Mutt, however, knows that they are quoted-printable encodings for 
> bytes, and as such represent things, and so can transform the "raw" 
> (or "undecoded") form into a "decoded" form where things like 
> =E2=80=99 are replaced with bytes with the hexadecimal values 
> indicated. If the "decoded" form is saved to a file, vim will see 
> those bytes and will be able to interpret them as a UTF-8-encoded 
> character.
> 
> In other words, if you run vim/od/less/whatever on the "undecoded" 
> ("raw") email, you will see the quoted-printable strings rather than 
> the UTF-8 characters. If, on the other hand, you run 
> vim/od/less/whatever on the "decoded" email, you will see the UTF-8 
> bytes. This "decoding" is what is controlled, for example, by mutt's 
> $pipe_decode variable. If you set pipe_decode and then pipe a message 
> to (for example) less, you should see the three bytes with those 
> values. If you unset pipe_decode and then pipe a message to less, you 
> should see the nine-byte string "=E2=80=99" (for example) instead.
> 
> Mutt stores all mail in "raw" or "undecoded" format, and decodes it 
> every time you view it. So, saving mail to a file and then editing 
> that file will show you the long =E2=80=99 form.

I wasn't sure how/if you could use mutt to actually save a message - I
think you can save the message to a separate mbox by pressing 's' (by
default) in the pager or index, right?

So what I did - and it made better sense anyway - was that I hit 'L' as
if I wanted to reply to the message. This caused mutt to start a vim
session on the message and then I was able to issue a vim/ex ":w
/tmp/msg" command to save the message to a file. So I must have saved a
copy of the "decoded" message that contained just the 3-byte UTF-8
encoding.
> 
> Does that help?

Crystal-clear explanation..!! It provides the missing piece of the
puzzle - that mail is en/de-coded at each end. I sort-of knew about this
but I did not see the implications and the penny just did not drop.

One of the difficulties of learning all this on a live environment is
that some garbled displays may be caused by mutt - and other things..
locale.. font.. bad procmail rewrite rules.. not being set up correctly
my end.. but if I understand the situation correctly, they could also be
caused by a bad setup at the other end (the sender's machine).. or even
some relaying machine.. So it's not just a case of saying.. ok, this
message does not display correctly here in mutt.. so let's investigate
my setup and fix this.. because there are other things that may not be
right over which I have no control. Not easy to tell the difference. 

Am I right in assuming this?

Lastly, I will need to set aside at least one week-end to review and
test my entire setup in detail. Are there any useful docs that are more
implementation-oriented (I run a linux box) than the RFC's and that
would cover the encodings & content header aspects?

Thank you very much for going to all that trouble setting me on the
right track.

cga