<<< Date Index >>>     <<< Thread Index >>>

Re: More on non-ascii chars in headers



-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Thursday, September 27 at 05:50 PM, quoth Eyolf Østrem:
>> 3. Along similar lines, windows-1252 contains the entire set of 
>> possible values, 0 to 255, and has a character assigned to each. 
>> Thus, no email will *ever* not match windows-1252. The way mutt 
>> figures out that a message isn't in a specific character set is if 
>> there are values in the message that aren't valid in the character 
>> set. For example, in Latin-1, values 0x00 through 0x1F are unused; 
>> thus if they appear in an email, it cannot be encoded in Latin-1. 
>> Windows-1252 may not always be the *right* character set, but 
>> there's no way for mutt to know that.
>
> But should I still remove utf8 from that list?

Yes!

> What if I receive a message with characters which are NOT in 
> Windows-1252 but in utf8?

That's technically impossible. Emails are sent as bytes, not as 
characters. Messages don't have "characters" until someone renders the 
byte sequence into pictures of letters on the screen through the use 
of a byte-to-symbol mapping known as a "character set".

> Will they then still match Windows-1252 but with the wrong 
> characters?

Precisely.

An email (heck, any file, for that matter) is not actually 
"characters". As far as the computer (or mutt) is concerned, an email 
is just a sequence of bytes. Every byte in the world is a number from 
0 to 255. When attempting to render an email (i.e. transform a 
sequence of  bytes into characters that people can read), your email 
client must use a lookup table to match byte values to the characters 
that they refer to. This lookup table is called a "character set". The 
windows-1252 character set maps every value (0-255) to a character. 
Thus, *ANY* arbitrary sequence of bytes can be interpreted as 
windows-1252. Put another way, there is no way to tell, just by 
looking at it, that a given sequence of bytes is not text encoded in 
windows-1252. By contrast, in latin-1 (aka iso-8859-1), not all values 
are valid, and thus mutt can know that if a message (aka sequence of 
bytes) contains any of those invalid values, then that message 
probably couldn't possibly be encoded with the latin-1 character set.

UTF-8 is a "multi-byte" character set, which means that you cannot use 
a single byte to represent all of its characters. Thus, some byte 
values are reserved to indicate "this is the first byte of a 
multi-byte character". However, every *single byte* in UTF-8 is a 
valid windows-1252 character.

If someone took a utf-8-encoded email (read: sequence of bytes) and 
handed it to a file reader that only understood windows-1252, it would 
get rendered, it would just look wrong. For example, this character: ☺
That character is not in windows-1252. In UTF-8, that character is 
encoded as three bytes, with the values 226, 152, and 186 (or, in 
hexadecimal, 0xE2, 0x98, and 0xBA). If this sequence of bytes is not 
labeled as utf-8, that character is indistinguishable from the 
windows-1252 letters ☺. Now, those three letters look like junk to 
you and me, but mutt doesn't know that, and can't.

~Kyle

P.S. Officially, windows-1252 has four invalid byte values, but 
Windows does actually consider them valid control characters, so in 
practice, windows-1252 has the entire byte range covered.
- -- 
Many who claim to have been transformed by Christ's love are deeply, 
even murderously, intolerant of criticism.
                                                          -- Sam Harris
-----BEGIN PGP SIGNATURE-----
Comment: Thank you for using encryption!

iD8DBQFG++bCBkIOoMqOI14RAiboAJ4pBegXksMpe70CwEpYtCNkprMk7wCfZSWN
GZQRrXkIgGPL+ALI9yKzK84=
=xv/I
-----END PGP SIGNATURE-----