Re: Q: View as Windows-1252?
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On Friday, August 3 at 11:29 PM, quoth Kai Grossjohann:
>> The other one is... well, downright malicious! Out of curiosity,
>> what mail client is composing messages mislabelled utf8 like that?
>
> I confess that I have no idea. Actually, I already had a value of
> assumed_charset and of charset, perhaps that did it. I had:
>
> set charset=utf8
> set assumed_charset=utf-8:windows-1252:iso-8859-1
>
> Perhaps the order of windows-1252 and iso-8859-1 was reversed. I
> thought that this was a smart move, because if decoding as UTF-8 works,
> then it's probably going to be UTF-8.
Ahhh, no, you're misunderstanding. Think of it this way: the computer
sees email as just an array of numbers. We like to think of them as
letters, but they're just numbers. The trick, of course, is that the
computer has to decide what to display on screen for each number, and
the problem is that the same number means different things in
different charsets. So it can do a test and see "does this number mean
something in this charset?". Better yet, "do all the numbers in this
email mean something in this charset?" Thus, if there's a number that
doesn't mean something in the charset (or means something
undisplayable), it can say "aha! this is the wrong charset".
UTF-8 uses almost the entire set of numbers. In other words, almost
*any* possible number is valid in UTF-8, and virtually every
unlabelled email you get will thus be treated as if it was UTF-8, even
though chances are most of them aren't UTF-8. Now, there's caveats to
that, because UTF-8 requires specific sequences of numbers in some
cases (so a message can be detected as not being UTF-8 in some cases),
but most of the time, most English-speaking folks don't use characters
that require specific sequences of numbers.
What you want to do instead with assumed_charset is to have it go in
order of restriction. Start with us-ascii---that's what most English
emails are sent in anyway, and it's also the most restrictive charset.
If your email contains a number that's not in that charset, then mutt
will know to try a different charset. Windows-1252 is a superset of
us-ascii, so next it will try that. If that works, great, if not, then
it gets to be time to check for utf-8.
Obviously, this isn't perfect, but the whole point of assumed_charset
is to be somewhat better at guessing the *correct* charset for
unlabeled emails.
It's also worth considering what the most common cases are. In the
English speaking world, MOST email is sent in either us-ascii or
windows-1252. Sometimes its iso-8859-1, and sometimes it's been
mislabelled as iso-8859-1. People using good email clients will label
their charset, but those who use very old or very poorly-written
clients might not (or might mislabel their charsets). These mail
clients are unlikely to be doing anything complicated with their
charsets, and are most likely to be assuming that everyone in the
world uses some basic charset (such as windows-1252, or us-ascii).
Mail clients that have put the time and effort into actually
supporting utf-8 tend to be aware of the problem of unlabelled
charsets, so it's highly unlikely that you'd find a UTF8-encoded
message that was not labelled as UTF-8.
There are exceptions everywhere, of course, but those are your common
cases.
>> Hmm, that's a badly worded man-page entry. I think it means one of
>> two things (both of which are, I think, true): either it's saying
>> that only the first charset that is valid for the message will be
>> used (i.e. if windows-1252 is a valid way of interpreting the
>> message, utf-8 will not be tried---this is especially important for
>> asian charsets, where in most cases there's no way to tell if the
>> charset produced random garbage or not),
>
> Hm. But surely the same thing applies to the header? So why was it
> explicitly talking about the message body?
Like I said: it's badly worded. The same thing applies to the header
as well.
> And perhaps Mutt was putting utf-8 there after Ctrl-E because that
> was the first entry in assumed_charset.
Huh. Possible. I've never paid enough attention to that detail of the
magic mutt pulls for bad email.
> But then, why didn't it try the whole list in the first place? Then
> it would have discovered the correct charset and wouldn't have
> displayed question marks for the non-ascii characters.
Indeed. Well, you may be dealing with a slightly different problem
then. Sometimes the question marks are mutt's doing, and sometimes
they're from your terminal (i.e. mutt told it to display character X,
but the terminal's font doesn't have a picture of that character, so
the terminal puts up an "I have no idea" character).
If mutt is having trouble, it will do one of two things: either it'll
replace the trouble character with three question marks (rare), or
it'll display the octal value of the character preceded by a
backslash, like this: \334
If it's doing anything else, then it's probably your terminal, and not
mutt, that's throwing up its hands in dismay.
> Very strange situation. Apologies for not investigating the situation
> fully before asking here.
NP
~Kyle
- --
A government which robs Peter to pay Paul can always depend on the
support of Paul.
-- George Bernard Shaw
-----BEGIN PGP SIGNATURE-----
Comment: Thank you for using encryption!
iD8DBQFGs6fxBkIOoMqOI14RAgKgAJ93valSKvF6NZqvgCTFrbBMRh6lYACg1XRY
lPct0We6cTC3jx+Cb7fSWtw=
=b1SG
-----END PGP SIGNATURE-----