<<< Date Index >>>     <<< Thread Index >>>

Re: Q: View as Windows-1252?



-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Friday, August  3 at 11:29 PM, quoth Kai Grossjohann:
>> The other one is... well, downright malicious! Out of curiosity, 
>> what mail client is composing messages mislabelled utf8 like that?
>
> I confess that I have no idea.  Actually, I already had a value of 
> assumed_charset and of charset, perhaps that did it.  I had:
>
> set charset=utf8 
> set assumed_charset=utf-8:windows-1252:iso-8859-1
>
> Perhaps the order of windows-1252 and iso-8859-1 was reversed.  I 
> thought that this was a smart move, because if decoding as UTF-8 works, 
> then it's probably going to be UTF-8.

Ahhh, no, you're misunderstanding. Think of it this way: the computer 
sees email as just an array of numbers. We like to think of them as 
letters, but they're just numbers. The trick, of course, is that the 
computer has to decide what to display on screen for each number, and 
the problem is that the same number means different things in 
different charsets. So it can do a test and see "does this number mean 
something in this charset?". Better yet, "do all the numbers in this 
email mean something in this charset?" Thus, if there's a number that 
doesn't mean something in the charset (or means something 
undisplayable), it can say "aha! this is the wrong charset".

UTF-8 uses almost the entire set of numbers. In other words, almost 
*any* possible number is valid in UTF-8, and virtually every 
unlabelled email you get will thus be treated as if it was UTF-8, even 
though chances are most of them aren't UTF-8. Now, there's caveats to 
that, because UTF-8 requires specific sequences of numbers in some 
cases (so a message can be detected as not being UTF-8 in some cases), 
but most of the time, most English-speaking folks don't use characters 
that require specific sequences of numbers.

What you want to do instead with assumed_charset is to have it go in 
order of restriction. Start with us-ascii---that's what most English 
emails are sent in anyway, and it's also the most restrictive charset. 
If your email contains a number that's not in that charset, then mutt 
will know to try a different charset. Windows-1252 is a superset of 
us-ascii, so next it will try that. If that works, great, if not, then 
it gets to be time to check for utf-8.

Obviously, this isn't perfect, but the whole point of assumed_charset 
is to be somewhat better at guessing the *correct* charset for 
unlabeled emails.

It's also worth considering what the most common cases are. In the 
English speaking world, MOST email is sent in either us-ascii or 
windows-1252. Sometimes its iso-8859-1, and sometimes it's been 
mislabelled as iso-8859-1. People using good email clients will label 
their charset, but those who use very old or very poorly-written 
clients might not (or might mislabel their charsets). These mail 
clients are unlikely to be doing anything complicated with their 
charsets, and are most likely to be assuming that everyone in the 
world uses some basic charset (such as windows-1252, or us-ascii). 
Mail clients that have put the time and effort into actually 
supporting utf-8 tend to be aware of the problem of unlabelled 
charsets, so it's highly unlikely that you'd find a UTF8-encoded 
message that was not labelled as UTF-8.

There are exceptions everywhere, of course, but those are your common 
cases.

>> Hmm, that's a badly worded man-page entry. I think it means one of 
>> two things (both of which are, I think, true): either it's saying 
>> that only the first charset that is valid for the message will be 
>> used (i.e. if windows-1252 is a valid way of interpreting the 
>> message, utf-8 will not be tried---this is especially important for 
>> asian charsets, where in most cases there's no way to tell if the 
>> charset produced random garbage or not),
>
> Hm.  But surely the same thing applies to the header?  So why was it 
> explicitly talking about the message body?

Like I said: it's badly worded. The same thing applies to the header 
as well.

> And perhaps Mutt was putting utf-8 there after Ctrl-E because that 
> was the first entry in assumed_charset.

Huh. Possible. I've never paid enough attention to that detail of the 
magic mutt pulls for bad email.

> But then, why didn't it try the whole list in the first place?  Then 
> it would have discovered the correct charset and wouldn't have 
> displayed question marks for the non-ascii characters.

Indeed. Well, you may be dealing with a slightly different problem 
then. Sometimes the question marks are mutt's doing, and sometimes 
they're from your terminal (i.e. mutt told it to display character X, 
but the terminal's font doesn't have a picture of that character, so 
the terminal puts up an "I have no idea" character).

If mutt is having trouble, it will do one of two things: either it'll 
replace the trouble character with three question marks (rare), or 
it'll display the octal value of the character preceded by a 
backslash, like this: \334

If it's doing anything else, then it's probably your terminal, and not 
mutt, that's throwing up its hands in dismay.

> Very strange situation.  Apologies for not investigating the situation 
> fully before asking here.

NP

~Kyle
- -- 
A government which robs Peter to pay Paul can always depend on the 
support of Paul.
                                                -- George Bernard Shaw
-----BEGIN PGP SIGNATURE-----
Comment: Thank you for using encryption!

iD8DBQFGs6fxBkIOoMqOI14RAgKgAJ93valSKvF6NZqvgCTFrbBMRh6lYACg1XRY
lPct0We6cTC3jx+Cb7fSWtw=
=b1SG
-----END PGP SIGNATURE-----