<<< Date Index >>>     <<< Thread Index >>>

Re: Charset Issue



-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Alain can correct me if I'm wrong about any of this. :)

On Wednesday, March 12 at 04:02 PM, quoth Jorge Luis:
>set charset="iso-8859-1"

Setting the $charset manually is usually a bad idea.

>satyr's environment includes LANG=en_US.UTF-8; yekk's is
>LANG=en_US.ISO8859-1.

And THAT is a good example of WHY setting the $charset manually is a 
bad idea. iso-8859-1 is not byte-compatible with UTF-8.

>My .sigs include an "a" with an acute accent (C octal escaped UTF-8:
>\303\241).  Sigs on both machines were created with emacs in a Latin-1
>language environment.
>
>Mail sent from satyr to yekk comes across with the accented glyph
>properly displayed.

That is most likely due to your editor, not mutt. Your editor is being 
fed a file with a UTF-8 character, and is recognizing that it needs to 
convert that character into the charset specified by $LANG.

>The headers include:
>
>Content-Type: text/plain; charset=iso-8859-1
>Content-Transfer-Encoding: quoted-printable
>
>The same mail, when viewed in satyr's =Sent folder, shows an escaped
>character (\341) in place of the á, even though the headers of the
>saved mail are the same.

The octal value 0341 (or 0xE1 in hex) is the encoding of á in 
ISO-8859-1. So, the headers are all correct, and so is what you sent. 
Here's the thing, though: on satyr (the UTF-8 environment), mutt reads 
that and thinks that it doesn't need to do ANY conversion to display 
properly. Which isn't true: 0xE1 means something very different 
(namely, it means you have a malformed character) in a UTF-8 
environment. So when validating that character for display (via 
ncurses or whatever your mutt uses---software that does not see mutt's 
$charset setting but instead sees the LANG environment variable) 
*FAILS*, mutt falls back to the ASCII-only escaped version. If mutt 
was aware that it was operating in a UTF-8 environment, it would 
convert that character and successfully display it.

>The only way I can get the accented character to display properly on
>both machines is to create a utf-8 encoded signature and set
>allow_8bit on satyr so that there's no qp encoding of the mail.

Here's the thing, though: how does your terminal handle malformed 
characters? Many terminals fall back to displaying the malformed 
pieces of characters as if they were ISO-8859-1 characters. When mutt 
doesn't have to decode quoted-printable, it doesn't verify every 
character letter-by-letter, and instead just does a wholesale 
conversion from the character set the mail is labeled as (iso-8859-1,  
in this case) to $charset (so, no change is made in this case) and 
dumps the result to the terminal. Your terminal sees the 0xE1 byte, 
recognizes that it is a malformed character, and does it's fallback: 
pretends that the byte is an ISO-8859-1 character. What you're seeing 
is not "correct"; you're seeing your terminal's error-recovery mode. 
:)

The difference here is, I think, that when mutt is decoding 
quoted-printable, it checks whether each decoded character is 
displayable, while when displaying messages that are not 
quoted-printable-encoded, it does not check each and every byte 
(because that would take too long).

>Is the global LANG evironment variable overriding the charset that's
>set in muttrc?

Think of it this way: your terminal is going to accept a specific 
character set. In order for your applications to know what the 
terminal will display, they rely on LANG (and all the other related 
envariables). When you specify a $charset manually, you're telling 
mutt to ignore LANG, but it ignores it at its own peril. What happens 
is that mutt then tries to display characters that are valid in 
$charset---but they may be invalid characters as far as the terminal 
is concerned.

To use another analogy, imagine that your terminal only outputs 
characters through a square hole. LANG indicates "square hole". 
However, you've set $charset to "round hole". That $charset setting 
doesn't change the fact that your terminal only outputs characters 
through a square hole. Thus, as mutt runs, it says "ah, $charset says 
that I'm pushing characters through a round hole" and so mutt converts 
all characters to look like round pegs. But no matter whether mutt 
believes the hole to be round, it is actually square, and you can't 
fit round pegs through a square hole. It would have been better if you 
allowed mutt to set $charset itself, because then it could 
automatically detect whether it needs to output characters as square 
pegs or round pegs. Does that make sense?

>How can I set mutt to use the iso-9959-1 charset?  Am
>I missing something obvious?

The question that comes to my mind is: what are you trying to achieve?

~Kyle
- -- 
Come to me, son of Jor-El. Kneel before Zod. Snootchie-bootchies.
                                                                 -- Jay
-----BEGIN PGP SIGNATURE-----
Comment: Thank you for using encryption!

iEYEARECAAYFAkfYSJgACgkQBkIOoMqOI17dmQCg2RKYbKFQ6MBPdjhkWna8Lgym
fZcAn2wSf/ixu0NP9MhCqc8k12kOYuM8
=QhaE
-----END PGP SIGNATURE-----