Re: Smarter send_charset

To: Mutt dev ml <mutt-dev@xxxxxxxx>
Subject: Re: Smarter send_charset
From: Alain Bench <veronatif@xxxxxxx>
Date: Tue, 06 Sep 2005 16:02:14 +0200 (CEST)
In-reply-to: <20050906030954.GA9606@xxxxxxxxxxxxxx>
List-unsubscribe: <mailto:mutt-dev-request@mutt.org?body=unsubscribe>
Mail-followup-to: Mutt dev ml <mutt-dev@xxxxxxxx>
References: <20050906030954.GA9606@xxxxxxxxxxxxxx>
Sender: owner-mutt-dev@xxxxxxxx
User-agent: Mutt/1.4i-ja.1

Hello Ryan,

 On Monday, September 5, 2005 at 11:09:54 PM -0400, Ryan King wrote:

> default send_charset is "us-ascii:iso-8859-1:utf-8". From that list,
> "Mutt will use the first character set into which the text can be
> converted exactly." I'm struggling to think of any way the utf-8
> encoding will be selected - because all bitpatterns from the smallest
> 0x00 to the grandest 0xFF are valid ISO-8859-1 (as far as I know).

    You missed a basic point: The origin charset is known. It's
$charset. Trying all $charset to $send_charset conversions until one is
completely succesfull is easy, when you know $charset.

    Example: $charset is CP-850, $send_charset is default. If you write
an e acute, it's a byte 0x82. Mutt tries in turn:

| $ printf "\x82" | iconv -f cp850 -t us-ascii | hex
| iconv: (stdin): cannot convert
| $ printf "\x82" | iconv -f cp850 -t iso-8859-1 | hex
| E9

    ...Mutt selects "iso-8859-1", and upon sending will convert CP-850
to Latin-1 and send a byte E9. Fine: That's a Latin-1 e acute.

    If you write a semi-graphic U+256C "BOX DRAWINGS DOUBLE VERTICAL AND
HORIZONTAL", it's a byte 0xCE. Mutt tries in turn:

| $ printf "\xCE" | iconv -f cp850 -t us-ascii | hex
| iconv: (stdin): cannot convert
| $ printf "\xCE" | iconv -f cp850 -t iso-8859-1 | hex
| iconv: (stdin): cannot convert
| $ printf "\xCE" | iconv -f cp850 -t utf-8 | hex
| E2 95 AC

    ...Mutt selects "utf-8", and upon sending will convert CP-850 to
UTF-8 and send three bytes E2 95 AC. Fine: That's an UTF-8 double cross.


    Knoweledge of origin $charset makes the difference between a
guessing heuristic that can fail (as example $file_charset feature), and
this sure deterministic $send_charset process.


> though the following line is going to be valid UTF-8, my client will
> lie to you all about the charset being used:

    The umlauts were valid Latin-1, and your message was labelled
accordingly. All is well.


> change the default send_charset to "us-ascii:utf-8:iso-8859-1".

    UTF-8 before would mask the following iso-8859-1: One of US-Ascii or
UTF-8 will *always* match. Latin-1 will never. Not even be tried.


> isn't it time to start making UTF-8 the default everywhere?

    UTF-8 is fine when needed, but forcing it's usage needlessly makes
messages bigger and reduces chances to be read correctly by old or
powerless clients. So the reply is: No, it's not time to force UTF-8.


Bye!    Alain.
-- 
set honor_followup_to=yes in muttrc is the default value, and makes your
list replies go where the original author wanted them to go: Only to the
list, or with a private copy.

References:
- Smarter send_charset
  - From: Ryan King

Prev by Date: Re: mutt/1924: Segmentation falut when accessing a writable but non-readable shared maildir folder.
Next by Date: Re: mutt/1917: mutt segfaults while reading a config file
Previous by thread: Smarter send_charset
Next by thread: Re: Smarter send_charset
Index(es):
- Date
- Thread