<<< Date Index >>>     <<< Thread Index >>>

Re: bug#1876: mutt-1.5.6i: Mutt doesn't handle invalid characters when replying to a mail



On 2004-09-30 20:32:19 +0200, Alain Bench wrote:
>     [ BTS to mutt-dev forwarding seems broken: Half mails ]
>     [don't appear (as your last 3). Hence the CC mutt-dev.]

I didn't know that. :(

>  On Saturday, May 29, 2004 at 11:24:12 AM +0200, Vincent Lefèvre wrote:
> 
>     [raw headers]
> > I don't want the assumed_charset here, but question marks (or similar
> > valid characters).
> 
>     What do you mean by « similar valid characters »?

The question mark was just an example. It could be a dot, or something
else, possibly in the whole Unicode character set.

> > how could I *automatically* know the correct encoding?
> 
>     No way. Perhaps some form of external hint can help guessing with
> good probability: Domain, mailing list, MUA, whatever. But guessing is
> not knowing.

OK, so I don't want any character replacement based on any guess
(except possibly for file attachments, where the charset is almost
always missing, and could even be detected from the contents, e.g.
for XML with the XML prolog).

>  On Monday, May 31, 2004 at 11:56:06 AM +0200, Vincent Lefèvre wrote:
> 
>     [1252 mislabelled Latin-1]
> > On 2004-05-22 16:00:11 +0200, Alain Bench wrote:
> >> Is there really a problem? Here I get the expected question marks
> > It is a problem, because I get invalid characters (instead of question
> > marks), and emacs is confused by them.

BTW, since then, I changed my editor configuration. I added the
following code to my script that calls emacs:

# Remove invalid characters
recode -f "..ucs,ucs.." "$1"

However, it would be better not to rely on such configuration. Mutt
should not propagate invalid sequences.

>     Understood. This depends on terminal's $charset. If iconv can't
> convert the char (from MIME charset to $charset), you get question mark.
> If iconv can, you get the converted char in editor.

This doesn't seems to be the case in practice, when the *source* is
invalid.

>     Latin-1 chars in the zone 0x80-0x9F are not invalid, but defined as
> control chars. And they are perfectly convertable to UTF-8. Example with
> Latin-1 0x80 U+0080 PADDING CHARACTER (PAD).
> 
> | $ echo -ne "\200" | iconv -f iso-8859-1 -t utf-8 | hex
> | C2 80
> 
>     And glibc-2.3.2/localedata/charmaps/UTF-8 table confirms:
> 
> | <U0080>     /xc2/x80     PADDING CHARACTER (PAD)

Hmm... OK, on this example, perhaps Emacs was a bit misleading, even
broken. For instance, when you add a 0x80 character to a latin-1 file,
Emacs no longer recognizes the file as latin-1, but as raw text.

Now, no-one seems to use them as control sequences, and I don't think
that such control sequences should be present in mail messages. And
they aren't printable characters, are they?

>     Iconv tells Mutt it's convertable. Mutt gives the valid UTF-8
> converted char to the editor. Why is editor confused?

Because emacs do charset recognition. Well, no big problem here with
the above example, IIRC. But the following already happened: Mutt gave
iso-8859-1 characters to the editor in UTF-8 locales, and then, this
was bad. IIRC, the mail I replied to was written in iso-8859-1 but
didn't declare the charset. Something like that. The consequence was
that my message contained mixed iso-8859-1 and utf-8 sequences!

> > iso-8859-1 isn't windows-1252; the problem needs to be reported to
> > the sender.
> 
>     I fully agree. You seem to consider Mutt as a problem detector.
> Some may prefer to use Mutt as an efficient MUA, workarounding or
> even hiding probs as much as possible, and use another tool to
> detect problems.

No, Mutt must detect problems, so that it doesn't propagate them.
I also don't want Mutt to do non-standard things I haven't asked
(like trying to fix the charset, since it can't do that reliably).

>     [raw headers]
[...]
>     Confirmed. But someone has made a yet unreleased patch for these
> problems that works perfectly so far: Unconvertables from first
> $assumed_charset are ?-masked, and thus no more parsing problems. And
> this doesn't break possibility to set multiple charsets.

Where is this patch please?

-- 
Vincent Lefèvre <vincent@xxxxxxxxxx> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / SPACES project at LORIA