Re: bug#1876: mutt-1.5.6i: Mutt doesn't handle invalid characters when replying to a mail
On 2004-10-02 21:56:26 +0200, Alain Bench wrote:
> On Saturday, October 2, 2004 at 12:02:39 PM +0200, Vincent Lefèvre wrote:
> [raw headers]
> > I don't want any character replacement based on any guess
>
> OK. You *will* have full ?-masking by setting $strict_mime=no and
> default $assumed_charset=us-ascii. Granted, it doesn't work yet.
OK.
> > [no guess] except possibly for file attachments, where the charset is
> > almost always missing
>
> Incoming attachments, or local files you attach to send? If the
> later, what do you think about the $file_charset feature?
Incoming attachments. I think this can already be done with filters
configured the in .mailcap file.
> > could even be detected from the contents, e.g. for XML with the XML
> > prolog
>
> Hum... Not generalisable.
It depends on the file. With LaTeX files, one can detect that when
the inputenc package is used by the file, e.g.
\usepackage[latin1]{inputenc}
In text/* files, there's sometimes some information that can be used
(e.g. from Emacs' local variables when there are present?).
> [bytes 80-9F with label Latin-1]
> > I added the following code to my script that calls emacs:
> >| # Remove invalid characters
> >| recode -f "..ucs,ucs.." "$1"
>
> What's that? Converts locale's UTF-8 to Unicode and back?
Yes. When the -f option is used, invalid sequences are removed during
a conversion. But in order to have a conversion, the destination
charset must be different than the source charset. That's why I need
an intermediate charset (and I never use ucs).
> Appplied on a file already iconved by Mutt from label Latin-1 to
> locale's UTF-8?
Not exactly, see the example below.
> I'm missing something, or this will just pass-thru the said inv^Wcontrol
> characters?
>
> | $ echo -ne "\200" | iconv -f iso-8859-1 | recode -f "..ucs,ucs.." | hex
> | C2 80
This wasn't for the control characters, but for latin-1 characters that
could be found in UTF-8 files incorrectly generated by Mutt/iconv. I've
finally found the message in my mail archives (it was in a temporary
archive, that's why I didn't find it earlier...). This is a message
with
Content-Type: TEXT/PLAIN; charset=X-UNKNOWN
Content-Transfer-Encoding: QUOTED-PRINTABLE
and with latin-1 characters in the body (Pine generated this crap, no
surprise!). In Mutt, with UTF-8 locales, this is displayed as:
non born\351s n'est pas repr\351sentable parce que son exposant est trop
When I replied with Mutt (using UTF-8 locales), the file given to the
editor contained these untranslated latin-1 characters, and Emacs
thought that it was a latin-1 file (and it was quite right), so that
the body of my reply was encoded in latin-1, though declared as utf-8
by Mutt.
Now, with my new mutteditor script that calls recode to remove the
invalid characters, I get in Emacs:
non borns n'est pas reprsentable parce que son exposant est trop
> > Mutt should not propagate invalid sequences.
>
> They are not invalid. Mutt relies on iconv, who says rightly they
> are valid. Perhaps annoying, but valid.
See above for the example. BTW, I've tried after commented out the
recode line, and Mutt still has the same problem. And this is not
iconv's fault:
$ /bin/echo -ne "\351" | iconv -f x-unknown -t iso-8859-1 | hex
iconv: conversion from `x-unknown' is not supported
000000
[control characters]
> > And they aren't printable characters, are they?
>
> All unprintable. You mean that as such, they should be \octalized in
> editor, as they already are in pager? Comments anyone?
I'd agree on this behavior, or possibly replaced by a question mark.
--
Vincent Lefèvre <vincent@xxxxxxxxxx> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / SPACES project at LORIA