<<< Date Index >>>     <<< Thread Index >>>

Re: Q: View as Windows-1252?



Hello Kai, hello Kyle, hello Christian,

 On Friday, August 3, 2007 at 15:37:38 +0200, Kai Grossjohann wrote:

> I (fairly) often get messages with no charset specified, or with the
> wrong charset specified, so I do Ctrl-E on them and edit the charset
> parameter to windows-1252

    As Kyle said, an $assumed_charset and a set of charset-hooks will
pretty well solve most such problems, say 95%. The rare residual
problem mails can then bearably be treated via <edit-type> manually.
I'll just add some comments to the discussion:

 -1) The second parameter of a charset-hook is a regexp. To avoid
annoying false positive matches, I advice to *always* write the
strictest possible regexp for your goal. This leads to
"charset-hook ^iso-8859-1$ windows-1252" and all such.

    Otherwise Latin-9 (iso-8859-15) would be matched, and would be
aliased to CP-1252. This would break Euros, and is definitely unwanted.


 -2) Drop "charset-hook windows-1251 windows-1252". This aliases one
charset to an entirely different and incompatible one. This may well fix
a mislabelling in a few mails, but it will also break all properly
1251 labelled ones. Definitely unwanted.

    Such deep mislabellings need another solution than static
charset-hooks. Perhaps dynamic charset-hooks declared inside folder- or
message-hooks (and unhooked by default). Or if rare enough, just the
occasional manual <edit-type>.


 -3) Drop "charset-hook ^us-ascii$ utf-8". This may well fix the wrong
(or lack of) label in a few MIME mails (containing UTF-8), but will
break the majority of such mails (really containing CP-1252). What you
want is a generic fix "charset-hook ^us-ascii$ cp1252" for the majority.
And additionally either some dynamic charset-hook or manual <edit-type>
(as above in point #2) to fix such UTF-8 corner case.


 -4) Kyle: You listed "charset-hook none windows-1252". I don't recall
having ever seen a charset=none label. Does it really happen?


 -5) Most people should not set $charset, but set LANG and let $charset
automatically inherit the right value.


 -6) "utf8" does not exist. Sometimes it's known by iconv as an alias,
but not on all platforms. This must be spelled "utf-8" with the dash.
We should realy consider populating Mutt's internal list of aliases for
charset.c:mutt_canonical_charset().


 -7) $assumed_charset takes a list of charsets, right. For raw headers,
Mutt scans the list and takes the first charset in which the header is
fully valid. However for bodies, Mutt takes... item #1 in list, period.
There is *no* charset auto-sensing for bodies.


 -8) Those charset auto-sensing lists (like $assumed_charset,
$file_charset/$attach_charset, or Vim's fileencodings) could list utf-8
first, then Latin-1 or such. And nothing appended.

    The reason is that any string is *always* valid Latin-1 (yes, even
if it contains bytes between 128 and 159). Nothing further will never be
checked. The same applies to nearly all 256 characters charsets in place
of Latin-1 (CP-125*, ISO-8859-*, CP-85*, KOI-8*, USW...). A few exceptions
do exist (example: byte 213 is invalid in CP-857), but don't really
invalidate this rule.

    To the contrary, UTF-8 strings are much more specific: UTF-8 uses
any bytes, but in specific sequences. This means that a Latin-1 text has
a fairly low risk to be wrongly sensed as being UTF-8. And that if some
text is valid UTF-8, then it very probably really is UTF-8.


 -9) There is no point in listing a subset together with it's superset
in $assumed_charset. The superset alone suffices.


 -10) Due to points #7 and #8, the optimal generic $assumed_charset for
westerners (ie for all Latin-1 centered languages) is the mono-charset
$assumed_charset=windows-1252

    Appending anything is (practically) useless. Prepending "utf-8"
would be good to headers, but would harm bodies.


 -11) MIME mails with a "Content-Type:" header but without charset label
are by default treated by Mutt *nearly* as if the label was "us-ascii".
However this case is a border case, and is impacted either by
$assumed_charset *or* by a "charset-hook ^us-ascii$ something". If both
exist, $assumed=blah wins over "charset-hook ^us-ascii$". But then a
"charset-hook ^blah$" would have the last word and win. There are some
more subtilities I won't try to explain, but for such mails <edit-type>
(and the attachments menu) shows charset=blah, provided $assumed was
"blah" during last folder loading. Runtime $assumed changes are ignored
(until next reload).


 -12) Christian gave the state-of-the-art generic set of hooks for
westerners. People on platforms where iconv knows EUC-JP-MS (ie *not*
unpatched libiconv) can just add this one:

| charset-hook ^euc-jp$ euc-jp-ms

    Iconv must know the target charset. Otherwise such a charset-hook is
worse than nothing.


Bye!    Alain.
-- 
Mutt muttrc tip to send mails in best adapted first necessary and sufficient
charset (version for East Europe Latin-2/CP-852/CP-1250 terminal users):
set 
send_charset="us-ascii:iso-8859-1:iso-8859-15:windows-1252:iso-8859-2:windows-1250:utf-8"