Re: japanese text in email body

To: Mutt users ml <mutt-users@xxxxxxxx>
Subject: Re: japanese text in email body
From: Henry Nelson <netb@xxxxxxxxxxxxxx>
Date: Sat, 19 Jun 2004 07:35:18 +0900
In-reply-to: <20040618163913.GA29090@xxxxxxxxx>
List-unsubscribe: <mailto:mutt-users-request@mutt.org?body=unsubscribe>
Mail-followup-to: Mutt users ml <mutt-users@xxxxxxxx>
References: <20040615112808.GA20268@xxxxxxxxxxxxxx> <20040616105815.GA21405@xxxxxxxxx> <20040616234904.GA20636@xxxxxxxxxxxxxxxxxxxxxxxxx> <20040608152908.GA3009@xxxxxxxxxxxxxxxxxxxxxxxxx> <20040608163345.GA1677@xxxxxxx> <20040612125623.GA6871@xxxxxxxxx> <20040615112808.GA20268@xxxxxxxxxxxxxx> <20040616105815.GA21405@xxxxxxxxx> <20040616214021.GA26002@xxxxxxxxxxxxxx> <20040618163913.GA29090@xxxxxxxxx>
Sender: owner-mutt-users@xxxxxxxx
User-agent: Mutt/1.5.4i-ja.1

On Fri, Jun 18, 2004 at 06:39:14PM +0200, Alain Bench wrote:
>  On Thursday, June 17, 2004 at 6:40:21 AM +0900, Henry Nelson wrote:
> 
> > How do you all do with the character "??"?  My use of this (a one
> > inside of a circle) caused a bit of havoc in a recent e-mail.
> 
>     You mean "???", the U+2460 circled digit one? In your mail it was
> replaced by a pair of question marks.

U+2460 "looks right" according to "http://www.fileformat.info/info/
unicode/char/2460/index.htm".  Would doing "^E" and changing charset to
"iso-2022-jp" display the character?  (Maybe not since should already
be labeled okay.)

>     Humm... According to the Glibc 2.3.2 charmap tables, this character
> is not part of EUC-JP. It exists only in:

Wow!  Anyway, all I know is that TeraTerm is set to receive and send "EUC"
and my locale, actually LC_CTYPE, is set to "ja_JP.eucJP".

>     Even if your EUC-JP terminal is in some way enhanced and has U+2460,
> iconv is not aware of this, and will fail to convert it from EUC-JP to

I compile iconv with "--enable-extra-encodings"; maybe this makes it aware?

>     What hex are the 2 bytes encoding circled one on your terminal? Here
> 3 bytes on UTF-8:
> 
> | $ echo -n "???" | hex
> | E2 91 A0

I don't have "hex".  I hope "hexdump" is same thing.

% echo -n "??" | hexdump                ## between quotes is (1)
0000000 ada1
0000002

% echo -n "?´↓??き?" | hexdump   ## between quotes is (1)(2)(3)(4)(5)
0000000 ada1 ada2 ada3 ada4 ada5
000000a

>     I guess that when they receive 2022-JP, it's OK. But that when they
> receive EUC-JP, Shift-JIS, or UTF-8, it has 2/3 risks to fail, as soon
> as the term doesn't match... Or are your students behind your magic
> NKF all-to-EUC converter?

Perhaps that's it.  These students are at home on Windows98 or WindowsXP
machine, or receiving mail on their cellphone.  The students who couldn't
read my mail had same cellphone provider "ezweb.ne.jp".  (Wasn't refering to
students logging into Unix shell account where filter is protection device.)

>     I'm afraid your nasty procmail rule or gawk script to auto-detach
> and remove attachments are at fault. This destroys the MIME structure

Yes.  Actually, I had protected mutt.org mails in addition to yours, but
I forgot to allow formail to keep "Content-Type:" header for the mutt
mailing list box.  I think that is corrected now.  (Thanks!)

>     You may try to <edit-type> (^E) my mail and replace the prompt by:
> 
> | multipart/mixed; boundary="ZGiS0Q5IWpPtfppv"
> 
>     ...but I'm not sure of the result. You will perhaps gain the shoguns

Anyway, I'll give it a try.

> > Any clues?
> 
>     Dave's mail was UTF-8. Your procmail rules wrongly interpreted it as
> being EUC-JP, and converted it, or overwrited the label. Or something
> like that. I get the same strange chars as you if I iconv Dave's raw
> UTF-8 mail from EUC-JP to ISO-2022-JP. After such corruption, no mailer

VERY interesting.  How is UTF-8 encoded in the mail, I wonder.  My procmail
recipe is:
:0 Bfbw
* $ \\^[\\\$B.*\\^[\\([BJ]
* H ?? $ !^(From|Message-ID|Received|Reply-To|Sender|Subject|To):? .*($NOFILTR)
|$LBDIR/nkf -Jeu
:0 aHfhw
|$FORMAIL -i"Content-Type: ${TOEUC:-text/plain; charset=euc-jp}"

In other words, that forced conversion _from_ ISO-2022-JP _to_ EUC-JP
is triggered by characters between [a raw escape followed by "$B"] and
[a raw escape followed by "(" and a "B" or "J"].  Are you saying that
UTF-8 is stuck between the same two tags?  Oh, no.  Very bad situation.
How can I distinguish between ISO-2022-JP and UTF-8?

> can display it right. Not Mutt's fault.

That's for sure.  Too bad all programs aren't coded as well as mutt!

> A: Because it messes up the order in which people normally read text.
> Q: Why is top-posting such a bad thing?

Q: What's "top-posting"?

-- 
henry nelson
 | day job: | http://yuba.kcn.ne.jp/biorec/nehan/henken.html

Follow-Ups:
- Re: japanese text in email body
  - From: Alain Bench

References:
- Re: japanese text in email body
  - From: Henry Nelson
- Re: japanese text in email body
  - From: Alain Bench
- Re: japanese text in email body
  - From: Dave Driscoll
- japanese text in email body
  - From: Dave Driscoll
- Re: japanese text in email body
  - From: steVe
- Re: japanese text in email body
  - From: Alain Bench
- Re: japanese text in email body
  - From: Henry Nelson
- Re: japanese text in email body
  - From: Alain Bench

Prev by Date: Re: Fwd: broken "^From " header
Next by Date: Re: Hooking based on the charset of the composed message.
Previous by thread: Re: japanese text in email body
Next by thread: Re: japanese text in email body
Index(es):
- Date
- Thread