<<< Date Index >>>     <<< Thread Index >>>

Re: japanese text in email body



On Fri, Jun 18, 2004 at 06:39:14PM +0200, Alain Bench wrote:
>  On Thursday, June 17, 2004 at 6:40:21 AM +0900, Henry Nelson wrote:
> 
> > How do you all do with the character "??"?  My use of this (a one
> > inside of a circle) caused a bit of havoc in a recent e-mail.
> 
>     You mean "???", the U+2460 circled digit one? In your mail it was
> replaced by a pair of question marks.

U+2460 "looks right" according to "http://www.fileformat.info/info/
unicode/char/2460/index.htm".  Would doing "^E" and changing charset to
"iso-2022-jp" display the character?  (Maybe not since should already
be labeled okay.)

>     Humm... According to the Glibc 2.3.2 charmap tables, this character
> is not part of EUC-JP. It exists only in:

Wow!  Anyway, all I know is that TeraTerm is set to receive and send "EUC"
and my locale, actually LC_CTYPE, is set to "ja_JP.eucJP".

>     Even if your EUC-JP terminal is in some way enhanced and has U+2460,
> iconv is not aware of this, and will fail to convert it from EUC-JP to

I compile iconv with "--enable-extra-encodings"; maybe this makes it aware?

>     What hex are the 2 bytes encoding circled one on your terminal? Here
> 3 bytes on UTF-8:
> 
> | $ echo -n "???" | hex
> | E2 91 A0

I don't have "hex".  I hope "hexdump" is same thing.

% echo -n "??" | hexdump                ## between quotes is (1)
0000000 ada1
0000002

% echo -n "?´↓??き?" | hexdump   ## between quotes is (1)(2)(3)(4)(5)
0000000 ada1 ada2 ada3 ada4 ada5
000000a

>     I guess that when they receive 2022-JP, it's OK. But that when they
> receive EUC-JP, Shift-JIS, or UTF-8, it has 2/3 risks to fail, as soon
> as the term doesn't match... Or are your students behind your magic
> NKF all-to-EUC converter?

Perhaps that's it.  These students are at home on Windows98 or WindowsXP
machine, or receiving mail on their cellphone.  The students who couldn't
read my mail had same cellphone provider "ezweb.ne.jp".  (Wasn't refering to
students logging into Unix shell account where filter is protection device.)

>     I'm afraid your nasty procmail rule or gawk script to auto-detach
> and remove attachments are at fault. This destroys the MIME structure

Yes.  Actually, I had protected mutt.org mails in addition to yours, but
I forgot to allow formail to keep "Content-Type:" header for the mutt
mailing list box.  I think that is corrected now.  (Thanks!)

>     You may try to <edit-type> (^E) my mail and replace the prompt by:
> 
> | multipart/mixed; boundary="ZGiS0Q5IWpPtfppv"
> 
>     ...but I'm not sure of the result. You will perhaps gain the shoguns

Anyway, I'll give it a try.

> > Any clues?
> 
>     Dave's mail was UTF-8. Your procmail rules wrongly interpreted it as
> being EUC-JP, and converted it, or overwrited the label. Or something
> like that. I get the same strange chars as you if I iconv Dave's raw
> UTF-8 mail from EUC-JP to ISO-2022-JP. After such corruption, no mailer

VERY interesting.  How is UTF-8 encoded in the mail, I wonder.  My procmail
recipe is:
:0 Bfbw
* $ \\^[\\\$B.*\\^[\\([BJ]
* H ?? $ !^(From|Message-ID|Received|Reply-To|Sender|Subject|To):? .*($NOFILTR)
|$LBDIR/nkf -Jeu
:0 aHfhw
|$FORMAIL -i"Content-Type: ${TOEUC:-text/plain; charset=euc-jp}"

In other words, that forced conversion _from_ ISO-2022-JP _to_ EUC-JP
is triggered by characters between [a raw escape followed by "$B"] and
[a raw escape followed by "(" and a "B" or "J"].  Are you saying that
UTF-8 is stuck between the same two tags?  Oh, no.  Very bad situation.
How can I distinguish between ISO-2022-JP and UTF-8?

> can display it right. Not Mutt's fault.

That's for sure.  Too bad all programs aren't coded as well as mutt!

> A: Because it messes up the order in which people normally read text.
> Q: Why is top-posting such a bad thing?

Q: What's "top-posting"?

-- 
henry nelson
 | day job: | http://yuba.kcn.ne.jp/biorec/nehan/henken.html