<<< Date Index >>>     <<< Thread Index >>>

Re: Problems with mutt and utf-8, can't talk to itself even!



-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wednesday, March 19 at 02:14 PM, quoth Chris G:
>Well I'm still not sure things are right, even after getting my 
>editor to do (approximately) the right thing.
>
>Here are some incorrect pound signs:-

Those are all encoded as three bytes: 0xEF 0xBF 0xBD

>Here are some correct (as in correctly encoded as utf-8 by my editor)
>pound signs:-

Those are also all the same three bytes: 0xEF 0xBF 0xBD

That *looks* like valid utf-8.

For a quick tutorial in three-byte utf-8, the way three-byte letters 
are encoded (in binary) is like this:

     1110xxxx 10yyyyyy 10zzzzzz

The three bytes 0xEF 0xBF and 0xBD are, in binary, this:

     11101111 10111111 10111101

Thus, the decoded portions are:

         1111   111111   111101

Put them back together as a single binary number:

     1111111111111101

That's 65533 in decimal (0xfffd in hex). In utf-8, that's referred to 
as U+FFFD, which (according to the Unicode specification) is:

     REPLACEMENT CHARACTER
     - used to replace an incoming character whose value is unknown or
       unrepresentable in Unicode
     - compare the use of U+001A as a control character to indicate the
       substitute function

In other words, if that's what your editor is generating, then it 
obviously doesn't know how to handle a pound symbol, even though it 
DOES seem to understand UTF-8 (kinda).

For what it's worth, the CORRECT utf-8 encoding of the pound symbol 
(U+00A3) is only two bytes. Here's how we get it. Two-byte unicode 
characters are encoded like this (in binary):

     110yyyyy 10zzzzzz

U+00A3 translates to the hex number 0xA3, which in binary is this:

     10100011

If we split that up, that becomes:

           10   100011

Thus, in UTF-8 it's encoded as:

     11000010 10100011

Thus, the correct UTF-8 encoding for a pound symbol is 0xC2 0xA3. 
Here's an example: £

~Kyle
- -- 
I contend that we are both atheists. I just believe in one fewer god 
than you do. When you understand why you dismiss all the other 
possible gods, you will understand why I dismiss yours.
                                           -- Sir Stephen Henry Roberts
-----BEGIN PGP SIGNATURE-----
Comment: Thank you for using encryption!

iEYEARECAAYFAkfhMFsACgkQBkIOoMqOI144+gCg5bLJ2t7fK7+Ih1A6qBFgeuka
jO0AoKDy+JgwsknmCiSDkOwG4OTE2p0Z
=euIx
-----END PGP SIGNATURE-----