<<< Date Index >>>     <<< Thread Index >>>

Re: mutt_FormatString() not multibyte-aware



Hi,

* Alain Bench [06-06-30 12:10:11 +0200] wrote:
On Friday, June 23, 2006 at 13:02:24 +0000, Rocco Rutte wrote:

we could also convert to utf-8 first because it's so trivial to test
for continuations (as mutt IIRC does in other places already).

   I don't get it: We need to count cells. Conversion to UTF-8 would
easely give the count of characters, but each one can take 0 to 2 cells.
So something around wcwidth() like mutt_strwidth() or such is still
needed. And those don't want UTF-8, but wc or current locale mb.

In rfc2047.c there is:

  #define CONTINUATION_BYTE(c) (((c) & 0xc0) == 0x80)

for UTF-8 with which you can easily determine how much bytes a multibyte character from a 'char*' has and that is what we need for padding.

The RfC2047 encoder now converts everything to UTF-8 and uses the above to produce encoded words which do not break within multibyte characters (which RfC2047 requires, but you likely know that) using the above #define.

And I can think of something similar for mutt_FormatString().

That would enable us to have the status lines being more correct; on single tokens extracted we would still need to use wcwidth() to determine their width on screen; but for detecting padding chars the above is good enough (given the performance implication of local->utf8->local doesn't count much)... and better than 'foo=*bar++'.

   On platforms where wcwidth() is unreliable, we could embed the
replacement in wcwidth.c via -HAVE_WC_FUNCS.

This is the case already, see wcwidth.c (which could need an update, btw).

  bye, Rocco
--
:wq!