Re: [Mutt] #2995: Mutt is folding subject line folds using a tab,

To: frnkblk@xxxxxxxxx, elliot.li.tech@xxxxxxxxx, ben@xxxxxxxxxxx, brendan@xxxxxxxxxx, pdmef@xxxxxxx
Subject: Re: [Mutt] #2995: Mutt is folding subject line folds using a tab,
From: Mutt <fleas@xxxxxxxx>
Date: Mon, 06 Apr 2009 08:03:41 -0000
Auto-submitted: auto-generated
Cc: mutt-dev@xxxxxxxx
In-reply-to: <039.9e188e8932288d27e6fb31a5d5236dc9@xxxxxxxx>
List-post: <mailto:mutt-dev@mutt.org>
List-unsubscribe: send mail to majordomo@mutt.org, body only "unsubscribe mutt-dev"
Mail-followup-to: fleas@xxxxxxxx
References: <039.9e188e8932288d27e6fb31a5d5236dc9@xxxxxxxx>
Reply-to: fleas@xxxxxxxx
Sender: owner-mutt-dev@xxxxxxxx

#2995: Mutt is folding subject line folds using a tab, which appears to be 
against
RFC 2822
----------------------+-----------------------------------------------------
  Reporter:  frnkblk  |       Owner:  mutt-dev                  
      Type:  defect   |      Status:  new                       
  Priority:  minor    |   Milestone:  1.6                       
 Component:  mutt     |     Version:  1.5.17                    
Resolution:           |    Keywords:  header folding white-space
----------------------+-----------------------------------------------------

Comment(by Derek Martin):

 {{{
 It is hard to believe this ticket has caused as much difficulty as it
 has... or that this bug has existed for almost exactly two years. :)

 On Tue, Mar 17, 2009 at 02:18:59PM -0000, Mutt wrote:

 This function was added in Mutt-1.5.14, which was released in April of
 2007.  As far as I can tell, Mutt did not insert a tab before that.
 This bug was reported in November of 2007, roughly 7 months later.
 Rocco asked why no one had noticed this before then...  I think the
 most likely explanation is a combination of the fact that until
 roughly then, (relatively) not many people were running those releases
 of Mutt, with the fact that most humans tend to write short subject
 lines (the subject is the header where this matters most, AFAICT).  I
 didn't spend the time to track it down, but I'd be willing to bet that
 the bug was reported shortly after Debian took a version of mutt
 containing this change...


 I don't see why this is so difficult.  Granted the function is hairy
 (which is a problem in and of itself), but it shouldn't be difficult
 to simply insert a newline before some whitespace.  However, that's
 not what it does... it does this:

       /* Space padding, but only if necessary */
       if (!first && *cp != '\t' && *cp != ' ')
       {
         if (fputc ('\t', fp) == EOF)
           return -1;
         col += 8 - (col % 8);
       }
     }

 If you read RFC (2)822, this seems clearly wrong.  The only character
 you are allowed to insert for line wrapping is a newline, which MUST
 be inserted before whitespace.  It doesn't say anything about
 *changing* existing whitespace.  Looking at this function, it appears
 to me that it has at least 4 fairly serious problems:

 1. It's long and hairy.  Everyone who's looked at it has said they
 would be loathe to try to modify it.  This is a pretty good sign that
 the function needs to be rewritten and probably broken down into
 smaller logical pieces.

 2. It uses iswspace() to determine if the next character is a
 whitespace character.  This is almost certainly wrong; RFC 2822
 defines WSP in headers as either ' ' or '\t'.  No other characters are
 considered WSP.  However, iswspace() probably evaluates to true in
 many other instances in various locales.  Probably a function
 is_rfc822_wsp() is needed (or a macro, though I really hate them).

 3. It is inserting a tab, where there was none, and where no RFC says
 you should insert one.  It is also *removing a space* in many cases.
 Who said you could do that?

 4. If the newline which was inserted for line wrapping was inserted
 correctly in the first place, this code should NEVER execute, since
 the next character after the inserted newline (which was from the
 original header) MUST be either ' ' or '\t'.  RFC 2822 says this:

    Each header field is logically a single line of characters
    comprising the field name, the colon, and the field body.  For
    convenience however, and to deal with the 998/78 character
    limitations per line, the field body portion of a header field can
    be split into a multiple line representation; this is called
    "folding".The general rule is that wherever this standard allows
    for folding white space (not simply WSP characters), a CRLF may be
    inserted before any WSP.

 It does not say "a CRLF and arbitrary additional WSP may be inserted
 before any WSP."   It absolutely doesn't say that you can change
 whitespace with arbitrary other whitespace.  It says you can insert a
 CRLF.  That's it.  Wherever WSP exists in the header, you can insert a
 CRLF before it.  That's not what mutt is doing, and what mutt
 interacts badly with other software that folds and unfolds correctly.

 FWIW Rocco's comments in earlier ticket updates about spaces being
 removed between encoded words in RFC 2047 are interesting, but
 irrelevant. RFC 2047 says white space between two encoded words must
 be ignored to account for the limit of 75 characters of any one
 encoded word, i.e. you need to be able to put them together into a
 single decoded encoded-word, if that is what they are to represent.
 Space between an encoded word and any unencoded word must not be
 ignored, as it is genuine space.  But we are not dealing with encoded
 words in this ticket, so RFC 2047 is not relevant.

 This code is wrong...  For the love of all that is good and decent in
 the world, please fix it, or let someone else fix it. :)

 FWIW, if I wrote this (which I could possibly do, maybe as soon as
 this weekend), I think the algorithm should (very) roughly be:

   1. start at the beginning of value (which is also called cp)
   2. get the next token (either WSP or MB-character), with the number
      bytes consumed from value, if there's one to get.
   3. If the token causes the line length to be greater than wraplen,
      and there are still chars left in value, or if there are no more
      characters left in value, insert '\n'.
   4. Add the token to the ouput buffer (if there was one).
   5. move the pointer that points to value up by the number of bytes
      consumed.
   6. If there are no more characters, print the buffer.

 I think the function should be divided in two: one that does step 2
 above (essentially the stuff in the big if block in the exiting
 function), and one that does the rest.  The block of code I quoted
 above should be removed entirely.

 A couple of other things to note:

 It looks to me like calling foldingstrfcpy() may be meant to fix
 this problem.  However, the '\t' is written directly to the file,
 instead of into the buffer, so it can't fix it.  Otherwise, it's not
 clear to me why the function is called (strncpy() would seem to
 be just as good here).  In fact, this seems to do the wrong thing; any
 tabs which the user *wanted* (ignoring the fact that this will
 essentially never be true) will be converted to spaces.

 Someone commented that this code may be shared by both the compose and
 the display code.  If that's the case, fixing it right probably
 requires an option to the function that controls whether the
 folded line is indented with a tab or not, for pretty display.

 If we're in an encoded word, but we find a '\n' (which is not allowed
 in an encoded word), we just assume that we were not really in an
 encoded word.  This seems possibly wrong, though maybe it isn't.  At
 the least, it seems like maybe mutt ought to warn the user that the
 header may be malformed (though I suppose if it is, they'll figure
 it out).  Is it possible this could be a worse bug?  Space is also not
 allowed in an encoded word, but there is no checking of this condition
 in that section of the function...

 Also while I can see why one might want to pass in the header's value
 as a const, there's no need to create the alias cp to value.  You can
 still increment the pointer even though the stuff it points to is
 constant.  Once the alias is created, the variable value is never
 referred to again...  In fact, the way the code is written is
 detrimental:  Generally, if you try to modify *cp the compiler will
 not complain, but your program will either segfault or let you do it!
 If you try to modify *value, you will get a compiler error.  [At
 least, this is how gcc behaves on my system.]

 Lastly the variables last and wrapped appear to be initialized
 twice; once in the variable declaration section, and again a bit
 further down in the code.
 }}}

-- 
Ticket URL: <http://dev.mutt.org/trac/ticket/2995#comment:>
Mutt <http://www.mutt.org/>
The Mutt mail user agent

Follow-Ups:
- Re: [Mutt] #2995: Mutt is folding subject line folds using a tab,
  - From: Rocco Rutte

References:
- [Mutt] #2995: Mutt is folding subject line folds using a tab, which appears to be against RFC 2822
  - From: Mutt

Prev by Date: mutt: 9 new changesets
Next by Date: Re: [Mutt] #2995: Mutt is folding subject line folds using a tab,
Previous by thread: Re: [Mutt] #2995: Mutt is folding subject line folds using a tab,
Next by thread: Re: [Mutt] #2995: Mutt is folding subject line folds using a tab,
Index(es):
- Date
- Thread