<<< Date Index >>>     <<< Thread Index >>>

Re: Filtering html



On Mon, 13 Jun, 2005, archaiesteron@xxxxxxxxxxx wrote:
> On Sun, Jun 12, 2005 at 12:39:18PM +0100, James Mason wrote:
> > How can I use mutt, in conjunction with fetchmail and procmail, to pipe
> > emails containing html through something like html2text as they arrive,
> 
> Do you already use procmail ? It is not clear in your message. If it
> is the case, please skip the first section of my answer.
> 
> First of all, use procmail by adding the following line
>     mda "procmail -d baruchel"
> in your .fetchmailrc for each account. Of course, replace "baruchel"
> by your login name (not your login name for the POP account but your login
> name on the current local machine). Then check you have no ~/.procmailrc
> and perform some initial tests : nothing must have changed (your mail
> should go normally in your local box). OK ?
> 
> Then edit a ~/.procmailrc and do the following things. If you don't have
> any ~/.procmailrc, add the following lines at the beginning:
> 
> VERBOSE=off
> MAILDIR=$HOME/Mail
> PMDIR=$HOME/.procmail
> DEFAULT=/var/mail/baruchel
> LOGFILE=$PMDIR/log
> 
> First line : you should put 'on' for the initial tests...
> Second line : fix it
> Third line : leave it as it is, but create the directory ~/.procmail
> Fourth line : fix it
> Last line : leave it as it is
> 
> Then add a recipe for filtering HTML.
> Add the line:
> :0 fbw
> then add you regexp. I have no idea for the best test for detecting
> HTML, but maybe you should study this :
>   http://www.mhonarc.org/~ehood/MIME/2045/rfc2045.html#5
> It seems that the header involved in your question is Content-Type:
> maybe something like that "Content-Type: text/html" but you should
> ask in a newsgroup.
> Thus it should look like
> 
> :0 fbw
> * ^Content-Type.*text/html
> | my_program
> 
> where your program will be a filter for the body (probably lynx --dump or
> anything you want).
> 
> If you find several regexp, you should use
> * ^Content-Type.*(text/html|another/regexp|still_another/regexp)
> | html2text
> 
> where | means OR and ( ... ) makes a group.
> 
> Of course, be careful, because if you do something wrong, you will
> lose the content of your mailbox (I suggest you work first on a junk
> mailbox with nothing important in it).

I think better is to make a copy:

  :0
  * ^Content-Type:.*text/html
  {
      :0 c
      original_mail_goes_here

      :0
      | html2text
  }

or like this:

  :0 c
  * ^Content-Type:.*text/html
  original_mail_goes_here
    :0 A
    | html2text

> 
> Hope it will help,
> 
> -- 
> Thomas Baruchel

-- 
(dogmaT
        (icq 303140614)
        (jabber dogmat_at_njs_dot_netlab_dot_cz)
        (mail dogmat_at_dogmat_dot_us)
        (web http://dogmat.us))