<<< Date Index >>>     <<< Thread Index >>>

Re: extract_url.pl - version 1.3



-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wednesday, May 21 at 02:24 PM, quoth Wilkinson, Alex:
>0n Tue, May 20, 2008 at 07:14:21PM -0500, Kyle Wheeler wrote:
>
>> The original reason for this script was because urlview doesn't 
>> correctly handle format=flowed email or any other email encodings, 
>> so URLs are often mishandled or simply broken. This script handles 
>> all known encodings *correctly* (when fed the raw email). It can be 
>> used either as a standalone script (which requires the Curses::UI 
>> perl module) or as a pre-filter for urlview.
>
> Ahh, now this is what i like to hear.

:)

> I have a few questions:
>
>  1. What is meant by "format=flowed email" ? 

Email that is tagged as "format=flowed" (i.e. it says so in the 
Content-Type header) informs the receiving client that some lines are 
"connected" and some lines are not. Lines that end in a space are 
considered "to be continued". It's kinda like putting a backslash at 
the end of a shell-script line. This allows the client to be able to 
re-format and re-wrap all the lines in an email to fit whatever 
display width is available. Note that this is for text/plain email 
only; NOT HTML.

Many mail clients can send format=flowed (also known as "f=f") mail, 
including mutt, Eudora, Apple's Mail.app, among others.

Part of the motivation for f=f mail is that the email spec limits line 
length, and part of it is things like blackberries need to be able to 
redisplay email on much smaller screens than the email was written 
for.

There's a variant of f=f email, called delsp=yes email (i.e. it has 
that tag in the Content-type header as well). The difference between 
this variant and the "standard" f=f email is in how lines are joined. 
Specifically, do we leave the space in there, or not? Several email 
clients use this technique to split long URLs over multiple lines 
(thus obeying the line-length restrictions of the relevant email RFCs) 
in a way that allows the client to reconstruct the original URL 
easily. In practice, this means that sentences get broken up by two 
spaces at the end of each line, while URLs get broken up by a single 
space at the end of the line.

My extract_url.pl handles both kinds of format=flowed email correctly.

(As an example, this email I'm sending right now is format=flowed 
formatted. Note that most lines end in a space.)

>  2. What are the "known encodings" ?

Primarily, Base64 and quoted-printable. In essence, anything that's 
understood by perl's MIME::Parser.

> I often have broken links in the body of my emails and I don't know why e.g.
>
> The link is meant to look like:
>
>   
> http://odinr.dcb.defence.gov.au/uhtbin/cgisirsi/MhOktEUDHs/DSTOE/242330010/60/54/X
>
> But I will always see it like this in mutt:
>
>   http://odinr.dcb.defence.gov.au/uhtbin/cgisirsi/MhOktEUDHs/DSTOE/2423300 
>   10/60/54/X

That kind of thing happened to me a *lot*.

> When I look at the raw spool file (independent of mutt) I see:
>
>   <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
>   <HTML>
>   <HEAD>
>   <META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; = 
>   charset=3Dus-ascii">
>   <META NAME=3D"Generator" CONTENT=3D"MS Exchange Server version = 
>   6.5.7652.24">
>   <TITLE>Link to catalogue</TITLE>
>   </HEAD>
>   <BODY>
>   <!-- Converted from text/rtf format -->
>   <BR>
>
>   <P><A = 
>   HREF=3D"http://odinr.dcb.defence.gov.au/uhtbin/cgisirsi/MhOktEUDHs/DSTOE/= 
>   242330010/60/54/X"><U><FONT COLOR=3D"#0000FF" SIZE=3D2 = 
>   FACE=3D"Arial">http://odinr.dcb.defence.gov.au/uhtbin/cgisirsi/MhOktEUDHs= 
>   /DSTOE/242330010/60/54/X</FONT></U></A>
>   </P>
>
> Would your script deal with this annoying problem (which I still don't 
> understand). If it would ... I am going to use it permanently :)

Yes, my script would handle that.

What you have there is an HTML email that's been quoted-printable 
encoded.

The MIME::Parser module automatically transforms the quoted-printable 
form:

    <P><A =
    HREF=3D"http://odinr.dcb.defence.gov.au/uhtbin/cgisirsi/MhOktEUDHs/DSTOE/=
    242330010/60/54/X"><U><FONT COLOR=3D"#0000FF" SIZE=3D2 =
    FACE=3D"Arial">http://odinr.dcb.defence.gov.au/uhtbin/cgisirsi/MhOktEUDHs=
    /DSTOE/242330010/60/54/X</FONT></U></A>
    </P>

Into straight-up HTML:

    <P><A
    
HREF="http://odinr.dcb.defence.gov.au/uhtbin/cgisirsi/MhOktEUDHs/DSTOE/242330010/60/54/X";><U><FONT
    COLOR="#0000FF" SIZE=2
    
FACE="Arial">http://odinr.dcb.defence.gov.au/uhtbin/cgisirsi/MhOktEUDHs/DSTOE/242330010/60/54/X</FONT></U></A>
    </P>

And then my script runs that through an HTML parser to extract the URL 
from the <A HREF=""> tag.

Obviously, in order to work its magic the best, my script needs access 
to the raw form of the email (if you feed it the pre-formatted output 
of a web browser like lynx, there's no way to tell whether the URL 
necessarily continues on the next line or not). But, given the raw 
form (i.e. following the directions on the web page), it will handle 
that.

The biggest difference between extract_url.pl and urlview is that 
urlview is just looking for URLs in plain text (which means when the 
line ends, so does the URL), while extract_url.pl is looking to decode 
things first, and so can reconstruct URLs that have been split over 
multiple lines. Because of that difference, extract_url.pl can be used 
as a pre-filter for urlview (it just prints out all the URLs in a form 
that urlview can understand).

This new version of extract_url.pl has the ability to do something 
else that urlview cannot, and that's maintain some sense of the 
context of a given URL from the original email. It's not perfect (take 
duplicate URLs for example), but I think it's worthwhile.

~Kyle
- -- 
Reason is itself a matter of faith. It is an act of faith to assert 
that our thoughts have any relation to reality at all.
                                                   -- G. K. Chesterton
-----BEGIN PGP SIGNATURE-----
Comment: Thank you for using encryption!

iD8DBQFIM8qIBkIOoMqOI14RAivAAKC4ZS9kfunofrnRsEdb9ChDjpy6UgCghIag
UGXsKhwXgpXPgZN7IRqfSEs=
=H+py
-----END PGP SIGNATURE-----