Re: extract_url.pl - version 1.3
- To: mutt-users@xxxxxxxx
- Subject: Re: extract_url.pl - version 1.3
- From: Kyle Wheeler <kyle-mutt@xxxxxxxxxxxxxx>
- Date: Wed, 21 May 2008 02:08:56 -0500
- Comment: DomainKeys? See http://domainkeys.sourceforge.net/
- Dkim-signature: v=1; a=rsa-sha1; c=relaxed; d=memoryhole.net; h=date :from:to:subject:message-id:references:mime-version:content-type :content-transfer-encoding:in-reply-to; s=default; bh=0clU8P+Ctx EBupVSY5C4T1XMhIo=; b=SlIxGZ5WicqvFPNwrUkmxXMt0Yl5BHCrEnqHVoBZLm Tr4jwpCiUjMZUCYJwtA7fwacNIjULg7TYAQfHs5+FjPT20nNJnJNNgQe17yBWVJD z9UhLjlLyVsPBi3vbetSFtUvLoEvLcEjpoSkzEx4M8o3b+SipXzeN47Rgxcic3jB k=
- Domainkey-signature: a=rsa-sha1; q=dns; c=nofws; s=default; d=memoryhole.net; b=F7Vd2b68FjNHEaGCkOtiKQzVeXg8HhomjJE+UyM3dOVMKtGi15+cxdn6Gg8/MnbCvcdvgzcMduf4rBTH5guNRXlWPfNYyeJdrAlu6MP6mZouOM+XHUJpy63cE/40Dfl+rTjLWEkZWhNFVCAWz0VHe22jVDs07v9qMMNmH5Fk6uo=; h=Received:Date:From:To:Subject:Message-ID:Mail-Followup-To:References:MIME-Version:Content-Type:Content-Disposition:Content-Transfer-Encoding:In-Reply-To:OpenPGP:User-Agent;
- In-reply-to: <20080521062447.GN86600@xxxxxxxxxxxxxxxxxxxxxxxxxxxx>
- List-post: <mailto:mutt-users@mutt.org>
- List-unsubscribe: send mail to majordomo@mutt.org, body only "unsubscribe mutt-users"
- Mail-followup-to: mutt-users@xxxxxxxx
- Openpgp: id=CA8E235E; url=http://www.memoryhole.net/~kyle/kyle-pgp.asc; preference=signencrypt
- References: <20080521001421.GA26283@xxxxxxxxxxxxx> <20080521062447.GN86600@xxxxxxxxxxxxxxxxxxxxxxxxxxxx>
- Sender: owner-mutt-users@xxxxxxxx
- User-agent: Mutt/1.5.13 (2006-08-11)
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On Wednesday, May 21 at 02:24 PM, quoth Wilkinson, Alex:
>0n Tue, May 20, 2008 at 07:14:21PM -0500, Kyle Wheeler wrote:
>
>> The original reason for this script was because urlview doesn't
>> correctly handle format=flowed email or any other email encodings,
>> so URLs are often mishandled or simply broken. This script handles
>> all known encodings *correctly* (when fed the raw email). It can be
>> used either as a standalone script (which requires the Curses::UI
>> perl module) or as a pre-filter for urlview.
>
> Ahh, now this is what i like to hear.
:)
> I have a few questions:
>
> 1. What is meant by "format=flowed email" ?
Email that is tagged as "format=flowed" (i.e. it says so in the
Content-Type header) informs the receiving client that some lines are
"connected" and some lines are not. Lines that end in a space are
considered "to be continued". It's kinda like putting a backslash at
the end of a shell-script line. This allows the client to be able to
re-format and re-wrap all the lines in an email to fit whatever
display width is available. Note that this is for text/plain email
only; NOT HTML.
Many mail clients can send format=flowed (also known as "f=f") mail,
including mutt, Eudora, Apple's Mail.app, among others.
Part of the motivation for f=f mail is that the email spec limits line
length, and part of it is things like blackberries need to be able to
redisplay email on much smaller screens than the email was written
for.
There's a variant of f=f email, called delsp=yes email (i.e. it has
that tag in the Content-type header as well). The difference between
this variant and the "standard" f=f email is in how lines are joined.
Specifically, do we leave the space in there, or not? Several email
clients use this technique to split long URLs over multiple lines
(thus obeying the line-length restrictions of the relevant email RFCs)
in a way that allows the client to reconstruct the original URL
easily. In practice, this means that sentences get broken up by two
spaces at the end of each line, while URLs get broken up by a single
space at the end of the line.
My extract_url.pl handles both kinds of format=flowed email correctly.
(As an example, this email I'm sending right now is format=flowed
formatted. Note that most lines end in a space.)
> 2. What are the "known encodings" ?
Primarily, Base64 and quoted-printable. In essence, anything that's
understood by perl's MIME::Parser.
> I often have broken links in the body of my emails and I don't know why e.g.
>
> The link is meant to look like:
>
>
> http://odinr.dcb.defence.gov.au/uhtbin/cgisirsi/MhOktEUDHs/DSTOE/242330010/60/54/X
>
> But I will always see it like this in mutt:
>
> http://odinr.dcb.defence.gov.au/uhtbin/cgisirsi/MhOktEUDHs/DSTOE/2423300
> 10/60/54/X
That kind of thing happened to me a *lot*.
> When I look at the raw spool file (independent of mutt) I see:
>
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
> <HTML>
> <HEAD>
> <META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
> charset=3Dus-ascii">
> <META NAME=3D"Generator" CONTENT=3D"MS Exchange Server version =
> 6.5.7652.24">
> <TITLE>Link to catalogue</TITLE>
> </HEAD>
> <BODY>
> <!-- Converted from text/rtf format -->
> <BR>
>
> <P><A =
> HREF=3D"http://odinr.dcb.defence.gov.au/uhtbin/cgisirsi/MhOktEUDHs/DSTOE/=
> 242330010/60/54/X"><U><FONT COLOR=3D"#0000FF" SIZE=3D2 =
> FACE=3D"Arial">http://odinr.dcb.defence.gov.au/uhtbin/cgisirsi/MhOktEUDHs=
> /DSTOE/242330010/60/54/X</FONT></U></A>
> </P>
>
> Would your script deal with this annoying problem (which I still don't
> understand). If it would ... I am going to use it permanently :)
Yes, my script would handle that.
What you have there is an HTML email that's been quoted-printable
encoded.
The MIME::Parser module automatically transforms the quoted-printable
form:
<P><A =
HREF=3D"http://odinr.dcb.defence.gov.au/uhtbin/cgisirsi/MhOktEUDHs/DSTOE/=
242330010/60/54/X"><U><FONT COLOR=3D"#0000FF" SIZE=3D2 =
FACE=3D"Arial">http://odinr.dcb.defence.gov.au/uhtbin/cgisirsi/MhOktEUDHs=
/DSTOE/242330010/60/54/X</FONT></U></A>
</P>
Into straight-up HTML:
<P><A
HREF="http://odinr.dcb.defence.gov.au/uhtbin/cgisirsi/MhOktEUDHs/DSTOE/242330010/60/54/X"><U><FONT
COLOR="#0000FF" SIZE=2
FACE="Arial">http://odinr.dcb.defence.gov.au/uhtbin/cgisirsi/MhOktEUDHs/DSTOE/242330010/60/54/X</FONT></U></A>
</P>
And then my script runs that through an HTML parser to extract the URL
from the <A HREF=""> tag.
Obviously, in order to work its magic the best, my script needs access
to the raw form of the email (if you feed it the pre-formatted output
of a web browser like lynx, there's no way to tell whether the URL
necessarily continues on the next line or not). But, given the raw
form (i.e. following the directions on the web page), it will handle
that.
The biggest difference between extract_url.pl and urlview is that
urlview is just looking for URLs in plain text (which means when the
line ends, so does the URL), while extract_url.pl is looking to decode
things first, and so can reconstruct URLs that have been split over
multiple lines. Because of that difference, extract_url.pl can be used
as a pre-filter for urlview (it just prints out all the URLs in a form
that urlview can understand).
This new version of extract_url.pl has the ability to do something
else that urlview cannot, and that's maintain some sense of the
context of a given URL from the original email. It's not perfect (take
duplicate URLs for example), but I think it's worthwhile.
~Kyle
- --
Reason is itself a matter of faith. It is an act of faith to assert
that our thoughts have any relation to reality at all.
-- G. K. Chesterton
-----BEGIN PGP SIGNATURE-----
Comment: Thank you for using encryption!
iD8DBQFIM8qIBkIOoMqOI14RAivAAKC4ZS9kfunofrnRsEdb9ChDjpy6UgCghIag
UGXsKhwXgpXPgZN7IRqfSEs=
=H+py
-----END PGP SIGNATURE-----