Exporting Old use.perl.org Blog Entries

December 11th, 2010

Perluse.perl.orgpQueryLWP::Simple

This week-end I finally got around importing all my old use.perl.org blog entries to Fearful Symmetry. To ease off the migration, I ended up writing two itsy-bitsy scripts. They’re nothing fancy, but in case they might help someone, here they are.

Harvest the entries

This was easy. For each account, use.perl.org has a journal entries listing page. So the whole operation consisted of grabbing that webpage and mirror everything on it looking like a journal entry. Not terribly sophisticated, but for this specific job it’s all we need.

Of the script itself, the most interesting part is LWP::Simple::getstore(). Most people know and use LWP::Simple::get(), but more than a few forget its sibling, which save the retrieved webpage directly to a file — which is perfect for harvesting activities like this one.

harvest_entries.pl

Extract the information off the harvested pages

As one might suspect, the harvested use.perl.org pages contain a little bit more than the raw blog entries. Getting to the information we want — the blog entry’s title, creation date, body, etc — is not hard, but it’s a little onerous to do by hand.

There are a lot of way to extract information from a webpage, from quick and dirty regular expressions (like I did in for the script above) to full-fledged DOM parsing using, say, HTML::Tree. As I’m playing a lot with jQuery these days, I wondered if there was anything Perlish available offering the same type of interface. Guess what? There is: pQuery.

After playing with it a little bit, I’d say that pQuery is not quite as slick and ready for prime-time as its JavaScript forebear. But again, for this small task, it allowed me to do the job.

The resulting script is as straight-forward as they come. I used Firebug to find out which html elements I want, tested the resulting paths with jQuery and, once I was happy with the result, adapted the result to pQuery.

extract_entry.pl

It’s harvesting time

With those two scripts ready to go, the harvesting process becomes much less of a chore:

$ perl files/harvest_entries.pl 
retrieving 38951...
retrieving 38951...

$ perl files/extract_entry.pl 38951 
title: Breaking off from the use.perl.org mothership
date: 10 May 2009
original url: http://use.perl.org/~Yanick/journal/38951


<p>
For the last couple of months, as a concession between
visibility and control, I'd been double-posting my blog
entries both here and on my
personal blog.
But now that my blog is registered on both the
<a href="http://perlsphere.net/" rel="nofollow">Perlsphere</a> and
<a href="http://ironman.enlightenedperl.org/" rel="nofollow">IronMan</a> aggregators,
the need for the second posts here has dwindled.  So... I'm going
on a limb and tentatively turn off the echoing.
See y'all on <a href="http://babyl.dyndns.org/techblog" rel="nofollow">Hacking Thy Fearful
Symmetry</a>!</p>

Of course, there is still the grooming of the use.perl.org html, and the actual importing to the new blogging engine. But… surely a handful of other scripts can take care of that, right? :-)