Web scraping a friend's blog
A friend who has a blog on a popular blogging farm asked me for help backing up her blog. The site doesn't offer to download a backup, so we'll use Web scraping!
- Web scraping > Interactive prototyping
- Requirements (features)
- Backup what? Plain text of all her posts ever, timestamps, maybe some styling applied (using HTML) to the text, (text of) comments per post, and images embedded in posts.
- Periodically harvest new posts and new comments. Automatically?
- Archive format? Enable local browsing, ie, convert each post to an HTML document, inserting related comments and images. A frames-based index page could be convenient. What for?
- How to deploy (distribute, install, maintain?) the solution? (Pure Python, plus Beautiful Soup, I guess?)
- Crawling
- Algorithm? Tentatively: fetch front page, find link to first entry (assuming they're in reverse chronological order), harvest that post, then follow link to previous post and harvest recursively. But: don't harvest twice (efficiency), but what if new comments on old entry, or edited (if possible)? Robustness? "Staggering" (throttling, out of civility)?
- Parsing the HTML
- Site's HTML is horrendous beyond words. Unspeakable, really. We'll probably use BeautifulSoup to extract content, or regular expressions.
Saved a WAR archive (it's a tarball) of the root page. Unpacked its contents: 65 files, 1.6MiB total. index.html is 846KB (actual text less than 1KB!). 33 small (<2KB) icons.
- Site's HTML is horrendous beyond words. Unspeakable, really. We'll probably use BeautifulSoup to extract content, or regular expressions.
- Operation…
Notes
(None.)
(Appending notes disabled temporarily.)
Last modified 2009-08-06 07:21:21 +0000