Bob Belderbos |¬†Software Developer     About     Archive     Feed

A Python script to import a complete blog to plain text

In today's post I'll show you a script I wrote yesterday to import an entire blog based on a (XML) sitemap. It also converts the posts from html to plain text. Allthough it was merely Python practice, it could actually be useful to export/backup blogs to a single text file for easy consumption in a text editor or offline.

How it works / some comments:

  • The class gets instantiated with a couple of arguments:
  • # create instant
    blog = ImportBlogPosts("", '<div class="entry-content">', '<div><br /><h4><strong>You might also like:')
    # get all URLs with /20 in them (this means real post URLs in my blog's case, so ignoring the about-, the archive- and homepage)
    # ... or just a single URL: 

    featured image

    The first arg is clear: the blog URL. The 2nd and 3rd args are the html start- and endsnippets to grab the actual post (and not all the fluff around it). The sitemap has to be present at URL/sitemap.xml, or it could be a file on the FS. I did see quite some blogs without the sitemap, so an RFE would be to let the class generate the sitemap if not present ...

  • Importing web content: this is easy in Python: urllib.urlopen, but in this case I had some issues with body content not being imported so I used "wget -q .." with
  • I find native module "xml.dom.minidom" pretty convenient for parsing XML
  • I use a nice module for html-to-text conversion: html2text - it delivers markdown format. One tricky thing with encodings (a subject I need to study on its own one day): to get the script working to stdout as well as redirecting its output to a file (with '>' on Unix) I had to use this magic:
  • ..
      postContent = postContent.decode('utf-8')
      print html2text.html2text(postContent).encode('ascii', 'ignore') 

    Yes that is right: decode to utf-8, and encode to ascii. Anybody having a better way or explanation, please comment.

The code

Here a copy of the code, see also here on Github.

#!/usr/bin/env python                                                                                                                                     
# -*- coding: utf-8 -*-
# Author: Bob Belderbos / written: Dec 2012
# Purpose: import all blog posts to one file, converting them in (markdown) text
# Thanks to html2text for doing the actual conversion ( ) 
import os, sys, pprint, xml.dom.minidom, urllib, html2text, subprocess

class ImportBlogPosts(object):
  """ Import all blog posts and create one big text file (pdf would increase size too much,
      and I like searching text files with Vim).  It uses the blog's sitemap to get all URLs. """

  def __init__(self, url, poststart, postend, sitemap="sitemap.xml"):
    """ Specify blog url, where post html starts/ stops, what urls in sitemap are valid, and sitemap """
    self.sitemap = sitemap
    self.sitemapUrl = "%s/%s" % (url, self.sitemap)
    self.postStartMark = poststart # where does post content html start?
    self.postEndMark = postend # where does post content html stop?
    if not os.path.isfile(self.sitemap):
      cmd = "wget -q %s" % self.sitemapUrl
      if != 0:
        sys.exit("No 0 returned from %s, exiting ..." % cmd)
    self.blogUrls = self.parse_sitemap(self.sitemap)

  def parse_sitemap(self, sitemap):
    """ Parse blog's specified xml sitemap """
    posts = {}
    dom = xml.dom.minidom.parse(sitemap)
    for element in dom.getElementsByTagName('url'):
      url = self.getText(element.getElementsByTagName("loc")[0].childNodes)
      mod = self.getText(element.getElementsByTagName("lastmod")[0].childNodes)
      posts[url] = mod # there can be identical mods, but urls are unique
    urls = []
    # return urls ordered desc on last mod. date
    for key, value in sorted(posts.iteritems(), reverse=True, key=lambda (k,v): (v,k)):
    return urls

  def getText(self, nodelist):
    """ Helper method for parsing XML childnodes (see parse_sitemap) """
    rc = ""
    for node in nodelist:
      if node.nodeType == node.TEXT_NODE:
        rc = rc +
    return rc

  def import_post_urls(self, urlCriteria="http"):
    """ Loop over blog URL getting each one's content, default 'http' practically results in importing all links """
    for i, url in enumerate(self.blogUrls):
      if urlCriteria in url: 
        html = self.get_url(url)
        if html != None:
          self.print_banner(i, url)

  def get_url(self, url):
    """ Import html from specified url """
      f = urllib.urlopen(url)
      html =
      return html
      print "Problem getting url %s" % url
      return None

  def print_banner(self, i, url):
    """ print a banner for a specified URL (to seperate from content) """
    divider = "+"*120
    print "\n\n"
    print divider
    print "%i) %s" % (i, url)
    print divider
    print "\n"

  def print_content(self, url): 
    """ Get blog post's content, get relevant html, then convert to plain text """
      # I know, I probably should have used urllib.urlopen but somehow it 
      # it doesn't import the body html, so using good 'ol wget as workaround
      cmd = "wget -q -O - %s" % url
      html = subprocess.check_output(cmd.split())
    except subprocess.CalledProcessError as e:
      print "Something went wrong importing %s, error: %s" % (url, e)
      return False
    postContent = self.filter_post_content(html)
    if postContent == None:
      print "postContent == None, something went wrong in filter_post_content?"
        # to print in terminal decode to utf-8 needed, to print and redirect
        # script's output to file with >, that only works with ascii encode
        postContent = postContent.decode('utf-8')
        print html2text.html2text(postContent).encode('ascii', 'ignore') 
        print "Cannot convert this post's html to plain text"

  def filter_post_content(self, textdata):
    """ Takes the post page html and return the post html body """
      post = textdata.split(self.postStartMark)
      post = "".join(post[1:]).split(self.postEndMark)
      return post[0]
      print "Cannot split post content based on specified start- and endmarks"
      return None

# end class

### run this program from cli
import optparse
parser = optparse.OptionParser()
parser.add_option('-u', '--url', help='specify a blog url', dest='url')
parser.add_option('-b', '--beginhtml', help='first html (div) tag of a blog post', dest='beginhtml')
parser.add_option('-e', '--endhtml', help='first html after the post content', dest='endhtml')
parser.add_option('-s', '--sitemap', help='sitemap name, default = sitemap.xml', dest='sitemap', default="sitemap.xml")
parser.add_option('-p', '--posts', help='url string to filter on, e.g. "/2012" for all 2012 posts', dest='posts', default="http")
(opts, args) = parser.parse_args()

# Making sure all mandatory options appeared.
mandatories = ['url', 'beginhtml', 'endhtml']
for m in mandatories:
  if not opts.__dict__[m]:
    print "Mandatory option is missing\n"

# Execute program with given cli options: 
blog = ImportBlogPosts(opts.url, opts.beginhtml, opts.endhtml, opts.sitemap)

### example class instant. syntax, and using it for other blogs
# + instant class
# blog = ImportBlogPosts("", '<div class="entry-content">', '<div><br /><h4><strong>You might also like:')
# + all posts my blog:
# blog.import_post_urls("/20")
# + only one post my blog:
# blog.import_post_urls('')
# + another single post on my blog:
# blog.import_post_urls('')
# + other blogs:
# blog = ImportBlogPosts("", '<div class="entry">', '<div class="home_bottom">', "zenhabits.xml") 
# blog = ImportBlogPosts("", '<div class="post content">', '<div class="clearfix"></div>', "/Users/bbelderbos/Downloads/gary.xml") 
# + import all urls
# blog.import_post_urls()
# blog = ImportBlogPosts("", '<div class="post-wrapper">', 'related posts', "/Users/bbelderbos/Downloads/programmingzen.xml") 
# + supposedly all posts
# blog.import_post_urls("/20")

Running the script / example outputs

I ran the following command to get all my 2012 blog posts in a text file:

$ python  -u \
  -b '<div class="entry-content">' -e '<div><br /><h4><strong>You might also like:' \ 
  -p "" > import_blog_posts_bb_2012.txt

You can see the result here. Run it without the -p option (no post filter) and you can get over 140 posts == 14.000 lines ... useful for quickly vi-ing anything and copying code snippets ;)

Another example: run it with -p "git" to get my 3 posts that had "git" in the url:

$ python  -u \ 
  -b '<div class="entry-content">' -e '<div><br /><h4><strong>You might also like:' \
  -p "git" > git.txt