I wanted to have a book category list for a potential feature in My Reading List. I found this page. Then I wondered how I could parse the html to reuse the categories. It turned out to be pretty easy in Perl :)

You can download the script here. It does the following:

  • 1. Usage: perl -w parseCategories.pl -j -s, where -j = json, -s = sql, -t = text 
    Sql would be to import the data into a table for re-use (e.g. autocomplete)
  • 2. it uses LWP::Simple to import the content of the mentioned website into a variable
  • 3. it splits the content in an array and loops over the lines, checking for the patterns:
    a. category (font.*<ul>),
    b. subcategory (<li><a href...)
    - it puts those in a hash
  • 4. depending the cli option, it provides the output

Perl's motto is TMTOWTDI (There's more than one way to do it), so I would be happy to hear any suggestions to improve this script.
At least it got the job done. It was just a quick test. Ideally you would want to make it re-usable:
1. receive URL from cli as well
2. put parsing in functions and let user define those as well.
- with this in place it could be extended to be a generic/simple URL/content parser. 

Bob Belderbos

Software Developer, Pythonista, Data Geek, Student of Life. About me