Some shell tricks for more efficient command line use

In this post some Shell / Unix techniques I use over and over again to be more efficient at the command line.

  • Aliases (.bashrc) save time. When I type "lt", shell runs "ls -lrth", "ht" lets me go to "/Applications/XAMPP/htdocs". All those seconds add up over time.
  • Stay in the same terminal, putting programs in the background. I use this often in Vim: ctrl-z to suspend Vim, do stuff at the command line, go back to Vim with "fg" (when suspending > 1 process, use "jobs" and fg #number to toggle between them)
  • Diff two command outputs without creating files. For example, taking out the "title" of one of my pages I could do a diff between the local version and the remote site:
  • $ diff <(cat header.php) <(ssh bob 'cat ~/public_html/fbreadinglist/header.php')
    42a43
    >   <title><?php  echo $title; ?></title>
    

    In the parenthesis (subshell) can go any command, quite a powerful technique.

  • I use sed daily to search and replace patterns from the command line, but its regex engine is archaic: you have to escape capturing parenthesis (), you don't have +, {}, non-greediness (*?). You can achieve a richer regex engine by using a perl oneliner with: perl -pe 's/pattern/replace/g'. So same syntax but replacing "sed" with "perl -pe". This is a great alternative to have when text parsing from the command line gets more complicated.
  • Search history: press Ctrl-R and you get (reverse-i-search)`': -> here you can type a string to search through your shell history. This saves a lot of time compared to hitting many times the up-key.
  • Navigation: use "cd -" to go to the previous directory (like a back button). If I need to sidetrack, I type bash to do work in a new shell, when exiting this new shell, I can continue where I left.
  • "set -e" in a shell script will cause it to die inmediately when any command does not return 0 (= ok status). This is useful to spot errors soon and not continue with bad assumptions.
  • Execute in combination with find: change permission on all 664 files to 664: $ find . -perm 644 -exec chmod 664 {} ;
  • Another example is to recursively search project files for "todo" actions: $ find . -name '*.py' -exec grep -inH "todo" {} ;

    A variant on this is the use of xargs which reads items from pipes / stdin: $ find *py |xargs grep -l todo 2>/dev/null

  • Command substitution: same as last example but with a for loop and using this technique: $ for i in $(find . -perm 644); do chmod 664 $i; done
  • In the () can go any command, I use this often to loop over a the output of a command or script.

  • Brace expansion allows you to quickly copy files: $ cp file{,.org}
  • Sometimes the "ls" command is too verbose, to see only subdirectories use the -d switch: $ ls -d */
  • Use shortcuts to move around on the command line. You can set bash in emacs or vi mode, see here. I use ctrl+a/e/u/l the most to respectively go to start/end line, delete all before cursor, and to clear the screen.
  • Change you prompt. Mine: PS1='[u@h W$(__git_ps1 " (%s)")]$ ' == [bbelderbos@Bob-Belderboss-MacBook-Pro posts (master)]$
  • Note that I use git-prompt.sh to show the current git branch in my prompt.

  • Useful to know about mkdir: the -m lets you specify the permissions of the dir: $ mkdir -m 777 dir_with_777_perms ; the -p option lets you create directory trees: $ mkdir -p /deep/file/tree/level4/level5/level6
  • No cat is needed for grep:
  • # extra process
    $ cat tmp/a/longfile.txt | grep string 
    # this is better / faster
    $ grep string tmp/a/longfile.txt
    
  • Redirect stderr to stdout is easy: add 2>&1 to your command: $ find / -name foo 2>&1

What are your favourite Shell / Unix tricks?

Of course the tips above are just a fraction of what is possible. Please share your time-savers in the comments below ...

Fundamental SSH tricks when working across multiple hosts

Time to note down some SSH tricks. Vim, Git, Unix, SSH! All very efficient when you spend some time learning the ins and outs.

Key-based login

It is hard to keep up with many hostnames and passwords so first thing to set up is a key-based login. First run ssh-keygen -t dsa and then ssh-copy-id (if not present check out this one-liner)

The power of ~/.ssh/config

Man. Check out this nice intro. Using this file is more convenient than shell aliases. I even used this file to solve a mysterious "Too many authentication failures for [user]" which happened (ssh -v) when multiple keys were offered to the remote host!

X11 forwarding

Run ssh -X host to show remote X applications on your local screen (if X is running on your local machine)

Run commands on a remote server

ssh user@remote_host 'cmd' - this is a powerful way to quickly analyze data on different hosts without the burden of logging in and copying terminal output. See a nice Perl loop example.

Pipe data from/to a remote system

One of my favorites: see this post for two powerful examples:

  • 1. send a big directory to a another server compressing / uncompressing it via the pipe-to-ssh:
  • $ tar -cz content | ssh user@remote_host 'tar -xz'

    The other way around works just as well to get files off a remote server: $ ssh remote_host tar c content/ | tar xv

  • 2. backup a remote mysql database to STDOUT and receive it via pipe/SSH as STDIN to import it into a local DB:
  • $ ssh user@remote_host 'mysqldump -udbuser -ppassword dbname' | mysql -uroot -ppassword backup

    Mount a remote directory

    $ sshfs user@remote_host:/home/user/documents local_folder/ - see this ssh hacks post.

    Practical example

    This just happened trying to convert this post from text to html:

    $ perl parsepost.pl 2013.05.12_ssh_tricks.txt 
    Can't locate HTML/Entities.pm in @INC ...
    

    Instead of fixing it right away I took the opportunity to try a quick workaround via ssh:

    # cat the txt post to STDOUT,
    # via pipe/ssh to remote server it goes into perl script as STDIN (-), 
    # the result is redirected to a file on my local server,
    # note that I can do "ssh bob" thanks to my .ssh/config setup ;)
    $ cat 2013.05.12_ssh_tricks.txt | ssh bob <path>/perl/wordpress_parse_post.pl - > 2013.05.12_ssh_tricks.html
    

    Only the beginning

    Forwarding, tunneling, proxies, access to services through a firewall ... much more is possible. Check out SSH: More than secure shell for other SSH use cases.

    Your favorite SSH trick/ hack?

    Feel free to comment below ...

Mastering Git: 5 useful tips to increase your skills

A little over a year ago I wrote my first post on git with some basics. I just started using it and since then I never looked back. A follow-up post of some more basic to intermediate Git operations ...

Git has definitely saved me a lot of time and gave my projects much better structure (you become better at committing (smaller) chunks of code!). I am also not afraid anymore to make changes, which leads to more freedom when coding. In this article, some random things I learned, in a couple of months I will follow up with an "advanced Git" post ...

Branching

Braches are cheap (41 bytes), and it is very convenient to work on bugs, new features, try out ideas in isolation (without touching the main branch aka master).

So first thing I now do when working on something non-trivial:

$ git checkout -b new_branch_name

Which creates and puts you on the new branch at once (longer version: $ git branch new_branch_name && git checkout new_branch_name). To merge? Simple:

# go to receiver branch
# any changes on branch you come from need to be committed
$ git checkout master 
# fold the branch in if happy with changes
$ git merge new_branch_name

Best is to merge often, the longer you diverge from master, the more likely you have to solve merge conflicts.

Rebasing

Today I learned about a new concept: rebasing. This basically allows you to clean up history: you can make it appear that all work happened in series, even though it originally happened in parallel. I also understand it allows you to move branches forward in time so that they stay closer to master.

!! Note that you want to do this on local commits only, because you effectively are changing history, don't do this on code that has been shared (pushed to the remote repo). Now that I branch more often, I can see me using this feature ...

Undo commands

Again, something to do before pushing code to a team, but to correct your last commit, easy:

$ git commit --amend

This not only lets you change your commit message, it also allows you to add new files (git add .) which then will be committed into the last commit you are amending.

Sometimes you are not happy with commits before the last one; --amend cannot help you there, but git reset does. You want to be cautious with this command though:

# undo commit, but keep files in changed state
$ git reset --soft tree-ish (e.g. HEAD^) 
# undo commit and also delete the changes, as if the change(s) never happened
$ git reset --hard tree-ish (e.g. HEAD^)

# example wiping out last commit: 
$ git log -2
commit d4d479a6121d6757dfcf49795dc0d7134be474f6
..
    test commit

commit 60cbc505273e6f583f347844069106fb1ffde5eb
..

$ git reset --hard HEAD^
HEAD is now at 60cbc50 make an extra div for flexible content under the video
$ git status
# On branch new_feature
nothing to commit (working directory clean)

See here for a good explanation of the different switches of git reset.

Also handy: to revert uncommited changes just do :

$ git checkout -- file(s)

... and the changes on the file(s) are cancelled. About undoing in git.

Useful tools

There are a lot, just a few handy things I learned in the tool's space:

  • Visualization of git history: check out gitk, or create the following alias to graphically see branches etc in the log:
  • $ git config --global --add alias.lol "log --graph --decorate --pretty=oneline --abbrev-commit --all"
    
  • Learn to use a bare repo. Some time ago I made a script that saves time creating remotes on my hosting server.
  • Use autocomplete and show the current branch in your shell prompt. Configure the following in your shell init file:
  • # from .bashrc
    ..
    source ~/.git-completion.bash
    source ~/.git-prompt.sh
    PS1='[\u@\h \W$(__git_ps1 " (%s)")]\$ ' 
    
    # prompt now shows active branch
    [bbelderbos@Bob-Belderboss-MacBook-Pro youtube_feed (master)]$ git checkout -b new_feature
    Switched to a new branch 'new_feature'
    [bbelderbos@Bob-Belderboss-MacBook-Pro youtube_feed (new_feature)]$ 
    
  • Stashing allows you to store some edits to a stack when you're not ready to commit them and you need to move branches for example (git doesn't allow to switch branch when there are pending changes):
  • $ git status 
    # dirty files
    $ git stash save
    $ git status # = clean
    $ git checkout other_branch # pending changes would abort this command, now I can
    $ git stash apply stash@{0} # use "pop" to wipe it out of the stack
    $ git status # change now shows up in the destination branch I am on
    

    File changes in >1 commits:

    $ git add -p
    

    This is a form of interactive committing, by showing hunks of the file (-p stands for --patch mode) that you can add to the commit or not. Below an example. I actually heard about this feature today, and I just tested it writing this blog. The great thing about Git (and Vim for that matter) is the easiness to test out things which give you direct feedback. As you get handier with the commands you see it is easy to undo anything, understanding a small subset of git commands and you can test things out without making a mess, besides, it is hard to really loose work in Git ...

    # making two edits to style.css
    # 1. adding padding to body
    # 2. increasing font-size of p
    $ git diff 
    diff --git a/style.css b/style.css
    ..
    @@ -2,6 +2,7 @@ body {
    ..
    +  padding: 5px;
     }
    ..
    +p {
    +  font-size: 1.1em;
    +}
     
    # add file in --patch / interactive mode, note the "Stage this hunk" questions
    # I only stage the first change
    $ git add -p
    diff --git a/style.css b/style.css
    index 88a2b04..13a1311 100644
    --- a/style.css
    +++ b/style.css
    @@ -2,6 +2,7 @@ body {
       font : 75%/1.5 "Lucida Grande", Helvetica, "Lucida Sans Unicode", Arial, Verdana, sans-serif;
       color:#000; background-color: #f2f2f2;  
       margin: 0 5px;
    +  padding: 5px;
     }
     a {
       color: #900;
    Stage this hunk [y,n,q,a,d,/,j,J,g,e,?]? y
    @@ -80,4 +81,7 @@ span.inactive {
       padding: 2px 6px 2px 6px;
       cursor: default;
     }
    +p {
    +  font-size: 1.1em;
    +}
     
    Stage this hunk [y,n,q,a,d,/,K,g,e,?]? n
    
    # gs is a bash alias for "git status"
    $ gs
    ..
    ..
    # modified:   style.css
    
    # first commit (with 1 of the 2 changes)
    $ git commit -m "added padding"
    [new_feature 95c7c53] added padding
     1 file changed, 1 insertion(+)
    
    # left is the second change (which was not staged with the first commit)
    $ git diff style.css
    diff --git a/style.css b/style.css
    index c0a0a96..13a1311 100644
    --- a/style.css
    +++ b/style.css
    @@ -81,4 +81,7 @@ span.inactive {
       padding: 2px 6px 2px 6px;
       cursor: default;
     }
    +p {
    +  font-size: 1.1em;
    +}
    
    # adding the 2nd change 
    $ git add .
    $ git commit -m "increased font-size"
    [new_feature 809e00a] increased font-size
     1 file changed, 3 insertions(+)
    
    # changes have gone in two commits
    $ git  diff HEAD^..HEAD
    diff --git a/style.css b/style.css
    index c0a0a96..13a1311 100644
    --- a/style.css
    +++ b/style.css
    @@ -81,4 +81,7 @@ span.inactive {
       padding: 2px 6px 2px 6px;
       cursor: default;
     }
    +p {
    +  font-size: 1.1em;
    +}
     
    $ git  diff HEAD^^..HEAD^
    diff --git a/style.css b/style.css
    index 88a2b04..c0a0a96 100644
    --- a/style.css
    +++ b/style.css
    @@ -2,6 +2,7 @@ body {
       font : 75%/1.5 "Lucida Grande", Helvetica, "Lucida Sans Unicode", Arial, Verdana, sans-serif;
       color:#000; background-color: #f2f2f2;  
       margin: 0 5px;
    +  padding: 5px;
     }
     a {
       color: #900;
    

    Git is a huge subject

    ... and learning more of it pays off: it saves time and you can improve software quality by making smart commits, isolate work in branches, etc.

    It is also a generic skill: you can use it for any project, be it Javascript, PHP, C++, Python, or whatever changes you want to keep track of: a blog, or if you write a book.

    I hope this rather random selection of tips has been useful, what other git operations do you often use to improve / speed up your work flow?

    $ git add 2013.03.24_git_intermediate.txt
    $ git commit -m "git post done"
    

Website re-design: making it fully responsive

As you see I adopted a new theme for my website / blog. Reason? Responsive web design. I want my site to be easily accessible on a desktop, tablet and smartphone.

Responsive is hot

Mashable predicted that 2013 is the Year of Responsive Web Design. And it is true, more websites are converting to this solution:

And it makes sense with the increasing amount of mobiles and tablets browsing the web. I do a lot more reading now on the phone/tablet myself, plus visits from these devices increase. Hence it was time for my site to be responsive as well!

Re-use or build from scratch?

Both options are interesting, but there are some very good templates: for Wordpress the mentioned 2012 theme or the Responsive Theme. I decided to use Morten Rand-Hendriksen's Anaximander theme which he builds from scratch in the Lynda course WordPress: Building Responsive Themes.

Apart from providing an awesome theme you can adapt for your blog, this course is really insightful how to build an advanced responsive Wordpress theme from scratch. It has some slick features, for example the dynamic homepage (fall-in-place-when-resize) layout using masonry.

Soon however I will restyle one of my projects from scratch (I am thinking of My Reading List) to be responsive.

Where to start? Resources: apart from looking at the CSS of the WP themes and mentioned Lynda course, I recommend starting with Ethan Marcotte's article on responsive webdesign and his book on the subject which I discussed here.

Look and compare ...

In the old design resizing would overlap the right sidebar. The new design uses media queries and a flexible grid to adapt to any size. It sets max-width: 100% on the images so they scale, see the images in this post, or resize the browser, or enjoy a better view of my blog if you are reading this on a tablet or phone :)

before
after - desktop
after - ipad
before - ipad

Python script to clone itunes autofill feature for USB device

I never thought I would autofill my iphone with music, but I recently tried it and I actually like it. You start to appreciate more types of music. In this post a python script to do the same thing for USB devices to listen to new music in my car.

It is funny how you limit yourself to a few albums/playlists over time, autofill breaks this habit. Now my daily music digest is much more variated. The autofill script is on github. It has become a bit longer than expected because I added more features.

With the required options (music lib path, destination dir path, max size of autofill), it just takes random songs. With the optional switches you can filter on genre and limit the song length. I use eyed3 to read the mp3 metadata. As it is expensive to run it against a lot of mp3 files, I dump its output to a file in json format. When using the -c option you can retrieve the metadata of your songs from this file. This makes it much faster and has the benefit (over a simple copy/paste script) that you can specify genres, etc.

Some useful things in Python:

  • recursively find files,
  • loading and storing json (note you can also use pickle),
  • optparse is a great module to deal with command line parsing (see switches below),
  • random dict key selection with random.choice,
  • become trained in catching every exception (here for example the AttributeError when eyed3 couldn't find genre.name etc.), otherwise the program will end unexpectedly!
  • rfe: it generally suits my purpose, but there are many options that could be added like filtering artist, album, and other criteria, delete previously selected songs, etc. Any ideas, feel free to post them in the comments.
  • code: split functionality over more / smaller methods and classes (length of 'path_exists_check' and 'select_genres' is convenient), build automated testing against each method.

By the way here are some other uses for your old USB drives


Script use:

$ python music_autofill.py
Mandatory option is missing: [path]

Usage: music_autofill.py [options]

Options:
  -h, --help            show this help message and exit
  -p PATH, --path=PATH  path to music lib
  -u USB, --usb=USB     path to usb stick
  -s SIZE, --size=SIZE  max size autofill in MB
  -g, --genres          filter on genres
  -l LENGTH, --length=LENGTH
                        max length of song in minutes
  -v, --verbose         verbose switch
  -c, --caching         dump music lib to json for fast retrieval

Example output

Without genres, using cache file:

$ python music_autofill.py -p "music_dir" -u ~/Desktop/tmp  -s 1000 -c 
1048576000 bytes reached, we're done!
Successfully copied: 103 / failures upon copying: 0
Cache file music_dir/my_music_library.json was used
- to run without caching, don't use the -c option
- to refresh the cache, delete the mentioned file

 

With -g to specify genres:

$ python music_autofill.py -p "music_dir" -u /Volumes/USB\ DISK/ -g  -s 200  -c 
provide genres to filter on, seperated by commas: brazilian beats, breaks, da bass, blues
209715200 bytes reached, we're done!
Successfully copied: 36 / failures upon copying: 0
Cache file music_dir/my_music_library.json was used
..

Script to email a weekly digest of upcoming and playing movies

To keep up2date with new and upcoming movies, I wrote a Python script to send me an weekly movie digest. It queries themoviedb.org and parses the "now-playing" and "upcoming pages", it creates the proper html and sends it. Putting it in a cronjob this happens weekly.

See the code here (previously I copied the code in the post, but what happens if I update it? I would push to github but would forget to update it in the post, hence a link to github only is best, with the exception of highlighting snippets in the post). This is the initial version and I might create a page to signup and let users select genres (which filter is already in), like any new books does for books. They actually inspired me to build this, their weekly new books mail is a nice service.

Note that the css styles are inside the html elements, which is not recommended for a website, but html email does not support linking to an external stylesheet, so here I didn't have a better option. Images count, so I use the bigger version, thanks to themoviedb.org's consistent url naming (replacing w92 by w185 in the poster url).

The mail code is a nice snippet I found on stackoverflow.

For DOM parsing, I highly recommend you get familiar with Beautiful Soup which makes parsing html easy and powerful.

The weekly digests are saved to sharemovies as well, see this example. See also some images below from two different mail clients.

Also note the encode/decodes that were needed, this article explains pretty well what to do when you are confronted with these annoying UnicodeDecodeError exceptions. Basically what worked for me is encoding obtained text from any websource to utf-8 when writing to a file or printing to stdout, and decoding to utf-8 when assigning it to variables.

Not much else, the rest is pretty basic Python. If you want to receive the weekly update, go to sharemovi.es and email via the Email button at the top. This homepage actually shows a subset of the upcoming and now-playing results I query with this script, however there it uses themoviedb API.

 

widget example
widget example
widget example

How to search and copy Stack Overflow data without leaving Vim!

To save time and concentrate as a developer, Vim is the best place to be. But I cannot code everything from memory so I made a tool to lookup stackoverflow questions and answers without leaving Vim.

Searching Google for programming related questions I found out that about 80% of the times I end up at Stack Overflow which has tons of useful information!

What if I could search this huge Q&A database from the command line? I built a Python class to do so. But to be able to run it inside a Vim buffer you will need the Vim plugin conque and some settings in .vimrc. With that setup you can search Stack Overflow interactively in a Vim split window and copy and paste useful code snippets back and forth.

In the following sections I will show you how it works...

Setup / config

      1. Install

Conque

      - make sure you use 2.2, 2.1 gave me some issues. Just download the file, open vim and run :so %, then exit. Opening Vim again and you can use the plugin.

 

      2. Get a copy of the

stackoverflow_cli_search script

      3. Setup a key mapping in .vimrc to open up the script in vertical split (at least that is how I like it):

nmap ,s :ConqueTermVSplit python ...path-to-script.../stackoverflow_cli_search.py

Note that I use comma (,) as mapleader - in .vimrc add: let mapleader = ","

I made two similar key mappings as well to:

      • a. try things in Python while I am coding:

nmap cp :ConqueTermVSplit python

nmap ,g :ConqueTermVSplit python ...path-to-script.../github_search.py

4. When coding you can just type ,s to start searching Stack Overflow - ,g for github (if you downloaded the script from the previously mentioned post as well) - or cp to get a interactive python shell. All in a new Vim vertical split window, so no need to leave the terminal, you can switch between the two windows hitting ctrl+w twice.

Here you see a printscreen of the split window:

vim split window

 

5. When you like to copy a code snippet you'll find, hit Esc and conque goes into normal mode so you can select with V (visually select current line) + a motion command + y (yank). Thenk you move to your code window (2x ctrl+w) and p (paste) the yanked buffer. To resume with the script go back to the stakcoverflow window (again 2x ctrl+w) and go into Insert mode with i, I, a, A, etc.

Example

The example below I literally pasted into this blog post staying in Vim (in the right window typing : ESC-Vgg-y to copy the whole buffer, then 2x ctrtl+w to go back to this post and there run: p to paste:

      (S)earch (default when pressing Enter)
      (1-15) Show answers for question number ...
      (N)ext answer
      (L)ist questions again for last search
      (Q)uit
      Enter choice:  

You picked: [s]
Enter search: python re.compile 
Questions found for search <python re.compile>
1) python regex re.compile match
2) python re.compile match percent sign %
3) Case insensitive Python regular expression without re.compile
4) Python and re.compile return inconsistent results
5) Does re.compile() or any given Python library call throw an exception?
6) python re.compile Beautiful soup
7) python re.compile strings with vars and numbers
8) how to do re.compile() with a list in python
9) python regex re.compile() match string
10) Python re.compile between two html tags
11) Python BeautifulSoup find using re.compile for end of string
12) Python: How does regex re.compile(r'^[-w]+$') search? Or, how does regex
 work in this context?
13) Clean Python Regular Expressions
14) Matching a specific sequence with regex?
15) Regex negated capture group returns answer

      (S)earch (default when pressing Enter)
      (1-15) Show answers for question number ...
      (N)ext answer
      (L)ist questions again for last search
      (Q)uit
      Enter choice: 9 

You picked: [9]
Q&A for 9) python regex re.compile() match string 
http://stackoverflow.com/questions/8012320/python-regex-re-compile-match-string

----------------------------------------
[ Question ]
----------------------------------------

Gents,
  I am trying to grab the version number from a string via python regex...
Given filename: facter-1.6.2.tar.gz

When, inside the loop:
import re
version = re.split('(.*d.d.d)',sfile)
print version

How do i get the 1.6.2 bit into version 
Thanks!

----------------------------------------
[ Answer #1 ]
----------------------------------------
Two logical problems:
1) Since you want only the 1.6.2 portion, you don't want to capture the .* part before the first d, so it goes outside the parentheses.

[truncated]

      (S)earch (default when pressing Enter)
      (1-15) Show answers for question number ...
      (N)ext answer
      (L)ist questions again for last search
      (Q)uit
      Enter choice: n 

You picked: [n]
----------------------------------------
[ Answer #2 ]
----------------------------------------
match = re.search(r'd.d.d', sfile)
if match:
    version = match.group()

      (S)earch (default when pressing Enter)
      (1-15) Show answers for question number ...
      (N)ext answer
      (L)ist questions again for last search
      (Q)uit
      Enter choice: n 

You picked: [n]
----------------------------------------
[ Answer #3 ]
----------------------------------------
>>> re.search(r"d+(.d+)+", sfile).group(0)
'1.6.2'

      (S)earch (default when pressing Enter)
      (1-15) Show answers for question number ...
      (N)ext answer
      (L)ist questions again for last search
      (Q)uit
      Enter choice: n 

You picked: [n]
All answers shown, choose a question of previous search (L) or press Enter (o
r S) for a new search

      (S)earch (default when pressing Enter)
      (1-15) Show answers for question number ...
      (N)ext answer
      (L)ist questions again for last search
      (Q)uit
      Enter choice:

 

The script

See below (download at Github):

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os, sys, urllib, urllib2, pprint
from bs4 import BeautifulSoup as Soup

class StackoverflowCliSearch(object):
  """ Query stackoverflow from cli 
      I think this could be handy in Vim's spit view (with ConqueTerm) """

  def __init__(self):
    """ Definition class variables, initialize menu """
    self.searchTerm = ""
    self.questions = {}
    self.showNumAnswers = 1 # show 1 answer first, then 1 by 1 pressing N
    self.show_menu() # start user interaction

  def show_menu(self):
    """ Menu that allows user to to search, query question's answers, etc. """
    prompt = """
      (S)earch (default when pressing Enter)
      (1-15) Show answers for question number ...
      (N)ext answer
      (L)ist questions again for last search
      (Q)uit
      Enter choice: """
    while True:
      chosen = False 
      while not chosen:
        try:
          choice = raw_input(prompt).strip().lower()
        except (EOFError, KeyboardInterrupt):
          choice = 'q'
        except:
          sys.exit("Not a valid option")
        if choice == '': choice = 's' # hitting Enter = new search
        print 'nYou picked: [%s]' % choice 
        if not choice.isdigit() and choice not in 'snlq':
          print "This is an invalid option, try again"
        else:
          chosen = True
      if choice.isdigit() : self.show_question_answer(int(choice))
      if choice == 's': self.search_questions() 
      if choice == 'n': self.show_more_answers() 
      if choice == 'l': self.list_questions(True) 
      if choice == 'q': sys.exit("Goodbye!")

  def search_questions(self):
    """ Searches stackoverflow for questions containing the search term """
    self.questions = {} 
    self.searchTerm = raw_input("Enter search: ").strip().lower()
    data = {'q': self.searchTerm }
    data = urllib.urlencode(data)
    soup = self.get_url("http://stackoverflow.com/search", data)
    for i,res in enumerate(soup.find_all(attrs={'class': 'result-link'})):
      q = res.find('a')
      self.questions[i+1] = {}
      self.questions[i+1]['url'] = "http://stackoverflow.com" + q.get('href')
      self.questions[i+1]['title'] = q.get('title')
    self.list_questions()

  def get_url(self, url, data=False):
    """ Imports url data into Soup for easy html parsing """
    u = urllib2.urlopen(url, data) if data else urllib2.urlopen(url)
    return Soup(u)

  def list_questions(self, repeat=False):
    """ Lists the questions that were found with the last search action """
    if not self.questions:
      print "No questions found for search <%s>" % self.searchTerm
      return False
    if not self.questions and repeat:
      print "There are no questions in memory yet, please perform a (S)earch first"
      return False
    print "Questions found for search <%s>" % self.searchTerm 
    for q in self.questions:
      print "%d) %s" % (q, self.questions[q]["title"])

  def show_question_answer(self, num):
    """ Shows the question and the first self.showNumAnswers answers """
    entries = []
    if num not in self.questions: 
      print "num <%s> does not appear in questions dict" % str(num) 
      return False
    print "Q&A for %d) %s n%sn" % 
      (num, self.questions[num]['title'], self.questions[num]['url'])
    soup = self.get_url(self.questions[num]['url'])
    for i,answer in enumerate(soup.find_all(attrs={'class': 'post-text'})):
      qa = "Question" if i == 0 else "Answer #%d" % i
      out = "%sn[ %s ]n%sn" % ("-"*40, qa, "-"*40)
      out += ''.join(answer.findAll(text=True))
      # print the Q and first Answer, save subsequent answers for iteration with option (N)ext answer
      if i <= self.showNumAnswers:
        print out
      else:
        entries.append(out) 
    self.output = iter(entries)

  def show_more_answers(self):
    """ Result of option (N)ext answer: iterates over the next answer (1 per method call) """ 
    if not self.output:
      print "There is no QA output yet, please select a Question listed or perform a (S)earch first"
      return False
    try:
      print self.output.next()
    except StopIteration as e:
      print "All answers shown, choose a question of previous search (L) or press Enter (or S) for a new search"

# instant
so = StackoverflowCliSearch()

 

Python script to query github code in your terminal

I was using the Advanced Search in Github the other day and thought: what if I could use this in a terminal. So I started to try out some things in Python which led to the following script.

Some comments about the script

featured image

  • It is an interactive script that you run from a terminal, main options are (n) for new search and (s) for show script snippet. You first search for a keyword and optionally a programming language and number result pages to parse. See an example at the end of this post ...
  • It takes these args and builds the right search URL (base url =https://github.com/search?q=) and uses html2text.theinfo.org to strip out the html (I tried the remote versionhere, for local use just download and import the html2text Python module, see my last post for an example).
  • It filters the relevant html with re.split(r"seconds)|## Breakdown", html) - this is based on what html2text makes of github's html markup.
  • When choosing (s) and then a number of the search result the method "show_script_context" imports the raw script and shows the line that matches the search string with 8 lines before and after (like grep -A8 -B8 would do)
  • You can use Conque to run this in a split window in Vim which allows you to copy output to the script you are working on.

The code

See below and on github:

#!/usr/bin/env python                                                                                                                                     
# -*- coding: utf-8 -*-
# Author: Bob Belderbos / written: Dec 2012
# Purpose: have an interactive github cli search app
#
import re, sys, urllib, pprint
# import html2text # -- to use local version

class GithubSearch:
  """ This is a command line wrapper around Github's Advanced Search
      https://github.com/search """

  def __init__(self):
    """ Setup variables """
    self.searchTerm = ""
    self.scripts = []
    self.show_menu()


  def show_menu(self):
    """ Show a menu to interactively use this program """
    prompt = """
      (N)ew search
      (S)how more context (github script)
      (Q)uit
      Enter choice: """
    while True:
      chosen = False 
      while not chosen:
        try:
          choice = raw_input(prompt).strip().lower()
        except (EOFError, KeyboardInterrupt):
          choice = 'q'
        except:
          sys.exit("Not a valid option")
        print 'nYou picked: [%s]' % choice 
        if choice not in 'nsq':
          print "This is an invalid option, try again"
        else:
          chosen = True
      if choice == 'q': sys.exit("Goodbye!")
      if choice == 'n': self.new_search() 
      if choice == 's': self.show_script_context()

  
  def new_search(self):
    """ Take the input field info for the advanced git search """
    # reset script url tracking list and counter
    self.scripts = [] 
    self.counter = 0
    # take user input to define the search
    try:
      self.searchTerm = raw_input("Enter search term: ").strip().lower().replace(" ", "+")
    except:
      sys.exit("Error handling this search term, exiting ...")
    lang = raw_input("Filter on programming language (press Enter to include all): ").strip().lower()
    try:
      prompt = "Number of search pages to process (default = 3): "
      numSearchPages = int(raw_input(prompt).strip()[0])
    except:
      numSearchPages = 3
    # get the search results
    for page in range(1,numSearchPages+1):
      results = self.get_search_results(page, lang)
      for result in results[1].split("##"): # each search result is divided by ##
        self.parse_search_result(result)


  def get_search_results(self, page, lang):
    """ Query github's advanced search and re.split for the relevant piece of info 
        RFE: have a branch to use html2text local copy if present, vs. remote if not """
    githubSearchUrl = "https://github.com/search?q="
    searchUrl = urllib.quote_plus("%s%s&p=%s&ref=searchbar&type=Code&l=%s" % 
      (githubSearchUrl, self.searchTerm, page, lang))
    html2textUrl = "http://html2text.theinfo.org/?url="
    queryUrl = html2textUrl+searchUrl
    html = urllib.urlopen(queryUrl).read()
    return re.split(r"seconds)|## Breakdown", html)


  def parse_search_result(self, result):
    """ Process the search results, also store each script URL in a list for reference """
    lines = result.split("n")
    source = "".join(lines[0:2])
    pattern = re.compile(r".*((.*?))s+((.*?)).*")
    m = pattern.match(source)
    if m != None:
      self.counter += 1 
      url = "https://raw.github.com%s" % m.group(1).replace("tree/", "")
      lang = m.group(2)
      self.print_banner(lang, url)
      self.scripts.append(url) # keep track of script links 
      for line in lines[2:]:
        # ignore pagination markup
        if "github.com" in line or "https://git" in line or "[Next" in line: continue 
        if line.strip() == "": continue
        print line


  def print_banner(self, lang, url):
    """ Print the script, lang, etc. in a clearly formatted way """
    print "n" + "+" * 125
    print "(%i) %s / src: %s" % (self.counter, lang, url)


  def show_script_context(self, script_num=""):
    """ Another menu option to show more context from the github script 
        surrounding or leading up to the search term """
    if len(self.scripts) == 0:
      print "There are no search results yet, so cannot show any scripts yet."
      return False
    script_num = int(raw_input("Enter search result number: ").strip())
    script = self.scripts[script_num-1] # list starts with index 0 = 1 less than counter
    a = urllib.urlopen(script)
    if a.getcode() != 200:
      print "The requested script did not give a 200 return code"
      return False
    lines = a.readlines() 
    a.close()
    if len(lines) == 0:
      print "Did not get content back from script, maybe it is gone?"
      return False
    num_context_lines = 8
    print "nExtracting more context for search term <%s> ..." % self.searchTerm
    print "Showing %i lines before and after the match in the original script hosted here:n%sn" % 
      (num_context_lines, script)
    for i, line in enumerate(lines):
      if self.searchTerm.lower() in line.lower():
        print "n... %s found at line %i ..." % (self.searchTerm, i)
        j = i - num_context_lines
        for x in lines[i-num_context_lines : i+num_context_lines]:
          if self.searchTerm.lower() in x.lower():
            print "%i ---> %s" % (j, x), # makes the match stand out
          else:
            print "%i      %s" % (j, x),        
          j += 1


### instant
github = GithubSearch()

See it in action

$ vi github_search.py 


      (N)ew search
      (S)how more context (github script)
      (Q)uit
      Enter choice: N

You picked: [n]
Enter search term: os.system
Filter on programming language (press Enter to include all): python
Number of search pages to process (default = 3): 

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
(1) Python / src: https://raw.github.com/fhopecc/stxt/1a14c802362047af4c9f6d5ec2312a57cbc9bca6/task/setup_win.py
    import os
    _os.system_(

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
(2) Python / src: https://raw.github.com/fhopecc/stxt/325dc6e2cbfecc9d071264f71aee7b156a8a6970/task/shutdown.py
    import os
    _os.system_('shutdown -s -f')

..
..
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
(9) Python / src: https://raw.github.com/rob0r/snmpnetif/b6228f3ba6c55a7f8119af3a1bd4c014f5533b9b/snmpnetif.py
    (True):
                try:
                    # clear the screen
                    if os.name == 'nt': clearscreen = _os.system_

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
(10) Python / src: https://raw.github.com/trey0/geocamShare/98029ffb1d26784346f7a2e5984048e8764df116/djangoWsgi.py
    .mkstemp('djangoWsgiSourceMe.txt')
        os.close(fd)
        _os.system_('bash -c "(source %s/sourceme.sh && printenv > %s

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
(11) Python / src: https://raw.github.com/jphcoi/MLMTC/48029dd647dc17173ed94693deccbb8d7bb42ed6/map_builder/CFpipe.py
    _sys
        try:
          _os.system_(command_sys)
        except:
          '----------------------------------detection de communaut

      (N)ew search
      (S)how more context (github script)
      (Q)uit
      Enter choice: s

You picked: [s]
Enter search result number: 9

Extracting more context for search term <os.system> ...
Showing 8 lines before and after the match in the original script hosted here:
https://raw.github.com/rob0r/snmpnetif/b6228f3ba6c55a7f8119af3a1bd4c014f5533b9b/snmpnetif.py


... os.system found at line 250 ...
242              ifidx = self.ifactive()
243              
244              # get active interface names
245              ifnames = self.ifnames(ifidx)
246              
247              while(True):
248                  try:
249                      # clear the screen
250 --->                 if os.name == 'nt': clearscreen = os.system('cls')
251 --->                 if os.name == 'posix': clearscreen = os.system('clear')
252                      
253                      # print the device name and uptime
254                      print(devicename)
255                      print('Device uptime: {0}n').format(self.devuptime())
256                      
257                      # print stats if the first loop has run
..

      (N)ew search
      (S)how more context (github script)
      (Q)uit
      Enter choice: n

You picked: [n]
Enter search term: grep
Filter on programming language (press Enter to include all): perl
Number of search pages to process (default = 3): 

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
(1) Perl / src: https://raw.github.com/xPapers/xPapers/1fe2bf177e3d37f2024d00601340627a8ded85ad/lib/xPapers/Cat.pm
    ->catCount($me->catCount-1);
        $me->save;
        $me->clear_cache;
        # detach
        $me->cat_memberships([_grep_

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
(2) Perl / src: https://raw.github.com/roethigj/Lx-Office-Anpassungen/e06afb2fc94573bc4a305a41e95a8b7a812e2db0/SL/IS.pm
    ->{TEMPLATE_ARRAYS}->{$_} }, "") } _grep_({ $_ ne "description" } @arrays));
        }
        $form

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
(3) Perl / src: https://raw.github.com/roethigj/Lx-Office-Anpassungen/e06afb2fc94573bc4a305a41e95a8b7a812e2db0/SL/OE.pm
    ->{sort} && _grep_($form->{sort}, keys(%allowed_sort_columns))) {
        $sortorder = $allowed

      (N)ew search
      (S)how more context (github script)
      (Q)uit
      Enter choice: s

You picked: [s]
Enter search result number: 3

Extracting more context for search term <grep> ...
Showing 8 lines before and after the match in the original script hosted here:
https://raw.github.com/roethigj/Lx-Office-Anpassungen/e06afb2fc94573bc4a305a41e95a8b7a812e2db0/SL/OE.pm


... grep found at line 210 ...
202          "ordnumber"               => "o.ordnumber",
203          "quonumber"               => "o.quonumber",
204          "name"                    => "ct.name",
205          "employee"                => "e.name",
206          "salesman"                => "e.name",
207          "shipvia"                 => "o.shipvia",
208          "transaction_description" => "o.transaction_description"
209        );
210 --->   if ($form->{sort} && grep($form->{sort}, keys(%allowed_sort_columns))) {
211          $sortorder = $allowed_sort_columns{$form->{sort}} . " ${sortdir}";
212        }
213        $query .= qq| ORDER by | . $sortorder;
214      
215        my $sth = $dbh->prepare($query);
216        $sth->execute(@values) ||
217          $form->dberror($query . " (" . join(", ", @values) . ")");

... grep found at line 1135 ...
1127        my $sameitem = "";
1128        foreach $item (sort { $a->[1] cmp $b->[1] } @partsgroup) {
1129          $i = $item->[0];
1130      
1131          if ($item->[1] ne $sameitem) {
1132            push(@{ $form->{TEMPLATE_ARRAYS}->{description} }, qq|$item->[1]|);
1133            $sameitem = $item->[1];
1134      
1135 --->       map({ push(@{ $form->{TEMPLATE_ARRAYS}->{$_} }, "") } grep({ $_ ne "description" } @arrays));
1136          }
1137      
1138          $form->{"qty_$i"} = $form->parse_amount($myconfig, $form->{"qty_$i"});
1139      
1140          if ($form->{"id_$i"} != 0) {
1141      
1142            # add number, description and qty to $form->{number}, ....
..

      (N)ew search
      (S)how more context (github script)
      (Q)uit
      Enter choice: q

You picked: [q]
Goodbye!

shell returned 1

A Python script to import a complete blog to plain text

In today's post I'll show you a script I wrote yesterday to import an entire blog based on a (XML) sitemap. It also converts the posts from html to plain text. Allthough it was merely Python practice, it could actually be useful to export/backup blogs to a single text file for easy consumption in a text editor or offline.

How it works / some comments:

  • The class gets instantiated with a couple of arguments:
  • # create instant
    blog = ImportBlogPosts("http://bobbelderbos.com", '<div class="entry-content">', '<div><br /><h4><strong>You might also like:')
    # get all URLs with /20 in them (this means real post URLs in my blog's case, so ignoring the about-, the archive- and homepage)
    blog.import_post_urls("/20")
    # ... or just a single URL: 
    blog.import_post_urls('http://bobbelderbos.com/2012/09/how-to-grow-craft-programming/')
    

    featured image

    The first arg is clear: the blog URL. The 2nd and 3rd args are the html start- and endsnippets to grab the actual post (and not all the fluff around it). The sitemap has to be present at URL/sitemap.xml, or it could be a file on the FS. I did see quite some blogs without the sitemap, so an RFE would be to let the class generate the sitemap if not present ...

  • Importing web content: this is easy in Python: urllib.urlopen, but in this case I had some issues with body content not being imported so I used "wget -q .." with subprocess.call
  • I find native module "xml.dom.minidom" pretty convenient for parsing XML
  • I use a nice module for html-to-text conversion: html2text - it delivers markdown format. One tricky thing with encodings (a subject I need to study on its own one day): to get the script working to stdout as well as redirecting its output to a file (with '>' on Unix) I had to use this magic:
  • ..
      postContent = postContent.decode('utf-8')
      print html2text.html2text(postContent).encode('ascii', 'ignore') 
    ..
    

    Yes that is right: decode to utf-8, and encode to ascii. Anybody having a better way or explanation, please comment.

The code

Here a copy of the code, see also here on Github.

#!/usr/bin/env python                                                                                                                                     
# -*- coding: utf-8 -*-
# Author: Bob Belderbos / written: Dec 2012
# Purpose: import all blog posts to one file, converting them in (markdown) text
# Thanks to html2text for doing the actual conversion ( http://www.aaronsw.com/2002/html2text/ ) 
# 
import os, sys, pprint, xml.dom.minidom, urllib, html2text, subprocess

class ImportBlogPosts(object):
  """ Import all blog posts and create one big text file (pdf would increase size too much,
      and I like searching text files with Vim).  It uses the blog's sitemap to get all URLs. """

  def __init__(self, url, poststart, postend, sitemap="sitemap.xml"):
    """ Specify blog url, where post html starts/ stops, what urls in sitemap are valid, and sitemap """
    self.sitemap = sitemap
    self.sitemapUrl = "%s/%s" % (url, self.sitemap)
    self.postStartMark = poststart # where does post content html start?
    self.postEndMark = postend # where does post content html stop?
    if not os.path.isfile(self.sitemap):
      cmd = "wget -q %s" % self.sitemapUrl
      if subprocess.call(cmd.split()) != 0:
        sys.exit("No 0 returned from %s, exiting ..." % cmd)
    self.blogUrls = self.parse_sitemap(self.sitemap)

      
  def parse_sitemap(self, sitemap):
    """ Parse blog's specified xml sitemap """
    posts = {}
    dom = xml.dom.minidom.parse(sitemap)
    for element in dom.getElementsByTagName('url'):
      url = self.getText(element.getElementsByTagName("loc")[0].childNodes)
      mod = self.getText(element.getElementsByTagName("lastmod")[0].childNodes)
      posts[url] = mod # there can be identical mods, but urls are unique
    urls = []
    # return urls ordered desc on last mod. date
    for key, value in sorted(posts.iteritems(), reverse=True, key=lambda (k,v): (v,k)):
      urls.append(key)
    return urls


  def getText(self, nodelist):
    """ Helper method for parsing XML childnodes (see parse_sitemap) """
    rc = ""
    for node in nodelist:
      if node.nodeType == node.TEXT_NODE:
        rc = rc + node.data
    return rc

  
  def import_post_urls(self, urlCriteria="http"):
    """ Loop over blog URL getting each one's content, default 'http' practically results in importing all links """
    for i, url in enumerate(self.blogUrls):
      if urlCriteria in url: 
        html = self.get_url(url)
        if html != None:
          self.print_banner(i, url)
          self.print_content(url)
  

  def get_url(self, url):
    """ Import html from specified url """
    try:
      f = urllib.urlopen(url)
      html = f.read()
      f.close()
      return html
    except: 
      print "Problem getting url %s" % url
      return None


  def print_banner(self, i, url):
    """ print a banner for a specified URL (to seperate from content) """
    divider = "+"*120
    print "\n\n"
    print divider
    print "%i) %s" % (i, url)
    print divider
    print "\n"


  def print_content(self, url): 
    """ Get blog post's content, get relevant html, then convert to plain text """
    try:
      # I know, I probably should have used urllib.urlopen but somehow it 
      # it doesn't import the body html, so using good 'ol wget as workaround
      cmd = "wget -q -O - %s" % url
      html = subprocess.check_output(cmd.split())
    except subprocess.CalledProcessError as e:
      print "Something went wrong importing %s, error: %s" % (url, e)
      return False
    postContent = self.filter_post_content(html)
    if postContent == None:
      print "postContent == None, something went wrong in filter_post_content?"
    else:
      try:
        # to print in terminal decode to utf-8 needed, to print and redirect
        # script's output to file with >, that only works with ascii encode
        postContent = postContent.decode('utf-8')
        print html2text.html2text(postContent).encode('ascii', 'ignore') 
      except:
        print "Cannot convert this post's html to plain text"


  def filter_post_content(self, textdata):
    """ Takes the post page html and return the post html body """
    try:
      post = textdata.split(self.postStartMark)
      post = "".join(post[1:]).split(self.postEndMark)
      return post[0]
    except:
      print "Cannot split post content based on specified start- and endmarks"
      return None

# end class


### run this program from cli
import optparse
parser = optparse.OptionParser()
parser.add_option('-u', '--url', help='specify a blog url', dest='url')
parser.add_option('-b', '--beginhtml', help='first html (div) tag of a blog post', dest='beginhtml')
parser.add_option('-e', '--endhtml', help='first html after the post content', dest='endhtml')
parser.add_option('-s', '--sitemap', help='sitemap name, default = sitemap.xml', dest='sitemap', default="sitemap.xml")
parser.add_option('-p', '--posts', help='url string to filter on, e.g. "/2012" for all 2012 posts', dest='posts', default="http")
(opts, args) = parser.parse_args()

# Making sure all mandatory options appeared.
mandatories = ['url', 'beginhtml', 'endhtml']
for m in mandatories:
  if not opts.__dict__[m]:
    print "Mandatory option is missing\n"
    parser.print_help()
    exit(-1)

# Execute program with given cli options: 
blog = ImportBlogPosts(opts.url, opts.beginhtml, opts.endhtml, opts.sitemap)
blog.import_post_urls(opts.posts)



### example class instant. syntax, and using it for other blogs
# + instant class
# blog = ImportBlogPosts("http://bobbelderbos.com", '<div class="entry-content">', '<div><br /><h4><strong>You might also like:')
# + all posts my blog:
# blog.import_post_urls("/20")
# + only one post my blog:
# blog.import_post_urls('http://bobbelderbos.com/2012/09/how-to-grow-craft-programming/')
# + another single post on my blog:
# blog.import_post_urls('http://bobbelderbos.com/2012/10/php-mysql-novice-to-ninja/')
# 
# + other blogs:
# blog = ImportBlogPosts("http://zenhabits.net", '<div class="entry">', '<div class="home_bottom">', "zenhabits.xml") 
# blog = ImportBlogPosts("http://blog.extracheese.org/", '<div class="post content">', '<div class="clearfix"></div>', "/Users/bbelderbos/Downloads/gary.xml") 
# + import all urls
# blog.import_post_urls()
# blog = ImportBlogPosts("http://programmingzen.com", '<div class="post-wrapper">', 'related posts', "/Users/bbelderbos/Downloads/programmingzen.xml") 
# + supposedly all posts
# blog.import_post_urls("/20")

Running the script / example outputs

I ran the following command to get all my 2012 blog posts in a text file:

$ python import_blog_posts.py  -u http://bobbelderbos.com \
  -b '<div class="entry-content">' -e '<div><br /><h4><strong>You might also like:' \ 
  -p "http://bobbelderbos.com/2012" > import_blog_posts_bb_2012.txt

You can see the result here. Run it without the -p option (no post filter) and you can get over 140 posts == 14.000 lines ... useful for quickly vi-ing anything and copying code snippets ;)

Another example: run it with -p "git" to get my 3 posts that had "git" in the url:

$ python import_blog_posts.py  -u http://bobbelderbos.com \ 
  -b '<div class="entry-content">' -e '<div><br /><h4><strong>You might also like:' \
  -p "git" > git.txt

Daily movie digest Spanish TV / Part II - rewrite in Python

To learn more Python I am making up some new scripting exercises these days. I am also rewriting some Perl scripts I made last year. Today my daily Spanish TV movie email script, discussed here: rewritten and improved in Python.

What is this script about?

This script queries http://www.sincroguia.tv/todas-las-peliculas.html for movies that will be aired on Spanish TV today. For each title it crawls the corresponding URL for additional details. All is formatted in an output I get emailed everyday via a cronjob on my webhost.

What could be better since last time?

    featured image

  • All titles were shown in verbose mode, so if I wanted to quickly see what was on, it required a lot of scrolling, not good. Hence the script prints a summary first now (example at the end of this post).
  • Related: the script has day detection now, because during weekdays I only want to know what is on starting 8pm, for weekend days I want to have the movie guide for the whole day.
  • The movie URLs get more thoroughly parsed, providing more movie info (kudos to sincroguia.tv, the movie info is actually quite good)
  • Spanish movie titles have their English counter-title on the details page, so I pushed this vital piece of information to the top summary. This way I can quickly see what movie it actually is !
  • Structure code in OOP: this is a new trend in my coding lately and I feel code gets much cleaner, and potentially more re-usable. The class is a black box and somebody could just plug it in and call the methods he/she is interested in, in this case only two, but it makes the point I think:
  • t = TvCine()
    t.print_movie_titles()
    t.print_movie_details()
    

    I still think the methods should be shorter and over time I want to introduce TDD to make it all more robust, but you have to start somewhere. I think this version is much more readable than the Perl variant (any opinions and suggestions are welcome in the comments of course). Btw, Why Python? is an interesting read if you also consider Python after or alongside Perl.

The script

Without further ado, see also on github:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Author: Bob Belderbos / written: Dec 2012
# Purpose: get movies aired on Spanish tv to use in 24-hour cronjob
#
import pprint, urllib, re, sys, datetime
from bs4 import BeautifulSoup as Soup

class TvCine(object):

  def __init__(self):
    """ Setup variables, define hour range of which I want to know the movie airing of """
    # if weekday (0-4 - 0 being Monday) show movies from 20-24h, weekend I want to see all movies aired: 
    self.weekday = datetime.datetime.today().weekday() 
    if self.weekday in [5,6]: # 5 = Sat, 6 = Sun
      self.START_TIME = 9
    else: 
      self.START_TIME = 20
    # always end at midnight (tomorrow a new day, so a new output from cron)
    self.END_TIME = 00
    self.moviePage = "http://www.sincroguia.tv/todas-las-peliculas.html" 
    self.movies = self.parse_movies()
    # pprint.pprint(self.movies); sys.exit()


  def parse_movies(self):
    """ Import the movie URL """
    soup = Soup(self.read_url(self.moviePage))
    movies = []
    for link in soup.find_all("a"):
      time = link.previous_sibling
      try:
        channel = re.sub(r".* - ", "", str(link.contents[0].encode(encoding='UTF-8',errors='strict')))
      except:
        channel = "not_found" 
      url = link.get('href')
      title = link.get('title')
      if not "/peliculas/" in url: continue
      if int(time[:2]) < self.START_TIME: continue
      if time[:2] == self.END_TIME: break
      (longTitle, verboseInfo) = self.get_movie_verbose_info(title, url)
      movies.append({ 'time': time[0:6], 
                      'channel': channel,
                      'title':longTitle.encode(encoding='UTF-8',errors='strict'), 
                      'url': url.encode(encoding='UTF-8',errors='strict'), 
                      'info': verboseInfo.encode(encoding='UTF-8',errors='strict'),  
                    })
    return movies

   
  def get_movie_verbose_info(self, title, url):
    """ Read the movie page in and return the translated title if available and all movie info """
    html = self.read_url(url)
    # try to get the relevant html section of the movie page, if nothing found too bad, move on
    soup = self.filter_relevant_bits(html)
    titleInfo = ficha = contentficha = ""
    lineNum = 0
    if soup: 
      for line in soup.li.stripped_strings: 
        ficha += line + "n"
      for line in soup.find_all('li')[1].stripped_strings:
        lineNum += 1
        if lineNum<3: titleInfo += line + " "
        contentficha += line + "n"
    else:
      ficha = "Not able to obtain movie info for %s" % title
    return (titleInfo, ficha+"n"+contentficha)


  def read_url(self, url):
    """ Read and return the content of a url """
    f = urllib.urlopen(url) 
    html = f.read()
    f.close
    return html

  
  def filter_relevant_bits(self, html):
    """ Get the html part that matters from the movie page """
    a = html.split('class="ficha">')
    try:
      movieInfo = a[1].split('<a href="javascript:;" onclick="remote')
    except IndexError:
      return False 
    soup = Soup(movieInfo[0]) 
    return soup


  def print_movie_titles(self): 
    """ Print all the movie titles to be aired on Spanish TV today """
    print "I. Movies Spanish TV Today %s:00-%s:00n" % (self.START_TIME, self.END_TIME)
    for m in self.movies:
      print m['time'], " | ", "%-8s" % m['channel'], " | ", m['title']
    print "nn"
      

  def print_movie_details(self):
    """ Print verbose details for each movie """
    print "II. Details for each movie ... n" 
    for m in self.movies:
      print "+" * 80
      print m['time'], " | ", "%-8s" % m['channel'], " | ", m['title']
      print "+" * 80
      print "URL: n" + m['url']
      print "nDetails: n" + m['info']
      print "nn"


### instant
t = TvCine()
t.print_movie_titles()
t.print_movie_details()

How does the output look?

I can send you one, or download the script and run it yourself. This is a snippet (the Spanish accents actually work in my terminal and email, not sure why they get messed up here):

$ python cine_tv.py 

I. Movies Spanish TV Today 20:00-0:00

20:00   |  L63       |  Cine: Poltergeist (Fenómenos extraños) Cine - Terror 
20:25   |  PARAM     |  El guerrero americano II: la confrontación (American Ninja 2: The Confrontation) 
22:00   |  PARAM     |  Seabiscuit, más allá de la leyenda Cine - Drama 
22:00   |  L63       |  Cine: Poltergeist II (Poltergeist II: The Other Side) 
22:00   |  T5        |  Cine: El equipo A (The A-Team) 
22:00   |  La2       |  El cine de La 2: Oliver Twist Cine - Drama 
22:10   |  A3        |  El peliculón: Toy Story 3 Cine - Animación 
22:25   |  La6       |  Cine: Estado de sitio (The Siege) 
22:30   |  La1       |  Cine: Ocean's Eleven (Ocean's Eleven) 
22:30   |  Nova      |  Cine: Homicidio en primer grado (Murder in the First) 



II. Details for each movie ... 

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
20:00   |  L63       |  Cine: Poltergeist (Fenómenos extraños) Cine - Terror 
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
URL: 
http://www.sincroguia.tv/peliculas/poltergeist_fenos_extra_18570664.html

Details: 
Director:
Tobe Hooper
Intérpretes:
JoBeth Williams, Oliver Robins, Heather O'Rourke, Beatrice Straight, Craig T. Nelson
Guión:
Steven Spielberg, Michael Grais, Mark Victor
Música:
Jerry Goldsmith
Director de Fotografía:
Matthew F. Leonetti
Producción:
Steven Spielberg, Frank Marshall
Productora:
Metro-Goldwyn-Mayer (MGM), SLM Production Group
Idioma Original:
Inglés
Nacionalidad:
Estados Unidos
Año:
1982
Duración:
114          minutos
Edad:
Todos los Públicos

Cine: Poltergeist (Fenómenos extraños)
Cine - Terror
laSexta3
Miércoles 02 de Enero de 2013
Inicio: 
        20:00        / Fin:
        22:00
Terror
Calificación Artística:
Calificación Comercial:
Una familia estadounidense padece fenómenos paranormales en su casa. Al principio, los espíritus se manifiestan moviendo muebles y demás objetos del hogar. Pero pronto se vuelven agresivos y secuestran a la hija pequeña de la familia. Cuando todas las explicaciones científicas y racionales han fracasado, los padres contratan a una espiritista que intentará limpiar la casa y recuperar a la niña.
Producida por Steven Spielberg (quien se rumorea que también dirigió parte del filme), "Poltergeist" fue una de las películas de terror más exitosas de la década de 1980. Recuperó para el cine a Tobe Hooper ("La matanza de Texas"), un nombre legendario del género. La cinta está basada en un capítulo de la serie "La dimensión desconocida" titulado "Niña perdida". "Poltergeist (Fenómenos extraños)" dio pie a dos secuelas y hasta a una serie de televisión.




++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
20:25   |  PARAM     |  El guerrero americano II: la confrontación (American Ninja 2: The Confrontation) 
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
URL: 
http://www.sincroguia.tv/peliculas/el_guerrero_americano_ii_la_confrontaci_18511018.html

Details: 
..
..
etc. etc.
..
..

In closing

Hope this inspires you to try to come up with coding exercises yourself to share. Feel free to ping me for ideas and suggestions.

I wish you all a Happy New Year