Intro

In The Unusual Books That Shaped 50+ Billionaires, Mega-Bestselling Authors, and Other Prodigies Tim Ferriss provides a mega-list of the most-gifted and favorite books of top performers. In this post I use the collections.Counter() tool to see which books get recommended more than once.

Pre-work

  • Before coding I got a copy with wget to inspect the html.

  • Recommended books have amazon.com in them, the author’s new book - Tools of Titans - should be discarded, but it is only mentioned once, so can ignore that for now.

  • I verified there were 204 Amazon links to assert that in my script (Vim users, add this to your .vimrc: nmap ,c :%s///gn ; now when you do a search (in this case amazon.com), pressing comma+c, you get the amount of hits: "x matches on y lines").

Code

from collections import Counter
from itertools import dropwhile
import os
import re

from bs4 import BeautifulSoup as Soup
import requests

AMAZON = "amazon.com"
CACHED_HTML = "titans_books.html"
SHORT_SRC = "http://bit.ly/2gP0fv3"

title = re.compile(r'.*{}/([^/]+).*'.format(AMAZON)).sub

def get_html():
	if os.path.isfile(CACHED_HTML):
		with open(CACHED_HTML) as f:
			html = f.read().lower()
	else:
		html = requests.get(SHORT_SRC).text
	return Soup(html, 'html5lib')

def get_books():
	cnt = Counter()
	for a in get_html().find_all('a', href=True):
		href = a['href']
		if AMAZON in href:
			book = title(r'\1', href)
			cnt[book] += 1
	return cnt

def get_multiple_mentions(books, keep=2):
	for key, count in dropwhile(lambda key_count: key_count[1] >= keep, books.most_common()):
		del books[key]
	return books

def print_results(books):
	for book, count in books.items():
		print("{:<3} {}".format(count, book))


if __name__ == "__main__":
	def test(cnt):
		assert(cnt["tao-te-ching-laozi"] == 3)
		assert(cnt["influence-psychology-persuasion-robert-cialdini"] == 2)
		assert(sum(cnt.values()) == 204)

	books = get_books()
	test(books)
	multiple_mentions = get_multiple_mentions(books)
	print_results(multiple_mentions)

Results:

	2   shogun-james-clavell
	2   unbearable-lightness-being-milan-kundera
	2   alchemist-paulo-coelho
	3   tao-te-ching-laozi
	2   zero-one-notes-startups-future
	2   sapiens-humankind-yuval-noah-harari
	2   influence-psychology-persuasion-robert-cialdini
	2   hour-body-uncommon-incredible-superhuman
	2   mindset-psychology-carol-s-dweck
	3   atlas-shrugged-ayn-rand
	2   4-hour-workweek-escape-live-anywhere
	2   what-makes-sammy-budd-schulberg
	2   war-art-through-creative-battles
	2   black-swan-improbable-robustness-fragility
	2   hard-thing-about-things-building
	2   drama-gifted-child-search-revised
	2   checklist-manifesto-how-things-right

Good Py habits

  • When importing modules seperate standard library from external ones. Here BeautifulSoup and requests go in their own block. This way it does not cost an extra brain cycle to figure out where modules are coming from. In case of a bigger project I would work in a virtualenv and create a requirements.txt file with pip freeze > requirements.txt

  • Use uppercase constants at the top, except method aliases and named tuples (not used here) for which I use lowercase.

  • I like the re.compile(r’’).sub alias (from a regex golf by Peter Norvig): it leads to more compact code in the body of the method that uses it:

      # at top: 
      title = re.compile(r'.*{}/([^/]+).*'.format(AMAZON)).sub
    
      # in method: 
      book = title(r'\1', href)
    
  • Use small methods that do one thing, with small interfaces (2 args most)

  • If not using unittests for more serious work it’s very easy to at least add some assert statements. In this case 3 asserts pretty much guarantee the program will behave as expected (see here the total Amazon links value of 204 I verified at the start).

      assert(sum(cnt.values()) == 204)
      assert(cnt["tao-te-ching-laozi"] == 3)
      assert(cnt["influence-psychology-persuasion-robert-cialdini"] == 2)
    
  • Don’t re-invent the wheel: Counter, BeautifulSoup and requests provide awesome functionality with little code, use them.

  • String formatting: use “{}”.format(var) over old-style %s, hard to beat this in terms of elegance / compactness:

      print("{:<3} {}".format(count, book))
    
  • Download html while developing (CACHED_HTML), requests slows down the process and unnecessarily hits Tim’s server.

Bonus golf ... Just for fun, can we make it tweet sized?

This is just for fun, don’t write code like this for maintainability (use the long format above).

Answer is: yes, but without the title extract (so showing links) and with a cheat: the imports I could not include. Then, yes I could wrap it in 139 characters :)

	>>> import requests
	>>> from collections import Counter
	>>> from bs4 import BeautifulSoup as Soup
	>>> cmd = "Counter(a['href'] for a in Soup(requests.get('http://bit.ly/2gP0fv3').text).find_all('a', href=True) if 'zon' in a['href']).most_common(20)"
	>>> assert(len(cmd) <= 140)
	>>> eval(cmd)
	...
	...

Conclusion: practice!

Apart from seriously wanting to know which books repeatingly came up, ‘silly’ exercises like these do help.

You stumble upon a lot of things, even for a relatively easy script as this. And to get to a decent result you practice refactoring.

If you need inspiration, opportunities are everywhere. Small example: I was doing a Javascript course and wanted to know the total time of the course which was not listed. I ended up using re.find_all, a method I did not use that much, to extract all mm:ss timestamps of individual videos which I summed. Again, not a big deal but still found new ways of doing things.

Code katas are great for learning!


Bob Belderbos

Software Developer, Pythonista, Data Geek, Student of Life. About me