I like to solve common day problems, at the same time learning more coding. Learning to code is all about practice. Today I show you a Perl script to get a daily email listing all movies broadcasted by Spanish TV in the coming 24 hours.
I tried to do this some time ago - BeautifulSoup is very powerful for html parsing, but with the typical tv guide I had to rely on a class "cine inactiu" to filter the movies out. This worked pretty well, but it was not complete and as the movie names are in Spanish I linked to imdbapi for more info which usually did not yield any result. So I needed to redo this exercise ...
I didn't start scripting. First I looked for a better source which I found. This page does half of the work finding all movies to be aired today on Spanish TV. Usually it is not necessary to code everything! What I wanted to add on top of this was:
- Enrich this list with info for each movie, mainly the director and actors. This was easy following and parsing each link of the list.
- Email this list to my email with a cron job. An issue was the encoding not showing up well in Apple Mail and mailx not sending out the cron mail when the encoding was not defined well in the command's switches. More on that in a minute, without further ado the script which is also available on Github:
#!/usr/bin/perl -w # author: bob belderbos # v0.1 sept 2012 # purpose: send an email with all movies on Spanish TV in the next 24 hours # sincroguia.tv has a pretty complete list # this script servers best in a daily cronjob # use strict; use Data::Dumper; use LWP::Simple; use Encode qw(encode decode); # http://perlgeek.de/en/article/encodings-and-unicode my $enc = 'utf-8'; my $output; my $email = "[email protected]"; my @html = getUrl("http://www.sincroguia.tv/todas-las-peliculas.html"); # movie lines start with hh:mm timestamps for (grep {/^d{2}:d{2}|^<br/} @html){ # separate days if(/^<br/){ s/<br />//g; $output .= createHeader($_, "*"); next; } # parse movies m/(d{2}:d{2}) - <a.*?"([^"]+)" href="([^"]+)".*- ([^<]+).*/; my ($time, $title, $url, $channel) = ($1, $2, $3, $4); $output .= encode($enc, createHeader("$time / $channel / $title") . "$urln" . getMovieInfo($url) . "nn"); } # send me the generated movielist sendEmail($email, $output); sub getUrl { my $url = shift; my @html = split /</?li[> ]/, get($url); return @html; } sub getMovieInfo { my $url = shift; my $info; for(getUrl($url)){ next if(! /column/); for my $line (split /</?h3[> ]/, $_){ if($line =~ /Director:|rpretes|Idioma|Nacionalidad|Añ/){ $line =~ s/.*?strong>(.*)</strong>(.*)/$1$2n/g; $line =~ s/Año/Estreno/g; $info .= $line ; } } last; } return $info; } sub createHeader { my $str = shift; my $delimiter = shift // "="; my $width = 70; my $output = "n" . $delimiter x $width . "n" . $str . "n" . $delimiter x $width . "n"; return $output; } sub sendEmail { my ($to, $output) = @_; my $subject = "Today's movies Spanish TV"; # on mail pipe: http://objectmix.com/perl/380680-sending-email-perl-using-pipe-mailx.html open my $pipe, '|-', 'mailx', '-s', $subject, # char issue mailx: http://forums.opensuse.org/english/other-forums/development/programming-scripting/419802-charset-problem-mailx.html '-S', 'ttycharset=utf-8', '-S', 'sendcharsets=utf-8', '-S', 'encoding=8bit', $to or die "can't open pipe to mailx: $!n"; print $pipe $output; close $pipe; }
Some notes / things I learned
- I learned how send email via Perl, see the "sendEmail" subroutine. I used a pipe / mailx, using sendmail would not work with my hosting account. The only thing I had trouble with were Spanish characters like é, á, ñ, etc. Mailx would complain, not send out the mail, putting the email in dead.letter in my home. Encoded to utf-8 the help of this article did the trick!
- The other challenge with Spanish characters was parsing the source page. I found a good article explaining encodings and code to do this.
- The rest is pretty basic Perl, some Regex that really makes me like Perl and use it for more and more text parsing. I probably have to use some eval {} if one of the movie pages does not respond, otherwise the cronjob will send me another mail with the stderr output of the script execution (or I can say 2>/dev/null in the cronjob, but I think script should handle this).
Bonus
Having the time coded as hh:mm, Apple mail recognizes this and when clicking it you can add an event to your calendar. This way I can easily put a reminder of a movie I potentially want to see later in the evening.
Update 23.09.2012
The movie page does not always give the English movie name and sometimes it has a generic term like "La película de la semana" which doesn't give any clew. So I did some more parsingto include the movie name, original name and plot info. I also included the Twitter and Facebook sharing links which are on each movie page at the bottom, example: