Last year I wanted to enhance my movie collection database by adding additional information which is available on the Imdb’s website such as rating, production year, genre, cast, summary, directors and so forth to make it easier finding a good film that I have not yet watched. My idea was to get this data in bulk without manual intervention from Imdb. Unfortunately, after a bit of research, I found out that they do not offer any web services to make my life easier so I had to use page scraping to do the job. In other words this means making an HTTP GET request on a specific imdb page and parse the html data I was interested in.
I opted to use google’s webservices to find the movie information page I was interested in. Initially, I started using imdb’s search but later I discovered that it sucks because I was only getting about 70% accuracy! With google’s search the accuracy exploded to 99.5%
The search works by passing in a list of titles for example:
Ice Age 2
Norbit
and get back all the data I am interested in, in csv format as following:
Ice Age 2,”fabb4e49-248e-4c34-b277-bb9927f3b498″,”Ice Age: The Meltdown”,”2006″,”6.9″,”U”,”Animation;Adventure;Comedy;Family”,”91″,”Ice Age 2 (Singapore: English title) (USA) (working title);Ice Age 2: The Meltdown (USA) (trailer title);Ice Age: The Meltdown (UK);”,”2 wins&13 nominations”,”Diego, Manny and Sid return in this sequel to the hit Ice Age. This time around the Ice Age is over and is starting to melt, which will destroy their valley. So they must unite and warn everyone about the situation. “,”Fall From Height;Character’s Point Of View Camera Shot;Ice Age;Love;Musical Number”,”Ray Romano;John Leguizamo;Denis Leary;Seann William Scott;Josh Peck;Queen Latifah;Will Arnett;Jay Leno;Chris Wedge;Peter Ackerman;Caitlin Rose Anderson;Connor Anderson;Joseph Bologna;Jack Crocicchia;Peter DeSève”,”",”",”http://imdb.com/title/tt0438097/”,
“http://ia.imdb.com/media/imdb/01/I/40/67/11/10m.jpg”
I saved this data in a csv file and used an OleDb File connection to be able use regular SQL for my searches. Now, I have the luxury to type an artist’s name, genre or year and get back movies in my collection that match my criteria ![]()
I intend to make this publicly available on a website and also provide a webservice for movie searches. My only constraint is that google only support 1000 searches per day on their legacy SOAP API. Unfortunately, google have stopped supporting their SOAP API search for some time now so it is not possible to issue new keys for the service. They have replaced it with an AJAX search API which is not good enough in this case because they do not give you programmatic access to the search results.
I’ll write another post to let know how I managed to show Imdb’s thumbnail on my site and bypass their cross site scripting protection. To make it clear, it wasn’t as easy as adding <img src=”http://ia.imdb.com/media/imdb/01/I/40/67/11/10m.jpg”>. Try it out. The image will not show up if you try referencing it on your website using <img src=”">.

