7th Jul, 2007

My Movie Search

Last year I wanted to enhance my movie collection database by adding additional information which is available on the Imdb’s website such as rating, production year, genre, cast, summary, directors and so forth to make it easier finding a good film that I have not yet watched. My idea was to get this data in bulk without manual intervention from Imdb. Unfortunately, after a bit of research, I found out that they do not offer any web services to make my life easier so I had to use page scraping to do the job. In other words this means making an HTTP GET request on a specific imdb page and parse the html data I was interested in.

I opted to use google’s webservices to find the movie information page I was interested in. Initially, I started using imdb’s search but later I discovered that it sucks because I was only getting about 70% accuracy! With google’s search the accuracy exploded to 99.5%

The search works by passing in a list of titles for example:
Ice Age 2
Norbit

and get back all the data I am interested in, in csv format as following:

Ice Age 2,”fabb4e49-248e-4c34-b277-bb9927f3b498″,”Ice Age: The Meltdown”,”2006″,”6.9″,”U”,”Animation;Adventure;Comedy;Family”,”91″,”Ice Age 2 (Singapore: English title) (USA) (working title);Ice Age 2: The Meltdown (USA) (trailer title);Ice Age: The Meltdown (UK);”,”2 wins&13 nominations”,”Diego, Manny and Sid return in this sequel to the hit Ice Age. This time around the Ice Age is over and is starting to melt, which will destroy their valley. So they must unite and warn everyone about the situation. “,”Fall From Height;Character’s Point Of View Camera Shot;Ice Age;Love;Musical Number”,”Ray Romano;John Leguizamo;Denis Leary;Seann William Scott;Josh Peck;Queen Latifah;Will Arnett;Jay Leno;Chris Wedge;Peter Ackerman;Caitlin Rose Anderson;Connor Anderson;Joseph Bologna;Jack Crocicchia;Peter DeSève”,”",”",”http://imdb.com/title/tt0438097/”,
“http://ia.imdb.com/media/imdb/01/I/40/67/11/10m.jpg”

I saved this data in a csv file and used an OleDb File connection to be able use regular SQL for my searches. Now, I have the luxury to type an artist’s name, genre or year and get back movies in my collection that match my criteria :)
I intend to make this publicly available on a website and also provide a webservice for movie searches. My only constraint is that google only support 1000 searches per day on their legacy SOAP API. Unfortunately, google have stopped supporting their SOAP API search for some time now so it is not possible to issue new keys for the service. They have replaced it with an AJAX search API which is not good enough in this case because they do not give you programmatic access to the search results.

I’ll write another post to let know how I managed to show Imdb’s thumbnail on my site and bypass their cross site scripting protection. To make it clear, it wasn’t as easy as adding <img src=”http://ia.imdb.com/media/imdb/01/I/40/67/11/10m.jpg”>. Try it out. The image will not show up if you try referencing it on your website using <img src=”">.

Have you ever been to Malta? All you need to know is a click away! Malta Travel Guide, Bargain Accommodation in Malta, Malta Hotels

Responses

Carlo, you may find some help if you try using Amazon. As they own IMDb, they provide much of the information straight from there and their APIs may make things like search and gathering the information easier.

And I suggest gathering the IMDb keywords for movies as well…for my personal collection on http://movies.nirajsanghvi.com I found that it made finding the right movie easier because I can now look through my movies by genre or situations too. I started my page before Amazon had nice APIs, so I’m screenscraping for content. I enter just an Amazon ID number for each DVD I get, and all the information gets auto-populated into a form from both Amazon and IMDb.

Niraj, I looked into Amazon’s API before I wrote the page scraper but unfortunately their webservice does not provide all the information I was interested in such as Genre and Ranking.

I found out google’s search on imdb’s site to be a better approach since it is much more accurate even if you have the title misspelt. In my case, I already had a database with titles and dvd numbers so I just passed in the information I already had in a list and it gave me back a csv file with all the information from Imdb. It would have been very time consuming to find amazon’s id for every movie I had. You can try using amazon’s api to search by title however I wasn’t impressed with their search results.

Hi there, I would be very interested in using this system you have made but I don’t really understand it.
If you could email me that would be great.

Leave a response

Your response:

Categories