An Interactive Visualization of the Netflix Prize Dataset

The visualization activated below (click the button) shows all 17,700 movies that are part of the Netflix Prize Competition. The movies are laid out such that simlar movies are close to one another. Similarity between two movies is computed based on whether users who like one like the other, or (and, really) those who dislike one dislike the other.  Alternatively, take a look at a colorful, static version.

Mouse over to get the movie titles…

Another Visualization of the Netflix Prize Dataset

Here’s a recent visualization I did of the dataset used in the Netflix Prize Competition. The dataset is 17,700 movies and 31 gigs of user ratings. This viz shows similar movies close to one another, with the similarities determined by a formula based on ratings.

I found most interesting a cluster of movies (in blue) that I’d say are generally acclaimed. The cluster contains movies of across all genres, such as Schindler’s List, BraveHeart, and Super Size Me. Beyond that, there’s a bunch of clusters which are mostly defined by a genre such as music, sports, documentary, Imax, children’s films, or bonus material. The big blob in the center is mostly what I’d call junk movies.

I’ve labeled some movies just to give some sense of what the clusters contain. There’s an interactive version of the viz as well, so you can explore the movies for yourself…

Scheme Tutorial

I was asked to give a short (1 hr) tutorial on the Scheme language this week for students in the graduate and undergraduate AI courses at Indiana.  Thought I would post the slides in case anyone wants to adapt it for their own purposes…

A Review of

I recently came across a small site running on Mediawiki called  The concept is that each article is a memory written, unlike Wikipedia, by a single author.  Subjective content allowed.

There seems to be a legit place for a site with this concept to complement Wikipedia.  Wikipedia is derivative knowledge, it is intended that the content be cited, meaning it already had to have been published somewhere.  Many valuable (and not so valuable) facts don’t fit that bill.  Also, when sources disagree but are merged into a single Wikipedia article, history according to Wikipedia has a rather non-deterministic feel to it.

That said, has a long way to go in terms of concept, technology, and adoption.  If anyone involved with MemoryArchive comes across this review…well, I have some ideas:

  1. The site needs to provide a data dump (similar to Wikipedia’s data dump) or API.  That way researchers can use the knowledge without scraping the content.  Incidentally, I have written a basic scraper in Perl for this site if anyone wants it.
  2. Use Semantic Mediawiki.  Its the future.
  3. Allow any users to create links, categories on any page.  You’re already using MediaWiki, might as well take advantage of the technology.
  4. Allow usernames to be linked to social network account such as Facebook.  It will create many opportunities for applications to use the memories, and for memories to be related to one another.
  5. Link events to Wikipedia pages on those events…as I said, its complimentary to Wikipedia.


Ensemble Machine Learning Tutorial

Here’s the slides from a 2-part lecture I’m giving on ensemble learning at Indiana University.  It includes a discussion of the Netflix Prize competition, and the use of ensemble techniques in that competition.

Introduction to Ensemble Learning

Featuring Successes in the Netflix Prize Competition

Todd Holloway Two Lecture Series for B551 November 20 & 27, 2007 Indiana University

  • Introduction Bias and variance problems
  • The Netflix Prize Success of ensemble methods in the
  • Netflix Prize Why Ensemble Methods Work
  • Algorithms AdaBoost BrownBoost Random forests

Bias and Variance

Decision Trees Small trees have high bias.

Large trees have high variance. Why?

Ensemble Classification Aggregation of predictions of multiple classifiers with the goal of improving accuracy.

Supervised learning task Training data is a set of users and ratings (1,2,3,4,5 stars) those users have given to movies. Construct a classifier that given a user and an unrated movie, correctly classifies that movie as either 1, 2, 3, 4, or 5 stars $1 million prize for a 10% improvement over Netflix’s current movie recommender/classifier (MSE = 0.9514)


  • Utility of combining diverse, independent opinions in human decision-making Protective Mechanism (e.g. stock portfolio diversity)
  • Violation of Ockham’s Razor Identifying the best model requires identifying the proper “model complexity”


  • Boosting-Make examples currently misclassified more important (or less, in some cases)
  • Bagging-Use different samples or attributes of the examples to generate diverse classifiers

Random forests

Let the number of training cases be M, and the number of variables in the classifier be N.

For each tree,

  • Choose a training set by choosing N times with replacement from all N available training cases.
  • For each node, randomly choose n variables on which to base the decision at that node.

Some more evidence of ensembling. In this case all the competition entrants predictions were combined after the event closed to see what could have been…