KDD 2011: Recap

Often I write some kind of summary when I get back from a conference, but today I ran into Justin Donaldson, a fellow Indiana alum, who said he still occasionally finds use of the content, or at least the visualizations, on my blog. Anyhow, I felt motivated to write my notes up as a blog post.

Overall many of the talks I attended where using latent factor models and stochastic gradient descent. A few mentions of gradient boosted trees. Good stuff. My favorite talks…

  • Charles Elkin’s “A Log-Linear Model with Latent Features for Dyadic Prediction”. A scalable method to do collaborative filtering using both latent features and “side information” (user/item content features). Can’t wait to try it out! Here’s a link.
  • Peter Norvig’s Keynote about data mining at Google. Some tidbits:
    • Google mostly uses unsupervised or semi-supervised learning, both because of the cost of labeling and because labels themselves can be an impediment to higher accuracy.
    • He had this great graph of the accuracy of several algorithms for the word sense disambiguation task plotted against the amount of data used in training. The best performing algorithm always underperformed the worst algorithm when it was given an order of magnitude more data. A great argument for simple learning algorithms at very large scale.
    • They are very interested in transfer learning
  • KDD Cup. Topic was music recommendation using Yahoo’s data.
    • Many of the same ideas and observations as in the Netflix Prize. Neighbor models and latent factor models trained with stochastic gradient descent seemed pervasive.
    • Ensembles were necessary to win, but the accuracy improvement wasn’t huge over the best individual models. Quite the argument for not using ensembles in industry.
    • Yahoo organizers made some really interesting comments about the data. Among them that the mean rating is quite different for power users, which makes sense. And the data is acquired from different UI mechanisms, if I understood correctly, which impacts the distributions.

Looking forward to tomorrow!

Share
No Comments »

Three Favorites from the Knight News Challenge

The Knight New Challenge, in it’s 5th year, is a very cool effort to fund ideas that mashup aspects of the news industry with new tech.  Looking at the winners from the past four years…I think most are more about inspiration than viability, and that’s just fine.  Here’s three that stood out to me…
image
imageMediaBugs.org. The idea is to have a mechanism to allow the ‘crowd’ to report errors in reporting.  Cool idea…probably needs a way for a user to trivially report an error while reading from a news site, like nytimes.com, to gain traction.

image
image ushahidi.com.  The idea here is a map based system for sharing information in a crisis.  Apparently was used following the Haiti earthquake.  Very cool!!

image
image documentcloud.org. Idea is to facilate sharing of documents used as sources in news stories.  I could imagine this idea could grow into community or algorithmic fact checking.

Share
No Comments »

Getting Really Large Images onto WordPress

image I wanted to added zoomable versions of some very large visualizations to the site this evening.  So I uploaded them to GigaPan (actually, some had been uploaded years ago), and embedded them on the site, and everything works great!!  Click on the ‘Data Art’ tab above to see for yourself.  Here’s the steps if you’re interested…

  1. GigaPan only accepts images that are 50 megapixels or more, which is really large.  If your image is large, but not that large, download SmillaEnlarger and increase the size a little.
  2. Sign up for a GigaPan account if you don’t already have one.
  3. Download GigaPan uploader.
  4. Install the uploader, and upload the image.  The software was pretty easy to use.  When the upload is finished, you’ll get a url for the image that includes a 5 digit id.  One of mine was 65469, for example.
  5. On the wordpress page or post where you want the image, place the following code (replacing 65469 with your image’s id):
    <iframe src=http://www.gigapan.org/media/gigapans/65469/options/nosnapshots/iframe/flash.html?height=400 frameborder=”0″ height=”400″ scrolling=”no” width=”100%”></iframe>
  6. That’s it!  You’ll get something like this…

Share
No Comments »

Direct Search

 image For the past couple of years I’ve been primarily involved with engineering models used in search engines.  At times I’ve run into situations where a model I’m using or developing has some parameters that need to be set.  For example, a model might have a parameter that is a threshold on a number of times a keyword will be counted before we decide that additional occurrences are probably spam (and, yes, I’m talking about BM25 here).  And, at times, either the cost function I would like to use to set the parameters is not differentiable (yeah, I’m thinking about DCG), or I’m perfectly happy to use a quick and dirty method.  So I end up going with a direct search algorithm.  Here’s what I’ve learned (and haven’t forgotten)…

Read More

Share
1 Comment »

Guide to Getting Started in Machine Learning

Someone at work recently asked how he should go about studying machine learning on his own. So I’m putting together a little guide. This post will be a living document…I’ll keep adding to it, so please suggest additions and make comments.

Fortunately, there’s a ton of great resources that are free and on the web. The very best way to get started that I can think of is to read chapter one of The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2009 edition). The pdf is available online. Or buy the book on Amazon here, if you prefer.

Once you’ve read the first chapter, download R. R is an open-source statistics package/language that’s quite popular. Never heard of it? Check out this post (How Google and Facebook are using R).

Read More

Share
19 Comments »

20 Useful Visualization Libraries

Well, not entirely limited to libraries.  Useful stuff for visualization practitioners sounded a little non-specific, though.  These are all freely available.

Read More

Share
22 Comments »

A Look at FINVIZ.com (Financial Visualizations)

about_finviz

FINVIZ is a suite of free financial tools that takes advantage of modern visualization ideas.  The infoviz and interaction designs are certainly worth a blog post.  Here’s a look at their efforts…

Read More

Share
6 Comments »

Visualizing the ‘Power Struggle’ in Wikipedia

wikimap-small A new visualization Bruce Herr and I recently completed is being featured in this week’s New Scientist Magazine (the article is free online, minus the viz).  They did a good job jazzing up the language used to describe the viz–‘power struggle’, ‘bubbling mass’, ‘blitzed articles’–but they also dumbed down the technical accomplishments.  I guess not everyone gets as excited about algorithms as I do. 

Before I talk anymore about the viz, though, let me mention its appearing at the NetSci 2007 Conference this week, and hopefully a varient will appear at Wikimania later this summer as well.  The viz is a huge 5 feet by 5 feet when printed, and I only include a low res, smaller version here.  At some point high quality art prints of it will appear at SciMaps for sale to fund further visualization research.

Read More

Share
26 Comments »

Another Visualization of the Netflix Prize Dataset

netflixAllMoviesSmall Here’s a recent visualization I did of the dataset used in the Netflix Prize Competition. The dataset is 17,700 movies and 31 gigs of user ratings. This viz shows similar movies close to one another, with the similarities determined by a formula based on ratings.

I found most interesting a cluster of movies (in blue) that I’d say are generally acclaimed. The cluster contains movies of across all genres, such as Schindler’s List, BraveHeart, and Super Size Me. Beyond that, there’s a bunch of clusters which are mostly defined by a genre such as music, sports, documentary, Imax, children’s films, or bonus material. The big blob in the center is mostly what I’d call junk movies.

Read More

Share
7 Comments »

GapMinder Talk

Just read an article about Google buying a small company called GapMinder which does data visualization.  I checked out the talk on the GapMinder homepage, and would recommend watching the first 10 minutes of it.  The visualization tool that is used throughout the talk is something special…easy to see Google’s interest.

Share
3 Comments »

Tips for Making Perl Programs Run Faster

In my daily work I tend to manipulate fairly large datasets, such as Wikipedia, U.S. Patents, Netflix Ratings, and Imdb.  Here’s a few tricks I’ve come across so that you don’t lose time waiting for your programs to finish. 

Read More

Share
3 Comments »

Flash vs. Processing

Over the past year and a half I’ve been hooked on the language Processing. I’ve even contributed a early version library for visualizing social network data

Read More

Share
11 Comments »