KDD 2011: Recap

Often I write some kind of summary when I get back from a conference, but today I ran into Justin Donaldson, a fellow Indiana alum, who said he still occasionally finds use of the content, or at least the visualizations, on my blog. Anyhow, I felt motivated to write my notes up as a blog post.

Overall many of the talks I attended where using latent factor models and stochastic gradient descent. A few mentions of gradient boosted trees. Good stuff. My favorite talks…

  • Charles Elkin’s “A Log-Linear Model with Latent Features for Dyadic Prediction”. A scalable method to do collaborative filtering using both latent features and “side information” (user/item content features). Can’t wait to try it out! Here’s a link.
  • Peter Norvig’s Keynote about data mining at Google. Some tidbits:
    • Google mostly uses unsupervised or semi-supervised learning, both because of the cost of labeling and because labels themselves can be an impediment to higher accuracy.
    • He had this great graph of the accuracy of several algorithms for the word sense disambiguation task plotted against the amount of data used in training. The best performing algorithm always underperformed the worst algorithm when it was given an order of magnitude more data. A great argument for simple learning algorithms at very large scale.
    • They are very interested in transfer learning
  • KDD Cup. Topic was music recommendation using Yahoo’s data.
    • Many of the same ideas and observations as in the Netflix Prize. Neighbor models and latent factor models trained with stochastic gradient descent seemed pervasive.
    • Ensembles were necessary to win, but the accuracy improvement wasn’t huge over the best individual models. Quite the argument for not using ensembles in industry.
    • Yahoo organizers made some really interesting comments about the data. Among them that the mean rating is quite different for power users, which makes sense. And the data is acquired from different UI mechanisms, if I understood correctly, which impacts the distributions.

Looking forward to tomorrow!

No Comments »

Three Favorites from the Knight News Challenge

The Knight New Challenge, in it’s 5th year, is a very cool effort to fund ideas that mashup aspects of the news industry with new tech.  Looking at the winners from the past four years…I think most are more about inspiration than viability, and that’s just fine.  Here’s three that stood out to me…
image
imageMediaBugs.org. The idea is to have a mechanism to allow the ‘crowd’ to report errors in reporting.  Cool idea…probably needs a way for a user to trivially report an error while reading from a news site, like nytimes.com, to gain traction.

image
image ushahidi.com.  The idea here is a map based system for sharing information in a crisis.  Apparently was used following the Haiti earthquake.  Very cool!!

image
image documentcloud.org. Idea is to facilate sharing of documents used as sources in news stories.  I could imagine this idea could grow into community or algorithmic fact checking.

No Comments »

Getting Really Large Images onto WordPress

image I wanted to added zoomable versions of some very large visualizations to the site this evening.  So I uploaded them to GigaPan (actually, some had been uploaded years ago), and embedded them on the site, and everything works great!!  Click on the ‘Data Art’ tab above to see for yourself.  Here’s the steps if you’re interested…

  1. GigaPan only accepts images that are 50 megapixels or more, which is really large.  If your image is large, but not that large, download SmillaEnlarger and increase the size a little.
  2. Sign up for a GigaPan account if you don’t already have one.
  3. Download GigaPan uploader.
  4. Install the uploader, and upload the image.  The software was pretty easy to use.  When the upload is finished, you’ll get a url for the image that includes a 5 digit id.  One of mine was 65469, for example.
  5. On the wordpress page or post where you want the image, place the following code (replacing 65469 with your image’s id):
    <iframe src=http://www.gigapan.org/media/gigapans/65469/options/nosnapshots/iframe/flash.html?height=400 frameborder=”0″ height=”400″ scrolling=”no” width=”100%”></iframe>
  6. That’s it!  You’ll get something like this…

No Comments »

First Data Visualization Meetup—Nov 10th

Lately Zhou Yu and I have been working to start a data visualization meetup in the Bay Area…something that surprisingly doesn’t already exist.  Well (finally!) we have scheduled our first meetup, a talk by Stamen CEO Eric Rodenbeck (bio).  We’re still looking for a regular venue for our meetups (ideally, one in SF and one on the Peninsula), but for now, Jeff Heer of Stanford has been good enough to allow us to use a classroom on campus.  I have no doubt this topic is going to attract a fantastic group of people.  Come join us!!

No Comments »

Direct Search

 image For the past couple of years I’ve been primarily involved with engineering models used in search engines.  At times I’ve run into situations where a model I’m using or developing has some parameters that need to be set.  For example, a model might have a parameter that is a threshold on a number of times a keyword will be counted before we decide that additional occurrences are probably spam (and, yes, I’m talking about BM25 here).  And, at times, either the cost function I would like to use to set the parameters is not differentiable (yeah, I’m thinking about DCG), or I’m perfectly happy to use a quick and dirty method.  So I end up going with a direct search algorithm.  Here’s what I’ve learned (and haven’t forgotten)…

Read More

1 Comment »

Beautiful Visualization: The Book

Had the opportunity last fall to contribute a chapter to the recently released book “Beautiful Visualization” by Julie Steele and Noah Iliinsky. So for my chapter I did visualizations of two large datasets. One was of the Netflix Prize, which was an updated version of a visualization I did a couple of years back. And since I was working at AT&T Interactive R&D at the time, the other visualization I did was of the query logs for Yellowpages.com, a local search engine owned by AT&T.

Julie Steele was wonderful to work with as an editor. And O’Reilly is kind enough to allow the chapter authors to release their own chapters in digital form. So if your interested, you can download the chapter here.

Read More

3 Comments »

Guide to Getting Started in Machine Learning

Someone at work recently asked how he should go about studying machine learning on his own. So I’m putting together a little guide. This post will be a living document…I’ll keep adding to it, so please suggest additions and make comments.

Fortunately, there’s a ton of great resources that are free and on the web. The very best way to get started that I can think of is to read chapter one of The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2009 edition). The pdf is available online. Or buy the book on Amazon here, if you prefer.

Once you’ve read the first chapter, download R. R is an open-source statistics package/language that’s quite popular. Never heard of it? Check out this post (How Google and Facebook are using R).

Read More

15 Comments »

20 Useful Visualization Libraries

Well, not entirely limited to libraries.  Useful stuff for visualization practitioners sounded a little non-specific, though.  These are all freely available.

Read More

21 Comments »

Network Visualization for Systems Biology

 roche3This is a quick look at the state-of-the-art of network visualization in systems biology. It’s an interesting topic on its own (and my day job at the moment), and also as it relates to the visualization of other types of networks, such as social networks (think Facebook). Systems biology is all about looking at proteins, pathogens, and more, within the contexts in which they interact. Naturally, then, the visualizations that tend to be particularly useful are those such as network visualizations that can provide macro understanding of the interactions.  Questions such visualizations help with include those of the form “if a drug affects protein X, what else will it affect?”

Read More

3 Comments »

A Look at FINVIZ.com (Financial Visualizations)

about_finviz

FINVIZ is a suite of free financial tools that takes advantage of modern visualization ideas.  The infoviz and interaction designs are certainly worth a blog post.  Here’s a look at their efforts…

Read More

5 Comments »

5 Reasons Visualization Is Not More Prevalent

Why does it seem I have to look hard to find good data visualization examples?  Why do few tech companies devote resources to visualization (Google’s the obvious exception)?  Why are there relatively few job postings for visualization, with many of those there are requiring mainly graphic design skills and not data visualization skills?  I was thinking about this today and I came up with a few possible reasons, some based on perceptions, and others based on marketplace realities.

Reason #1: People Don’t Know What Data Visualization Is

benfry-monkey-small People don’t know what data visualization is.  Don’t believe me?  Read the Amazon.com reviews for the book Data Visualization by Ben Fry. They contain negative comments such as “One would expect a book with the title ‘Visualizing Data’ to be crammed with pictures”.  The issue seems be that too much of the book is devoted to data and the mapping of data properties to visual properties

Read More

15 Comments »

10 New York Times Visualizations

NYTimes.com has done a great job of moving beyond the static infographics found in newspapers.  10 favorites below…comment if you know of good ones I’ve missed.  Also, for further reading/viewing, see…

- Playgrounds for Data: Inspiration from NYTimes.com Interactives
- Infovis 2007 slides on Matthew Ericson’s blog…

 nytimesnamingnames

Read More

3 Comments »

ETech Presentation on Ensemble Machine Learning

logo

Just wanted to put up my slides from ETech this past week.  The talk is pretty similar to the talk I posted a few months ago, just a bit more fleshed out.
[ppt][pptx][pdf]

Read More

3 Comments »

See Conference (Information Visualization) to be Streamed Live in April

An information visualization conference, the See Conference, is being held in Wiesbaden, Germany, on April 19th.  Impressive speaker list.  The conference organizers plan to stream the speeches in real time via the conference website.

see1  
Ben Fry
see2  
Zachary Lieberman
see3  
Frank van Ham
see4
And comfortable seats!
1 Comment »

Ensemble Machine Learning Tutorial

ensembleTutorialSlide

Here’s the slides from a 2-part lecture I’m giving on ensemble learning at Indiana University.  It includes a discussion of the Netflix Prize competition, and the use of ensemble techniques in that competition.

[PDF][PPT]

1 Comment »

Visualizing Science and Technology in Wikipedia

sciwikivis-small If you didn’t see our original Wikipedia Activity Visualization, check it out here (there’s a detailed explanation, as well).  Also, there is a Google maps style zoomable version here.

This new version uses the same layout and images (well, slightly improved) as the original, but this time we tried to highlight activity in regions of Wikipedia that are predominately math or science or technology.

So we developed a program to classify Wikipedia articles as being one of these three categories (or none), based on the categories the article was assigned to and their positions in the Wikipedia category link network.

Read More

4 Comments »

Scheme Tutorial

windowslivewriterschemetutorial-13a79image-5 I was asked to give a short (1 hr) tutorial on the Scheme language this week for students in the graduate and undergraduate AI courses at Indiana.  Thought I would post the slides in case anyone wants to adapt it for their own purposes…

PDF version
PPT (Office 2007) version

Read More

4 Comments »

ICCBR 2007 Highlights

iccbrLogo
ICCBR07 (International Conference on Case Based Reasoning) is held on alternating years with the ECCBR conference.  The venue was Belfast, a city with nice blue collar charm to it.  Seemed sort of a European version of my hometown of Green Bay.  Stayed in a Queens University dorm room, where I was constantly reminded I am too old to be staying in the dorms.  Should have paid out for a stay for the Europa Hotel where the conference was held…classy place.

Read More

1 Comment »

Google Tech Talk Review: Statistical Aspects of Data Mining

This is a talk series being given at Google by David Mease based on a Master’s level stats course he is teaching this summer at Stanford.  Its easy listening if you already have some data mining or stats background. 

 

The introduction (part 1) is particularly well done, as is the portion on association rule mining (parts 7 and 8).  This is the first half of the course which has already occurred…I’ll add links as new sessions are added to Google video.

Read More

1 Comment »

Visualizing the ‘Power Struggle’ in Wikipedia

wikimap-small A new visualization Bruce Herr and I recently completed is being featured in this week’s New Scientist Magazine (the article is free online, minus the viz).  They did a good job jazzing up the language used to describe the viz–’power struggle’, ‘bubbling mass’, ‘blitzed articles’–but they also dumbed down the technical accomplishments.  I guess not everyone gets as excited about algorithms as I do. 

Before I talk anymore about the viz, though, let me mention its appearing at the NetSci 2007 Conference this week, and hopefully a varient will appear at Wikimania later this summer as well.  The viz is a huge 5 feet by 5 feet when printed, and I only include a low res, smaller version here.  At some point high quality art prints of it will appear at SciMaps for sale to fund further visualization research.

Read More

23 Comments »

Another Visualization of the Netflix Prize Dataset

netflixAllMoviesSmall Here’s a recent visualization I did of the dataset used in the Netflix Prize Competition. The dataset is 17,700 movies and 31 gigs of user ratings. This viz shows similar movies close to one another, with the similarities determined by a formula based on ratings.

I found most interesting a cluster of movies (in blue) that I’d say are generally acclaimed. The cluster contains movies of across all genres, such as Schindler’s List, BraveHeart, and Super Size Me. Beyond that, there’s a bunch of clusters which are mostly defined by a genre such as music, sports, documentary, Imax, children’s films, or bonus material. The big blob in the center is mostly what I’d call junk movies.

Read More

7 Comments »

An Interactive Visualization of the Netflix Prize Dataset

smallNetflixVizInteractive The visualization activated below (click the button) shows all 17,700 movies that are part of the Netflix Prize Competition. The movies are laid out such that simlar movies are close to one another. Similarity between two movies is computed based on whether users who like one like the other, or (and, really) those who dislike one dislike the other.  Alternatively, take a look at a colorful, static version.

Read More

3 Comments »

GapMinder Talk

Just read an article about Google buying a small company called GapMinder which does data visualization.  I checked out the talk on the GapMinder homepage, and would recommend watching the first 10 minutes of it.  The visualization tool that is used throughout the talk is something special…easy to see Google’s interest.

3 Comments »

Installing WordPress on GoDaddy

wordpressGoDaddy Setting up WordPress on a GoDaddy hosting account is really not difficult (this blog is an example that it can be done!).  Below are my notes on the process.  If you glance at these steps, and don’t want to mess around with this, consider using one of the following hosting services which come with WordPress pre-installed (fairly rare): An Hosting, Lunarpages, BlueHost, Yahoo

Steps for installing WordPress on a GoDaddy Hosting Account

Read More

83 Comments »

A Tutorial on Flash Remoting Using Perl

remoting Flash remoting is a big improvement over forms/cgi for communication between flash and server.  There’s a great little project called amfphp for using php with flash remoting.  There’s a whole lot less great (but appreciated!) version called amf::perl for perl and python. There is little documentation, so I thought I’d post an example. 

Read More

2 Comments »

Tips for Making Perl Programs Run Faster

In my daily work I tend to manipulate fairly large datasets, such as Wikipedia, U.S. Patents, Netflix Ratings, and Imdb.  Here’s a few tricks I’ve come across so that you don’t lose time waiting for your programs to finish. 

Read More

3 Comments »

Flash vs. Processing

Over the past year and a half I’ve been hooked on the language Processing. I’ve even contributed a early version library for visualizing social network data

Read More

11 Comments »

Top 10 Google Tech Talks

All Google Tech Talks are here (Google EngEDU is the actual name of the talk series). Thought I’d compile a top ten list…

  1. Python and Python 3000. Two talks about the Python language given by its inventor Guido van Rossum. The first is about the language’s origins and the second is about its future.
  2. Read More

4 Comments »

CoCitation vs. Bibliometric Coupling

I recently posted an efficient algorithm for computing the similarity of two Wikipedia pages (or any two nodes in a network) using cocitation similarity. Another type of similarity which may be worth considering is bibliometric coupling, in which two pages are similar if the pages they link to are similar. What is interesting is that it is only a few minor tweaks to the cocitation algorithm to compute bibliometricc coupling. Here’s the bibliometric coupling psuedocode (Perl style):

Read More

No Comments »

Wikipedia Page Similarities

I’m working on a visualization/map of Wikipedia pages. The map will layout pages close to one another if they are similar. So, in order to create such a map I need to compute the similarity of any two Wikipedia pages.

For my first attempt at this, I decided to go with a cocitation measure of similarity. So, two Wikipedia pages will be said to be similar if other Wikipedia pages that link to one usually link to the other.

However, the naive way to compute this, looking at every pair of pages, is far too inefficient given that there are 650,000 pages in the English Wikipedia, and 14.5 million pagelinks. So I’ve worked up a much more efficient algorithm. Here’s the psuedocode…I hope someone, somewhere out in cyberspace will find this useful. (It can, in fact, be used to compute co-citation similarities for any data represented as nodes and links)

Read More

1 Comment »