KDD 2011: Recap

Often I write some kind of summary when I get back from a conference, but today I ran into Justin Donaldson, a fellow Indiana alum, who said he still occasionally finds use of the content, or at least the visualizations, on my blog. Anyhow, I felt motivated to write my notes up as a blog post.

Overall many of the talks I attended where using latent factor models and stochastic gradient descent. A few mentions of gradient boosted trees. Good stuff. My favorite talks…

  • Charles Elkin’s “A Log-Linear Model with Latent Features for Dyadic Prediction”. A scalable method to do collaborative filtering using both latent features and “side information” (user/item content features). Can’t wait to try it out! Here’s a link.
  • Peter Norvig’s Keynote about data mining at Google. Some tidbits:
    • Google mostly uses unsupervised or semi-supervised learning, both because of the cost of labeling and because labels themselves can be an impediment to higher accuracy.
    • He had this great graph of the accuracy of several algorithms for the word sense disambiguation task plotted against the amount of data used in training. The best performing algorithm always underperformed the worst algorithm when it was given an order of magnitude more data. A great argument for simple learning algorithms at very large scale.
    • They are very interested in transfer learning
  • KDD Cup. Topic was music recommendation using Yahoo’s data.
    • Many of the same ideas and observations as in the Netflix Prize. Neighbor models and latent factor models trained with stochastic gradient descent seemed pervasive.
    • Ensembles were necessary to win, but the accuracy improvement wasn’t huge over the best individual models. Quite the argument for not using ensembles in industry.
    • Yahoo organizers made some really interesting comments about the data. Among them that the mean rating is quite different for power users, which makes sense. And the data is acquired from different UI mechanisms, if I understood correctly, which impacts the distributions.

Looking forward to tomorrow!

Three Favorites from the Knight News Challenge

The Knight New Challenge, in it’s 5th year, is a very cool effort to fund ideas that mashup aspects of the news industry with new tech.  Looking at the winners from the past four years…I think most are more about inspiration than viability, and that’s just fine.  Here’s three that stood out to me…

MediaBugs.org. The idea is to have a mechanism to allow the ‘crowd’ to report errors in reporting.  Cool idea…probably needs a way for a user to trivially report an error while reading from a news site, like nytimes.com, to gain traction.

ushahidi.com.  The idea here is a map based system for sharing information in a crisis.  Apparently was used following the Haiti earthquake.  Very cool!!

documentcloud.org. Idea is to facilate sharing of documents used as sources in news stories.  I could imagine this idea could grow into community or algorithmic fact checking.

Getting Really Large Images onto WordPress

I wanted to added zoomable versions of some very large visualizations to the site this evening.  So I uploaded them to GigaPan (actually, some had been uploaded years ago), and embedded them on the site, and everything works great!!  Click on the ‘Data Art’ tab above to see for yourself.  Here’s the steps if you’re interested…

  1. GigaPan only accepts images that are 50 megapixels or more, which is really large.  If your image is large, but not that large, download SmillaEnlarger and increase the size a little.
  2. Sign up for a GigaPan account if you don’t already have one.
  3. Download GigaPan uploader.
  4. Install the uploader, and upload the image.  The software was pretty easy to use.  When the upload is finished, you’ll get a url for the image that includes a 5 digit id.  One of mine was 65469, for example.
  5. On the wordpress page or post where you want the image, place the following code (replacing 65469 with your image’s id):
    <iframe src=http://www.gigapan.org/media/gigapans/65469/options/nosnapshots/iframe/flash.html?height=400 frameborder=”0″ height=”400″ scrolling=”no” width=”100%”></iframe>
  6. That’s it!  You’ll get something like this…

Network Visualization for Systems Biology

This is a quick look at the state-of-the-art of network visualization in systems biology. It’s an interesting topic on its own (and my day job at the moment), and also as it relates to the visualization of other types of networks, such as social networks (think Facebook). Systems biology is all about looking at proteins, pathogens, and more, within the contexts in which they interact. Naturally, then, the visualizations that tend to be particularly useful are those such as network visualizations that can provide macro understanding of the interactions.  Questions such visualizations help with include those of the form “if a drug affects protein X, what else will it affect?”

The Networks
Quite a bit of interesting complexity is present in these interaction networks (the data).  They are often small-world, disassociative (unlike social networks), scale-free, and exhibit modularity.  Biologists are usually either interested in looking at larger scale cell level networks, or meaningful sub-networks called pathways, which typically are in the range of 50-500 nodes.

Making life interesting, duplicate nodes representing different states are often included.  The edges are directed, and may be hyperedges when multiple nodes necessarily interact together. And, in truth, the edges are often approximations of the actual interactions in the underlying network.  These approximations come from experimental findings published in journals.

This image is part of Roche Applied Science’s“Biochemical Pathways” series of wall charts.  The charts are in the style of circuit diagrams, which seems to be the most common 2-D representation of metabolic pathways.  This set seems to have been particular influential.  The appeal of this ‘map’ is likely its scale.  Viewers can spend a great deal of time exploring.  In visualization there is a notion of ‘information density’, meaning the more visual attributes used to convey the data, the more information that may be present in the visualization.  This image has a very high information density.

Layout

In general (not just systems bio), network/graph layout (choosing where to place the nodes and edges) is done with consideration for (A) the topology network and (B) the aesthetics.  The primary topology concern is to place connected node pairs near one another and unconnected pairs apart.  The primary aesthetic concerns are to ensure that nodes do not overlap, edges do not cross, and labels are readable.

However, nodes in systems biology often also have biologically significant locations associated with them (e.g., within a cell, or within the nucleus of a cell).  The most common way of handling this location information is to treat the layout in a standard network layout manner, but constrain nodes to a compartment/level designated as the extracellular, membrane, cytoplasm, nucleus, etc.  This visualization, created with the Cerebral plugin for Cytoscape is the best example I know of of this.

Realism

Most of the network visualization tools for systems biology create very abstract images.  However, in high quality publications, such as the journal Nature, the abstract images are often hand rendered to include more realistic imagery.  Something I would like to do more of if look at actual microscope images and behavioral models to try to usefully bridge the gap.

Visual Data Mining

There are many uses of these network visualizations for biologists and others.  One is just that they can leave a more lasting

impression/memory than simple lists.  A major use case, though, is visual data mining, which may take many forms.  Followers of Tufte know that contrasts are often the most valuable element of a visualization.  This image is a straightforward example.  More sophistication visual data mining might include clustering and classification of those clusters.

Because the Roche wall charts beg to be explored, it is only natural that a tool would be created for doing so.  G-Language is an open source shell that supports, among other things, pathway visualization plugins.  The Genome Projector is module for G-Language which uses the Google Maps API to allow exploration and annotation.  No doubt, as systems biology network visualization tools reach later versions, more and more will support rich interaction and, perhaps, treat the visualization as a vehicle for collaboration.

Hierarchy and Metanodes

 

 

 

 

 

In the networks section above, I mentioned that the networks are often modular.  The most obvious modules are organelles.  But other modules exist, such as those defined functionality.  As the above examples show, incorporation of the modularity information into the visualization often is done in a manner that makes it even more abstract.

Haugeland’s AI Views 25 Years Later

A couple of years ago, I picked John Haugeland’s Artificial Intelligence: The Very Idea up off the free book table in the computer science department of Indiana University. Finally read it this weekend.  Published in 1985, there’s  a lot to like about the book, but its definitely a product of its time.  That period being when computer and cognitive scientists were obsessing about knowledge representation.  Wanted to call-out a few (perhaps arrogant) quotes reflective of its day…

“A different pipedream of the 1950s was machine translation of natural languages.  The idea first gained currency in 1949 (via a ‘memorandum’ circulated by mathematician Warren Weaver) and was vigorously pursed … Weaver actually proposed a statisticalsolution based on the N nearest words (or nouns) in the immediate context. …  Might a more sophisticated ’statistical semantics’ (Weaver’s own phrase) carry the day? Not a chance.”

Pipedream…somebody tell Google 🙂  Actually, I had no idea machine translation was worked on in the 1950s.  Cool!  I would mention that the other pipedream of the ’50s he discusses is cybernetics, which, in various forms, is also a very popular area of research today.

“Artificial Intelligence must start by trying to understand knowledge…and then, on that basis, tackle learning.  It may even happen that, once the fundamental structures are worked out, acquisition and adaptation will be comparatively easy to include…it does not appear that learning is the most basic problem, let alone a shortcut or a natural starting point.”

Seems like research that has treated knowledge representation and learning as one problem (neural nets, Bayesian nets, etc) has been particularly fruitful.

“AI has discovered that knowledge itself is extraordinarily complex and difficult to implement–so much so that even the general structure of a system with common sense is not yet clear.”

And, clearly, the Cyc project solved this problem 

Anyway, the book is still a very interesting read, particularly if you like thinking about the challenges inherent in the domain knowledge representation.