20 Useful Visualization Libraries

Well, not entirely limited to libraries.  Useful stuff for visualization practitioners sounded a little non-specific, though.  These are all freely available.

1. Prefuse (Java) & FLAIR (Flex)

 

 

 

2. simile (AJAX)

3. Processing (Java)

4. GigaPan (Service)

5. Modest Maps (Flash, Python)

6. Google Visualization API (Javascript)

7. Google Chart API (Javascript)

8. Google Maps API (Javascript, Flash)

9. GraphViz (Wrappers for a dozen languages including Java, Perl, Python.  Free.)

10. JFree (Java)

11. pChart (PHP)

12. OpenLayers (JavaScript)

13. Anti-Grain (C++)

14. JGraph (Java)

15. Boost Graph Library (C++, phyton wrapper)

16. Open Flash Chart (Flash)

17. Ubigraph (Wrappers for Python, Java, C, and more)

18. JUNG (Java)

19. TimeMap (Java)

20. Many Eyes (online service)

 

 

 

 

Network Visualization for Systems Biology

This is a quick look at the state-of-the-art of network visualization in systems biology. It’s an interesting topic on its own (and my day job at the moment), and also as it relates to the visualization of other types of networks, such as social networks (think Facebook). Systems biology is all about looking at proteins, pathogens, and more, within the contexts in which they interact. Naturally, then, the visualizations that tend to be particularly useful are those such as network visualizations that can provide macro understanding of the interactions.  Questions such visualizations help with include those of the form “if a drug affects protein X, what else will it affect?”

The Networks
Quite a bit of interesting complexity is present in these interaction networks (the data).  They are often small-world, disassociative (unlike social networks), scale-free, and exhibit modularity.  Biologists are usually either interested in looking at larger scale cell level networks, or meaningful sub-networks called pathways, which typically are in the range of 50-500 nodes.

Making life interesting, duplicate nodes representing different states are often included.  The edges are directed, and may be hyperedges when multiple nodes necessarily interact together. And, in truth, the edges are often approximations of the actual interactions in the underlying network.  These approximations come from experimental findings published in journals.

This image is part of Roche Applied Science’s“Biochemical Pathways” series of wall charts.  The charts are in the style of circuit diagrams, which seems to be the most common 2-D representation of metabolic pathways.  This set seems to have been particular influential.  The appeal of this ‘map’ is likely its scale.  Viewers can spend a great deal of time exploring.  In visualization there is a notion of ‘information density’, meaning the more visual attributes used to convey the data, the more information that may be present in the visualization.  This image has a very high information density.

Layout

In general (not just systems bio), network/graph layout (choosing where to place the nodes and edges) is done with consideration for (A) the topology network and (B) the aesthetics.  The primary topology concern is to place connected node pairs near one another and unconnected pairs apart.  The primary aesthetic concerns are to ensure that nodes do not overlap, edges do not cross, and labels are readable.

However, nodes in systems biology often also have biologically significant locations associated with them (e.g., within a cell, or within the nucleus of a cell).  The most common way of handling this location information is to treat the layout in a standard network layout manner, but constrain nodes to a compartment/level designated as the extracellular, membrane, cytoplasm, nucleus, etc.  This visualization, created with the Cerebral plugin for Cytoscape is the best example I know of of this.

Realism

Most of the network visualization tools for systems biology create very abstract images.  However, in high quality publications, such as the journal Nature, the abstract images are often hand rendered to include more realistic imagery.  Something I would like to do more of if look at actual microscope images and behavioral models to try to usefully bridge the gap.

Visual Data Mining

There are many uses of these network visualizations for biologists and others.  One is just that they can leave a more lasting

impression/memory than simple lists.  A major use case, though, is visual data mining, which may take many forms.  Followers of Tufte know that contrasts are often the most valuable element of a visualization.  This image is a straightforward example.  More sophistication visual data mining might include clustering and classification of those clusters.

Because the Roche wall charts beg to be explored, it is only natural that a tool would be created for doing so.  G-Language is an open source shell that supports, among other things, pathway visualization plugins.  The Genome Projector is module for G-Language which uses the Google Maps API to allow exploration and annotation.  No doubt, as systems biology network visualization tools reach later versions, more and more will support rich interaction and, perhaps, treat the visualization as a vehicle for collaboration.

Hierarchy and Metanodes

 

 

 

 

 

In the networks section above, I mentioned that the networks are often modular.  The most obvious modules are organelles.  But other modules exist, such as those defined functionality.  As the above examples show, incorporation of the modularity information into the visualization often is done in a manner that makes it even more abstract.

Beautiful Visualization: The Book

Had the opportunity last fall to contribute a chapter to the recently released book “Beautiful Visualization” by Julie Steele and Noah Iliinsky. So for my chapter I did visualizations of two large datasets. One was of the Netflix Prize, which was an updated version of a visualization I did a couple of years back. And since I was working at AT&T Interactive R&D at the time, the other visualization I did was of the query logs for Yellowpages.com, a local search engine owned by AT&T.

Julie Steele was wonderful to work with as an editor. And O’Reilly is kind enough to allow the chapter authors to release their own chapters in digital form. So if your interested, you can download the chapter here.

Here’s the Netflix visualization from the chapter. Click it to enlarge.

Movies in the Netflix Prize Dataset

Closeup of Netflix Prize Visualization.

Another closeup of the Netflix Prize visualization.

A Look at FINVIZ.com (Financial Visualizations)

FINVIZ is a suite of free financial tools that takes advantage of modern visualization ideas.  The infoviz and interaction designs are certainly worth a blog post.  Here’s a look at their efforts…

1. Sector Visualization.  This visualization is a treemap implemented using the Google Maps API.   It shows how well sectors and companies (stocks) within those sectors are doing.  The attention to detail is exceptional.  The company name stays the same size on zoom, and is dual encoded using a background image.  The gain/loss is shown using shades of green/red, and is also dual encoded using text.  On mouseover details are provided in a side panel.

2. Stock Charts.  When you create a portfolio of stocks, a number of views of that portfolio.  One is a small multiples view which allows easy comparison without overlay as one has to do with Google Finance and Yahoo Finance charts.  Again, attention to detail is wonderful.  The current price is highlighted, the trend lines are nicely colored, and the volume bar chart is part of the background.

3. Trends.  They use Sparklines for trend indicators.  Well, they may just be icons (not encoded by actual data), but I’ll delude myself nonetheless.

4. News. They aggregate the news items for all the stocks in a portfolio onto one page.  Very nicely done.  Only shows the day, month, year, when they change.  Overlays chart when mouseover of price (notice the little icon to indicate this next to the word price…attention to detail).

5. Profiles.  Again, just very nicely done, showing all of the profiles on the same page.

6.  Relative Volume Indicator.  A second vertical axis is added.

 

 

 

Any set of images may be loaded into the google maps api which handles loading and zoom.

I don’t know for certain how they implemented this map…I could be entirely wrong about the google maps usage, and it could have been done entirely in flash, for example.

But we did something similar with our Wikipedia map using Google Maps API by generating the images at multiple levels of zoom then loading them into the api. See http://scimaps.org/maps/wikipedia/20080103/

5 Reasons Visualization Is Not More Prevalent

Why does it seem I have to look hard to find good data visualization examples?  Why do few tech companies devote resources to visualization (Google’s the obvious exception)?  Why are there relatively few job postings for visualization, with many of those there are requiring mainly graphic design skills and not data visualization skills?  I was thinking about this today and I came up with a few possible reasons, some based on perceptions, and others based on marketplace realities.

Reason #1: People Don’t Know What Data Visualization Is

People don’t know what data visualization is.  Don’t believe me?  Read the Amazon.com reviews for the book Data Visualization by Ben Fry. They contain negative comments such as “One would expect a book with the title ‘Visualizing Data’ to be crammed with pictures”.  The issue seems be that too much of the book is devoted to data and the mapping of data properties to visual properties.

Graphic design is different from data visualization.  Graphic designers are largely free from having to deal with actual data, and from having their product emerge from data.  Graphic design components and data visualization components are often mixed, and with great success.  But they are different.  Art is not visualization.  And visualization is not art…unless it is .

The above visualization (which is, in fact, by Ben Fry) is driven by the properties of two underlying datasets.  One dataset is the DNA of a monkey.  The genes (the data) are represented as very tiny white text.  A second dataset used is human DNA. It is only depicted after the difference of the two datasets has been computed.  Then the genes that are different between the monkey and human are represented in red.  Fry obviously didn’t choose which areas of the visualization would be red, the data did.  What about the monkey pic?  Even that is a visual representation of a property of the dataset…the type of the DNA dataset shown in white text.

Reason #2: Crappy Existing Visualizations have Polluted Perception

 

 

 

 

 

The visualization on the left is the interface for the search engine Kartoo.  The visualization on the right is a feature CNET used to have called The Big Picture.  Both attempt to visualize data usually shown as lists (search results, related news articles) as 2D networks.  Its a nice idea, as pairwise relationship properties can be visually represented as edges.  But these particular efforts both miss the boat.  They don’t actually increase the amount of information represented by very much vs lists, while greatly increasing the mental load placed on the user trying to extract the basic information.

Reason #3: People are Unable to Mentally Separate the View from the Data

Here’s another Ben Fry work (I was watching a video/talk of his earlier today, which is part of the reason he is so prevalent in this post).  It shows six different visualizations of the same dataset.

Many times data relates to physical objects.  In such cases people may have trouble dealing with such data as visually represented in any other manner than that which includes those physical objects.  Or another situation is one in which data has just always been depicted in a certain way, which interferes with any new depiction.

Reason #4: Visualization is Difficult to Create and Easy to Copy

 

 

 

 

 

This is somewhat irrelevant, but I have had a Yahoo mail account for about a decade.  There was a good six year stretch where it never changed.  If Gmail hadn’t come along, who knows.

When Google released Google Finance, it marked a number of firsts…the use of AJAX for stock charts (the chart itself is actually Flash), the overlay of events on the chart, and the dual time sliders.  No doubt Google spent much time and effort designing this visualization tool.  How long did it take Yahoo Finance to copy Google Finance’s chart once Google revealed it?  Not long.  Good visualization design is hard.  It’s even harder when its object is to deconstruct very complex data.  Reverse engineering a visualization is easy.

Reason #5: People Won’t Pay for Visualization?

I’m not so sure about this one, but our company’s CTO recently commented to me that he couldn’t think of any successful standalone visualization effort other than Processing.

Applications such as Google Maps don’t count both because its free, and, more importantly, because people wouldn’t have access to the underlying data without the visualization.  I can think of a few commercial successful standalone visualizations such as this one, but surely the list is fairly short.

10 New York Times Visualizations

NYTimes.com has done a great job of moving beyond the static infographics found in newspapers.  10 favorites below…comment if you know of good ones I’ve missed.  Also, for further reading/viewing, see…

– Playgrounds for Data: Inspiration from NYTimes.com Interactives
– Infovis 2007 slides on Matthew Ericson’s blog…

The Times had a great graphic comparing wars, but I can’t seem to find the link now. I think it listed WWI, WWII, Korea, Vietnam, Iraq I, and the current Iraq war. The graphic compared duration, casualties, countries involved. It was really stunning. I wish I could track it down now.

 

See Conference (Information Visualization) to be Streamed Live in April

An information visualization conference, the See Conference, is being held in Wiesbaden, Germany, on April 19th.  Impressive speaker list.  The conference organizers plan to stream the speeches in real time via the conference website.

Due to this post I attended the conference and wrote my impressions down:
http://informationandvisualization.de/blog/impressions-see-conference3

Visualizing the ‘Power Struggle’ in Wikipedia

A new visualization Bruce Herr and I recently completed is being featured in this week’s New Scientist Magazine (the article is free online, minus the viz).  They did a good job jazzing up the language used to describe the viz–’power struggle’, ‘bubbling mass’, ‘blitzed articles’–but they also dumbed down the technical accomplishments.  I guess not everyone gets as excited about algorithms as I do.

Before I talk anymore about the viz, though, let me mention its appearing at the NetSci 2007 Conference this week, and hopefully a varient will appear at Wikimania later this summer as well.  The viz is a huge 5 feet by 5 feet when printed, and I only include a low res, smaller version here.  At some point high quality art prints of it will appear at SciMaps for sale to fund further visualization research.

Now for the good stuff.  Much like my visualization of the netflix prize competition data, we began this piece by representing the data as a network.  In this case the nodes in the network are wikipedia articles and the edges are the links between articles.  We then (with some help from our friends at Sandia) used an algorithm to lay out all 650,000 nodes (wikipedia articles) that had at least one link in such a way that similar articles are near one another.  These are the yellow dots, which when viewed at low res give a yellow tint to the whole picture.

The sizes of the nodes (circles, dots, whatever you want to call them), are based on a model of revision activity.  So large circles indicate that an article might be controversial, or the subject of lots of vandalism, or just a topic whose content frequently changes.  We labeled only the largest nodes, to keep it readable.  There is an interactive version of this in the works based on the google maps platform which will change the labels and pictures used as the user ‘zooms’ in or out.  Stay tuned for that.

The image used for each tile was selected automatically, simply by using the first image in the most linked to article among all the articles in that tile.  We were pleasantly surprised by the quality of the images that appeared.

Our hope for this visualization approach, which we continue to improve on, is that it could be updated in real time to give a macro sense of what is happening in Wikipedia.  I personally hope that some variation of it will end up in high schools as a teaching tool and for generating discussions.

Top 20 Most Hotly Revised Articles

  • Jesus
  • Adolf Hitler
  • October 2003
  • Nintendo revolution
  • Hurricane Katrina
  • India
  • RuneScape
  • Anarchism
  • Britney Spears
  • PlayStation 3
  • Saddam Hussein
  • Japan
  • Albert Einstein
  • 2004 Indian Ocean Earthquake
  • New York City
  • Germany
  • Muhammad
  • Pope Benedict XVI
  • Ronald Regan
  • Hinduism

Google Tech Talk Review: Statistical Aspects of Data Mining

This is a talk series being given at Google by David Mease based on a Master’s level stats course he is teaching this summer at Stanford.  Its easy listening if you already have some data mining or stats background.

The introduction (part 1) is particularly well done, as is the portion on association rule mining (parts 7 and 8).  This is the first half of the course which has already occurred…I’ll add links as new sessions are added to Google video.

Part 1: Introduction. Discussion of locations of potentially useful data (grocery checkout, apartment door card, elevator card, laptop login, traffic sensors, cell phone, google badge, etc).  Note mild obsession with consent.  Overview of predicting future vs describing patterns, and other broad areas of data mining.  Intro to R.

Part 2: Data. Reading datasets into excel and R. Observational (data mining) vs Experimental.  Qualitative vs quantitative.  Nominal vs ordinal.  And so on…

Part 3: Data cont. More Excel and R.  Sampling.

Part 4:  Plots. Histograms, ECDF.

Part 5:  More R plots.  Overlaying multiple plots. Statistical significance.  Labels in plots.

Part 6:  More R plots.  Box plots.  Color in plots.  Installing packages.  ACCENT principles and Tufte.

Part 7: Association Rules. Measures of location. Measures of spread.  Measures of association.  Frequent itemsets.  Similar to conditional probabilities.

Part 8: More association rule mining.  Support and confidence calculations. Personalization using rules. Beyond support and confidence.Part 9: Review

Part 10: Classification.  Overview.  A negative view of decision trees.  DTs in R.  Algos for generating DTs.

Part 11: More DTs.  Gini index.  Entropy. Pruning. Precision, recall, f-measure, and ROC curve.

Part 12: Nearest Neighbor. KNN.  Support Vector Machines. Adding ’slack’ variables, using basis functions to make the space linearly separable. Some comments on Stats vs ML. Intro to ensemble (uncorrelated) classifiers.

Part 13: Last class.  Random Forests.  AdaBoost.  Some discussion of limits of classifiers (nondeterministic observational datasets).  Clustering.  K-Means.

I too am so lost! I tried endlessly to set up wordpress . After numerous failed attempts I finnaly put down my ego and called customer support. They told me to my horror, that I can’t run php with using windows! They say in life change can be scary. I’ve battled deadly illness, traveled alone in the Middle East, got into the ring with 300lb men that wanted to rip my head off but the thought of changing my OS is way too scary and I can’t do it. I read that the “no php with windows” is bullshit and a case of under qualified support at godaddy. Is this true? I have know experience at php so if anyone has the answer, I’m really gonna need it in fine detail please. Wonderful blog ya got here!

35 Great Visualizations

Geographical & Historical

WorldProcessor. Globes overlaid with information. Beautiful…must see!
Wikisky Google maps for the stars.
Flight Patterns Visualizations of FAA data.
TextArc: History of Science Beautiful.
2007 Calender. Brad Paley design.
31 days in Iraq. Visualization of deaths in Iraq. Depressing.
Tracing the Visitor’s Eye Flickr tags on a geospatial basemap.
Schreiner International Cables Map. Old world map of cables.
Napolean’s March. Made famous by Edward Tufte.

Government

The State of the Union in Words. From the NYTimes.
Death and Taxes. Where taxes go to.
U.S. Frequency Allocations Chart. Radio frequency allocations.
2006 Election Results. Interesting maps.
Taxonomy Visualization of Patent Data. Patent viz.

Internet

IRC Arcs IRC viz.
Welkin RDF visualizer.
History Flow Visualization of the Wikipedia. Wikipedia viz.
Who owns the Internet? Map of the net backbones.
Amaznode. Amazon product visualization.
Java Technology Concept Map. Relationships among Java technologies.
The Dumpster. Romance and Blogs.

Pop Culture

Radio Protector Music recommendation visualizer.
Movie and Actors. Visualization of imdb data.

Geneology

rhnav Geneology visualization.
20 Generations Family Tree. One particular family tree viz.
Swarm. Networks of website.

Science

Tree of Life. Species visualization.
Map of Science. From the journal Nature.

Literature

TextArc Visualizations of books, other writings
Visual Poetry. Visualizations of poems.
The Voice. One of the better tag clouds I’ve seen.

Other

Thinking Machine 4. A chess match viz.
Ph.D. Thesis Map. Dissertation outline in the style of a subway map.
Newspaper Map. Very unusual.
Visual Elements Periodic Table.