Beautiful Visualization: The Book

Had the opportunity last fall to contribute a chapter to the recently released book “Beautiful Visualization” by Julie Steele and Noah Iliinsky. So for my chapter I did visualizations of two large datasets. One was of the Netflix Prize, which was an updated version of a visualization I did a couple of years back. And since I was working at AT&T Interactive R&D at the time, the other visualization I did was of the query logs for Yellowpages.com, a local search engine owned by AT&T.

Julie Steele was wonderful to work with as an editor. And O’Reilly is kind enough to allow the chapter authors to release their own chapters in digital form. So if your interested, you can download the chapter here.

Here’s the Netflix visualization from the chapter. Click it to enlarge.

Movies in the Netflix Prize Dataset

Closeup of Netflix Prize Visualization.

Another closeup of the Netflix Prize visualization.

A Look at FINVIZ.com (Financial Visualizations)

FINVIZ is a suite of free financial tools that takes advantage of modern visualization ideas.  The infoviz and interaction designs are certainly worth a blog post.  Here’s a look at their efforts…

1. Sector Visualization.  This visualization is a treemap implemented using the Google Maps API.   It shows how well sectors and companies (stocks) within those sectors are doing.  The attention to detail is exceptional.  The company name stays the same size on zoom, and is dual encoded using a background image.  The gain/loss is shown using shades of green/red, and is also dual encoded using text.  On mouseover details are provided in a side panel.

2. Stock Charts.  When you create a portfolio of stocks, a number of views of that portfolio.  One is a small multiples view which allows easy comparison without overlay as one has to do with Google Finance and Yahoo Finance charts.  Again, attention to detail is wonderful.  The current price is highlighted, the trend lines are nicely colored, and the volume bar chart is part of the background.

3. Trends.  They use Sparklines for trend indicators.  Well, they may just be icons (not encoded by actual data), but I’ll delude myself nonetheless.

4. News. They aggregate the news items for all the stocks in a portfolio onto one page.  Very nicely done.  Only shows the day, month, year, when they change.  Overlays chart when mouseover of price (notice the little icon to indicate this next to the word price…attention to detail).

5. Profiles.  Again, just very nicely done, showing all of the profiles on the same page.

6.  Relative Volume Indicator.  A second vertical axis is added.

 

 

 

Any set of images may be loaded into the google maps api which handles loading and zoom.

I don’t know for certain how they implemented this map…I could be entirely wrong about the google maps usage, and it could have been done entirely in flash, for example.

But we did something similar with our Wikipedia map using Google Maps API by generating the images at multiple levels of zoom then loading them into the api. See http://scimaps.org/maps/wikipedia/20080103/

Guide to Getting Started in Machine Learning

Someone at work recently asked how he should go about studying machine learning on his own. So I’m putting together a little guide. This post will be a living document…I’ll keep adding to it, so please suggest additions and make comments.

Fortunately, there’s a ton of great resources that are free and on the web. The very best way to get started that I can think of is to read chapter one of The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2009 edition). The pdf is available online. Or buy the book on Amazon here, if you prefer.

Once you’ve read the first chapter, download R. R is an open-source statistics package/language that’s quite popular. Never heard of it? Check out this post (How Google and Facebook are using R).

Once you’ve installed R, maybe played around a little, then check out this page which describes the major machine learning packages in R. If you’re already familiar with some of the techniques, then dive in and start playing around with them in R. On the other hand, if it looks really complicated, don’t worry about it yet.

Oh, by the way, if you want to start playing around with machine learning in R, you’ll need data. Check out the UCI Machine Learning Repository. They have both real and toy datasets. The iris dataset, for example, is famous for showing up in many research publications.

I’d suggest next reading more of The Elements of Statistical Learning. Its an excellent book. Try doing some of the programming exercises using R. If you don’t like this book, there are plenty of others. Bishop’s Pattern Recognition and Machine Learning is a famous one. It can be a little difficult depending on your math background. Tom Mitchell’s Machine Learning is another that’s often used to teach the topic.

5 Reasons Visualization Is Not More Prevalent

Why does it seem I have to look hard to find good data visualization examples?  Why do few tech companies devote resources to visualization (Google’s the obvious exception)?  Why are there relatively few job postings for visualization, with many of those there are requiring mainly graphic design skills and not data visualization skills?  I was thinking about this today and I came up with a few possible reasons, some based on perceptions, and others based on marketplace realities.

Reason #1: People Don’t Know What Data Visualization Is

People don’t know what data visualization is.  Don’t believe me?  Read the Amazon.com reviews for the book Data Visualization by Ben Fry. They contain negative comments such as “One would expect a book with the title ‘Visualizing Data’ to be crammed with pictures”.  The issue seems be that too much of the book is devoted to data and the mapping of data properties to visual properties.

Graphic design is different from data visualization.  Graphic designers are largely free from having to deal with actual data, and from having their product emerge from data.  Graphic design components and data visualization components are often mixed, and with great success.  But they are different.  Art is not visualization.  And visualization is not art…unless it is .

The above visualization (which is, in fact, by Ben Fry) is driven by the properties of two underlying datasets.  One dataset is the DNA of a monkey.  The genes (the data) are represented as very tiny white text.  A second dataset used is human DNA. It is only depicted after the difference of the two datasets has been computed.  Then the genes that are different between the monkey and human are represented in red.  Fry obviously didn’t choose which areas of the visualization would be red, the data did.  What about the monkey pic?  Even that is a visual representation of a property of the dataset…the type of the DNA dataset shown in white text.

Reason #2: Crappy Existing Visualizations have Polluted Perception

 

 

 

 

 

The visualization on the left is the interface for the search engine Kartoo.  The visualization on the right is a feature CNET used to have called The Big Picture.  Both attempt to visualize data usually shown as lists (search results, related news articles) as 2D networks.  Its a nice idea, as pairwise relationship properties can be visually represented as edges.  But these particular efforts both miss the boat.  They don’t actually increase the amount of information represented by very much vs lists, while greatly increasing the mental load placed on the user trying to extract the basic information.

Reason #3: People are Unable to Mentally Separate the View from the Data

Here’s another Ben Fry work (I was watching a video/talk of his earlier today, which is part of the reason he is so prevalent in this post).  It shows six different visualizations of the same dataset.

Many times data relates to physical objects.  In such cases people may have trouble dealing with such data as visually represented in any other manner than that which includes those physical objects.  Or another situation is one in which data has just always been depicted in a certain way, which interferes with any new depiction.

Reason #4: Visualization is Difficult to Create and Easy to Copy

 

 

 

 

 

This is somewhat irrelevant, but I have had a Yahoo mail account for about a decade.  There was a good six year stretch where it never changed.  If Gmail hadn’t come along, who knows.

When Google released Google Finance, it marked a number of firsts…the use of AJAX for stock charts (the chart itself is actually Flash), the overlay of events on the chart, and the dual time sliders.  No doubt Google spent much time and effort designing this visualization tool.  How long did it take Yahoo Finance to copy Google Finance’s chart once Google revealed it?  Not long.  Good visualization design is hard.  It’s even harder when its object is to deconstruct very complex data.  Reverse engineering a visualization is easy.

Reason #5: People Won’t Pay for Visualization?

I’m not so sure about this one, but our company’s CTO recently commented to me that he couldn’t think of any successful standalone visualization effort other than Processing.

Applications such as Google Maps don’t count both because its free, and, more importantly, because people wouldn’t have access to the underlying data without the visualization.  I can think of a few commercial successful standalone visualizations such as this one, but surely the list is fairly short.

Haugeland’s AI Views 25 Years Later

A couple of years ago, I picked John Haugeland’s Artificial Intelligence: The Very Idea up off the free book table in the computer science department of Indiana University. Finally read it this weekend.  Published in 1985, there’s  a lot to like about the book, but its definitely a product of its time.  That period being when computer and cognitive scientists were obsessing about knowledge representation.  Wanted to call-out a few (perhaps arrogant) quotes reflective of its day…

“A different pipedream of the 1950s was machine translation of natural languages.  The idea first gained currency in 1949 (via a ‘memorandum’ circulated by mathematician Warren Weaver) and was vigorously pursed … Weaver actually proposed a statisticalsolution based on the N nearest words (or nouns) in the immediate context. …  Might a more sophisticated ’statistical semantics’ (Weaver’s own phrase) carry the day? Not a chance.”

Pipedream…somebody tell Google 🙂  Actually, I had no idea machine translation was worked on in the 1950s.  Cool!  I would mention that the other pipedream of the ’50s he discusses is cybernetics, which, in various forms, is also a very popular area of research today.

“Artificial Intelligence must start by trying to understand knowledge…and then, on that basis, tackle learning.  It may even happen that, once the fundamental structures are worked out, acquisition and adaptation will be comparatively easy to include…it does not appear that learning is the most basic problem, let alone a shortcut or a natural starting point.”

Seems like research that has treated knowledge representation and learning as one problem (neural nets, Bayesian nets, etc) has been particularly fruitful.

“AI has discovered that knowledge itself is extraordinarily complex and difficult to implement–so much so that even the general structure of a system with common sense is not yet clear.”

And, clearly, the Cyc project solved this problem 

Anyway, the book is still a very interesting read, particularly if you like thinking about the challenges inherent in the domain knowledge representation.

10 New York Times Visualizations

NYTimes.com has done a great job of moving beyond the static infographics found in newspapers.  10 favorites below…comment if you know of good ones I’ve missed.  Also, for further reading/viewing, see…

– Playgrounds for Data: Inspiration from NYTimes.com Interactives
– Infovis 2007 slides on Matthew Ericson’s blog…

The Times had a great graphic comparing wars, but I can’t seem to find the link now. I think it listed WWI, WWII, Korea, Vietnam, Iraq I, and the current Iraq war. The graphic compared duration, casualties, countries involved. It was really stunning. I wish I could track it down now.

 

See Conference (Information Visualization) to be Streamed Live in April

An information visualization conference, the See Conference, is being held in Wiesbaden, Germany, on April 19th.  Impressive speaker list.  The conference organizers plan to stream the speeches in real time via the conference website.

Due to this post I attended the conference and wrote my impressions down:
http://informationandvisualization.de/blog/impressions-see-conference3

Wikipedia Page Similarities

I’m working on a visualization, a ‘map’ if you will, of Wikipedia pages. The map will layout pages close to one another if they are similar. So, in order to create such a map I need to compute the similarity of any two Wikipedia pages.

For my first attempt at this, I decided to go with a cocitation measure of similarity. So, two Wikipedia pages will be said to be similar if other Wikipedia pages that link to one usually link to the other.

However, the naive way to compute this, looking at every pair of pages, is far too inefficient given that there are 650,000 pages in the English Wikipedia, and 14.5 million pagelinks. So I’ve worked up a much more efficient algorithm. Here’s the psuedocode…I hope someone, somewhere out in cyberspace will find this useful. (It can, in fact, be used to compute co-citation similarities for any data represented as nodes and links)

%nodes, %links //the wikipedia pages and pagelinks
%reverse = reverse(%links) //flipping the pagelinks around
%cocitations //2d hash for temporarily storing cocitation counts
%scores //2d hash storing the final similarity scores

foreach node (keys %nodes){ 

   foreach sourceNode (keys %reverse->{node}) //count cocitations for node (wiki page)
      foreach cocited (keys %links->{sourceNode})
         $cocitations{node}->{cocited} ++;

  citationCount = keys(%reverse->{node})
  foreach node2 ( keys %cocitations{node}) //similarities scores for node 
      citationCount2 = keys(%reverse->{node2})
      scores{node}->{node2} =
		2 * cocitations{node}->{node2} / (citationCount + citationCount2)
}

CoCitation vs. Bibliometric Coupling

I recently posted an efficient algorithm for computing the similarity of two Wikipedia pages (or any two nodes in a network) using cocitation similarity. Another type of similarity which may be worth considering is bibliometric coupling, in which two pages are similar if the pages they link to are similar. What is interesting is that it is only a few minor tweaks to the cocitation algorithm to compute bibliometricc coupling. Here’s the bibliometric coupling psuedocode (Perl style):

%nodes, %links //the wikipedia pages and pagelinks
%reverse = reverse(%links) //flipping the pagelinks around
%biblioCounts //2d hash for temporarily storing counts
%scores //2d hash storing the final similarity scores

foreach node (keys %nodes){ 



   foreach linkedNode (keys %links->{node}) //count cocitations for node (wiki page)
      foreach node2 (keys %reverse->{linkedNode})
         $biblioCounts{node}->{node2} ++;

  citationCount = keys(%linked->{node})
  foreach node2 ( keys %biblioCounts{node}) //similarities scores for node
      citationCount2 = keys(%linked->{node2})
      scores{node}->{node2} =
		2 * biblioCounts{node}->{node2} / (citationCount + citationCount2)
}

Ranking Online Backup Services

This post is a bit off topic for this blog.  However, I recently decided I really needed an off-site backup service (er, I lost some files).  And, as usual, I spent way too much time looking around the web for such a service.  Anyways, I thought I’d share my homework.   To give you an idea of my situation, I have about 30GB of data (code, datasets, visualizations) that I need constantly backed up.

Backup Services vs. Online Storage

There are a ton of options.  I just wanted an automatic backup service–one that works flawlessly.  Many services I looked at emphasize that the files can be accessed anywhere.  These services often refer to themselves as online storage rather that online backup.  The pure backup services usually offer unlimited space and focus on the software for backing up changes on demand or when your computer is idle, and on the software for restoring files.  The tradeoff is usually that files are stored compressed, and are only meant to be accessed for somewhat rare restores.  Online storage services, by comparision, limit the storage space, and although they almost always advertise the backup aspect, files usually need to be manually transfered.  The advantage is that the files on such services can be accessed over the web.

I’m happy to keep files on my usb flash drive which I need access to wherever I go, and so I just need the automatic backup services.

I’ve ranked the services based on my impressions of their offerings and personal experience…

Best Deals

  1. Mozy–I ended up going with Mozy as my backup service, and an very happy so far.  I go to lunch or a meeting, and come back to a message that my files were backed up while my computer was idle.  The software for selecting which files and folders to backup is quite easy to use.  However, Mozy is purely a backup service, all files are stored compressed on their servers.  So if you’re interested in remote access to your files, look elsewhere.  2GB free, or unlimited space for 4.95/month.
  2. Carbonite–Their software for automatic backup is really pretty slick.  I was torn between Carbonite and Mozy.  Like Mozy this is strictly a backup service.  There is no free version; unlimited space for 49.95/yr.

Runners Up

  1. StreamLoad (MediaMax)–StreamLoad is rebranded across a number of sites including MediaMax.com and the AMD Media Vault.  Initially, at least on paper, this seemed liked the best deal to me.  They have a free service of 25GB (they throttle downloading though), which may be enough space for many folks.  It both supports automatic backup and provides anywhere access to your files.  Its 4.95/month for the larger 100GB.  I said initially it looked good.  I downloaded the software, and it kept crashing on both my laptop and desktop.  Unacceptable 🙁  Maybe others will have a better experience.
  2. Files Anywhere–Provides both online access and automatic backup.  11.95/month for 10gb.
  3. Box.net–Looks nice, but only 5GB for 4.95/month is too small for my needs.
  4. FlipDrive–Does not provide automatic backup.  4.95/month for 20 gb.
  5. Iomega IStorage–5GB for 19.99/month.  Yikes!  Too small a space, and too expensive.
  6. IBackup.com–5GB for 9.95/month, 50GB for 49.95/month.  Too little for too much.
  7. ShareFile.com–Pay by bandwidth.  23.95/month for 2gb of bandwidth.

Foolish

  1. XDrive–Owned by AOL.  Provides for automatic backup and online access.  Strictly a free service which provides 5gb of space.  Reviews are mixed on its software, but mainly I decided to stay away because if I needed more than 5gb, there is no means of upgrading.
  2. GoDaddy I’m a big GoDaddy fan.  This blog runs on a GoDaddy Hosting accont. But their online storage service just does not look like a bargain.  Called “Online File Folder”, its 2gb for $20/yr.  No automatic backup software.
  3. True Share–Overpriced.  “As little as 30/month for 3gb”.
  4. Amazon S3–A web service.  This is strictly for developers only.  Might be a good deal for a small or medium sized business.
  5. Yahoo Briefcase–A pathetic 25mb of storage.  Might as well just use Gmail for storage.

If you just need online storage and remote access, consider a regular hosting account

  1. An Hosting 
  2. Lunarpages 
  3. BlueHost
  4. Yahoo

Around the Web

  1. TechCrunch: The Online Storage Gang
  2. PC World: Online Storage