Sep
8
20 Useful Visualization Libraries
September 8, 2008 | 13 Comments
Well, not entirely limited to libraries. Useful stuff for visualization practitioners sounded a little non-specific, though. These are all freely available.
May
29
Network Visualization for Systems Biology
May 29, 2008 | Leave a Comment
This is a quick look at the state-of-the-art of network visualization in systems biology. It’s an interesting topic on its own (and my day job at the moment), and also as it relates to the visualization of other types of networks, such as social networks (think Facebook). Systems biology is all about looking at proteins, pathogens, and more, within the contexts in which they interact. Naturally, then, the visualizations that tend to be particularly useful are those such as network visualizations that can provide macro understanding of the interactions. Questions such visualizations help with include those of the form “if a drug affects protein X, what else will it affect?”
The Networks
Quite a bit of interesting complexity is present in these interaction networks (the data). They are often small-world, disassociative (unlike social networks), scale-free, and exhibit modularity. Biologists are usually either interested in looking at larger scale cell level networks, or meaningful sub-networks called pathways, which typically are in the range of 50-500 nodes.
Making life interesting, duplicate nodes representing different states are often included. The edges are directed, and may be hyperedges when multiple nodes necessarily interact together. And, in truth, the edges are often approximations of the actual interactions in the underlying network. These approximations come from experimental findings published in journals.
Sphere: Related ContentMay
12
A Look at FINVIZ.com (Financial Visualizations)
May 12, 2008 | 2 Comments
FINVIZ is a suite of free financial tools that takes advantage of modern visualization ideas. The infoviz and interaction designs are certainly worth a blog post. Here’s a look at their efforts…
1. Sector Visualization. This visualization is a treemap implemented using the Google Maps API. It shows how well sectors and companies (stocks) within those sectors are doing. The attention to detail is exceptional. The company name stays the same size on zoom, and is dual encoded using a background image. The gain/loss is shown using shades of green/red, and is also dual encoded using text. On mouseover details are provided in a side panel.
Sphere: Related ContentApr
20
5 Reasons Visualization Is Not More Prevalent
April 20, 2008 | 12 Comments
Why does it seem I have to look hard to find good data visualization examples? Why do few tech companies devote resources to visualization (Google’s the obvious exception)? Why are there relatively few job postings for visualization, with many of those there are requiring mainly graphic design skills and not data visualization skills? I was thinking about this today and I came up with a few possible reasons, some based on perceptions, and others based on marketplace realities.
Reason #1: People Don’t Know What Data Visualization Is
People don’t know what data visualization is. Don’t believe me? Read the Amazon.com reviews for the book Data Visualization by Ben Fry. They contain negative comments such as “One would expect a book with the title ‘Visualizing Data’ to be crammed with pictures”. The issue seems be that too much of the book is devoted to data and the mapping of data properties to visual properties.
Apr
13
Haugeland’s AI Views 25 Years Later
April 13, 2008 | Leave a Comment
A couple of years ago, I picked John Haugeland’s Artificial Intelligence: The Very Idea up off the free book table in the computer science department of Indiana University. Finally read it this weekend. Published in 1985, there’s a lot to like about the book, but its definitely a product of its time. That period being when computer and cognitive scientists were obsessing about knowledge representation. Wanted to call-out a few (perhaps arrogant) quotes reflective of its day…
“A different pipedream of the 1950s was machine translation of natural languages. The idea first gained currency in 1949 (via a ‘memorandum’ circulated by mathematician Warren Weaver) and was vigorously pursed … Weaver actually proposed a statistical solution based on the N nearest words (or nouns) in the immediate context. … Might a more sophisticated ’statistical semantics’ (Weaver’s own phrase) carry the day? Not a chance.”
Sphere: Related ContentApr
3
10 New York Times Visualizations
April 3, 2008 | 3 Comments
NYTimes.com has done a great job of moving beyond the static infographics found in newspapers. 10 favorites below…comment if you know of good ones I’ve missed. Also, for further reading/viewing, see…
- Playgrounds for Data: Inspiration from NYTimes.com Interactives
- Infovis 2007 slides on Matthew Ericson’s blog…
Mar
11
ETech Presentation on Ensemble Machine Learning
March 11, 2008 | 2 Comments
![]()
Just wanted to put up my slides from ETech this past week. The talk is pretty similar to the talk I posted a few months ago, just a bit more fleshed out.
[ppt][pptx][pdf]
Unfortunately, I only made it to the conference for the day I was speaking. Beautiful venue. Seemed that most the buzz related to social networking issues and climate change. Would have liked to have heard Peter Norvig’s talk. Maybe another year.
Sphere: Related ContentMar
9
An information visualization conference, the See Conference, is being held in Wiesbaden, Germany, on April 19th. Impressive speaker list. The conference organizers plan to stream the speeches in real time via the conference website.
Ben Fry |
Zachary Lieberman |
Frank van Ham |
And comfortable seats! |
Nov
23
Ensemble Machine Learning Tutorial
November 23, 2007 | 1 Comment
Here’s the slides from a 2-part lecture I’m giving on ensemble learning at Indiana University. It includes a discussion of the Netflix Prize competition, and the use of ensemble techniques in that competition.
Sphere: Related ContentNov
5
A Review of MemoryArchive.org
November 5, 2007 | 1 Comment
I recently came across a small site running on Mediawiki called MemoryArchive.org. The concept is that each article is a memory written, unlike Wikipedia, by a single author. Subjective content allowed.
There seems to be a legit place for a site with this concept to complement Wikipedia. Wikipedia is derivative knowledge, it is intended that the content be cited, meaning it already had to have been published somewhere. Many valuable (and not so valuable) facts don’t fit that bill. Also, when sources disagree but are merged into a single Wikipedia article, history according to Wikipedia has a rather non-deterministic feel to it.
That said, MemoryArchive.org has a long way to go in terms of concept, technology, and adoption. If anyone involved with MemoryArchive comes across this review…well, I have some ideas:
Sphere: Related ContentOct
2
Visualizing Science & Tech Activity in Wikipedia
October 2, 2007 | Leave a Comment
If you didn’t see our original Wikipedia Activity Visualization, check it out here (there’s a detailed explanation, as well). Also, there is a Google maps style zoomable version here.
This new version uses the same layout and images (well, slightly improved) as the original, but this time we tried to highlight activity in regions of Wikipedia that are predominately math or science or technology.
So we developed a program to classify Wikipedia articles as being one of these three categories (or none), based on the categories the article was assigned to and their positions in the Wikipedia category link network.
Sphere: Related ContentSep
2
Scheme Tutorial
September 2, 2007 | 3 Comments
I was asked to give a short (1 hr) tutorial on the Scheme language this week for students in the graduate and undergraduate AI courses at Indiana. Thought I would post the slides in case anyone wants to adapt it for their own purposes…
PDF version
PPT (Office 2007) version
Aug
17
ICCBR 2007 Highlights
August 17, 2007 | 1 Comment
![]()
ICCBR07 (International Conference on Case Based Reasoning) is held on alternating years with the ECCBR conference. The venue was Belfast, a city with nice blue collar charm to it. Seemed sort of a European version of my hometown of Green Bay. Stayed in a Queens University dorm room, where I was constantly reminded I am too old to be staying in the dorms. Should have paid out for a stay for the Europa Hotel where the conference was held…classy place.
Day 1
Perceptions of CBR by David Aha. Argued that CBR may become irrelevant if there are not more theoretical results published. Showed stats that more recent CBR publications are system-oriented to back up his argument (but may be true of AI in general). Suggested that CBR researchers have theory envy towards machine learning practitioners.
Note: A thought occurring to me is that CBR is more a set of design patterns, ones which are fairly accessible by the general public given the diverse interests of the delegates present.
Credible Case Based Reasoning by Eyke Hullermeier. A formal treatment of the retrieval component of CBR (nicely timed to correspond to Aha’s argument).
Jul
27
This is a talk series being given at Google by David Mease based on a Master’s level stats course he is teaching this summer at Stanford. Its easy listening if you already have some data mining or stats background.
The introduction (part 1) is particularly well done, as is the portion on association rule mining (parts 7 and 8). This is the first half of the course which has already occurred…I’ll add links as new sessions are added to Google video.
Sphere: Related ContentJul
21
On Transfer Learning
July 21, 2007 | Leave a Comment
Definition (from DARPA): The ability of a system to recognize and apply knowledge and skills learned in previous tasks to novel tasks
Current approaches involve either the building of a shared model of a domain or multiple domains, in the form of a case base, hierarchy, or relational schema, that couple the classifiers together, or the creation of mapping between distinct representations. Bayesian and neural approaches dominate the research thus far.
(from Droy 2007-IJCAI07) In spam filtering, a typical data set consists of thousands of labeled emails belonging to a collection of users. In this sense, we have multiple data sets–one for each user. Should we combine the data set and ignore the prior knowledge that different users labeled each email? If we combine the data from a group of users who roughly agree on the definition of spam we will have increased the available training data from which to make predictions. However, if the preferences within a population of users are heterogeneous, then we should expect that simply collapsing the data into an undifferentiated collection will make our predictions worse.
Sphere: Related ContentMay
20
Visualizing the ‘Power Struggle’ in Wikipedia
May 20, 2007 | 18 Comments
A new visualization Bruce Herr and I recently completed is being featured in this week’s New Scientist Magazine (the article is free online, minus the viz). They did a good job jazzing up the language used to describe the viz–’power struggle’, ‘bubbling mass’, ‘blitzed articles’–but they also dumbed down the technical accomplishments. I guess not everyone gets as excited about algorithms as I do.
Before I talk anymore about the viz, though, let me mention its appearing at the NetSci 2007 Conference this week, and hopefully a varient will appear at Wikimania later this summer as well. The viz is a huge 5 feet by 5 feet when printed, and I only include a low res, smaller version here. At some point high quality art prints of it will appear at SciMaps for sale to fund further visualization research.
Now for the good stuff. Much like my visualization of the netflix prize competition data, we began this piece by representing the data as a network. In this case the nodes in the network are wikipedia articles and the edges are the links between articles. We then (with some help from our friends at Sandia) used an algorithm to lay out all 650,000 nodes (wikipedia articles) that had at least one link in such a way that similar articles are near one another. These are the yellow dots, which when viewed at low res give a yellow tint to the whole picture.
Sphere: Related ContentApr
3
Another Visualization of the Netflix Prize Dataset
April 3, 2007 | 7 Comments
Here’s a recent visualization I did of the dataset used in the Netflix Prize Competition. The dataset is 17,700 movies and 31 gigs of user ratings. This viz shows similar movies close to one another, with the similarities determined by a formula based on ratings.
I found most interesting a cluster of movies (in blue) that I’d say are generally acclaimed. The cluster contains movies of across all genres, such as Schindler’s List, BraveHeart, and Super Size Me. Beyond that, there’s a bunch of clusters which are mostly defined by a genre such as music, sports, documentary, Imax, children’s films, or bonus material. The big blob in the center is mostly what I’d call junk movies.
I’ve labeled some movies just to give some sense of what the clusters contain. There’s an interactive version of the viz as well, so you can explore the movies for yourself…
Sphere: Related ContentApr
3
An Interactive Visualization of the Netflix Prize Dataset
April 3, 2007 | 2 Comments
The visualization activated below (click the button) shows all 17,700 movies that are part of the Netflix Prize Competition. The movies are laid out such that simlar movies are close to one another. Similarity between two movies is computed based on whether users who like one like the other, or (and, really) those who dislike one dislike the other. Alternatively, take a look at a colorful, static version.
Mouse over to get the movie titles…
Sphere: Related ContentMar
18
GapMinder Talk
March 18, 2007 | 3 Comments
Just read an article about Google buying a small company called GapMinder which does data visualization. I checked out the talk on the GapMinder homepage, and would recommend watching the first 10 minutes of it. The visualization tool that is used throughout the talk is something special…easy to see Google’s interest.
Sphere: Related ContentFeb
28
Installing WordPress on GoDaddy
February 28, 2007 | 57 Comments
Setting up WordPress on a GoDaddy hosting account is really not difficult (this blog is an example that it can be done!). Below are my notes on the process. If you glance at these steps, and don’t want to mess around with this, consider using one of the following hosting services which come with WordPress pre-installed (fairly rare): An Hosting, Lunarpages, BlueHost, Yahoo
Steps for installing WordPress on a GoDaddy Hosting Account
1. Get an account. If you haven’t already, purchase a hosting account. I chose the Deluxe plan, which really isn’t very expensive. You’ll be emailed directions after you purchase the account. The email will say it takes 24-48 hrs to activate, but it actually only takes 20 minutes or so.
Sphere: Related ContentFeb
28
A Tutorial on Flash Remoting Using Perl
February 28, 2007 | 2 Comments
Flash remoting is a big improvement over forms/cgi for communication between flash and server. There’s a great little project called amfphp for using php with flash remoting. There’s a whole lot less great (but appreciated!) version called amf::perl for perl and python. There is little documentation, so I thought I’d post an example.
Here’s my remoting notes dealing with amf::perl. For context, I was working on a movie recommendation system.
Steps:
Sphere: Related ContentFeb
24
Tips for Making Perl Programs Run Faster
February 24, 2007 | 3 Comments
In my daily work I tend to manipulate fairly large datasets, such as Wikipedia, U.S. Patents, Netflix Ratings, and Imdb. Here’s a few tricks I’ve come across so that you don’t lose time waiting for your programs to finish.
- Use Storable
Feb
22
Flash vs. Processing
February 22, 2007 | 10 Comments
Over the past year and a half I’ve been hooked on the language Processing. I’ve even contributed a early version library for visualizing social network data.
Sphere: Related ContentFeb
22
Ranking Online Backup Services
February 22, 2007 | 3 Comments
This post is a bit off topic for this blog. However, I recently decided I really needed an off-site backup service (er, I lost some files). And, as usual, I spent way too much time looking around the web for such a service. Anyways, I thought I’d share my homework. To give you an idea of my situation, I have about 30GB of data (code, datasets, visualizations) that I need constantly backed up.
Backup Services vs. Online Storage
There are a ton of options. I just wanted an automatic backup service–one that works flawlessly. Many services I looked at emphasize that the files can be accessed anywhere. These services often refer to themselves as online storage rather that online backup. The pure backup services usually offer unlimited space and focus on the software for backing up changes on demand or when your computer is idle, and on the software for restoring files. The tradeoff is usually that files are stored compressed, and are only meant to be accessed for somewhat rare restores. Online storage services, by comparision, limit the storage space, and although they almost always advertise the backup aspect, files usually need to be manually transfered. The advantage is that the files on such services can be accessed over the web.
Sphere: Related ContentFeb
17
Top 10 Google Tech Talks
February 17, 2007 | 3 Comments
All Google Tech Talks are here (Google EngEDU is the actual name of the talk series). Thought I’d compile a top ten list…
- Python and Python 3000. Two talks about the Python language given by its inventor Guido van Rossum. The first is about the language’s origins and the second is about its future.
Feb
15
CoCitation vs. Bibliometric Coupling
February 15, 2007 | Leave a Comment
I recently posted an efficient algorithm for computing the similarity of two Wikipedia pages (or any two nodes in a network) using cocitation similarity. Another type of similarity which may be worth considering is bibliometric coupling, in which two pages are similar if the pages they link to are similar. What is interesting is that it is only a few minor tweaks to the cocitation algorithm to compute bibliometricc coupling. Here’s the bibliometric coupling psuedocode (Perl style):
%nodes, %links //the wikipedia pages and pagelinks %reverse = reverse(%links) //flipping the pagelinks around %biblioCounts //2d hash for temporarily storing counts %scores //2d hash storing the final similarity scores foreach node (keys %nodes){Sphere: Related Content
Feb
12
Wikipedia Page Similarities
February 12, 2007 | 1 Comment
I’m working on a visualization, a ‘map’ if you will, of Wikipedia pages. The map will layout pages close to one another if they are similar. So, in order to create such a map I need to compute the similarity of any two Wikipedia pages.
For my first attempt at this, I decided to go with a cocitation measure of similarity. So, two Wikipedia pages will be said to be similar if other Wikipedia pages that link to one usually link to the other.
However, the naive way to compute this, looking at every pair of pages, is far too inefficient given that there are 650,000 pages in the English Wikipedia, and 14.5 million pagelinks. So I’ve worked up a much more efficient algorithm. Here’s the psuedocode…I hope someone, somewhere out in cyberspace will find this useful. (It can, in fact, be used to compute co-citation similarities for any data represented as nodes and links) Read more
Sphere: Related ContentFeb
10
Visualization Google Tech Talks
February 10, 2007 | Leave a Comment
15 Views of a Node Link Graph: An Information Visualization Portfolio by Tamara Munzner
Scholarly Data, Network Science, and (Google) Maps by Katy Borner
Sphere: Related ContentFeb
10
35 Great Visualizations
February 10, 2007 | 1 Comment

Geographical & Historical
WorldProcessor. Globes overlaid with information. Beautiful…must see!
Wikisky Google maps for the stars.
Flight Patterns Visualizations of FAA data.
TextArc: History of Science Beautiful.
2007 Calender. Brad Paley design.
31 days in Iraq. Visualization of deaths in Iraq. Depressing.
Tracing the Visitor’s Eye Flickr tags on a geospatial basemap.
Schreiner International Cables Map. Old world map of cables.
Napolean’s March. Made famous by Edward Tufte.
Government
Sphere: Related Content







