A Beautiful WWW http://abeautifulwww.com Information Retrieval. Information Visualization. Data Mining. Artificial Intelligence. Web Programming. Sun, 11 Oct 2009 07:37:33 +0000 http://wordpress.org/?v=2.9 en hourly 1 Guide to Getting Started in Machine Learning http://abeautifulwww.com/2009/10/11/guide-to-getting-started-in-machine-learning/ http://abeautifulwww.com/2009/10/11/guide-to-getting-started-in-machine-learning/#comments Sun, 11 Oct 2009 05:01:02 +0000 admin http://abeautifulwww.com/?p=151
Someone at work recently asked how he should go about studying machine learning on his own. So I’m putting together a little guide. This post will be a living document…I’ll keep adding to it, so please suggest additions and make comments. Fortunately, there’s a ton of great resources that are free and [...]]]>

Someone at work recently asked how he should go about studying machine learning on his own. So I’m putting together a little guide. This post will be a living document…I’ll keep adding to it, so please suggest additions and make comments.


Fortunately, there’s a ton of great resources that are free and on the web. The very best way to get started that I can think of is to read chapter one of The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2009 edition). The pdf is available online. Or buy the book on Amazon here, if you prefer.

Once you’ve read the first chapter, download R. R is an open-source statistics package/language that’s quite popular. Never heard of it? Check out this post (How Google and Facebook are using R).

Once you’ve installed R, maybe played around a little, then check out this page which describes the major machine learning packages in R. If you’re already familiar with some of the techniques, then dive in and start playing around with them in R. On the other hand, if it looks really complicated, don’t worry about it yet.

Oh, by the way, if you want to start playing around with machine learning in R, you’ll need data. Check out the UCI Machine Learning Repository. They have both real and toy datasets. The iris dataset, for example, is famous for showing up in many research publications.

I’d suggest next reading more of The Elements of Statistical Learning. Its an excellent book. Try doing some of the programming exercises using R. If you don’t like this book, there are plenty of others. Bishop’s Pattern Recognition and Machine Learning is a famous one. It can be a little difficult depending on your math background. Tom Mitchell’s Machine Learning is another that’s often used to teach the topic.

If you’re looking for perhaps a more passive experience, or want the feel of a classrom, Andrew Ng of Stanford has posted all of his lectures online. He starts by saying that he thinks machine learning is the most exciting field in all of computer science. Here here!

Another great resource is the machine learning course MIT has posted on their OpenCourseWare site. It has the lecture notes, assignments, and more.

I’ll stop here now. More later.

Share/Bookmark]]>
http://abeautifulwww.com/2009/10/11/guide-to-getting-started-in-machine-learning/feed/ 10
20 Useful Visualization Libraries http://abeautifulwww.com/2008/09/08/20-useful-visualization-libraries/ http://abeautifulwww.com/2008/09/08/20-useful-visualization-libraries/#comments Mon, 08 Sep 2008 05:04:10 +0000 admin http://abeautifulwww.com/2008/09/08/20-useful-visualization-libraries/
Well, not entirely limited to libraries.  Useful stuff for visualization practitioners sounded a little non-specific, though.  These are all freely available. 1. Prefuse (Java) & FLARE (Flex)    2. simile (AJAX)    3. Processing (Java)     4. GigaPan (Service)     5. Modest Maps (Flash, Python) 6. Google Visualization API (Javascript) 7. Google Chart API (Javascript) 8. Google [...]]]>

Well, not entirely limited to libraries.  Useful stuff for visualization practitioners sounded a little non-specific, though.  These are all freely available.

1. Prefuse (Java) & FLARE (Flex) 
image11image14 

 2. simile (AJAX)

image104image109 

 3. Processing (Java)

 image46imageimage267  

4. GigaPan (Service)

image278image98    

5. Modest Maps (Flash, Python)

imageimage

6. Google Visualization API (Javascript)

imageimage

7. Google Chart API (Javascript)

imageimageimage image

8. Google Maps API (Javascript, Flash)

9. GraphViz (Wrappers for a dozen languages including Java, Perl, Python.  Free.) 

image image

10. JFree (Java)

imageimage

11. pChart (PHP)

imageimage image

12. OpenLayers (JavaScript)

imageimage178

13. Anti-Grain (C++)

imageimageimage

14. JGraph (Java)

image

15. Boost Graph Library (C++, phyton wrapper)

16. Open Flash Chart (Flash)

imageimage

17. Ubigraph (Wrappers for Python, Java, C, and more)

imageimage203

18. JUNG (Java)

imageimage

19. TimeMap (Java)

imageimageimage

20. Many Eyes (online service)

imageimage image

Share/Bookmark]]>
http://abeautifulwww.com/2008/09/08/20-useful-visualization-libraries/feed/ 19
Network Visualization for Systems Biology http://abeautifulwww.com/2008/05/29/network-visualization-for-systems-biology/ http://abeautifulwww.com/2008/05/29/network-visualization-for-systems-biology/#comments Fri, 30 May 2008 03:37:51 +0000 admin http://abeautifulwww.com/2008/05/29/state-of-the-art-in-network-visualization-for-systems-biology/
 This is a quick look at the state-of-the-art of network visualization in systems biology. It’s an interesting topic on its own (and my day job at the moment), and also as it relates to the visualization of other types of networks, such as social networks (think Facebook). Systems biology is all about looking at proteins, [...]]]>

 roche3This is a quick look at the state-of-the-art of network visualization in systems biology. It’s an interesting topic on its own (and my day job at the moment), and also as it relates to the visualization of other types of networks, such as social networks (think Facebook). Systems biology is all about looking at proteins, pathogens, and more, within the contexts in which they interact. Naturally, then, the visualizations that tend to be particularly useful are those such as network visualizations that can provide macro understanding of the interactions.  Questions such visualizations help with include those of the form “if a drug affects protein X, what else will it affect?”

The Networks
Quite a bit of interesting complexity is present in these interaction networks (the data).  They are often small-world, disassociative (unlike social networks), scale-free, and exhibit modularity.  Biologists are usually either interested in looking at larger scale cell level networks, or meaningful sub-networks called pathways, which typically are in the range of 50-500 nodes.

Making life interesting, duplicate nodes representing different states are often included.  The edges are directed, and may be hyperedges when multiple nodes necessarily interact together. And, in truth, the edges are often approximations of the actual interactions in the underlying network.  These approximations come from experimental findings published in journals.  

A First Look
roche1 This image is part of Roche Applied Science’s “Biochemical Pathways” series of wall charts.  The charts are in the style of circuit diagrams, which seems to be the most common 2-D representation of metabolic pathways.  This set seems to have been particular influential.  The appeal of this ‘map’ is likely its scale.  Viewers can spend a great deal of time exploring.  In visualization there is a notion of ‘information density’, meaning the more visual attributes used to convey the data, the more information that may be present in the visualization.  This image has a very high information density. 

Layout

clip_image004In general (not just systems bio), network/graph layout (choosing where to place the nodes and edges) is done with consideration for (A) the topology network and (B) the aesthetics.  The primary topology concern is to place connected node pairs near one another and unconnected pairs apart.  The primary aesthetic concerns are to ensure that nodes do not overlap, edges do not cross, and labels are readable.    

cerebralmapk However, nodes in systems biology often also have biologically significant locations associated with them (e.g., within a cell, or within the nucleus of a cell).  The most common way of handling this location information is to treat the layout in a standard network layout manner, but constrain nodes to a compartment/level designated as the extracellular, membrane, cytoplasm, nucleus, etc.  This visualization, created with the Cerebral plugin for Cytoscape is the best example I know of of this.

Realism

clip_image008clip_image033Most of the network visualization tools for systems biology create very abstract images.  However, in high quality publications, such as the journal Nature, the abstract images are often hand rendered to include more realistic imagery.  Something I would like to do more of if look at actual microscope images and behavioral models to try to usefully bridge the gap.

Visual Data Mining

clip_image010There are many uses of these network visualizations for biologists and others.  One is just that they can leave a more lasting impression/memory than simple lists.  A major use case, though, is visual data mining, which may take many forms.  Followers of Tufte know that contrasts are often the most valuable element of a visualization.  This image is a straightforward example.  More sophistication visual data mining might include clustering and classification of those clusters.

clip_image012
clip_image019clip_image017

Zoom and Community Involvement

genomeprojectorBecause the Roche wall charts beg to be explored, it is only natural that a tool would be created for doing so.  G-Language is an open source shell that supports, among other things, pathway visualization plugins.  The Genome Projector is module for G-Language which uses the Google Maps API to allow exploration and annotation.  No doubt, as systems biology network visualization tools reach later versions, more and more will support rich interaction and, perhaps, treat the visualization as a vehicle for collaboration.

Hierarchy and Metanodes
 imageimageIn the networks section above, I mentioned that the networks are often modular.  The most obvious modules are organelles.  But other modules exist, such as those defined functionality.  As the above examples show, incorporation of the modularity information into the visualization often is done in a manner that makes it even more abstract.  

Share/Bookmark]]>
http://abeautifulwww.com/2008/05/29/network-visualization-for-systems-biology/feed/ 2
A Look at FINVIZ.com (Financial Visualizations) http://abeautifulwww.com/2008/05/12/a-look-at-finvizcom-financial-visualizations/ http://abeautifulwww.com/2008/05/12/a-look-at-finvizcom-financial-visualizations/#comments Mon, 12 May 2008 06:31:47 +0000 admin http://abeautifulwww.com/2008/05/12/a-look-at-finvizcom-financial-visualizations/
FINVIZ is a suite of free financial tools that takes advantage of modern visualization ideas.  The infoviz and interaction designs are certainly worth a blog post.  Here’s a look at their efforts… 1. Sector Visualization.  This visualization is a treemap implemented using the Google Maps API.   It shows how well sectors and companies (stocks) within [...]]]>

about_finviz

FINVIZ is a suite of free financial tools that takes advantage of modern visualization ideas.  The infoviz and interaction designs are certainly worth a blog post.  Here’s a look at their efforts…

1. Sector Visualization.  This visualization is a treemap implemented using the Google Maps API.   It shows how well sectors and companies (stocks) within those sectors are doing.  The attention to detail is exceptional.  The company name stays the same size on zoom, and is dual encoded using a background image.  The gain/loss is shown using shades of green/red, and is also dual encoded using text.  On mouseover details are provided in a side panel.

map

map-closup

2. Stock Charts.  When you create a portfolio of stocks, a number of views of that portfolio.  One is a small multiples view which allows easy comparison without overlay as one has to do with Google Finance and Yahoo Finance charts.  Again, attention to detail is wonderful.  The current price is highlighted, the trendlines are nicely colored, and the volume barchart is part of the background.

smallMultiples

 3. Trends.  They use Sparklines for trend indicators.  Well, they may just be icons (not encoded by actual data), but I’ll delude myself nonetheless.

 sparklines

4. News. They aggregate the news items for all the stocks in a portfolio onto one page.  Very nicely done.  Only shows the day, month, year, when they change.  Overlays chart when mouseover of price (notice the little icon to indicate this next to the word price…attention to detail). 

news

5. Profiles.  Again, just very nicely done, showing all of the profiles on the same page.

profiles

6.  Relative Volume Indicator.  A second vertical axis is added.

image

Share/Bookmark]]>
http://abeautifulwww.com/2008/05/12/a-look-at-finvizcom-financial-visualizations/feed/ 2
5 Reasons Visualization Is Not More Prevalent http://abeautifulwww.com/2008/04/20/5-reasons-visualization-is-not-more-prevalent/ http://abeautifulwww.com/2008/04/20/5-reasons-visualization-is-not-more-prevalent/#comments Sun, 20 Apr 2008 07:33:43 +0000 admin http://abeautifulwww.com/2008/04/20/5-reasons-visualization-is-not-more-prevalent/
Why does it seem I have to look hard to find good data visualization examples?  Why do few tech companies devote resources to visualization (Google’s the obvious exception)?  Why are there relatively few job postings for visualization, with many of those there are requiring mainly graphic design skills and not data visualization skills?  I was [...]]]>

Why does it seem I have to look hard to find good data visualization examples?  Why do few tech companies devote resources to visualization (Google’s the obvious exception)?  Why are there relatively few job postings for visualization, with many of those there are requiring mainly graphic design skills and not data visualization skills?  I was thinking about this today and I came up with a few possible reasons, some based on perceptions, and others based on marketplace realities.

Reason #1: People Don’t Know What Data Visualization Is

benfry-monkey-small People don’t know what data visualization is.  Don’t believe me?  Read the Amazon.com reviews for the book Data Visualization by Ben Fry. They contain negative comments such as “One would expect a book with the title ‘Visualizing Data’ to be crammed with pictures”.  The issue seems be that too much of the book is devoted to data and the mapping of data properties to visual properties

Graphic design is different from data visualization.  Graphic designers are largely free from having to deal with actual data, and from having their product emerge from data.  Graphic design components and data visualization components are often mixed, and with great success.  But they are different.  Art is not visualization.  And visualization is not art…unless it is ;)

The above visualization (which is, in fact, by Ben Fry) is driven by the properties of two underlying datasets.  One dataset is the DNA of a monkey.  The genes (the data) are represented as very tiny white text.  A second dataset used is human DNA. It is only depicted after the difference of the two datasets has been computed.  Then the genes that are different between the monkey and human are represented in red.  Fry obviously didn’t choose which areas of the visualization would be red, the data did.  What about the monkey pic?  Even that is a visual representation of a property of the dataset…the type of the DNA dataset shown in white text.   

Reason #2: Crappy Existing Visualizations have Polluted Perception

kartoo600px-Cnet05thebigpicture 

The visualization on the left is the interface for the search engine Kartoo.  The visualization on the right is a feature CNET used to have called The Big Picture.  Both attempt to visualize data usually shown as lists (search results, related news articles) as 2D networks.  Its a nice idea, as pairwise relationship properties can be visually represented as edges.  But these particular efforts both miss the boat.  They don’t actually increase the amount of information represented by very much vs lists, while greatly increasing the mental load placed on the user trying to extract the basic information. 

Reason #3: People are Unable to Mentally Separate the View from the Data

benfrymultivizonedataset Here’s another Ben Fry work (I was watching a video/talk of his earlier today, which is part of the reason he is so prevalent in this post).  It shows six different visualizations of the same dataset.

Many times data relates to physical objects.  In such cases people may have trouble dealing with such data as visually represented in any other manner than that which includes those physical objects.  Or another situation is one in which data has just always been depicted in a certain way, which interferes with any new depiction. 

Reason #4: Visualization is Difficult to Create and Easy to Copy

googlefinance yahoofinance

This is somewhat irrelevant, but I have had a Yahoo mail account for about a decade.  There was a good six year stretch where it never changed.  If Gmail hadn’t come along, who knows. 

When Google released Google Finance, it marked a number of firsts…the use of AJAX for stock charts (the chart itself is actually Flash), the overlay of events on the chart, and the dual time sliders.  No doubt Google spent much time and effort designing this visualization tool.  How long did it take Yahoo Finance to copy Google Finance’s chart once Google revealed it?  Not long.  Good visualization design is hard.  It’s even harder when its object is to deconstruct very complex data.  Reverse engineering a visualization is easy.

Reason #5: People Won’t Pay for Visualization?

I’m not so sure about this one, but our company’s CTO recently commented to me that he couldn’t think of any successful standalone visualization effort other than Processing

Applications such as Google Maps don’t count both because its free, and, more importantly, because people wouldn’t have access to the underlying data without the visualization.  I can think of a few commercial successful standalone visualizations such as this one, but surely the list is fairly short. 

Share/Bookmark]]>
http://abeautifulwww.com/2008/04/20/5-reasons-visualization-is-not-more-prevalent/feed/ 13
Haugeland’s AI Views 25 Years Later http://abeautifulwww.com/2008/04/13/haugelands-ai-views-25-years-later/ http://abeautifulwww.com/2008/04/13/haugelands-ai-views-25-years-later/#comments Sun, 13 Apr 2008 21:58:08 +0000 admin http://abeautifulwww.com/2008/04/13/haugelands-ai-views-25-years-later/
A couple of years ago, I picked John Haugeland’s Artificial Intelligence: The Very Idea up off the free book table in the computer science department of Indiana University. Finally read it this weekend.  Published in 1985, there’s  a lot to like about the book, but its definitely a product of its time.  That period [...]]]>

image

A couple of years ago, I picked John Haugeland’s Artificial Intelligence: The Very Idea up off the free book table in the computer science department of Indiana University. Finally read it this weekend.  Published in 1985, there’s  a lot to like about the book, but its definitely a product of its time.  That period being when computer and cognitive scientists were obsessing about knowledge representation.  Wanted to call-out a few (perhaps arrogant) quotes reflective of its day…

“A different pipedream of the 1950s was machine translation of natural languages.  The idea first gained currency in 1949 (via a ‘memorandum’ circulated by mathematician Warren Weaver) and was vigorously pursed … Weaver actually proposed a statistical solution based on the N nearest words (or nouns) in the immediate context. …  Might a more sophisticated ’statistical semantics’ (Weaver’s own phrase) carry the day? Not a chance.”

Pipedream…somebody tell Google :)   Actually, I had no idea machine translation was worked on in the 1950s.  Cool!  I would mention that the other pipedream of the ’50s he discusses is cybernetics, which, in various forms, is also a very popular area of research today.

“Artificial Intelligence must start by trying to understand knowledge…and then, on that basis, tackle learning.  It may even happen that, once the fundamental structures are worked out, acquisition and adaptation will be comparatively easy to include…it does not appear that learning is the most basic problem, let alone a shortcut or a natural starting point.”

Seems like research that has treated knowledge representation and learning as one problem (neural nets, Bayesian nets, etc) has been particularly fruitful.

“AI has discovered that knowledge itself is extraordinarily complex and difficult to implement–so much so that even the general structure of a system with common sense is not yet clear.”

And, clearly, the Cyc project solved this problem ;)

Anyway, the book is still a very interesting read, particularly if you like thinking about the challenges inherent in the domain knowledge representation.

Share/Bookmark]]>
http://abeautifulwww.com/2008/04/13/haugelands-ai-views-25-years-later/feed/ 0
10 New York Times Visualizations http://abeautifulwww.com/2008/04/03/10-new-york-times-visualizations/ http://abeautifulwww.com/2008/04/03/10-new-york-times-visualizations/#comments Thu, 03 Apr 2008 04:07:47 +0000 admin http://abeautifulwww.com/2008/04/03/10-new-york-times-visualizations/
NYTimes.com has done a great job of moving beyond the static infographics found in newspapers.  10 favorites below…comment if you know of good ones I’ve missed.  Also, for further reading/viewing, see… - Playgrounds for Data: Inspiration from NYTimes.com Interactives - Infovis 2007 slides on Matthew Ericson’s blog…                 ]]>

NYTimes.com has done a great job of moving beyond the static infographics found in newspapers.  10 favorites below…comment if you know of good ones I’ve missed.  Also, for further reading/viewing, see…

- Playgrounds for Data: Inspiration from NYTimes.com Interactives
- Infovis 2007 slides on Matthew Ericson’s blog…

 nytimesnamingnames

nytimesUnion 

nytimesHowClassWorks 

nytimesBuyOrRent 

nytimesSectorSnap 

nytimesmoviebox 

nytimeskatrina 

nytimes-election2004

nytimesCasualities 

primary

Share/Bookmark]]>
http://abeautifulwww.com/2008/04/03/10-new-york-times-visualizations/feed/ 3
ETech Presentation on Ensemble Machine Learning http://abeautifulwww.com/2008/03/11/etech-presentation-on-ensemble-machine-learning-3/ http://abeautifulwww.com/2008/03/11/etech-presentation-on-ensemble-machine-learning-3/#comments Tue, 11 Mar 2008 05:55:11 +0000 admin http://abeautifulwww.com/2008/03/11/etech-presentation-on-ensemble-machine-learning-3/
Just wanted to put up my slides from ETech this past week.  The talk is pretty similar to the talk I posted a few months ago, just a bit more fleshed out. [ppt][pptx][pdf] Unfortunately, I only made it to the conference for the day I was speaking.  Beautiful venue.  Seemed that most the [...]]]>

logo

Just wanted to put up my slides from ETech this past week.  The talk is pretty similar to the talk I posted a few months ago, just a bit more fleshed out.
[ppt][pptx][pdf]

Unfortunately, I only made it to the conference for the day I was speaking.  Beautiful venue.  Seemed that most the buzz related to social networking issues and climate change.  Would have liked to have heard Peter Norvig’s talk.  Maybe another year.

etech1

etech2

Share/Bookmark]]>
http://abeautifulwww.com/2008/03/11/etech-presentation-on-ensemble-machine-learning-3/feed/ 2
See Conference (Information Visualization) to be Streamed Live in April http://abeautifulwww.com/2008/03/09/see-conference-information-visualization-to-be-streamed-live-in-april-2/ http://abeautifulwww.com/2008/03/09/see-conference-information-visualization-to-be-streamed-live-in-april-2/#comments Sun, 09 Mar 2008 06:50:46 +0000 admin http://abeautifulwww.com/2008/03/09/see-conference-information-visualization-to-be-streamed-live-in-april-2/
An information visualization conference, the See Conference, is being held in Wiesbaden, Germany, on April 19th.  Impressive speaker list.  The conference organizers plan to stream the speeches in real time via the conference website.    Ben Fry    Zachary Lieberman    Frank van Ham And comfortable seats! ]]>

An information visualization conference, the See Conference, is being held in Wiesbaden, Germany, on April 19th.  Impressive speaker list.  The conference organizers plan to stream the speeches in real time via the conference website.

see1  
Ben Fry
see2  
Zachary Lieberman
see3  
Frank van Ham
see4
And comfortable seats!
Share/Bookmark]]>
http://abeautifulwww.com/2008/03/09/see-conference-information-visualization-to-be-streamed-live-in-april-2/feed/ 1
Ensemble Machine Learning Tutorial http://abeautifulwww.com/2007/11/23/ensemble-machine-learning-tutorial/ http://abeautifulwww.com/2007/11/23/ensemble-machine-learning-tutorial/#comments Fri, 23 Nov 2007 20:11:00 +0000 admin http://abeautifulwww.com/2007/11/23/ensemble-machine-learning-tutorial/
Here’s the slides from a 2-part lecture I’m giving on ensemble learning at Indiana University.  It includes a discussion of the Netflix Prize competition, and the use of ensemble techniques in that competition. [PDF][PPT] ]]>

ensembleTutorialSlide

Here’s the slides from a 2-part lecture I’m giving on ensemble learning at Indiana University.  It includes a discussion of the Netflix Prize competition, and the use of ensemble techniques in that competition.

[PDF][PPT]

Share/Bookmark]]>
http://abeautifulwww.com/2007/11/23/ensemble-machine-learning-tutorial/feed/ 1