Ensemble Machine Learning Tutorial

ensembleTutorialSlide

Here’s the slides from a 2-part lecture I’m giving on ensemble learning at Indiana University.  It includes a discussion of the Netflix Prize competition, and the use of ensemble techniques in that competition.

[PDF][PPT]

1 Comment »

Visualizing Science and Technology in Wikipedia

sciwikivis-small If you didn’t see our original Wikipedia Activity Visualization, check it out here (there’s a detailed explanation, as well).  Also, there is a Google maps style zoomable version here.

This new version uses the same layout and images (well, slightly improved) as the original, but this time we tried to highlight activity in regions of Wikipedia that are predominately math or science or technology.

So we developed a program to classify Wikipedia articles as being one of these three categories (or none), based on the categories the article was assigned to and their positions in the Wikipedia category link network.

Read More

4 Comments »

Scheme Tutorial

windowslivewriterschemetutorial-13a79image-5 I was asked to give a short (1 hr) tutorial on the Scheme language this week for students in the graduate and undergraduate AI courses at Indiana.  Thought I would post the slides in case anyone wants to adapt it for their own purposes…

PDF version
PPT (Office 2007) version

Read More

4 Comments »

ICCBR 2007 Highlights

iccbrLogo
ICCBR07 (International Conference on Case Based Reasoning) is held on alternating years with the ECCBR conference.  The venue was Belfast, a city with nice blue collar charm to it.  Seemed sort of a European version of my hometown of Green Bay.  Stayed in a Queens University dorm room, where I was constantly reminded I am too old to be staying in the dorms.  Should have paid out for a stay for the Europa Hotel where the conference was held…classy place.

Read More

1 Comment »

Google Tech Talk Review: Statistical Aspects of Data Mining

This is a talk series being given at Google by David Mease based on a Master’s level stats course he is teaching this summer at Stanford.  Its easy listening if you already have some data mining or stats background. 

 

The introduction (part 1) is particularly well done, as is the portion on association rule mining (parts 7 and 8).  This is the first half of the course which has already occurred…I’ll add links as new sessions are added to Google video.

Read More

1 Comment »

Visualizing the ‘Power Struggle’ in Wikipedia

wikimap-small A new visualization Bruce Herr and I recently completed is being featured in this week’s New Scientist Magazine (the article is free online, minus the viz).  They did a good job jazzing up the language used to describe the viz–’power struggle’, ‘bubbling mass’, ‘blitzed articles’–but they also dumbed down the technical accomplishments.  I guess not everyone gets as excited about algorithms as I do. 

Before I talk anymore about the viz, though, let me mention its appearing at the NetSci 2007 Conference this week, and hopefully a varient will appear at Wikimania later this summer as well.  The viz is a huge 5 feet by 5 feet when printed, and I only include a low res, smaller version here.  At some point high quality art prints of it will appear at SciMaps for sale to fund further visualization research.

Read More

21 Comments »

Another Visualization of the Netflix Prize Dataset

netflixAllMoviesSmall Here’s a recent visualization I did of the dataset used in the Netflix Prize Competition. The dataset is 17,700 movies and 31 gigs of user ratings. This viz shows similar movies close to one another, with the similarities determined by a formula based on ratings.

I found most interesting a cluster of movies (in blue) that I’d say are generally acclaimed. The cluster contains movies of across all genres, such as Schindler’s List, BraveHeart, and Super Size Me. Beyond that, there’s a bunch of clusters which are mostly defined by a genre such as music, sports, documentary, Imax, children’s films, or bonus material. The big blob in the center is mostly what I’d call junk movies.

Read More

7 Comments »

An Interactive Visualization of the Netflix Prize Dataset

smallNetflixVizInteractive The visualization activated below (click the button) shows all 17,700 movies that are part of the Netflix Prize Competition. The movies are laid out such that simlar movies are close to one another. Similarity between two movies is computed based on whether users who like one like the other, or (and, really) those who dislike one dislike the other.  Alternatively, take a look at a colorful, static version.

Read More

4 Comments »

GapMinder Talk

Just read an article about Google buying a small company called GapMinder which does data visualization.  I checked out the talk on the GapMinder homepage, and would recommend watching the first 10 minutes of it.  The visualization tool that is used throughout the talk is something special…easy to see Google’s interest.

3 Comments »

Installing WordPress on GoDaddy

wordpressGoDaddy Setting up WordPress on a GoDaddy hosting account is really not difficult (this blog is an example that it can be done!).  Below are my notes on the process.  If you glance at these steps, and don’t want to mess around with this, consider using one of the following hosting services which come with WordPress pre-installed (fairly rare): An Hosting, Lunarpages, BlueHost, Yahoo

Steps for installing WordPress on a GoDaddy Hosting Account

Read More

82 Comments »

A Tutorial on Flash Remoting Using Perl

remoting Flash remoting is a big improvement over forms/cgi for communication between flash and server.  There’s a great little project called amfphp for using php with flash remoting.  There’s a whole lot less great (but appreciated!) version called amf::perl for perl and python. There is little documentation, so I thought I’d post an example. 

Read More

2 Comments »

Tips for Making Perl Programs Run Faster

In my daily work I tend to manipulate fairly large datasets, such as Wikipedia, U.S. Patents, Netflix Ratings, and Imdb.  Here’s a few tricks I’ve come across so that you don’t lose time waiting for your programs to finish. 

Read More

3 Comments »

Flash vs. Processing

Over the past year and a half I’ve been hooked on the language Processing. I’ve even contributed a early version library for visualizing social network data

Read More

11 Comments »

Top 10 Google Tech Talks

All Google Tech Talks are here (Google EngEDU is the actual name of the talk series). Thought I’d compile a top ten list…

  1. Python and Python 3000. Two talks about the Python language given by its inventor Guido van Rossum. The first is about the language’s origins and the second is about its future.
  2. Read More

4 Comments »

CoCitation vs. Bibliometric Coupling

I recently posted an efficient algorithm for computing the similarity of two Wikipedia pages (or any two nodes in a network) using cocitation similarity. Another type of similarity which may be worth considering is bibliometric coupling, in which two pages are similar if the pages they link to are similar. What is interesting is that it is only a few minor tweaks to the cocitation algorithm to compute bibliometricc coupling. Here’s the bibliometric coupling psuedocode (Perl style):

Read More

No Comments »

Wikipedia Page Similarities

I’m working on a visualization/map of Wikipedia pages. The map will layout pages close to one another if they are similar. So, in order to create such a map I need to compute the similarity of any two Wikipedia pages.

For my first attempt at this, I decided to go with a cocitation measure of similarity. So, two Wikipedia pages will be said to be similar if other Wikipedia pages that link to one usually link to the other.

However, the naive way to compute this, looking at every pair of pages, is far too inefficient given that there are 650,000 pages in the English Wikipedia, and 14.5 million pagelinks. So I’ve worked up a much more efficient algorithm. Here’s the psuedocode…I hope someone, somewhere out in cyberspace will find this useful. (It can, in fact, be used to compute co-citation similarities for any data represented as nodes and links)

Read More

1 Comment »

Visualization Google Tech Talks

15 Views of a Node Link Graph: An Information Visualization Portfolio by Tamara Munzner

Read More

No Comments »

35 Great Visualizations

Geographical & Historical

WorldProcessor. Globes overlaid with information. Beautiful…must see!
Wikisky Google maps for the stars.
Flight Patterns Visualizations of FAA data.
TextArc: History of Science Beautiful.
2007 Calender. Brad Paley design.
31 days in Iraq. Visualization of deaths in Iraq. Depressing.
Tracing the Visitor’s Eye Flickr tags on a geospatial basemap.
Schreiner International Cables Map. Old world map of cables.
Napolean’s March. Made famous by Edward Tufte.

Read More

No Comments »