Tips for Making Perl Programs Run Faster

In my daily work I tend to manipulate fairly large datasets, such as Wikipedia, U.S. Patents, Netflix Ratings, and Imdb.  Here’s a few tricks I’ve come across so that you don’t lose time waiting for your programs to finish.

  1. Use Storable
    • If you need to a save a hash, array, or more complex data structure to use in its entirety at a later time, it is around 4 times faster to use storable than to read/write it from a text file.  Its simple to use, just two calls…
    • store(\%hash, $filename);
    • $hashref = retrieve($filename);
  2. Use a Berkeley Db for random access
    • Berkeley Db’s are associative databases (as opposed to relational databases like MySql, Postgresql, etc).  And they are quite useful.
    • Whereas we want to use Storable when we will be later loading the entire data structure back into memory, we want to use BDBs when we will only be performing random access at a later time.  So, for example, lets say I have all the IMDB movie titles in a hash, and I want to save it to disk.  But I know that later on I am only going to want to look up one or two titles at a time, so that’s a case where I want to use a BDB.
    • There’s two modules for BDBs to choose: DB_File or BerkeleyDB.  I usually use DB_File.
  3. Use Unix Grep
    • This might sound a bit off the wall, but when you have your data in a file, rather than reading it in and parsing it using Perl regular expressions, it can be up to 100 times faster to call Grep or AWK.  Grep is particularly useful when your task is simply to lookup something in a text file.  In general, when trying to speed up code, don’t be afraid to take advantage of the operating system.
    • system(”command”, $input, @files);
  4. Patterns for Storing Hash Structures
    • This tip is somewhat domain dependent.  I often find myself using 2d hashes, primarily because I work with data represented as graphs (i.e. a set of edges represented as node-node-weight).  So these 2d hashes look like {id1}->{id2}=value.
    • When writing such a hash out to text I’ll use the command chr(1) to generate a delimiter, and directly write “id1 id2 value\n” (with chr(1) instead of spaces) per entry.
    • When using Storable, I don’t need to do anything special.
    • When using a BDB, there is no such thing as a 2d hash.  Either one has to create multiple BDBs, or, usually the better option, is to convert the hash to {id}=id2a|value|id2b|value… and use split when accessing the data.  Again, I like to use chr(1) instead of | as a delimiter.

GapMinder Talk

Just read an article about Google buying a small company called GapMinder which does data visualization.  I checked out the talk on the GapMinder homepage, and would recommend watching the first 10 minutes of it.  The visualization tool that is used throughout the talk is something special…easy to see Google’s interest.

I think GapMinder was the tool used by Hans Rosling to make his really awesome presentation at TED Talks last year. You can find all the talks here. You can search by names. Some of those talks are simply outstanding.

http://www.ted.com/index.php

Please look at his earlier talk too.

Great blog, by the way. Visualization brings the data alive.

Its online now… quite interesting…
http://tools.google.com/gapminder/

I read the entire post and the comments and made the changes just as you all suggested and it seems to be working beautifully – the first time. This is far too good to be true!

I’m just now building my first WordPress site and it looks a lot easier than building sites with Dreamweaver or the other apps I’ve gotten used to (I’ve tried them all!)

I’ve got GoDaddy Deluxe Hosting where I can put multiple domains under a master domain for the same price. I just opened a new sub-folder named eucopyright where I’m going to build the new site and uploaded all the wordpress files to it.

5 minutes and it seems to work as advertised. Wow. Thanks!

I got to be honest, I have been to countless websites, read numerous amounts of information on forums and details about installing wordpress with my godaddy account. I got to say I am lost! It has taken me a two weeks to accept my faith and let go of my pride to omit my failure.
I’ve followed the ” Famous 5-Minute Install “ multiple times, but when I try to access my web site I get the message “Additionally, a 404 Not Found error was encountered while trying to use an Error Document to handle the request. “ At this point I don’t even know what questions to start asking, if someone could please help a lost soul I would really appreciate it.

Visualizing the ‘Power Struggle’ in Wikipedia

A new visualization Bruce Herr and I recently completed is being featured in this week’s New Scientist Magazine (the article is free online, minus the viz).  They did a good job jazzing up the language used to describe the viz–’power struggle’, ‘bubbling mass’, ‘blitzed articles’–but they also dumbed down the technical accomplishments.  I guess not everyone gets as excited about algorithms as I do.

Before I talk anymore about the viz, though, let me mention its appearing at the NetSci 2007 Conference this week, and hopefully a varient will appear at Wikimania later this summer as well.  The viz is a huge 5 feet by 5 feet when printed, and I only include a low res, smaller version here.  At some point high quality art prints of it will appear at SciMaps for sale to fund further visualization research.

Now for the good stuff.  Much like my visualization of the netflix prize competition data, we began this piece by representing the data as a network.  In this case the nodes in the network are wikipedia articles and the edges are the links between articles.  We then (with some help from our friends at Sandia) used an algorithm to lay out all 650,000 nodes (wikipedia articles) that had at least one link in such a way that similar articles are near one another.  These are the yellow dots, which when viewed at low res give a yellow tint to the whole picture.

The sizes of the nodes (circles, dots, whatever you want to call them), are based on a model of revision activity.  So large circles indicate that an article might be controversial, or the subject of lots of vandalism, or just a topic whose content frequently changes.  We labeled only the largest nodes, to keep it readable.  There is an interactive version of this in the works based on the google maps platform which will change the labels and pictures used as the user ‘zooms’ in or out.  Stay tuned for that.

The image used for each tile was selected automatically, simply by using the first image in the most linked to article among all the articles in that tile.  We were pleasantly surprised by the quality of the images that appeared.

Our hope for this visualization approach, which we continue to improve on, is that it could be updated in real time to give a macro sense of what is happening in Wikipedia.  I personally hope that some variation of it will end up in high schools as a teaching tool and for generating discussions.

Top 20 Most Hotly Revised Articles

  • Jesus
  • Adolf Hitler
  • October 2003
  • Nintendo revolution
  • Hurricane Katrina
  • India
  • RuneScape
  • Anarchism
  • Britney Spears
  • PlayStation 3
  • Saddam Hussein
  • Japan
  • Albert Einstein
  • 2004 Indian Ocean Earthquake
  • New York City
  • Germany
  • Muhammad
  • Pope Benedict XVI
  • Ronald Regan
  • Hinduism

On Transfer Learning

Definition (from DARPA): The ability of a system to recognize and apply knowledge and skills learned in previous tasks to novel tasks

Current approaches involve either the building of a shared model of a domain or multiple domains, in the form of a case base, hierarchy, or relational schema, that couple the classifiers together, or the creation of mapping between distinct representations.  Bayesian and neural approaches dominate the research thus far.

(from Droy 2007-IJCAI07) In spam filtering, a typical data set consists of thousands of labeled emails belonging to a collection of users.  In this sense, we have multiple data sets–one for each user.  Should we combine the data set and ignore the prior knowledge that different users labeled each email?  If we combine the data from a group of users who roughly agree on the definition of spam we will have increased the available training data from which to make predictions.  However, if the preferences within a population of users are heterogeneous, then we should expect that simply collapsing the data into an undifferentiated collection will make our predictions worse.

Resources
Caruana dissertation (1997).  Part of ALVINN
Berkeley 2005 course.  Reading list.  Bayesian approaches are focused in on.
Oregon State 2005 course. Probabilistic Relational Models.
DARPA Proposal.  Now in its third and final year.
CBR Approach.  Strategy game playing.
Wikipedia Entry

Workshops
NIPS 1995
Inductive Transfer : 10 Years Later (2005)

I farted around with this for a total of about 4 hours over a two-day period and still couldn’t get it to work. Was finally about to give up and found simple directions on GoDaddy’s website. It was MUCH easier and was truly a 5-minute install and worked like a charm the first time. You can download and install WordPress right from within your GoDaddy account. Go to Hosting Account List/Content/GoDaddy Hosting Connection. In the Community Tools panel click the WordPress icon and it downloads, configures and installs for you, simple as pie. You don’t have to open files, edit, rename, etc. – none of that crap.

Man, wish I had know about this earlier, would have saved me HOURS of aggravation. Hopefully others will see how easy this method is.

Google Tech Talk Review: Statistical Aspects of Data Mining

This is a talk series being given at Google by David Mease based on a Master’s level stats course he is teaching this summer at Stanford.  Its easy listening if you already have some data mining or stats background.

The introduction (part 1) is particularly well done, as is the portion on association rule mining (parts 7 and 8).  This is the first half of the course which has already occurred…I’ll add links as new sessions are added to Google video.

Part 1: Introduction. Discussion of locations of potentially useful data (grocery checkout, apartment door card, elevator card, laptop login, traffic sensors, cell phone, google badge, etc).  Note mild obsession with consent.  Overview of predicting future vs describing patterns, and other broad areas of data mining.  Intro to R.

Part 2: Data. Reading datasets into excel and R. Observational (data mining) vs Experimental.  Qualitative vs quantitative.  Nominal vs ordinal.  And so on…

Part 3: Data cont. More Excel and R.  Sampling.

Part 4:  Plots. Histograms, ECDF.

Part 5:  More R plots.  Overlaying multiple plots. Statistical significance.  Labels in plots.

Part 6:  More R plots.  Box plots.  Color in plots.  Installing packages.  ACCENT principles and Tufte.

Part 7: Association Rules. Measures of location. Measures of spread.  Measures of association.  Frequent itemsets.  Similar to conditional probabilities.

Part 8: More association rule mining.  Support and confidence calculations. Personalization using rules. Beyond support and confidence.Part 9: Review

Part 10: Classification.  Overview.  A negative view of decision trees.  DTs in R.  Algos for generating DTs.

Part 11: More DTs.  Gini index.  Entropy. Pruning. Precision, recall, f-measure, and ROC curve.

Part 12: Nearest Neighbor. KNN.  Support Vector Machines. Adding ’slack’ variables, using basis functions to make the space linearly separable. Some comments on Stats vs ML. Intro to ensemble (uncorrelated) classifiers.

Part 13: Last class.  Random Forests.  AdaBoost.  Some discussion of limits of classifiers (nondeterministic observational datasets).  Clustering.  K-Means.

I too am so lost! I tried endlessly to set up wordpress . After numerous failed attempts I finnaly put down my ego and called customer support. They told me to my horror, that I can’t run php with using windows! They say in life change can be scary. I’ve battled deadly illness, traveled alone in the Middle East, got into the ring with 300lb men that wanted to rip my head off but the thought of changing my OS is way too scary and I can’t do it. I read that the “no php with windows” is bullshit and a case of under qualified support at godaddy. Is this true? I have know experience at php so if anyone has the answer, I’m really gonna need it in fine detail please. Wonderful blog ya got here!

Visualizing Science & Tech Activity in Wikipedia

If you didn’t see our original Wikipedia Activity Visualization, check it out here (there’s a detailed explanation, as well).  Also, there is a Google maps style zoomable version here.

This new version uses the same layout and images (well, slightly improved) as the original, b

ut this time we tried to highlight activity in regions of Wikipedia that are predominately math or science or technology.

So we developed a program to classify Wikipedia articles as being one of these three categories (or none), based on the categories the article was assigned to and their positions in the Wikipedia category link network.

We were not surprised to see a tight cluster of math pages, in a region, I would add, which has little ‘hot’ activity.  In fact, the only article in that region with lots of activity is the article “Earth”.  It was also not surprising that technology articles are fairly spread out among the topics.

What’s striking is the science-related band (green-blue) that runs diagonal through the middle of the topic map.  I won’t share my interpretation, but rather let those interested come up with there own.  Hope you enjoy, please leave comments!

(CLICK IMAGE TO ENLARGE)

Above: The most actively edited science-related articles.

Left: Not much science here…a good indication the algorithms are working pretty well!

 

ICCBR 2007 Highlights

ICCBR07 (International Conference on Case Based Reasoning) is held on alternating years with the ECCBR conference.  The venue was Belfast, a city with nice blue collar charm to it.  Seemed sort of a European version of my hometown of Green Bay.  Stayed in a Queens University dorm room, where I was constantly reminded I am too old to be staying in the dorms.  Should have paid out for a stay for the Europa Hotel where the conference was held…classy place.

Day 1

Perceptions of CBR by David Aha.  Argued that CBR may become irrelevant if there are not more theoretical results published.  Showed stats that more recent CBR publications are system-oriented to back up his argument (but may be true of AI in general).  Suggested that CBR researchers have theory envy towards machine learning practitioners.
Note: A thought occurring to me is that CBR is more a set of design patterns, ones which are fairly accessible by the general public given the diverse interests of the delegates present.
Credible Case Based Reasoning by Eyke Hullermeier. A formal treatment of the retrieval component of CBR (nicely timed to correspond to Aha’s argument).

Day 2

Databases and CBR by Larry Kerschberg.  History of database research.  Currently concerned with meta data, reasoning, and data providence, and moving beyond ’just a schema’ (what can happen to data item).  Suggested using CBR for web resource discovery (Wikipedia, Amazon, Facebook as parts of casebase) as major direction for CBR. Invitation to submit CBR work to special issue of Journal of Intelligent Information Systems on integrating artificial intelligence and database technologies.  Scaling, cross domain issues also brought up.  Ran out of time to cover his paper ”Knowledge Sifter”. He was quite interested in CaseML as well.
Empolis by Ralph Traphoner.  An Eclipse based tool for CBR systems.  Supports a plug-in architecture, a fact alone which makes the tool promising in my mind.  Not sure whether it is commercial or what license it carries.
Industry Panel by GE, GM, and Knexus.  Discussion of what industry wants.  Examples of money making applications was a theme, not surprisingly.  Hybrid systems and Internet data were mentioned as well.  One researcher suggested that releasing data was not worth his time.  Another mentioned he has never heard of a case based recommender in the field.  One audience member raised a concern that many outside of the CBR community see CBR as just KNN and a lot of talk.  Something that strikes me about this community is that “industry” is non-tech.

Day 3

My talk at the workshop on personalizing similarity measures.
jColibri by Juan Recio-Garcia.  CBR system using Lucene.

Day 4

Note: Google Group on context in CBR is now up.
Note: Suggestion that the difference for most people between IR and TCBR is that TCBR systems have some inference step.
CBR in RoboCup Soccer by Hans-Dieter Burkhard. Funny videos!  I’ll try to post these.
Note: David Leake’s paper on data providence won best paper.  Congrats David!

Additional Comments

Note: A “cooking” contest is being help at next year’s ECCBR where recipes are cases.  I like the idea of a contest, and the crossover to the real world (taste of cooking matters), however I wonder if the cooking domain allows for sufficient bragging rights.
Note: Major topics in CBR continue to be similarity measures, context awareness, text processing, and health care support systems.  The new topic for next year is to involve web 2.0 technologies.

35 Great Visualizations

Geographical & Historical

WorldProcessor. Globes overlaid with information. Beautiful…must see!
Wikisky Google maps for the stars.
Flight Patterns Visualizations of FAA data.
TextArc: History of Science Beautiful.
2007 Calender. Brad Paley design.
31 days in Iraq. Visualization of deaths in Iraq. Depressing.
Tracing the Visitor’s Eye Flickr tags on a geospatial basemap.
Schreiner International Cables Map. Old world map of cables.
Napolean’s March. Made famous by Edward Tufte.

Government

The State of the Union in Words. From the NYTimes.
Death and Taxes. Where taxes go to.
U.S. Frequency Allocations Chart. Radio frequency allocations.
2006 Election Results. Interesting maps.
Taxonomy Visualization of Patent Data. Patent viz.

Internet

IRC Arcs IRC viz.
Welkin RDF visualizer.
History Flow Visualization of the Wikipedia. Wikipedia viz.
Who owns the Internet? Map of the net backbones.
Amaznode. Amazon product visualization.
Java Technology Concept Map. Relationships among Java technologies.
The Dumpster. Romance and Blogs.

Pop Culture

Radio Protector Music recommendation visualizer.
Movie and Actors. Visualization of imdb data.

Geneology

rhnav Geneology visualization.
20 Generations Family Tree. One particular family tree viz.
Swarm. Networks of website.

Science

Tree of Life. Species visualization.
Map of Science. From the journal Nature.

Literature

TextArc Visualizations of books, other writings
Visual Poetry. Visualizations of poems.
The Voice. One of the better tag clouds I’ve seen.

Other

Thinking Machine 4. A chess match viz.
Ph.D. Thesis Map. Dissertation outline in the style of a subway map.
Newspaper Map. Very unusual.
Visual Elements Periodic Table.

Flash vs. Processing

Over the past year and a half I’ve been hooked on the language Processing. I’ve even contributed a early version library for visualizing social network data.

For those unfamiliar with Processing, it’s a variant of Java.  Its distinguished by its emphasis on interactive media. The fundamental unit in Processing is a sketch, which is entirely and continuously redrawn at some given rate.  Sketches may be compiled into either standalone apps or Java applets.

Many creative types work with Processing–here’s some cool examples: The Dumpster, Thinking Machines 4 (pictured left), and Relations Environment. More can be seen on exhibition.  Its not hard to see why I was excited by the language.

I’m still excited by the Processing community and all the cool apps and exhibits they are turning out.  However, today I am much less excited by the technology.

  1. As an Internet content delivery mechanism, Java Applets are a poor choice.  They have a large memory footprint and so are slow to load, and they provide few options for communicating between client and server.
  2. Sketches are redrawn in their entirety with each tick of the clock.  There are no layers.  Why is this a problem?  This limits the number of graphical objects one can involve in a sketch without the sketch slowing down.
  3. There are no user interface components of any quality.  And Java Swing is not compatible with Processing.  While this is a real limitation, this is a somewhat hollow complaint, as I realize it is only a matter of time before quality UI widgets are created within Processing for Processing.
  4. Java 1.5 is not supported.  My concern here deals with the fact that one of Processing’s greatest strengths is that a developer may use Java.  I am wondering whether the Processing community will be able to maintain this integration over time.

Recently I’ve been looking into Flash (e.g., the chart in Google Finance is Flash) and have begun to believe that Flash is a better alternative.  It is great at content delivery, has convenient UI components and support, and the fundamental unit of Flash, the “movie”, can be redrawn using separate layers.

I am still learning ActionScript, the object-oriented language used for Flash.  My early impressions of it are that recent versions 2.0 and 3.0 are of a pretty high quality, but a far cry from Java.

I’ll post a follow-up to this in a few months after I have more experience with Flash and Actionscript under my belt.

A Tutorial on Flash Remoting Using Perl

Flash remoting is a big improvement over forms/cgi for communication between flash and server.  There’s a great little project called amfphp for using php with flash remoting.  There’s a whole lot less great (but appreciated!) version called amf::perl for perl and python. There is little documentation, so I thought I’d post an example.

Here’s my remoting notes dealing with amf::perl.  For context, I was working on a movie recommendation system.

Steps:

(1) On the Server side, in the cgi-bin, create a perl file, and call it, say “recommendation.pl”.   Your version of this file will look the same, except replace “/recommendation” with an appropriate subdirectory where you will put the code in step 2.

#!perl  -w
use strict;
use AMF::Perl;

my $gateway = AMF::Perl->new;
$gateway->setBaseClassPath(“./recommendation/”);
$gateway->service();

(2) In a subdirectory, in this case “/recommendation”, create a perl module, in this case called “rec.pm”.  This module contains the “methodTable” which will be accessed by the flash code.  The example below has both get and set calls.

#!perl  -w

use AMF::Perl;
use lib 'Filesystem\\Active Projects\\'; ## directory where my server side code is
use SimMan::TitleLookup; ## my own server side perl modulesuse SimMan::Recommendation;

package rec;
## must have this constructorsub new 
{
    my ($proto)=@_;
    my $self={};
    bless $self, $proto;
    return $self;
}

sub methodTable
{
    return {
        “getFeatureWeights” => {
            “description” => “Returns the current feature weights”,
            “access” => “remote”,
        },
        “setFeatureWeights” => {
            “description” => “Sets the current feature weights”,
            “access” => “remote”,
        },
        “disambiguateQuery” => {
            “description” => “Returns possible titles and their ids”,
            “access” => “remote”,
        },
    };
}

sub getFeatureWeights
{
    my ($self) = @_;
    my $hash = SimMan::Readers::readFeatureWeights();
    my @array;
    push @array, $hash;
    return \@array;
}

sub setFeatureWeights
{
    my ($self, $featureWeights) = @_;
    SimMan::Readers::setFeatureWeights($featureWeights);}

sub disambiguateQuery
{
    my ($self, $usertext) = @_;
    return $titles = SimMan::TitleLookup::getCandidateTitles($usertext);
}

1;

(3) On the client side, first include the necessary files and give the location of the perl files we just created.

Then you can directly call the methods listed in the methodTable.  Notice that the results sent back by the perl module will be returned to a function called “method_result”, where method is the name of the method.

import mx.remoting.NetServices;
NetDebug.initialize();
NetServices.setDefaultGatewayUrl(“http://localhost/cgi-bin/recommendation.pl”);
var connection = NetServices.createGatewayConnection();
var recommender = connection.getService(“rec”, this);

function disambiguateQuery_Result(result){
    if (result.length == 0){
        Results.htmlText = “<b>No movies found</b>”;
    }
    else{
        Results.htmlText = “<b>Are you looking for…</b><br><br>”;
        for (i=0; i< result.length; i++) {
            Results.htmlText += “result[i].id + “ ” + result[i].title + “<br>”;
        }
    }
}

searchText.enter = function(){
    recommender.disambiguateQuery(searchText.text, 10);
};

Hope this helps someone!