Wikipedia Page Similarities

I’m working on a visualization, a ‘map’ if you will, of Wikipedia pages. The map will layout pages close to one another if they are similar. So, in order to create such a map I need to compute the similarity of any two Wikipedia pages.

For my first attempt at this, I decided to go with a cocitation measure of similarity. So, two Wikipedia pages will be said to be similar if other Wikipedia pages that link to one usually link to the other.

However, the naive way to compute this, looking at every pair of pages, is far too inefficient given that there are 650,000 pages in the English Wikipedia, and 14.5 million pagelinks. So I’ve worked up a much more efficient algorithm. Here’s the psuedocode…I hope someone, somewhere out in cyberspace will find this useful. (It can, in fact, be used to compute co-citation similarities for any data represented as nodes and links)

%nodes, %links //the wikipedia pages and pagelinks
%reverse = reverse(%links) //flipping the pagelinks around
%cocitations //2d hash for temporarily storing cocitation counts
%scores //2d hash storing the final similarity scores

foreach node (keys %nodes){ 

   foreach sourceNode (keys %reverse->{node}) //count cocitations for node (wiki page)
      foreach cocited (keys %links->{sourceNode})
         $cocitations{node}->{cocited} ++;

  citationCount = keys(%reverse->{node})
  foreach node2 ( keys %cocitations{node}) //similarities scores for node 
      citationCount2 = keys(%reverse->{node2})
      scores{node}->{node2} =
		2 * cocitations{node}->{node2} / (citationCount + citationCount2)
}

CoCitation vs. Bibliometric Coupling

I recently posted an efficient algorithm for computing the similarity of two Wikipedia pages (or any two nodes in a network) using cocitation similarity. Another type of similarity which may be worth considering is bibliometric coupling, in which two pages are similar if the pages they link to are similar. What is interesting is that it is only a few minor tweaks to the cocitation algorithm to compute bibliometricc coupling. Here’s the bibliometric coupling psuedocode (Perl style):

%nodes, %links //the wikipedia pages and pagelinks
%reverse = reverse(%links) //flipping the pagelinks around
%biblioCounts //2d hash for temporarily storing counts
%scores //2d hash storing the final similarity scores

foreach node (keys %nodes){ 



   foreach linkedNode (keys %links->{node}) //count cocitations for node (wiki page)
      foreach node2 (keys %reverse->{linkedNode})
         $biblioCounts{node}->{node2} ++;

  citationCount = keys(%linked->{node})
  foreach node2 ( keys %biblioCounts{node}) //similarities scores for node
      citationCount2 = keys(%linked->{node2})
      scores{node}->{node2} =
		2 * biblioCounts{node}->{node2} / (citationCount + citationCount2)
}

Ranking Online Backup Services

This post is a bit off topic for this blog.  However, I recently decided I really needed an off-site backup service (er, I lost some files).  And, as usual, I spent way too much time looking around the web for such a service.  Anyways, I thought I’d share my homework.   To give you an idea of my situation, I have about 30GB of data (code, datasets, visualizations) that I need constantly backed up.

Backup Services vs. Online Storage

There are a ton of options.  I just wanted an automatic backup service–one that works flawlessly.  Many services I looked at emphasize that the files can be accessed anywhere.  These services often refer to themselves as online storage rather that online backup.  The pure backup services usually offer unlimited space and focus on the software for backing up changes on demand or when your computer is idle, and on the software for restoring files.  The tradeoff is usually that files are stored compressed, and are only meant to be accessed for somewhat rare restores.  Online storage services, by comparision, limit the storage space, and although they almost always advertise the backup aspect, files usually need to be manually transfered.  The advantage is that the files on such services can be accessed over the web.

I’m happy to keep files on my usb flash drive which I need access to wherever I go, and so I just need the automatic backup services.

I’ve ranked the services based on my impressions of their offerings and personal experience…

Best Deals

  1. Mozy–I ended up going with Mozy as my backup service, and an very happy so far.  I go to lunch or a meeting, and come back to a message that my files were backed up while my computer was idle.  The software for selecting which files and folders to backup is quite easy to use.  However, Mozy is purely a backup service, all files are stored compressed on their servers.  So if you’re interested in remote access to your files, look elsewhere.  2GB free, or unlimited space for 4.95/month.
  2. Carbonite–Their software for automatic backup is really pretty slick.  I was torn between Carbonite and Mozy.  Like Mozy this is strictly a backup service.  There is no free version; unlimited space for 49.95/yr.

Runners Up

  1. StreamLoad (MediaMax)–StreamLoad is rebranded across a number of sites including MediaMax.com and the AMD Media Vault.  Initially, at least on paper, this seemed liked the best deal to me.  They have a free service of 25GB (they throttle downloading though), which may be enough space for many folks.  It both supports automatic backup and provides anywhere access to your files.  Its 4.95/month for the larger 100GB.  I said initially it looked good.  I downloaded the software, and it kept crashing on both my laptop and desktop.  Unacceptable 🙁  Maybe others will have a better experience.
  2. Files Anywhere–Provides both online access and automatic backup.  11.95/month for 10gb.
  3. Box.net–Looks nice, but only 5GB for 4.95/month is too small for my needs.
  4. FlipDrive–Does not provide automatic backup.  4.95/month for 20 gb.
  5. Iomega IStorage–5GB for 19.99/month.  Yikes!  Too small a space, and too expensive.
  6. IBackup.com–5GB for 9.95/month, 50GB for 49.95/month.  Too little for too much.
  7. ShareFile.com–Pay by bandwidth.  23.95/month for 2gb of bandwidth.

Foolish

  1. XDrive–Owned by AOL.  Provides for automatic backup and online access.  Strictly a free service which provides 5gb of space.  Reviews are mixed on its software, but mainly I decided to stay away because if I needed more than 5gb, there is no means of upgrading.
  2. GoDaddy I’m a big GoDaddy fan.  This blog runs on a GoDaddy Hosting accont. But their online storage service just does not look like a bargain.  Called “Online File Folder”, its 2gb for $20/yr.  No automatic backup software.
  3. True Share–Overpriced.  “As little as 30/month for 3gb”.
  4. Amazon S3–A web service.  This is strictly for developers only.  Might be a good deal for a small or medium sized business.
  5. Yahoo Briefcase–A pathetic 25mb of storage.  Might as well just use Gmail for storage.

If you just need online storage and remote access, consider a regular hosting account

  1. An Hosting 
  2. Lunarpages 
  3. BlueHost
  4. Yahoo

Around the Web

  1. TechCrunch: The Online Storage Gang
  2. PC World: Online Storage

Tips for Making Perl Programs Run Faster

In my daily work I tend to manipulate fairly large datasets, such as Wikipedia, U.S. Patents, Netflix Ratings, and Imdb.  Here’s a few tricks I’ve come across so that you don’t lose time waiting for your programs to finish.

  1. Use Storable
    • If you need to a save a hash, array, or more complex data structure to use in its entirety at a later time, it is around 4 times faster to use storable than to read/write it from a text file.  Its simple to use, just two calls…
    • store(\%hash, $filename);
    • $hashref = retrieve($filename);
  2. Use a Berkeley Db for random access
    • Berkeley Db’s are associative databases (as opposed to relational databases like MySql, Postgresql, etc).  And they are quite useful.
    • Whereas we want to use Storable when we will be later loading the entire data structure back into memory, we want to use BDBs when we will only be performing random access at a later time.  So, for example, lets say I have all the IMDB movie titles in a hash, and I want to save it to disk.  But I know that later on I am only going to want to look up one or two titles at a time, so that’s a case where I want to use a BDB.
    • There’s two modules for BDBs to choose: DB_File or BerkeleyDB.  I usually use DB_File.
  3. Use Unix Grep
    • This might sound a bit off the wall, but when you have your data in a file, rather than reading it in and parsing it using Perl regular expressions, it can be up to 100 times faster to call Grep or AWK.  Grep is particularly useful when your task is simply to lookup something in a text file.  In general, when trying to speed up code, don’t be afraid to take advantage of the operating system.
    • system(”command”, $input, @files);
  4. Patterns for Storing Hash Structures
    • This tip is somewhat domain dependent.  I often find myself using 2d hashes, primarily because I work with data represented as graphs (i.e. a set of edges represented as node-node-weight).  So these 2d hashes look like {id1}->{id2}=value.
    • When writing such a hash out to text I’ll use the command chr(1) to generate a delimiter, and directly write “id1 id2 value\n” (with chr(1) instead of spaces) per entry.
    • When using Storable, I don’t need to do anything special.
    • When using a BDB, there is no such thing as a 2d hash.  Either one has to create multiple BDBs, or, usually the better option, is to convert the hash to {id}=id2a|value|id2b|value… and use split when accessing the data.  Again, I like to use chr(1) instead of | as a delimiter.

A Tutorial on Flash Remoting Using Perl

Flash remoting is a big improvement over forms/cgi for communication between flash and server.  There’s a great little project called amfphp for using php with flash remoting.  There’s a whole lot less great (but appreciated!) version called amf::perl for perl and python. There is little documentation, so I thought I’d post an example.

Here’s my remoting notes dealing with amf::perl.  For context, I was working on a movie recommendation system.

Steps:

(1) On the Server side, in the cgi-bin, create a perl file, and call it, say “recommendation.pl”.   Your version of this file will look the same, except replace “/recommendation” with an appropriate subdirectory where you will put the code in step 2.

#!perl  -w
use strict;
use AMF::Perl;

my $gateway = AMF::Perl->new;
$gateway->setBaseClassPath(“./recommendation/”);
$gateway->service();

(2) In a subdirectory, in this case “/recommendation”, create a perl module, in this case called “rec.pm”.  This module contains the “methodTable” which will be accessed by the flash code.  The example below has both get and set calls.

#!perl  -w

use AMF::Perl;
use lib 'Filesystem\\Active Projects\\'; ## directory where my server side code is
use SimMan::TitleLookup; ## my own server side perl modulesuse SimMan::Recommendation;

package rec;
## must have this constructorsub new 
{
    my ($proto)=@_;
    my $self={};
    bless $self, $proto;
    return $self;
}

sub methodTable
{
    return {
        “getFeatureWeights” => {
            “description” => “Returns the current feature weights”,
            “access” => “remote”,
        },
        “setFeatureWeights” => {
            “description” => “Sets the current feature weights”,
            “access” => “remote”,
        },
        “disambiguateQuery” => {
            “description” => “Returns possible titles and their ids”,
            “access” => “remote”,
        },
    };
}

sub getFeatureWeights
{
    my ($self) = @_;
    my $hash = SimMan::Readers::readFeatureWeights();
    my @array;
    push @array, $hash;
    return \@array;
}

sub setFeatureWeights
{
    my ($self, $featureWeights) = @_;
    SimMan::Readers::setFeatureWeights($featureWeights);}

sub disambiguateQuery
{
    my ($self, $usertext) = @_;
    return $titles = SimMan::TitleLookup::getCandidateTitles($usertext);
}

1;

(3) On the client side, first include the necessary files and give the location of the perl files we just created.

Then you can directly call the methods listed in the methodTable.  Notice that the results sent back by the perl module will be returned to a function called “method_result”, where method is the name of the method.

import mx.remoting.NetServices;
NetDebug.initialize();
NetServices.setDefaultGatewayUrl(“http://localhost/cgi-bin/recommendation.pl”);
var connection = NetServices.createGatewayConnection();
var recommender = connection.getService(“rec”, this);

function disambiguateQuery_Result(result){
    if (result.length == 0){
        Results.htmlText = “<b>No movies found</b>”;
    }
    else{
        Results.htmlText = “<b>Are you looking for…</b><br><br>”;
        for (i=0; i< result.length; i++) {
            Results.htmlText += “result[i].id + “ ” + result[i].title + “<br>”;
        }
    }
}

searchText.enter = function(){
    recommender.disambiguateQuery(searchText.text, 10);
};

Hope this helps someone!