In my daily work I tend to manipulate fairly large datasets, such as Wikipedia, U.S. Patents, Netflix Ratings, and Imdb.  Here’s a few tricks I’ve come across so that you don’t lose time waiting for your programs to finish. 

  1. Use Storable
    • If you need to a save a hash, array, or more complex data structure to use in its entirety at a later time, it is around 4 times faster to use storable than to read/write it from a text file.  Its simple to use, just two calls…
    • store(\%hash, $filename);
    • $hashref = retrieve($filename);
  2. Use a Berkeley Db for random access
    • Berkeley Db’s are associative databases (as opposed to relational databases like MySql, Postgresql, etc).  And they are quite useful.
    • Whereas we want to use Storable when we will be later loading the entire data structure back into memory, we want to use BDBs when we will only be performing random access at a later time.  So, for example, lets say I have all the IMDB movie titles in a hash, and I want to save it to disk.  But I know that later on I am only going to want to look up one or two titles at a time, so that’s a case where I want to use a BDB.
    • There’s two modules for BDBs to choose: DB_File or BerkeleyDB.  I usually use DB_File.
  3. Use Unix Grep
    • This might sound a bit off the wall, but when you have your data in a file, rather than reading it in and parsing it using Perl regular expressions, it can be up to 100 times faster to call Grep or AWK.  Grep is particularly useful when your task is simply to lookup something in a text file.  In general, when trying to speed up code, don’t be afraid to take advantage of the operating system.
    • system(“command”, $input, @files);
  4. Patterns for Storing Hash Structures
    • This tip is somewhat domain dependent.  I often find myself using 2d hashes, primarily because I work with data represented as graphs (i.e. a set of edges represented as node-node-weight).  So these 2d hashes look like {id1}->{id2}=value.
    • When writing such a hash out to text I’ll use the command chr(1) to generate a delimiter, and directly write “id1 id2 value\n” (with chr(1) instead of spaces) per entry.
    • When using Storable, I don’t need to do anything special.
    • When using a BDB, there is no such thing as a 2d hash.  Either one has to create multiple BDBs, or, usually the better option, is to convert the hash to {id}=id2a|value|id2b|value… and use split when accessing the data.  Again, I like to use chr(1) instead of | as a delimiter.