ETech Presentation on Ensemble Machine Learning

Just wanted to put up my slides from ETech this past week.  The talk is pretty similar to the talk I posted a few months ago, just a bit more fleshed out.
[ppt][pptx][pdf]

Unfortunately, I only made it to the conference for the day I was speaking.  Beautiful venue.  Seemed that most the buzz related to social networking issues and climate change.  Would have liked to have heard Peter Norvig’s talk.  Maybe another year.

Top 10 Google Tech Talks

All Google Tech Talks are here (Google EngEDU is the actual name of the talk series). Thought I’d compile a top ten list…

  1. Python and Python 3000. Two talks about the Python language given by its inventor Guido van Rossum. The first is about the language’s origins and the second is about its future.
  2. How Open Source Projects Survive Poisonous People (And You Can Too). Really liked this talk. Really liked it! Given by the lead developers of SubVersion (and other large projects), this talk provides a guide to working as a team. Next time I lead a project, I am going to ask everyone to watch this before starting work.
  3. Winning the DARPA Grand Challenge. The story of the robot race thru the mojave desert. Having been on a Grand Challenge team, I appreciate just how hard it was to win.
  4. Scrum, et al. An excellent talk about the Scrum agile software development methodology.
  5. Wikipedia and MediaWiki. A talk about the implementation of Wikipedia given by it original, and for a long time, only paid staff developer. Not a very dynamic talk, but the insider perspective is interesting.
  6. Computers versus Common Sense. Doug Lenet gives a talk about the famous AI project Cyc. I thought this project was put to rest a long time ago. Guess I was wrong. Anyways, if you like symbolic Artifical Intelligence, its a really interesting talk.
  7. Scholarly Data, Network Science, and (Google) Maps. A very good information visualization talk.
  8. 15 Views of a Node Link Graph: An Information Visualization Portfolio. A bunch of visualization techniques. I think Tamara Munzner leads some of the most interesting visualization work anywhere.
  9. Human Computation. A talk about harnessing human knowledge for tasks such as spam filtering and image recognition.
  10. Scrum Tuning: Lessons learned from Scrum implementation at Google. A talk about the experience of using Scrum given by one of its inventors.

 

Installing WordPress on GoDaddy

Setting up WordPress on a GoDaddy hosting account is really not difficult (this blog is an example that it can be done!).  Below are my notes on the process.  If you glance at these steps, and don’t want to mess around with this, consider using one of the following hosting services which come with WordPress pre-installed (fairly rare): An Hosting, Lunarpages, BlueHost, Yahoo

Steps for installing WordPress on a GoDaddy Hosting Account

1. Get an account.  If you haven’t already, purchase a hosting account.  I chose the Deluxe plan, which really isn’t very expensive.  You’ll be emailed directions after you purchase the account.  The email will say it takes 24-48 hrs to activate, but it actually only takes 20 minutes or so.

2. Login to the “my account”.  The login is on the GoDaddy homepage.  On the my account screen, click “Hosting Account List”.  Then click “open” under control panel.  You should be at the “Hosting Manager” seen below.

3. Create a MySql Database.  WordPress stores its data on MySql.

  • Click the MySql icon.  Then click “Create New Database”. Name the db “Wordpress”.
  • Create a db login.
  • Confirm.
  • Submit.  Wait a minute. Then refresh. The status should change to “setup”.
  • Click the db name.
  • Highlight the hostname and copy it (ctrl-c).  You’ll need it for the WordPress config file.

4.  Download WordPress.  Unzip the files.

5.  Configure the file wp-config.php.  Change the following lines using your information.

define(’DB_NAME’, ‘wordpress’);
define(’DB_USER’, ‘username’);
define(’DB_PASSWORD’, ‘password’);
define(’DB_HOST’, ‘localhost’);

6.  Upload the WordPress directory to your GoDaddy account.  You’ll need an ftp client to upload files to your account (I use Smart Ftp) and you’ll need the ftp address for your site.  Your address is ftp.yourdomain.com.  Put the files in you top level directory, that way when you go to www.yourdomain.com it will load WordPress.

7. Test WordPress.  There are detailed directions for configuring WordPress here.

Nice simple info. Couple of small suggestion because I host most of my clients on GoDaddy, and many of them are beginning to use WordPress.

After downloading the WordPress files, you’ll need to rename wp-config-sample.php to wp-config.php before you upload the edited file.

Also I’ve found that I do need to edit the ‘localhost’ info in the config file using the GoDaddy database info. Go to the GoDaddy database control panel and look for a label that names the database something like this — p91mysql121.secureserver.net.

Thanks for letting me add these items. Hope they may save someone a little time and frustration.

I have documented my struggles with WordPress on Godaddy on my blog. In addition to the steps mentioned in the article and the helpful comments in the comments, there are a few common issues that popped up for me. I have listed the problems and the resolutions on my blog also. Hopefully someone finds this informative.
http://www.kelath.net/blog/index.php/2006/01/23/set-up-wordpress-the-easy-way/

btw, i am on their economy hosting plan on IIS.

 

KDD 2011: Recap

Often I write some kind of summary when I get back from a conference, but today I ran into Justin Donaldson, a fellow Indiana alum, who said he still occasionally finds use of the content, or at least the visualizations, on my blog. Anyhow, I felt motivated to write my notes up as a blog post.

Overall many of the talks I attended where using latent factor models and stochastic gradient descent. A few mentions of gradient boosted trees. Good stuff. My favorite talks…

  • Charles Elkin’s “A Log-Linear Model with Latent Features for Dyadic Prediction”. A scalable method to do collaborative filtering using both latent features and “side information” (user/item content features). Can’t wait to try it out! Here’s a link.
  • Peter Norvig’s Keynote about data mining at Google. Some tidbits:
    • Google mostly uses unsupervised or semi-supervised learning, both because of the cost of labeling and because labels themselves can be an impediment to higher accuracy.
    • He had this great graph of the accuracy of several algorithms for the word sense disambiguation task plotted against the amount of data used in training. The best performing algorithm always underperformed the worst algorithm when it was given an order of magnitude more data. A great argument for simple learning algorithms at very large scale.
    • They are very interested in transfer learning
  • KDD Cup. Topic was music recommendation using Yahoo’s data.
    • Many of the same ideas and observations as in the Netflix Prize. Neighbor models and latent factor models trained with stochastic gradient descent seemed pervasive.
    • Ensembles were necessary to win, but the accuracy improvement wasn’t huge over the best individual models. Quite the argument for not using ensembles in industry.
    • Yahoo organizers made some really interesting comments about the data. Among them that the mean rating is quite different for power users, which makes sense. And the data is acquired from different UI mechanisms, if I understood correctly, which impacts the distributions.

Looking forward to tomorrow!

Three Favorites from the Knight News Challenge

The Knight New Challenge, in it’s 5th year, is a very cool effort to fund ideas that mashup aspects of the news industry with new tech.  Looking at the winners from the past four years…I think most are more about inspiration than viability, and that’s just fine.  Here’s three that stood out to me…

MediaBugs.org. The idea is to have a mechanism to allow the ‘crowd’ to report errors in reporting.  Cool idea…probably needs a way for a user to trivially report an error while reading from a news site, like nytimes.com, to gain traction.

ushahidi.com.  The idea here is a map based system for sharing information in a crisis.  Apparently was used following the Haiti earthquake.  Very cool!!

documentcloud.org. Idea is to facilate sharing of documents used as sources in news stories.  I could imagine this idea could grow into community or algorithmic fact checking.

Getting Really Large Images onto WordPress

I wanted to added zoomable versions of some very large visualizations to the site this evening.  So I uploaded them to GigaPan (actually, some had been uploaded years ago), and embedded them on the site, and everything works great!!  Click on the ‘Data Art’ tab above to see for yourself.  Here’s the steps if you’re interested…

  1. GigaPan only accepts images that are 50 megapixels or more, which is really large.  If your image is large, but not that large, download SmillaEnlarger and increase the size a little.
  2. Sign up for a GigaPan account if you don’t already have one.
  3. Download GigaPan uploader.
  4. Install the uploader, and upload the image.  The software was pretty easy to use.  When the upload is finished, you’ll get a url for the image that includes a 5 digit id.  One of mine was 65469, for example.
  5. On the wordpress page or post where you want the image, place the following code (replacing 65469 with your image’s id):
    <iframe src=http://www.gigapan.org/media/gigapans/65469/options/nosnapshots/iframe/flash.html?height=400 frameborder=”0″ height=”400″ scrolling=”no” width=”100%”></iframe>
  6. That’s it!  You’ll get something like this…

First Data Visualization Meetup—Nov 10th

Lately Zhou Yu and I have been working to start a data visualization meetup in the Bay Area…something that surprisingly doesn’t already exist.  Well (finally!) we have scheduled our first meetup, a talk by Stamen CEO Eric Rodenbeck (bio).  We’re still looking for a regular venue for our meetups (ideally, one in SF and one on the Peninsula), but for now, Jeff Heer of Stanford has been good enough to allow us to use a classroom on campus.  I have no doubt this topic is going to attract a fantastic group of people.  Come join us!!

20 Useful Visualization Libraries

Well, not entirely limited to libraries.  Useful stuff for visualization practitioners sounded a little non-specific, though.  These are all freely available.

1. Prefuse (Java) & FLAIR (Flex)

 

 

 

2. simile (AJAX)

3. Processing (Java)

4. GigaPan (Service)

5. Modest Maps (Flash, Python)

6. Google Visualization API (Javascript)

7. Google Chart API (Javascript)

8. Google Maps API (Javascript, Flash)

9. GraphViz (Wrappers for a dozen languages including Java, Perl, Python.  Free.)

10. JFree (Java)

11. pChart (PHP)

12. OpenLayers (JavaScript)

13. Anti-Grain (C++)

14. JGraph (Java)

15. Boost Graph Library (C++, phyton wrapper)

16. Open Flash Chart (Flash)

17. Ubigraph (Wrappers for Python, Java, C, and more)

18. JUNG (Java)

19. TimeMap (Java)

20. Many Eyes (online service)

 

 

 

 

Tuning Search Engine Components

For the past couple of years I’ve been primarily involved with engineering models used in search engines.  At times I’ve run into situations where a model I’m using or developing has some parameters that need to be set.  For example, a model might have a parameter that is a threshold on a number of times a keyword will be counted before we decide that additional occurrences are probably spam (and, yes, I’m talking about BM25 here).  And, at times, either the cost function I would like to use to set the parameters is not differentiable (yeah, I’m thinking about DCG), or I’m perfectly happy to use a quick and dirty method.  So I end up going with a direct search algorithm.  Here’s what I’ve learned (and haven’t forgotten)…

  • I don’t know of any direct search method that scales to more than a dozen-sih parameters.
  • Apache Commons Math has two direct search algorithms implemented in its Optimization package that are great place to start.  The package also provides a framework for defining the cost function.  Check it out: http://commons.apache.org/math/userguide/optimization.html
  • Implementations abound in which each parameter is iteratively changed, using a heuristic for direction and possibility momentum for the changes.  Evaluation of the cost function usually happens after a single parameter is updated, rather than only after an epoch.  Here is a good example lifted from a paper describing the winning solution to the Netflix Prize (http://www.netflixprize.com/assets/ProgressPrize2008_BigChaos.pdf)…

Network Visualization for Systems Biology

This is a quick look at the state-of-the-art of network visualization in systems biology. It’s an interesting topic on its own (and my day job at the moment), and also as it relates to the visualization of other types of networks, such as social networks (think Facebook). Systems biology is all about looking at proteins, pathogens, and more, within the contexts in which they interact. Naturally, then, the visualizations that tend to be particularly useful are those such as network visualizations that can provide macro understanding of the interactions.  Questions such visualizations help with include those of the form “if a drug affects protein X, what else will it affect?”

The Networks
Quite a bit of interesting complexity is present in these interaction networks (the data).  They are often small-world, disassociative (unlike social networks), scale-free, and exhibit modularity.  Biologists are usually either interested in looking at larger scale cell level networks, or meaningful sub-networks called pathways, which typically are in the range of 50-500 nodes.

Making life interesting, duplicate nodes representing different states are often included.  The edges are directed, and may be hyperedges when multiple nodes necessarily interact together. And, in truth, the edges are often approximations of the actual interactions in the underlying network.  These approximations come from experimental findings published in journals.

This image is part of Roche Applied Science’s“Biochemical Pathways” series of wall charts.  The charts are in the style of circuit diagrams, which seems to be the most common 2-D representation of metabolic pathways.  This set seems to have been particular influential.  The appeal of this ‘map’ is likely its scale.  Viewers can spend a great deal of time exploring.  In visualization there is a notion of ‘information density’, meaning the more visual attributes used to convey the data, the more information that may be present in the visualization.  This image has a very high information density.

Layout

In general (not just systems bio), network/graph layout (choosing where to place the nodes and edges) is done with consideration for (A) the topology network and (B) the aesthetics.  The primary topology concern is to place connected node pairs near one another and unconnected pairs apart.  The primary aesthetic concerns are to ensure that nodes do not overlap, edges do not cross, and labels are readable.

However, nodes in systems biology often also have biologically significant locations associated with them (e.g., within a cell, or within the nucleus of a cell).  The most common way of handling this location information is to treat the layout in a standard network layout manner, but constrain nodes to a compartment/level designated as the extracellular, membrane, cytoplasm, nucleus, etc.  This visualization, created with the Cerebral plugin for Cytoscape is the best example I know of of this.

Realism

Most of the network visualization tools for systems biology create very abstract images.  However, in high quality publications, such as the journal Nature, the abstract images are often hand rendered to include more realistic imagery.  Something I would like to do more of if look at actual microscope images and behavioral models to try to usefully bridge the gap.

Visual Data Mining

There are many uses of these network visualizations for biologists and others.  One is just that they can leave a more lasting

impression/memory than simple lists.  A major use case, though, is visual data mining, which may take many forms.  Followers of Tufte know that contrasts are often the most valuable element of a visualization.  This image is a straightforward example.  More sophistication visual data mining might include clustering and classification of those clusters.

Because the Roche wall charts beg to be explored, it is only natural that a tool would be created for doing so.  G-Language is an open source shell that supports, among other things, pathway visualization plugins.  The Genome Projector is module for G-Language which uses the Google Maps API to allow exploration and annotation.  No doubt, as systems biology network visualization tools reach later versions, more and more will support rich interaction and, perhaps, treat the visualization as a vehicle for collaboration.

Hierarchy and Metanodes

 

 

 

 

 

In the networks section above, I mentioned that the networks are often modular.  The most obvious modules are organelles.  But other modules exist, such as those defined functionality.  As the above examples show, incorporation of the modularity information into the visualization often is done in a manner that makes it even more abstract.