Tuning Search Engine Components

No Comments »

 image For the past couple of years I’ve been primarily involved with engineering models used in search engines.  At times I’ve run into situations where a model I’m using or developing has some parameters that need to be set.  For example, a model might have a parameter that is a threshold on a number of times a keyword will be counted before we decide that additional occurrences are probably spam (and, yes, I’m talking about BM25 here).  And, at times, either the cost function I would like to use to set the parameters is not differentiable (yeah, I’m thinking about DCG), or I’m perfectly happy to use a quick and dirty method.  So I end up going with a direct search algorithm.  Here’s what I’ve learned (and haven’t forgotten)…

  • I don’t know of any direct search method that scales to more than a dozen-sih parameters.
  • Apache Commons Math has two direct search algorithms implemented in its Optimization package that are great place to start.  The package also provides a framework for defining the cost function.  Check it out: http://commons.apache.org/math/userguide/optimization.html 
  • Implementations abound in which each parameter is iteratively changed, using a heuristic for direction and possibility momentum for the changes.  Evaluation of the cost function usually happens after a single parameter is updated, rather than only after an epoch.  Here is a good example lifted from a paper describing the winning solution to the Netflix Prize (http://www.netflixprize.com/assets/ProgressPrize2008_BigChaos.pdf)…

image

  • Share/Bookmark

Beautiful Visualization: The Book

No Comments »

Had the opportunity last fall to contribute a chapter to the recently released book “Beautiful Visualization” by Julie Steele and Noah Iliinsky. So for my chapter I did visualizations of two large datasets. One was of the Netflix Prize, which was an updated version of a visualization I did a couple of years back. And since I was working at AT&T Interactive R&D at the time, the other visualization I did was of the query logs for Yellowpages.com, a local search engine owned by AT&T.

Julie Steele was wonderful to work with as an editor. And O’Reilly is kind enough to allow the chapter authors to release their own chapters in digital form. So if your interested, you can download the chapter here.

Here’s the Netflix visualization from the chapter. Click it to enlarge.

Movies in the Netflix Prize Dataset


Closeup of Netflix Prize Visualization.


Another closeup of the Netflix Prize visualization.

  • Share/Bookmark

Guide to Getting Started in Machine Learning

11 Comments »

Someone at work recently asked how he should go about studying machine learning on his own. So I’m putting together a little guide. This post will be a living document…I’ll keep adding to it, so please suggest additions and make comments.


Fortunately, there’s a ton of great resources that are free and on the web. The very best way to get started that I can think of is to read chapter one of The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2009 edition). The pdf is available online. Or buy the book on Amazon here, if you prefer.

Once you’ve read the first chapter, download R. R is an open-source statistics package/language that’s quite popular. Never heard of it? Check out this post (How Google and Facebook are using R).

Once you’ve installed R, maybe played around a little, then check out this page which describes the major machine learning packages in R. If you’re already familiar with some of the techniques, then dive in and start playing around with them in R. On the other hand, if it looks really complicated, don’t worry about it yet.

Oh, by the way, if you want to start playing around with machine learning in R, you’ll need data. Check out the UCI Machine Learning Repository. They have both real and toy datasets. The iris dataset, for example, is famous for showing up in many research publications.

I’d suggest next reading more of The Elements of Statistical Learning. Its an excellent book. Try doing some of the programming exercises using R. If you don’t like this book, there are plenty of others. Bishop’s Pattern Recognition and Machine Learning is a famous one. It can be a little difficult depending on your math background. Tom Mitchell’s Machine Learning is another that’s often used to teach the topic.

If you’re looking for perhaps a more passive experience, or want the feel of a classrom, Andrew Ng of Stanford has posted all of his lectures online. He starts by saying that he thinks machine learning is the most exciting field in all of computer science. Here here!

Another great resource is the machine learning course MIT has posted on their OpenCourseWare site. It has the lecture notes, assignments, and more.

I’ll stop here now. More later.

  • Share/Bookmark

20 Useful Visualization Libraries

21 Comments »

Well, not entirely limited to libraries.  Useful stuff for visualization practitioners sounded a little non-specific, though.  These are all freely available.

1. Prefuse (Java) & FLARE (Flex) 
image11image14 

 2. simile (AJAX)

image104image109 

 3. Processing (Java)

 image46imageimage267  

4. GigaPan (Service)

image278image98    

5. Modest Maps (Flash, Python)

imageimage

6. Google Visualization API (Javascript)

imageimage

7. Google Chart API (Javascript)

imageimageimage image

8. Google Maps API (Javascript, Flash)

9. GraphViz (Wrappers for a dozen languages including Java, Perl, Python.  Free.) 

image image

10. JFree (Java)

imageimage

11. pChart (PHP)

imageimage image

12. OpenLayers (JavaScript)

imageimage178

13. Anti-Grain (C++)

imageimageimage

14. JGraph (Java)

image

15. Boost Graph Library (C++, phyton wrapper)

16. Open Flash Chart (Flash)

imageimage

17. Ubigraph (Wrappers for Python, Java, C, and more)

imageimage203

18. JUNG (Java)

imageimage

19. TimeMap (Java)

imageimageimage

20. Many Eyes (online service)

imageimage image

  • Share/Bookmark

Network Visualization for Systems Biology

2 Comments »

 roche3This is a quick look at the state-of-the-art of network visualization in systems biology. It’s an interesting topic on its own (and my day job at the moment), and also as it relates to the visualization of other types of networks, such as social networks (think Facebook). Systems biology is all about looking at proteins, pathogens, and more, within the contexts in which they interact. Naturally, then, the visualizations that tend to be particularly useful are those such as network visualizations that can provide macro understanding of the interactions.  Questions such visualizations help with include those of the form “if a drug affects protein X, what else will it affect?”

The Networks
Quite a bit of interesting complexity is present in these interaction networks (the data).  They are often small-world, disassociative (unlike social networks), scale-free, and exhibit modularity.  Biologists are usually either interested in looking at larger scale cell level networks, or meaningful sub-networks called pathways, which typically are in the range of 50-500 nodes.

Making life interesting, duplicate nodes representing different states are often included.  The edges are directed, and may be hyperedges when multiple nodes necessarily interact together. And, in truth, the edges are often approximations of the actual interactions in the underlying network.  These approximations come from experimental findings published in journals.  

A First Look
roche1 This image is part of Roche Applied Science’s “Biochemical Pathways” series of wall charts.  The charts are in the style of circuit diagrams, which seems to be the most common 2-D representation of metabolic pathways.  This set seems to have been particular influential.  The appeal of this ‘map’ is likely its scale.  Viewers can spend a great deal of time exploring.  In visualization there is a notion of ‘information density’, meaning the more visual attributes used to convey the data, the more information that may be present in the visualization.  This image has a very high information density. 

Layout

clip_image004In general (not just systems bio), network/graph layout (choosing where to place the nodes and edges) is done with consideration for (A) the topology network and (B) the aesthetics.  The primary topology concern is to place connected node pairs near one another and unconnected pairs apart.  The primary aesthetic concerns are to ensure that nodes do not overlap, edges do not cross, and labels are readable.    

cerebralmapk However, nodes in systems biology often also have biologically significant locations associated with them (e.g., within a cell, or within the nucleus of a cell).  The most common way of handling this location information is to treat the layout in a standard network layout manner, but constrain nodes to a compartment/level designated as the extracellular, membrane, cytoplasm, nucleus, etc.  This visualization, created with the Cerebral plugin for Cytoscape is the best example I know of of this.

Realism

clip_image008clip_image033Most of the network visualization tools for systems biology create very abstract images.  However, in high quality publications, such as the journal Nature, the abstract images are often hand rendered to include more realistic imagery.  Something I would like to do more of if look at actual microscope images and behavioral models to try to usefully bridge the gap.

Visual Data Mining

clip_image010There are many uses of these network visualizations for biologists and others.  One is just that they can leave a more lasting impression/memory than simple lists.  A major use case, though, is visual data mining, which may take many forms.  Followers of Tufte know that contrasts are often the most valuable element of a visualization.  This image is a straightforward example.  More sophistication visual data mining might include clustering and classification of those clusters.

clip_image012
clip_image019clip_image017

Zoom and Community Involvement

genomeprojectorBecause the Roche wall charts beg to be explored, it is only natural that a tool would be created for doing so.  G-Language is an open source shell that supports, among other things, pathway visualization plugins.  The Genome Projector is module for G-Language which uses the Google Maps API to allow exploration and annotation.  No doubt, as systems biology network visualization tools reach later versions, more and more will support rich interaction and, perhaps, treat the visualization as a vehicle for collaboration.

Hierarchy and Metanodes
 imageimageIn the networks section above, I mentioned that the networks are often modular.  The most obvious modules are organelles.  But other modules exist, such as those defined functionality.  As the above examples show, incorporation of the modularity information into the visualization often is done in a manner that makes it even more abstract.  

  • Share/Bookmark

A Look at FINVIZ.com (Financial Visualizations)

2 Comments »

about_finviz

FINVIZ is a suite of free financial tools that takes advantage of modern visualization ideas.  The infoviz and interaction designs are certainly worth a blog post.  Here’s a look at their efforts…

1. Sector Visualization.  This visualization is a treemap implemented using the Google Maps API.   It shows how well sectors and companies (stocks) within those sectors are doing.  The attention to detail is exceptional.  The company name stays the same size on zoom, and is dual encoded using a background image.  The gain/loss is shown using shades of green/red, and is also dual encoded using text.  On mouseover details are provided in a side panel.

map

map-closup

2. Stock Charts.  When you create a portfolio of stocks, a number of views of that portfolio.  One is a small multiples view which allows easy comparison without overlay as one has to do with Google Finance and Yahoo Finance charts.  Again, attention to detail is wonderful.  The current price is highlighted, the trendlines are nicely colored, and the volume barchart is part of the background.

smallMultiples

 3. Trends.  They use Sparklines for trend indicators.  Well, they may just be icons (not encoded by actual data), but I’ll delude myself nonetheless.

 sparklines

4. News. They aggregate the news items for all the stocks in a portfolio onto one page.  Very nicely done.  Only shows the day, month, year, when they change.  Overlays chart when mouseover of price (notice the little icon to indicate this next to the word price…attention to detail). 

news

5. Profiles.  Again, just very nicely done, showing all of the profiles on the same page.

profiles

6.  Relative Volume Indicator.  A second vertical axis is added.

image

  • Share/Bookmark

5 Reasons Visualization Is Not More Prevalent

13 Comments »

Why does it seem I have to look hard to find good data visualization examples?  Why do few tech companies devote resources to visualization (Google’s the obvious exception)?  Why are there relatively few job postings for visualization, with many of those there are requiring mainly graphic design skills and not data visualization skills?  I was thinking about this today and I came up with a few possible reasons, some based on perceptions, and others based on marketplace realities.

Reason #1: People Don’t Know What Data Visualization Is

benfry-monkey-small People don’t know what data visualization is.  Don’t believe me?  Read the Amazon.com reviews for the book Data Visualization by Ben Fry. They contain negative comments such as “One would expect a book with the title ‘Visualizing Data’ to be crammed with pictures”.  The issue seems be that too much of the book is devoted to data and the mapping of data properties to visual properties

Graphic design is different from data visualization.  Graphic designers are largely free from having to deal with actual data, and from having their product emerge from data.  Graphic design components and data visualization components are often mixed, and with great success.  But they are different.  Art is not visualization.  And visualization is not art…unless it is ;)

The above visualization (which is, in fact, by Ben Fry) is driven by the properties of two underlying datasets.  One dataset is the DNA of a monkey.  The genes (the data) are represented as very tiny white text.  A second dataset used is human DNA. It is only depicted after the difference of the two datasets has been computed.  Then the genes that are different between the monkey and human are represented in red.  Fry obviously didn’t choose which areas of the visualization would be red, the data did.  What about the monkey pic?  Even that is a visual representation of a property of the dataset…the type of the DNA dataset shown in white text.   

Reason #2: Crappy Existing Visualizations have Polluted Perception

kartoo600px-Cnet05thebigpicture 

The visualization on the left is the interface for the search engine Kartoo.  The visualization on the right is a feature CNET used to have called The Big Picture.  Both attempt to visualize data usually shown as lists (search results, related news articles) as 2D networks.  Its a nice idea, as pairwise relationship properties can be visually represented as edges.  But these particular efforts both miss the boat.  They don’t actually increase the amount of information represented by very much vs lists, while greatly increasing the mental load placed on the user trying to extract the basic information. 

Reason #3: People are Unable to Mentally Separate the View from the Data

benfrymultivizonedataset Here’s another Ben Fry work (I was watching a video/talk of his earlier today, which is part of the reason he is so prevalent in this post).  It shows six different visualizations of the same dataset.

Many times data relates to physical objects.  In such cases people may have trouble dealing with such data as visually represented in any other manner than that which includes those physical objects.  Or another situation is one in which data has just always been depicted in a certain way, which interferes with any new depiction. 

Reason #4: Visualization is Difficult to Create and Easy to Copy

googlefinance yahoofinance

This is somewhat irrelevant, but I have had a Yahoo mail account for about a decade.  There was a good six year stretch where it never changed.  If Gmail hadn’t come along, who knows. 

When Google released Google Finance, it marked a number of firsts…the use of AJAX for stock charts (the chart itself is actually Flash), the overlay of events on the chart, and the dual time sliders.  No doubt Google spent much time and effort designing this visualization tool.  How long did it take Yahoo Finance to copy Google Finance’s chart once Google revealed it?  Not long.  Good visualization design is hard.  It’s even harder when its object is to deconstruct very complex data.  Reverse engineering a visualization is easy.

Reason #5: People Won’t Pay for Visualization?

I’m not so sure about this one, but our company’s CTO recently commented to me that he couldn’t think of any successful standalone visualization effort other than Processing

Applications such as Google Maps don’t count both because its free, and, more importantly, because people wouldn’t have access to the underlying data without the visualization.  I can think of a few commercial successful standalone visualizations such as this one, but surely the list is fairly short. 

  • Share/Bookmark

Haugeland’s AI Views 25 Years Later

No Comments »

image

A couple of years ago, I picked John Haugeland‘s Artificial Intelligence: The Very Idea up off the free book table in the computer science department of Indiana University. Finally read it this weekend.  Published in 1985, there’s  a lot to like about the book, but its definitely a product of its time.  That period being when computer and cognitive scientists were obsessing about knowledge representation.  Wanted to call-out a few (perhaps arrogant) quotes reflective of its day…

“A different pipedream of the 1950s was machine translation of natural languages.  The idea first gained currency in 1949 (via a ‘memorandum’ circulated by mathematician Warren Weaver) and was vigorously pursed … Weaver actually proposed a statistical solution based on the N nearest words (or nouns) in the immediate context. …  Might a more sophisticated ‘statistical semantics’ (Weaver’s own phrase) carry the day? Not a chance.”

Pipedream…somebody tell Google :)   Actually, I had no idea machine translation was worked on in the 1950s.  Cool!  I would mention that the other pipedream of the ’50s he discusses is cybernetics, which, in various forms, is also a very popular area of research today.

“Artificial Intelligence must start by trying to understand knowledge…and then, on that basis, tackle learning.  It may even happen that, once the fundamental structures are worked out, acquisition and adaptation will be comparatively easy to include…it does not appear that learning is the most basic problem, let alone a shortcut or a natural starting point.”

Seems like research that has treated knowledge representation and learning as one problem (neural nets, Bayesian nets, etc) has been particularly fruitful.

“AI has discovered that knowledge itself is extraordinarily complex and difficult to implement–so much so that even the general structure of a system with common sense is not yet clear.”

And, clearly, the Cyc project solved this problem ;)

Anyway, the book is still a very interesting read, particularly if you like thinking about the challenges inherent in the domain knowledge representation.

  • Share/Bookmark

10 New York Times Visualizations

3 Comments »

NYTimes.com has done a great job of moving beyond the static infographics found in newspapers.  10 favorites below…comment if you know of good ones I’ve missed.  Also, for further reading/viewing, see…

- Playgrounds for Data: Inspiration from NYTimes.com Interactives
- Infovis 2007 slides on Matthew Ericson’s blog…

 nytimesnamingnames

nytimesUnion 

nytimesHowClassWorks 

nytimesBuyOrRent 

nytimesSectorSnap 

nytimesmoviebox 

nytimeskatrina 

nytimes-election2004

nytimesCasualities 

primary

  • Share/Bookmark

ETech Presentation on Ensemble Machine Learning

2 Comments »

logo

Just wanted to put up my slides from ETech this past week.  The talk is pretty similar to the talk I posted a few months ago, just a bit more fleshed out.
[ppt][pptx][pdf]

Unfortunately, I only made it to the conference for the day I was speaking.  Beautiful venue.  Seemed that most the buzz related to social networking issues and climate change.  Would have liked to have heard Peter Norvig’s talk.  Maybe another year.

etech1

etech2

  • Share/Bookmark

See Conference (Information Visualization) to be Streamed Live in April

1 Comment »

An information visualization conference, the See Conference, is being held in Wiesbaden, Germany, on April 19th.  Impressive speaker list.  The conference organizers plan to stream the speeches in real time via the conference website.

see1  
Ben Fry
see2  
Zachary Lieberman
see3  
Frank van Ham
see4
And comfortable seats!
  • Share/Bookmark

Ensemble Machine Learning Tutorial

1 Comment »

ensembleTutorialSlide

Here’s the slides from a 2-part lecture I’m giving on ensemble learning at Indiana University.  It includes a discussion of the Netflix Prize competition, and the use of ensemble techniques in that competition.

[PDF][PPT]

  • Share/Bookmark

A Review of MemoryArchive.org

1 Comment »

MemoryArchive I recently came across a small site running on Mediawiki called MemoryArchive.org.  The concept is that each article is a memory written, unlike Wikipedia, by a single author.  Subjective content allowed.

There seems to be a legit place for a site with this concept to complement Wikipedia.  Wikipedia is derivative knowledge, it is intended that the content be cited, meaning it already had to have been published somewhere.  Many valuable (and not so valuable) facts don’t fit that bill.  Also, when sources disagree but are merged into a single Wikipedia article, history according to Wikipedia has a rather non-deterministic feel to it.

That said, MemoryArchive.org has a long way to go in terms of concept, technology, and adoption.  If anyone involved with MemoryArchive comes across this review…well, I have some ideas:

  1. The site needs to provide a data dump (similar to Wikipedia’s data dump) or API.  That way researchers can use the knowledge without scraping the content.  Incidentally, I have written a basic scraper in Perl for this site if anyone wants it.
  2. Use Semantic Mediawiki.  Its the future.
  3. Allow any users to create links, categories on any page.  You’re already using MediaWiki, might as well take advantage of the technology.
  4. Allow usernames to be linked to social network account such as Facebook.  It will create many opportunities for applications to use the memories, and for memories to be related to one another.
  5. Link events to Wikipedia pages on those events…as I said, its complimentary to Wikipedia.

MemoryArchive.org
MemoryArchive.org

  • Share/Bookmark

Visualizing Science & Tech Activity in Wikipedia

1 Comment »

 sciwikivis-small If you didn’t see our original Wikipedia Activity Visualization, check it out here (there’s a detailed explanation, as well).  Also, there is a Google maps style zoomable version here.

This new version uses the same layout and images (well, slightly improved) as the original, but this time we tried to highlight activity in regions of Wikipedia that are predominately math or science or technology. 

So we developed a program to classify Wikipedia articles as being one of these three categories (or none), based on the categories the article was assigned to and their positions in the Wikipedia category link network. 

We were not surprised to see a tight cluster of math pages, in a region, I would add, which has little ‘hot’ activity.  In fact, the only article in that region with lots of activity is the article “Earth”.  It was also not surprising that technology articles are fairly spread out among the topics.

What’s striking is the science-related band (green-blue) that runs diagonal through the middle of the topic map.  I won’t share my interpretation, but rather let those interested come up with there own.  Hope you enjoy, please leave comments!

(CLICK IMAGE TO ENLARGE)

07-Wikipedia-PS3-150DPI
(CLICK IMAGE TO ENLARGE)

science

Above: The most actively edited science-related articles.

Left: Not much science here…a good indication the algorithms are working pretty well!

  • Share/Bookmark

Scheme Tutorial

4 Comments »

windowslivewriterschemetutorial-13a79image-5 I was asked to give a short (1 hr) tutorial on the Scheme language this week for students in the graduate and undergraduate AI courses at Indiana.  Thought I would post the slides in case anyone wants to adapt it for their own purposes…

PDF version
PPT (Office 2007) version

image

  • Share/Bookmark

ICCBR 2007 Highlights

1 Comment »

iccbrLogo
ICCBR07 (International Conference on Case Based Reasoning) is held on alternating years with the ECCBR conference.  The venue was Belfast, a city with nice blue collar charm to it.  Seemed sort of a European version of my hometown of Green Bay.  Stayed in a Queens University dorm room, where I was constantly reminded I am too old to be staying in the dorms.  Should have paid out for a stay for the Europa Hotel where the conference was held…classy place.

Day 1

Perceptions of CBR by David Aha.  Argued that CBR may become irrelevant if there are not more theoretical results published.  Showed stats that more recent CBR publications are system-oriented to back up his argument (but may be true of AI in general).  Suggested that CBR researchers have theory envy towards machine learning practitioners. 
Note: A thought occurring to me is that CBR is more a set of design patterns, ones which are fairly accessible by the general public given the diverse interests of the delegates present.   
Credible Case Based Reasoning by Eyke Hullermeier. A formal treatment of the retrieval component of CBR (nicely timed to correspond to Aha’s argument).

Day 2

Databases and CBR by Larry Kerschberg.  History of database research.  Currently concerned with meta data, reasoning, and data providence, and moving beyond ’just a schema’ (what can happen to data item).  Suggested using CBR for web resource discovery (Wikipedia, Amazon, Facebook as parts of casebase) as major direction for CBR. Invitation to submit CBR work to special issue of Journal of Intelligent Information Systems on integrating artificial intelligence and database technologies.  Scaling, cross domain issues also brought up.  Ran out of time to cover his paper ”Knowledge Sifter”. He was quite interested in CaseML as well.
Empolis by Ralph Traphoner.  An Eclipse based tool for CBR systems.  Supports a plug-in architecture, a fact alone which makes the tool promising in my mind.  Not sure whether it is commercial or what license it carries.
Industry Panel by GE, GM, and Knexus.  Discussion of what industry wants.  Examples of money making applications was a theme, not surprisingly.  Hybrid systems and Internet data were mentioned as well.  One researcher suggested that releasing data was not worth his time.  Another mentioned he has never heard of a case based recommender in the field.  One audience member raised a concern that many outside of the CBR community see CBR as just KNN and a lot of talk.  Something that strikes me about this community is that “industry” is non-tech.

Day 3

My talk at the workshop on personalizing similarity measures. 
jColibri by Juan Recio-Garcia.  CBR system using Lucene.

Day 4

Note: Google Group on context in CBR is now up.
Note: Suggestion that the difference for most people between IR and TCBR is that TCBR systems have some inference step.
CBR in RoboCup Soccer by Hans-Dieter Burkhard. Funny videos!  I’ll try to post these.
Note: David Leake’s paper on data providence won best paper.  Congrats David!

Additional Comments

Note: A “cooking” contest is being help at next year’s ECCBR where recipes are cases.  I like the idea of a contest, and the crossover to the real world (taste of cooking matters), however I wonder if the cooking domain allows for sufficient bragging rights.
Note: Major topics in CBR continue to be similarity measures, context awareness, text processing, and health care support systems.  The new topic for next year is to involve web 2.0 technologies.

  • Share/Bookmark

Google Tech Talk Review: Statistical Aspects of Data Mining

1 Comment »

This is a talk series being given at Google by David Mease based on a Master’s level stats course he is teaching this summer at Stanford.  Its easy listening if you already have some data mining or stats background. 

 

The introduction (part 1) is particularly well done, as is the portion on association rule mining (parts 7 and 8).  This is the first half of the course which has already occurred…I’ll add links as new sessions are added to Google video.

Part 1: Introduction. Discussion of locations of potentially useful data (grocery checkout, apartment door card, elevator card, laptop login, traffic sensors, cell phone, google badge, etc).  Note mild obsession with consent.  Overview of predicting future vs describing patterns, and other broad areas of data mining.  Intro to R.

Part 2: Data. Reading datasets into excel and R. Observational (data mining) vs Experimental.  Qualitative vs quantitative.  Nominal vs ordinal.  And so on…

Part 3: Data cont. More Excel and R.  Sampling.

Part 4Plots. Histograms, ECDF.

Part 5:  More R plots.  Overlaying multiple plots. Statistical significance.  Labels in plots.

Part 6:  More R plots.  Box plots.  Color in plots.  Installing packages.  ACCENT principles and Tufte.

Part 7: Association Rules. Measures of location. Measures of spread.  Measures of association.  Frequent itemsets.  Similar to conditional probabilities.

Part 8: More association rule mining.  Support and confidence calculations. Personalization using rules. Beyond support and confidence.Part 9: Review

Part 10: Classification.  Overview.  A negative view of decision trees.  DTs in R.  Algos for generating DTs.

Part 11: More DTs.  Gini index.  Entropy. Pruning. Precision, recall, f-measure, and ROC curve.

Part 12: Nearest Neighbor. KNN.  Support Vector Machines. Adding ‘slack’ variables, using basis functions to make the space linearly separable. Some comments on Stats vs ML. Intro to ensemble (uncorrelated) classifiers.

Part 13: Last class.  Random Forests.  AdaBoost.  Some discussion of limits of classifiers (nondeterministic observational datasets).  Clustering.  K-Means.

  • Share/Bookmark

On Transfer Learning

No Comments »

transferlearningDefinition (from DARPA): The ability of a system to recognize and apply knowledge and skills learned in previous tasks to novel tasks

Current approaches involve either the building of a shared model of a domain or multiple domains, in the form of a case base, hierarchy, or relational schema, that couple the classifiers together, or the creation of mapping between distinct representations.  Bayesian and neural approaches dominate the research thus far. 

(from Droy 2007-IJCAI07) In spam filtering, a typical data set consists of thousands of labeled emails belonging to a collection of users.  In this sense, we have multiple data sets–one for each user.  Should we combine the data set and ignore the prior knowledge that different users labeled each email?  If we combine the data from a group of users who roughly agree on the definition of spam we will have increased the available training data from which to make predictions.  However, if the preferences within a population of users are heterogeneous, then we should expect that simply collapsing the data into an undifferentiated collection will make our predictions worse.

Resources
Caruana dissertation (1997).  Part of ALVINN
Berkeley 2005 course.  Reading list.  Bayesian approaches are focused in on.  
Oregon State 2005 course. Probabilistic Relational Models.
DARPA Proposal.  Now in its third and final year. 
CBR Approach.  Strategy game playing. 
Wikipedia Entry 

Workshops
NIPS 1995
Inductive Transfer : 10 Years Later (2005)

05-29_figure01

  • Share/Bookmark

Visualizing the ‘Power Struggle’ in Wikipedia

21 Comments »

wikimap-small A new visualization Bruce Herr and I recently completed is being featured in this week’s New Scientist Magazine (the article is free online, minus the viz).  They did a good job jazzing up the language used to describe the viz–’power struggle’, ‘bubbling mass’, ‘blitzed articles’–but they also dumbed down the technical accomplishments.  I guess not everyone gets as excited about algorithms as I do. 

Before I talk anymore about the viz, though, let me mention its appearing at the NetSci 2007 Conference this week, and hopefully a varient will appear at Wikimania later this summer as well.  The viz is a huge 5 feet by 5 feet when printed, and I only include a low res, smaller version here.  At some point high quality art prints of it will appear at SciMaps for sale to fund further visualization research.

Now for the good stuff.  Much like my visualization of the netflix prize competition data, we began this piece by representing the data as a network.  In this case the nodes in the network are wikipedia articles and the edges are the links between articles.  We then (with some help from our friends at Sandia) used an algorithm to lay out all 650,000 nodes (wikipedia articles) that had at least one link in such a way that similar articles are near one another.  These are the yellow dots, which when viewed at low res give a yellow tint to the whole picture.

The sizes of the nodes (circles, dots, whatever you want to call them), are based on a model of revision activity.  So large circles indicate that an article might be controversial, or the subject of lots of vandalism, or just a topic whose content frequently changes.  We labeled only the largest nodes, to keep it readable.  There is an interactive version of this in the works based on the google maps platform which will change the labels and pictures used as the user ‘zooms’ in or out.  Stay tuned for that.

The image used for each tile was selected automatically, simply by using the first image in the most linked to article among all the articles in that tile.  We were pleasantly surprised by the quality of the images that appeared.   

Our hope for this visualization approach, which we continue to improve on, is that it could be updated in real time to give a macro sense of what is happening in Wikipedia.  I personally hope that some variation of it will end up in high schools as a teaching tool and for generating discussions.

Top 20 Most Hotly Revised Articles

  • Jesus
  • Adolf Hitler
  • October 2003
  • Nintendo revolution
  • Hurricane Katrina
  • India
  • RuneScape
  • Anarchism
  • Britney Spears
  • PlayStation 3
  • Saddam Hussein
  • Japan
  • Albert Einstein
  • 2004 Indian Ocean Earthquake
  • New York City
  • Germany
  • Muhammad
  • Pope Benedict XVI
  • Ronald Regan
  • Hinduism 
  • Share/Bookmark

Another Visualization of the Netflix Prize Dataset

7 Comments »

netflixAllMoviesSmall Here’s a recent visualization I did of the dataset used in the Netflix Prize Competition. The dataset is 17,700 movies and 31 gigs of user ratings. This viz shows similar movies close to one another, with the similarities determined by a formula based on ratings.

I found most interesting a cluster of movies (in blue) that I’d say are generally acclaimed. The cluster contains movies of across all genres, such as Schindler’s List, BraveHeart, and Super Size Me. Beyond that, there’s a bunch of clusters which are mostly defined by a genre such as music, sports, documentary, Imax, children’s films, or bonus material. The big blob in the center is mostly what I’d call junk movies.

I’ve labeled some movies just to give some sense of what the clusters contain. There’s an interactive version of the viz as well, so you can explore the movies for yourself…

  • Share/Bookmark

An Interactive Visualization of the Netflix Prize Dataset

2 Comments »

smallNetflixVizInteractive The visualization activated below (click the button) shows all 17,700 movies that are part of the Netflix Prize Competition. The movies are laid out such that simlar movies are close to one another. Similarity between two movies is computed based on whether users who like one like the other, or (and, really) those who dislike one dislike the other.  Alternatively, take a look at a colorful, static version.

Mouse over to get the movie titles…

[swf movie="http://abeautifulwww.com/wp-content/uploads/2007/04/vizview.swf" /]
http://abeautifulwww.com/wp-content/uploads/2007/04/vizview.swf

  • Share/Bookmark

GapMinder Talk

3 Comments »

Just read an article about Google buying a small company called GapMinder which does data visualization.  I checked out the talk on the GapMinder homepage, and would recommend watching the first 10 minutes of it.  The visualization tool that is used throughout the talk is something special…easy to see Google’s interest.

  • Share/Bookmark

Installing WordPress on GoDaddy

77 Comments »

wordpressGoDaddy Setting up WordPress on a GoDaddy hosting account is really not difficult (this blog is an example that it can be done!).  Below are my notes on the process.  If you glance at these steps, and don’t want to mess around with this, consider using one of the following hosting services which come with WordPress pre-installed (fairly rare): An Hosting, Lunarpages, BlueHost, Yahoo

Steps for installing WordPress on a GoDaddy Hosting Account

1. Get an account.  If you haven’t already, purchase a hosting account.  I chose the Deluxe plan, which really isn’t very expensive.  You’ll be emailed directions after you purchase the account.  The email will say it takes 24-48 hrs to activate, but it actually only takes 20 minutes or so.

GoDaddy.com Hosting & Servers

2. Login to the “my account”.  The login is on the GoDaddy homepage.  On the my account screen, click “Hosting Account List”.  Then click “open” under control panel.  You should be at the “Hosting Manager” seen below.

3. Create a MySql Database.  WordPress stores its data on MySql.

  • Click the MySql icon.  Then click “Create New Database”. Name the db “WordPress”. 
  • Create a db login.
  • Confirm.
  • Submit.  Wait a minute. Then refresh. The status should change to “setup”. 
  • Click the db name.
  • Highlight the hostname and copy it (ctrl-c).  You’ll need it for the WordPress config file.

4.  Download WordPress.  Unzip the files.

5.  Configure the file wp-config.php.  Change the following lines using your information.

define(‘DB_NAME’, ‘wordpress’);
define(‘DB_USER’, ‘username’);
define(‘DB_PASSWORD’, ‘password’);
define(‘DB_HOST’, ‘localhost’);

6.  Upload the WordPress directory to your GoDaddy account.  You’ll need an ftp client to upload files to your account (I use Smart Ftp) and you’ll need the ftp address for your site.  Your address is ftp.yourdomain.com.  Put the files in you top level directory, that way when you go to www.yourdomain.com it will load WordPress.

7. Test WordPress.  There are detailed directions for configuring WordPress here.

 

Around the Web

  • Here is another good tutorial.
  • Share/Bookmark

A Tutorial on Flash Remoting Using Perl

2 Comments »

remoting Flash remoting is a big improvement over forms/cgi for communication between flash and server.  There’s a great little project called amfphp for using php with flash remoting.  There’s a whole lot less great (but appreciated!) version called amf::perl for perl and python. There is little documentation, so I thought I’d post an example. 

Here’s my remoting notes dealing with amf::perl.  For context, I was working on a movie recommendation system.

Steps:

(1) On the Server side, in the cgi-bin, create a perl file, and call it, say “recommendation.pl”.   Your version of this file will look the same, except replace “/recommendation” with an appropriate subdirectory where you will put the code in step 2. 

#!perl  -w
use strict;
use AMF::Perl;

my $gateway = AMF::Perl->new;
$gateway->setBaseClassPath("./recommendation/");
$gateway->service();
 

(2) In a subdirectory, in this case “/recommendation”, create a perl module, in this case called “rec.pm”.  This module contains the “methodTable” which will be accessed by the flash code.  The example below has both get and set calls. 

#!perl  -w

use AMF::Perl;
use lib 'Filesystem\\Active Projects\\'; ## directory where my server side code is
use SimMan::TitleLookup; ## my own server side perl modulesuse SimMan::Recommendation;

package rec;
## must have this constructorsub new 
{
    my ($proto)=@_;
    my $self={};
    bless $self, $proto;
    return $self;
}

sub methodTable
{
    return {
        "getFeatureWeights" => {
            "description" => "Returns the current feature weights",
            "access" => "remote",
        },
        "setFeatureWeights" => {
            "description" => "Sets the current feature weights",
            "access" => "remote",
        },
        "disambiguateQuery" => {
            "description" => "Returns possible titles and their ids",
            "access" => "remote",
        },
    };
}

sub getFeatureWeights
{
    my ($self) = @_;
    my $hash = SimMan::Readers::readFeatureWeights();
    my @array;
    push @array, $hash;
    return \@array;
}

sub setFeatureWeights
{
    my ($self, $featureWeights) = @_;
    SimMan::Readers::setFeatureWeights($featureWeights);}

sub disambiguateQuery
{
    my ($self, $usertext) = @_;
    return $titles = SimMan::TitleLookup::getCandidateTitles($usertext);
}

1;
 

(3) On the client side, first include the necessary files and give the location of the perl files we just created.  

Then you can directly call the methods listed in the methodTable.  Notice that the results sent back by the perl module will be returned to a function called “method_result”, where method is the name of the method.

import mx.remoting.NetServices;
NetDebug.initialize();
NetServices.setDefaultGatewayUrl("http://localhost/cgi-bin/recommendation.pl");
var connection = NetServices.createGatewayConnection();
var recommender = connection.getService("rec", this);

function disambiguateQuery_Result(result){
    if (result.length == 0){
        Results.htmlText = "<b>No movies found</b>";
    }
    else{
        Results.htmlText = "<b>Are you looking for...</b><br><br>";
        for (i=0; i< result.length; i++) {
            Results.htmlText += "result[i].id + " " + result[i].title + "<br>";
        }
    }
}

searchText.enter = function(){
    recommender.disambiguateQuery(searchText.text, 10);
};

Hope this helps someone!

  • Share/Bookmark

Tips for Making Perl Programs Run Faster

3 Comments »

In my daily work I tend to manipulate fairly large datasets, such as Wikipedia, U.S. Patents, Netflix Ratings, and Imdb.  Here’s a few tricks I’ve come across so that you don’t lose time waiting for your programs to finish. 

  1. Use Storable
    • If you need to a save a hash, array, or more complex data structure to use in its entirety at a later time, it is around 4 times faster to use storable than to read/write it from a text file.  Its simple to use, just two calls…
    • store(\%hash, $filename);
    • $hashref = retrieve($filename);
  2. Use a Berkeley Db for random access
    • Berkeley Db’s are associative databases (as opposed to relational databases like MySql, Postgresql, etc).  And they are quite useful.
    • Whereas we want to use Storable when we will be later loading the entire data structure back into memory, we want to use BDBs when we will only be performing random access at a later time.  So, for example, lets say I have all the IMDB movie titles in a hash, and I want to save it to disk.  But I know that later on I am only going to want to look up one or two titles at a time, so that’s a case where I want to use a BDB.
    • There’s two modules for BDBs to choose: DB_File or BerkeleyDB.  I usually use DB_File.
  3. Use Unix Grep
    • This might sound a bit off the wall, but when you have your data in a file, rather than reading it in and parsing it using Perl regular expressions, it can be up to 100 times faster to call Grep or AWK.  Grep is particularly useful when your task is simply to lookup something in a text file.  In general, when trying to speed up code, don’t be afraid to take advantage of the operating system.
    • system(“command”, $input, @files);
  4. Patterns for Storing Hash Structures
    • This tip is somewhat domain dependent.  I often find myself using 2d hashes, primarily because I work with data represented as graphs (i.e. a set of edges represented as node-node-weight).  So these 2d hashes look like {id1}->{id2}=value.
    • When writing such a hash out to text I’ll use the command chr(1) to generate a delimiter, and directly write “id1 id2 value\n” (with chr(1) instead of spaces) per entry.
    • When using Storable, I don’t need to do anything special.
    • When using a BDB, there is no such thing as a 2d hash.  Either one has to create multiple BDBs, or, usually the better option, is to convert the hash to {id}=id2a|value|id2b|value… and use split when accessing the data.  Again, I like to use chr(1) instead of | as a delimiter. 
  • Share/Bookmark

Flash vs. Processing

11 Comments »

Over the past year and a half I’ve been hooked on the language Processing. I’ve even contributed a early version library for visualizing social network data

For those unfamiliar with Processing, it’s a variant of Java.  Its distinguished by its emphasis on interactive media. The fundamental unit in Processing is a sketch, which is entirely and continuously redrawn at some given rate.  Sketches may be compiled into either standalone apps or Java applets.

Many creative types work with Processing–here’s some cool examples: The DumpsterThinking Machines 4 (pictured left), and Relations Environment. More can be seen on exhibition.  Its not hard to see why I was excited by the language. 

I’m still excited by the Processing community and all the cool apps and exhibits they are turning out.  However, today I am much less excited by the technology. 

  1. As an Internet content delivery mechanism, Java Applets are a poor choice.  They have a large memory footprint and so are slow to load, and they provide few options for communicating between client and server.
  2. Sketches are redrawn in their entirety with each tick of the clock.  There are no layers.  Why is this a problem?  This limits the number of graphical objects one can involve in a sketch without the sketch slowing down.
  3. There are no user interface components of any quality.  And Java Swing is not compatible with Processing.  While this is a real limitation, this is a somewhat hollow complaint, as I realize it is only a matter of time before quality UI widgets are created within Processing for Processing. 
  4. Java 1.5 is not supported.  My concern here deals with the fact that one of Processing’s greatest strengths is that a developer may use Java.  I am wondering whether the Processing community will be able to maintain this integration over time. 

Recently I’ve been looking into Flash (e.g., the chart in Google Finance is Flash) and have begun to believe that Flash is a better alternative.  It is great at content delivery, has convenient UI components and support, and the fundamental unit of Flash, the “movie”, can be redrawn using separate layers.

I am still learning ActionScript, the object-oriented language used for Flash.  My early impressions of it are that recent versions 2.0 and 3.0 are of a pretty high quality, but a far cry from Java.

I’ll post a follow-up to this in a few months after I have more experience with Flash and Actionscript under my belt.

Around the web

  • Share/Bookmark

Ranking Online Backup Services

3 Comments »

This post is a bit off topic for this blog.  However, I recently decided I really needed an off-site backup service (er, I lost some files).  And, as usual, I spent way too much time looking around the web for such a service.  Anyways, I thought I’d share my homework.   To give you an idea of my situation, I have about 30GB of data (code, datasets, visualizations) that I need constantly backed up. 

Backup Services vs. Online Storage

There are a ton of options.  I just wanted an automatic backup service–one that works flawlessly.  Many services I looked at emphasize that the files can be accessed anywhere.  These services often refer to themselves as online storage rather that online backup.  The pure backup services usually offer unlimited space and focus on the software for backing up changes on demand or when your computer is idle, and on the software for restoring files.  The tradeoff is usually that files are stored compressed, and are only meant to be accessed for somewhat rare restores.  Online storage services, by comparision, limit the storage space, and although they almost always advertise the backup aspect, files usually need to be manually transfered.  The advantage is that the files on such services can be accessed over the web. 

I’m happy to keep files on my usb flash drive which I need access to wherever I go, and so I just need the automatic backup services. 

I’ve ranked the services based on my impressions of their offerings and personal experience…

Best Deals

  1. Mozy–I ended up going with Mozy as my backup service, and an very happy so far.  I go to lunch or a meeting, and come back to a message that my files were backed up while my computer was idle.  The software for selecting which files and folders to backup is quite easy to use.  However, Mozy is purely a backup service, all files are stored compressed on their servers.  So if you’re interested in remote access to your files, look elsewhere.  2GB free, or unlimited space for 4.95/month.
  2. Carbonite–Their software for automatic backup is really pretty slick.  I was torn between Carbonite and Mozy.  Like Mozy this is strictly a backup service.  There is no free version; unlimited space for 49.95/yr. 

Runners Up

  1. StreamLoad (MediaMax)–StreamLoad is rebranded across a number of sites including MediaMax.com and the AMD Media Vault.  Initially, at least on paper, this seemed liked the best deal to me.  They have a free service of 25GB (they throttle downloading though), which may be enough space for many folks.  It both supports automatic backup and provides anywhere access to your files.  Its 4.95/month for the larger 100GB.  I said initially it looked good.  I downloaded the software, and it kept crashing on both my laptop and desktop.  Unacceptable :(   Maybe others will have a better experience.
  2. Files Anywhere–Provides both online access and automatic backup.  11.95/month for 10gb. 
  3. Box.net–Looks nice, but only 5GB for 4.95/month is too small for my needs.
  4. FlipDrive–Does not provide automatic backup.  4.95/month for 20 gb. 
  5. Iomega IStorage–5GB for 19.99/month.  Yikes!  Too small a space, and too expensive.
  6. IBackup.com–5GB for 9.95/month, 50GB for 49.95/month.  Too little for too much. 
  7. ShareFile.com–Pay by bandwidth.  23.95/month for 2gb of bandwidth.

Foolish

  1. XDrive–Owned by AOL.  Provides for automatic backup and online access.  Strictly a free service which provides 5gb of space.  Reviews are mixed on its software, but mainly I decided to stay away because if I needed more than 5gb, there is no means of upgrading.
  2. GoDaddy I’m a big GoDaddy fan.  This blog runs on a GoDaddy Hosting accont. But their online storage service just does not look like a bargain.  Called “Online File Folder”, its 2gb for $20/yr.  No automatic backup software.
  3. True Share–Overpriced.  “As little as 30/month for 3gb”.
  4. Amazon S3–A web service.  This is strictly for developers only.  Might be a good deal for a small or medium sized business.   
  5. Yahoo Briefcase–A pathetic 25mb of storage.  Might as well just use Gmail for storage.

If you just need online storage and remote access, consider a regular hosting account

  1. An Hosting
  2. Lunarpages
  3. BlueHost
  4. Yahoo

Around the Web

  1. TechCrunch: The Online Storage Gang
  2. PC World: Online Storage 
  • Share/Bookmark

Top 10 Google Tech Talks

4 Comments »

All Google Tech Talks are here (Google EngEDU is the actual name of the talk series). Thought I’d compile a top ten list…

  1. Python and Python 3000. Two talks about the Python language given by its inventor Guido van Rossum. The first is about the language’s origins and the second is about its future.
  2. How Open Source Projects Survive Poisonous People (And You Can Too). Really liked this talk. Really liked it! Given by the lead developers of SubVersion (and other large projects), this talk provides a guide to working as a team. Next time I lead a project, I am going to ask everyone to watch this before starting work.
  3. Winning the DARPA Grand Challenge. The story of the robot race thru the mojave desert. Having been on a Grand Challenge team, I appreciate just how hard it was to win.
  4. Scrum, et al. An excellent talk about the Scrum agile software development methodology.
  5. Wikipedia and MediaWiki. A talk about the implementation of Wikipedia given by it original, and for a long time, only paid staff developer. Not a very dynamic talk, but the insider perspective is interesting.
  6. Computers versus Common Sense. Doug Lenet gives a talk about the famous AI project Cyc. I thought this project was put to rest a long time ago. Guess I was wrong. Anyways, if you like symbolic Artifical Intelligence, its a really interesting talk.
  7. Scholarly Data, Network Science, and (Google) Maps. A very good information visualization talk.
  8. 15 Views of a Node Link Graph: An Information Visualization Portfolio. A bunch of visualization techniques. I think Tamara Munzner leads some of the most interesting visualization work anywhere.
  9. Human Computation. A talk about harnessing human knowledge for tasks such as spam filtering and image recognition.
  10. Scrum Tuning: Lessons learned from Scrum implementation at Google. A talk about the experience of using Scrum given by one of its inventors.
  • Share/Bookmark

CoCitation vs. Bibliometric Coupling

No Comments »

I recently posted an efficient algorithm for computing the similarity of two Wikipedia pages (or any two nodes in a network) using cocitation similarity. Another type of similarity which may be worth considering is bibliometric coupling, in which two pages are similar if the pages they link to are similar. What is interesting is that it is only a few minor tweaks to the cocitation algorithm to compute bibliometricc coupling. Here’s the bibliometric coupling psuedocode (Perl style):

%nodes, %links //the wikipedia pages and pagelinks
%reverse = reverse(%links) //flipping the pagelinks around
%biblioCounts //2d hash for temporarily storing counts
%scores //2d hash storing the final similarity scores

foreach node (keys %nodes){ 

   foreach linkedNode (keys %links->{node}) //count cocitations for node (wiki page)
      foreach node2 (keys %reverse->{linkedNode})
         $biblioCounts{node}->{node2} ++;

  citationCount = keys(%linked->{node})
  foreach node2 ( keys %biblioCounts{node}) //similarities scores for node
      citationCount2 = keys(%linked->{node2})
      scores{node}->{node2} =
		2 * biblioCounts{node}->{node2} / (citationCount + citationCount2)
}
  • Share/Bookmark

Wikipedia Page Similarities

1 Comment »

I’m working on a visualization, a ‘map’ if you will, of Wikipedia pages. The map will layout pages close to one another if they are similar. So, in order to create such a map I need to compute the similarity of any two Wikipedia pages.

For my first attempt at this, I decided to go with a cocitation measure of similarity. So, two Wikipedia pages will be said to be similar if other Wikipedia pages that link to one usually link to the other.

However, the naive way to compute this, looking at every pair of pages, is far too inefficient given that there are 650,000 pages in the English Wikipedia, and 14.5 million pagelinks. So I’ve worked up a much more efficient algorithm. Here’s the psuedocode…I hope someone, somewhere out in cyberspace will find this useful. (It can, in fact, be used to compute co-citation similarities for any data represented as nodes and links)

%nodes, %links //the wikipedia pages and pagelinks
%reverse = reverse(%links) //flipping the pagelinks around
%cocitations //2d hash for temporarily storing cocitation counts
%scores //2d hash storing the final similarity scores

foreach node (keys %nodes){ 

   foreach sourceNode (keys %reverse->{node}) //count cocitations for node (wiki page)
      foreach cocited (keys %links->{sourceNode})
         $cocitations{node}->{cocited} ++;

  citationCount = keys(%reverse->{node})
  foreach node2 ( keys %cocitations{node}) //similarities scores for node 
      citationCount2 = keys(%reverse->{node2})
      scores{node}->{node2} =
		2 * cocitations{node}->{node2} / (citationCount + citationCount2)
}
  • Share/Bookmark