Google Tech Talk Review: Statistical Aspects of Data Mining

1 Comment »

This is a talk series being given at Google by David Mease based on a Master’s level stats course he is teaching this summer at Stanford.  Its easy listening if you already have some data mining or stats background. 

 

The introduction (part 1) is particularly well done, as is the portion on association rule mining (parts 7 and 8).  This is the first half of the course which has already occurred…I’ll add links as new sessions are added to Google video.

Part 1: Introduction. Discussion of locations of potentially useful data (grocery checkout, apartment door card, elevator card, laptop login, traffic sensors, cell phone, google badge, etc).  Note mild obsession with consent.  Overview of predicting future vs describing patterns, and other broad areas of data mining.  Intro to R.

Part 2: Data. Reading datasets into excel and R. Observational (data mining) vs Experimental.  Qualitative vs quantitative.  Nominal vs ordinal.  And so on…

Part 3: Data cont. More Excel and R.  Sampling.

Part 4Plots. Histograms, ECDF.

Part 5:  More R plots.  Overlaying multiple plots. Statistical significance.  Labels in plots.

Part 6:  More R plots.  Box plots.  Color in plots.  Installing packages.  ACCENT principles and Tufte.

Part 7: Association Rules. Measures of location. Measures of spread.  Measures of association.  Frequent itemsets.  Similar to conditional probabilities.

Part 8: More association rule mining.  Support and confidence calculations. Personalization using rules. Beyond support and confidence.Part 9: Review

Part 10: Classification.  Overview.  A negative view of decision trees.  DTs in R.  Algos for generating DTs.

Part 11: More DTs.  Gini index.  Entropy. Pruning. Precision, recall, f-measure, and ROC curve.

Part 12: Nearest Neighbor. KNN.  Support Vector Machines. Adding ‘slack’ variables, using basis functions to make the space linearly separable. Some comments on Stats vs ML. Intro to ensemble (uncorrelated) classifiers.

Part 13: Last class.  Random Forests.  AdaBoost.  Some discussion of limits of classifiers (nondeterministic observational datasets).  Clustering.  K-Means.

  • Share/Bookmark

On Transfer Learning

No Comments »

transferlearningDefinition (from DARPA): The ability of a system to recognize and apply knowledge and skills learned in previous tasks to novel tasks

Current approaches involve either the building of a shared model of a domain or multiple domains, in the form of a case base, hierarchy, or relational schema, that couple the classifiers together, or the creation of mapping between distinct representations.  Bayesian and neural approaches dominate the research thus far. 

(from Droy 2007-IJCAI07) In spam filtering, a typical data set consists of thousands of labeled emails belonging to a collection of users.  In this sense, we have multiple data sets–one for each user.  Should we combine the data set and ignore the prior knowledge that different users labeled each email?  If we combine the data from a group of users who roughly agree on the definition of spam we will have increased the available training data from which to make predictions.  However, if the preferences within a population of users are heterogeneous, then we should expect that simply collapsing the data into an undifferentiated collection will make our predictions worse.

Resources
Caruana dissertation (1997).  Part of ALVINN
Berkeley 2005 course.  Reading list.  Bayesian approaches are focused in on.  
Oregon State 2005 course. Probabilistic Relational Models.
DARPA Proposal.  Now in its third and final year. 
CBR Approach.  Strategy game playing. 
Wikipedia Entry 

Workshops
NIPS 1995
Inductive Transfer : 10 Years Later (2005)

05-29_figure01

  • Share/Bookmark