Google Tech Talk Review: Statistical Aspects of Data Mining

This is a talk series being given at Google by David Mease based on a Master’s level stats course he is teaching this summer at Stanford.  Its easy listening if you already have some data mining or stats background.

The introduction (part 1) is particularly well done, as is the portion on association rule mining (parts 7 and 8).  This is the first half of the course which has already occurred…I’ll add links as new sessions are added to Google video.

Part 1: Introduction. Discussion of locations of potentially useful data (grocery checkout, apartment door card, elevator card, laptop login, traffic sensors, cell phone, google badge, etc).  Note mild obsession with consent.  Overview of predicting future vs describing patterns, and other broad areas of data mining.  Intro to R.

Part 2: Data. Reading datasets into excel and R. Observational (data mining) vs Experimental.  Qualitative vs quantitative.  Nominal vs ordinal.  And so on…

Part 3: Data cont. More Excel and R.  Sampling.

Part 4:  Plots. Histograms, ECDF.

Part 5:  More R plots.  Overlaying multiple plots. Statistical significance.  Labels in plots.

Part 6:  More R plots.  Box plots.  Color in plots.  Installing packages.  ACCENT principles and Tufte.

Part 7: Association Rules. Measures of location. Measures of spread.  Measures of association.  Frequent itemsets.  Similar to conditional probabilities.

Part 8: More association rule mining.  Support and confidence calculations. Personalization using rules. Beyond support and confidence.Part 9: Review

Part 10: Classification.  Overview.  A negative view of decision trees.  DTs in R.  Algos for generating DTs.

Part 11: More DTs.  Gini index.  Entropy. Pruning. Precision, recall, f-measure, and ROC curve.

Part 12: Nearest Neighbor. KNN.  Support Vector Machines. Adding ’slack’ variables, using basis functions to make the space linearly separable. Some comments on Stats vs ML. Intro to ensemble (uncorrelated) classifiers.

Part 13: Last class.  Random Forests.  AdaBoost.  Some discussion of limits of classifiers (nondeterministic observational datasets).  Clustering.  K-Means.

I too am so lost! I tried endlessly to set up wordpress . After numerous failed attempts I finnaly put down my ego and called customer support. They told me to my horror, that I can’t run php with using windows! They say in life change can be scary. I’ve battled deadly illness, traveled alone in the Middle East, got into the ring with 300lb men that wanted to rip my head off but the thought of changing my OS is way too scary and I can’t do it. I read that the “no php with windows” is bullshit and a case of under qualified support at godaddy. Is this true? I have know experience at php so if anyone has the answer, I’m really gonna need it in fine detail please. Wonderful blog ya got here!