Presentation is loading. Please wait.

Presentation is loading. Please wait.

Midterm Review. 1-Intro Data Mining vs. Statistics –Predictive v. experimental; hypotheses vs data-driven Different types of data Data Mining pitfalls.

Similar presentations


Presentation on theme: "Midterm Review. 1-Intro Data Mining vs. Statistics –Predictive v. experimental; hypotheses vs data-driven Different types of data Data Mining pitfalls."— Presentation transcript:

1 Midterm Review

2 1-Intro Data Mining vs. Statistics –Predictive v. experimental; hypotheses vs data-driven Different types of data Data Mining pitfalls –With lots of data you can find anything Data privacy and security –Good and bad examples

3 2- EDA and Visualization Good visualization is good analysis Examples of vis –1-d, 2-d, multivariate –Histograms, boxplots, scatterplots, density estimates, etc –Overplotting with many points –Conditional plots (small multiples) –Good, bad examples

4 3- Data mining concepts Preparing data for analysis –How to deal with missing data? –What are good transformations? –How to deal with outliers Data reduction –Reducing n: sampling, subsetting –Reducing p: Principal components: finding projections that preserve variance –Scree plot shows how much variance is accounted for in the PC MDS: –Needs a distance matrix –Mimimizes ‘stress function’ –mostly used for visualization and EDA In-vs-out of sample evaluation –In-sample: must penalize for complexity –Out-of-sample: use cross-validation to evaluate predictive performance

5 3- Data mining concepts Complexity/Performance tradeoff Evaluating Classification models –Accuracy (how many did I get right): not the best choice –Precision/recall or Sensitivity/specificity tradeoff –Selecting different thresholds for ROC curve.

6 4-Regression Linear regression –What is it, what are the assumptions, how do you check them –Model selection Exhaustive or Greedy (forward/backward selection) search Extensions of Linear regression –Non-linear in parameters, linear in form –Generalized Linear Models Logisitic regression Poisson regression –Shrinkage Ridge regression Lasso regression Profile plots show the trace of parameter estimates –Principal component regression –Nonparametric models Smoothing splines

7 5-Classification Categorical or binary response – ‘supervised’ learning LDA: fit a parametric model to each class Classification (decision) trees –Binary splits on any predictor X –Best split found algorithmically by gini or entropy to maximize purity –Best size can be found via cross validation –Can be unstable K-Nearest Neighbors –Tradeoff of large/small k Probabilistic models –Bayes error rate: best possible error if model is correct –Naïve Bayes Independence assumption on p(x i |c)

8 6-Clustering No response variable – ‘unsupervised’ learning Needs distance measures –Euclidean, cosine, jaccard, edit, ordinal and categorical K-means –Select initial solution –Classify points, than re-calculate means Hierarchical clustering –Solutions for all k from 1 to n –Dendrogram effective visualization –Different distance functions (links) will result in different clusterings Probabilistic –Mixture models fit using EM algorithm –Model based clustering


Download ppt "Midterm Review. 1-Intro Data Mining vs. Statistics –Predictive v. experimental; hypotheses vs data-driven Different types of data Data Mining pitfalls."

Similar presentations


Ads by Google