Data Analytics CMIS Short Course part II Day 1 Part 1: Introduction Sam Buttrey December 2015.

Data Analytics CMIS Short Course part II Day 1 Part 1: Introduction Sam Buttrey December 2015

Who Am I? A.B., Princeton, Statistics; M.A., Ph.D., U. California-Berkeley, Statistics Naval Postgraduate School, Department of Operations Research, 1996-Present Data Analysis, Data Mining, Big Data Analytics, Classification, Modeling and Applications… Married, one child…

Tentative Schedule Today: Trees and Ensembles – 9:00 – 10:00: Recap part I (Whitaker); Overview –10:00 – 2:00: Regression & Classification trees – 2:00 – 3:00: Ensemble models – 3:00 – 4:00: *Evaluating classifiers: ROC and F 1 Tomorrow: Unsupervised Models – 9:00 – 11:00: Principal components –11:00 – 2:00: Clustering – 2:00 – 4:00? Association Rules If Time Remains –Simple forecasting models 3

The Big Picture These courses are intended to help you visualize, predict, classify and find patterns in large data sets Often these are constructed by combining different data sets from different sources –Inconsistencies, redundancies, noise Data sets in the course are small enough to be used quickly, but we want automatic techniques that scale up to huge-ish data

Data Will normally appear as a rectangular array: rows are observations, columns are measurements (variables): n  p –Data that is not already rectangular will be wrestled into this form! – columns of pixels, counts of terms in documents, etc. Each column has the same sort of measurement: numeric (incl. date/time), categorical, logical (True/False), text Data might be missing

Types of Models (i) Often one response (target) variable is the primary measure of interest (“Y ”) We want to predict the value of Y in new data where predictors (X ’s) are known When Y is numeric, this is regression –E.g. size of error in TACNAV data When Y is categorical, this is classification –E.g. digit recognition (0, 1, …, 9) These models are called “supervised” because the true Y ’s are known

Recap of Part I Whitaker, September 2015 1.R and RStudio 2.Linear regression –Comparison to Nearest-Neighbor Methods 3.Logistic Regression 4.Controlling Complexity –Training set/test set; cross-validation –Lasso and other regularization 5.Intro. to Neural Networks These all involve linear combinations of predictor variables!

Constant Concerns Modern models like trees are very flexible and therefore prone to over-fitting Control complexity by: 1.Using separate training and validation sets, the latter to compare models Then evaluate prediction error with test set 2.Cross-validation across, say, 10 folds 3.Regularization (shrinkage of coefficients) via ridge or lasso A constant theme in big data 8

Use the model built with the training data to predict a new set of data An Independent Test Set Training Set Measure of Complexity More ComplexLess Complex Under-Fitting High Bias – Low Variance Over-Fitting Low Bias – High Variance

Types of Models (ii) Unsupervised models have no particular response variable Goals are to find groups (clustering, source separation), or anomalies, or relationships (association rules), or reduce dimensionality for visualization Generally more difficult and less satisfying than supervised models –Hard to evaluate or compare quality 10

R and RStudio Recap R is a very popular free open-source statistical environment RStudio is a free front-end that makes managing scripts and graphics easier Our variables come in vectors; a rectangular set of vectors makes up a data.frame Example – beer (35 x 11)

R Basics Restated R is case-sensitive (but Windows isn’t) help (thing) or ?thing for help a <- b assigns value of b to a –Subsequent changes to b don’t affect a Recall earlier commands with up-arrow –history(100) shows last 100 Use forward slash for file names Special characters: \n, \t, \\, \" –Single or double-quotes okay; # for comment == for “is equal”, != for “not equal” 12

Materials The disc has Slides, R Scripts, Data sets and Libraries (plus a few random things) –library ( ) looks in default places; library (, lib.loc= ) Cntl-Enter executes a line from a script, but… Lots of commands are already given to you – for best results make sure you understand them 13

R Refresher Get beer data into R from Excel Data frames and variable types Simple exploration –Plot of Calories vs. Alcohol Simple linear regression model of Calories vs. Alcohol Drawing the response “surface” –To be used to compare the linear model with the tree model Let’s do this! 14

Data Analytics CMIS Short Course part II Day 1 Part 1: Introduction Sam Buttrey December 2015.

Similar presentations

Presentation on theme: "Data Analytics CMIS Short Course part II Day 1 Part 1: Introduction Sam Buttrey December 2015."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Analytics CMIS Short Course part II Day 1 Part 1: Introduction Sam Buttrey December 2015.

Similar presentations

Presentation on theme: "Data Analytics CMIS Short Course part II Day 1 Part 1: Introduction Sam Buttrey December 2015."— Presentation transcript:

Similar presentations

About project

Feedback