Data Analytics CMIS Short Course part II Day 1 Part 1: Introduction Sam Buttrey December 2015.

Slides:



Advertisements
Similar presentations
What is Statistical Modeling
Advertisements

The World’s Fastest Crash Course in Statistics Or, What You Need to Know to Answer Your Research Question 13 November 2006.
Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Slides by David Madigan.
Three kinds of learning
CSE 300: Software Reliability Engineering Topics covered: Software metrics and software reliability Software complexity and software quality.
Evaluation of Results (classifiers, and beyond) Biplav Srivastava Sources: [Witten&Frank00] Witten, I.H. and Frank, E. Data Mining - Practical Machine.
General Mining Issues a.j.m.m. (ton) weijters Overfitting Noise and Overfitting Quality of mined models (some figures are based on the ML-introduction.
Educational Data Mining Overview John Stamper PSLC Summer School /25/2011 1PSLC Summer School 2011.
Part I: Classification and Bayesian Learning
Classification and Prediction: Regression Analysis
CSC 478 Programming Data Mining Applications Course Summary Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Classifiers, Part 3 Week 1, Video 5 Classification  There is something you want to predict (“the label”)  The thing you want to predict is categorical.
Midterm Review. 1-Intro Data Mining vs. Statistics –Predictive v. experimental; hypotheses vs data-driven Different types of data Data Mining pitfalls.
Overview DM for Business Intelligence.
Introduction Mohammad Beigi Department of Biomedical Engineering Isfahan University
This week: overview on pattern recognition (related to machine learning)
Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression.
Learning from Observations Chapter 18 Through
Categorical data. Decision Tree Classification Which feature to split on? Try to classify as many as possible with each split (This is a good split)
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Topic (vi): New and Emerging Methods Topic organizer: Maria Garcia (USA) UNECE Work Session on Statistical Data Editing Oslo, Norway, September 2012.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Classification Derek Hoiem CS 598, Spring 2009 Jan 27, 2009.
Dimension Reduction in Workers Compensation CAS predictive Modeling Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial Data Mining, Inc.
Nurissaidah Ulinnuha. Introduction Student academic performance ( ) Logistic RegressionNaïve Bayessian Artificial Neural Network Student Academic.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Sections 4.1 Inferring Rudimentary Rules Rodney Nielsen.
BOĞAZİÇİ UNIVERSITY DEPARTMENT OF MANAGEMENT INFORMATION SYSTEMS MATLAB AS A DATA MINING ENVIRONMENT.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Chapter 8: Simple Linear Regression Yang Zhenlin.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
1 Statistics & R, TiP, 2011/12 Neural Networks  Technique for discrimination & regression problems  More mathematical theoretical foundation  Works.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Data Mining and Decision Support
Data Analytics CMIS Short Course part II Day 1 Part 3: Ensembles Sam Buttrey December 2015.
Chong Ho Yu.  Data mining (DM) is a cluster of techniques, including decision trees, artificial neural networks, and clustering, which has been employed.
Data Analytics CMIS Short Course part II Day 1 Part 2: Trees Sam Buttrey December 2015.
Overview of the Data Mining Process
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Notes on HW 1 grading I gave full credit as long as you gave a description, confusion matrix, and working code Many people’s descriptions were quite short.
CSC 478 Programming Data Mining Applications Course Summary Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Data Analytics CMIS Short Course part II Day 1 Part 1: Clustering Sam Buttrey December 2015.
Machine Learning in Practice Lecture 2 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.
1 Statistics & R, TiP, 2011/12 Multivariate Methods  Multivariate data  Data display  Principal component analysis Unsupervised learning technique 
Information Processing by Neuronal Populations Chapter 6: Single-neuron and ensemble contributions to decoding simultaneously recoded spike trains Information.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
LECTURE 13: LINEAR MODEL SELECTION PT. 3 March 9, 2016 SDS 293 Machine Learning.
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
Regression. Why Regression? Everything we’ve done in this class has been regression: When you have categorical IVs and continuous DVs, the ANOVA framework.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
SNS COLLEGE OF TECHNOLOGY
Who am I? Work in Probabilistic Machine Learning Like to teach 
Eco 6380 Predictive Analytics For Economists Spring 2016
Boosting and Additive Trees (2)
CSE 4705 Artificial Intelligence
Dr. Morgan C. Wang Department of Statistics
Linear Model Selection and regularization
CSCI N317 Computation for Scientific Applications Unit Weka
Course Introduction CSC 576: Data Mining.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Lecture 16. Classification (II): Practical Considerations
COSC 4368 Intro Supervised Learning Organization
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Is Statistics=Data Science
Presentation transcript:

Data Analytics CMIS Short Course part II Day 1 Part 1: Introduction Sam Buttrey December 2015

Who Am I? A.B., Princeton, Statistics; M.A., Ph.D., U. California-Berkeley, Statistics Naval Postgraduate School, Department of Operations Research, 1996-Present Data Analysis, Data Mining, Big Data Analytics, Classification, Modeling and Applications… Married, one child…

Tentative Schedule Today: Trees and Ensembles – 9:00 – 10:00: Recap part I (Whitaker); Overview –10:00 – 2:00: Regression & Classification trees – 2:00 – 3:00: Ensemble models – 3:00 – 4:00: *Evaluating classifiers: ROC and F 1 Tomorrow: Unsupervised Models – 9:00 – 11:00: Principal components –11:00 – 2:00: Clustering – 2:00 – 4:00? Association Rules If Time Remains –Simple forecasting models 3

The Big Picture These courses are intended to help you visualize, predict, classify and find patterns in large data sets Often these are constructed by combining different data sets from different sources –Inconsistencies, redundancies, noise Data sets in the course are small enough to be used quickly, but we want automatic techniques that scale up to huge-ish data

Data Will normally appear as a rectangular array: rows are observations, columns are measurements (variables): n  p –Data that is not already rectangular will be wrestled into this form! – columns of pixels, counts of terms in documents, etc. Each column has the same sort of measurement: numeric (incl. date/time), categorical, logical (True/False), text Data might be missing

Types of Models (i) Often one response (target) variable is the primary measure of interest (“Y ”) We want to predict the value of Y in new data where predictors (X ’s) are known When Y is numeric, this is regression –E.g. size of error in TACNAV data When Y is categorical, this is classification –E.g. digit recognition (0, 1, …, 9) These models are called “supervised” because the true Y ’s are known

Recap of Part I Whitaker, September R and RStudio 2.Linear regression –Comparison to Nearest-Neighbor Methods 3.Logistic Regression 4.Controlling Complexity –Training set/test set; cross-validation –Lasso and other regularization 5.Intro. to Neural Networks These all involve linear combinations of predictor variables!

Constant Concerns Modern models like trees are very flexible and therefore prone to over-fitting Control complexity by: 1.Using separate training and validation sets, the latter to compare models Then evaluate prediction error with test set 2.Cross-validation across, say, 10 folds 3.Regularization (shrinkage of coefficients) via ridge or lasso A constant theme in big data 8

Use the model built with the training data to predict a new set of data An Independent Test Set Training Set Measure of Complexity More ComplexLess Complex Under-Fitting High Bias – Low Variance Over-Fitting Low Bias – High Variance

Types of Models (ii) Unsupervised models have no particular response variable Goals are to find groups (clustering, source separation), or anomalies, or relationships (association rules), or reduce dimensionality for visualization Generally more difficult and less satisfying than supervised models –Hard to evaluate or compare quality 10

R and RStudio Recap R is a very popular free open-source statistical environment RStudio is a free front-end that makes managing scripts and graphics easier Our variables come in vectors; a rectangular set of vectors makes up a data.frame Example – beer (35 x 11)

R Basics Restated R is case-sensitive (but Windows isn’t) help (thing) or ?thing for help a <- b assigns value of b to a –Subsequent changes to b don’t affect a Recall earlier commands with up-arrow –history(100) shows last 100 Use forward slash for file names Special characters: \n, \t, \\, \" –Single or double-quotes okay; # for comment == for “is equal”, != for “not equal” 12

Materials The disc has Slides, R Scripts, Data sets and Libraries (plus a few random things) –library ( ) looks in default places; library (, lib.loc= ) Cntl-Enter executes a line from a script, but… Lots of commands are already given to you – for best results make sure you understand them 13

R Refresher Get beer data into R from Excel Data frames and variable types Simple exploration –Plot of Calories vs. Alcohol Simple linear regression model of Calories vs. Alcohol Drawing the response “surface” –To be used to compare the linear model with the tree model Let’s do this! 14