Lecture 1: Introduction to Machine Learning Methods

Slides:



Advertisements
Similar presentations
K-NEAREST NEIGHBORS AND DECISION TREE Nonparametric Supervised Learning.
Advertisements

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Model generalization Test error Bias, variance and complexity
Model Assessment and Selection
Model Assessment, Selection and Averaging
Model assessment and cross-validation - overview
Data mining and statistical learning - lecture 6
Basis Expansion and Regularization Presenter: Hongliang Fei Brian Quanz Brian Quanz Date: July 03, 2008.
Linear Methods for Regression Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning: An Introduction
Additive Models and Trees
Prediction Methods Mark J. van der Laan Division of Biostatistics U.C. Berkeley
2008 Chingchun 1 Bootstrap Chingchun Huang ( 黃敬群 ) Vision Lab, NCTU.
Data mining and statistical learning - lecture 13 Separating hyperplane.
Chapter 11 Multiple Regression.
ICS 273A Intro Machine Learning
Bagging LING 572 Fei Xia 1/24/06. Ensemble methods So far, we have covered several learning methods: FSA, HMM, DT, DL, TBL. Question: how to improve results?
Sparse vs. Ensemble Approaches to Supervised Learning
Comp 540 Chapter 9: Additive Models, Trees, and Related Methods
Ensemble Learning (2), Tree and Forest
Principles of the Global Positioning System Lecture 10 Prof. Thomas Herring Room A;
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Classification Ensemble Methods 1
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
LECTURE 17: BEYOND LINEARITY PT. 2 March 30, 2016 SDS 293 Machine Learning.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
LECTURE 13: LINEAR MODEL SELECTION PT. 3 March 9, 2016 SDS 293 Machine Learning.
1 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Machine learning, pattern recognition and statistical data modelling Lecture 10.
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
PREDICT 422: Practical Machine Learning
Introduction to Machine Learning
Data Transformation: Normalization
PREDICT 422: Practical Machine Learning
Chapter 7. Classification and Prediction
k-Nearest neighbors and decision tree
Bagging and Random Forests
Deep Feedforward Networks
Introduction to Machine Learning and Tree Based Methods
Boosting and Additive Trees (2)
Lecture 17. Boosting¶ CS 109A/AC 209A/STAT 121A Data Science: Harvard University Fall 2016 Instructors: P. Protopapas, K. Rader, W. Pan.
Boosting and Additive Trees
Machine learning, pattern recognition and statistical data modelling
The Elements of Statistical Learning
Machine learning, pattern recognition and statistical data modelling
ECE 5424: Introduction to Machine Learning
Bias and Variance of the Estimator
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Introduction to Predictive Modeling
Lecture 4: Econometric Foundations
Presenter: Georgi Nalbantov
Linear Model Selection and regularization
Decision Trees By Cole Daily CSCI 446.
Ensemble learning.
Generally Discriminant Analysis
Ensemble learning Reminder - Bagging of Trees Random Forest
Model generalization Brief summary of methods
CS639: Data Management for Data Science
Support Vector Machines 2
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
STT : Intro. to Statistical Learning
Presentation transcript:

Lecture 1: Introduction to Machine Learning Methods Stephen P. Ryan Olin Business School Washington University in St. Louis

Structure of the Course Goals Basic machine learning algorithms Heterogeneous treatment effects Some recent econometric theory advances How to use machine learning in your research projects Sequence: Introduction to various ML techniques Causal inference and ML Random projection with large choice sets Double machine learning Using text data Moment trees for heterogeneous models

Shout Out to Elements of Statistical Learning http://statweb.stanford.edu/~tibs/ElemStatLearn/ Free PDF online Convenient summary of many ML techniques by some of the leaders of the field: Trevor Hastie, Robert Tibshirani, and Jerome Friedman Many examples, figures, algorithms in these notes are drawn from this book Other web resources: Christopher Manning and Richard Socher: Natural Language Processing with Deep Learning, http://web.stanford.edu/class/cs224n/syllabus.html Oxford Deep NLP course: https://github.com/oxford-cs-deepnlp- 2017/lectures Deep learning: http://deeplearning.net/

Machine Learning What is machine learning? Key idea: prediction Econometric theory has provided a general treatment of nonparametric estimation in the last twenty years Highest level: Nothing new under the sun Twist: combine model selection with estimation Fit in the middle ground between fully pre-specified parametric models and completely nonparametric approaches Scalability on large data sets (e.g., so-called big data)

Two Broad Types of Machine Learning Methods Frequency domain: Pre-specify the right-hand side variables Select among those Many methods select a finite number of elements from potentially high- dimensional set Characteristic space: Search over which variables (and their interactions) belong as explanatory variables Don’t need to take a stand on functional form of elements Both approaches require some restrictions on function complexity

The General Problem Consider: 𝑦=𝑓(𝑥,𝛽,𝜖) Even when x is univariate this can be a complex relationship What are some approaches for estimating this relationship? Nonparametric: kernel regression Semi-nonparametric: series estimation, e.g. b-splines with increasing knot density Parametric: assume functional form, e.g. 𝑦= 𝑥 ′ 𝛽+𝜖 Each approach has pros and cons

ML Approaches in Characteristic Space Alternatively, consider growing the set of RHS variables Classification and regression trees do this Recursive partitioning of the characteristic space One starts with just the data and lets the tree decide what interactions matter Nodes split on decision rule Final nodes (“leaves”) tree are classification vote or mean value of function

ML Approaches in Frequency Domain Many different ML algorithms approach this problem by saturating the RHS of the model, then assigning most of them zero influence Variants of penalized regression: min 𝑦 𝑖 − 𝑥 𝑖 ′ 𝛽 2 −𝐺( dim 𝛽 ) Where G is some penalty function Ridge regression: penalize squared beta (normed X) LASSO: penalize sum absolute value beta Square-root LASSO: penalize sqrt sum absolute value beta Support vector machines have slightly different objective function Also series of incremental approaches, simple to complex

Comparing LASSO and Ridge Regression

Incremental Forward Stagewise Regression

Coefficient Paths

Basis Functions

Increasing Smoothness

Splines in General There are many varieties of splines Smoothing spline is the solution to the following problem: Amazingly, there is a cubic spline that minimizes this objective function All splines can also be written in terms of basis splines, or b-splines B-splines are great and you should think about them when you need to approximate an unknown function

B-Spline Construction Define a set of knots for a b-spline of degree M We place two knots at the endpoints of our data, call that knot 0 and knot K We define K-1 knots on the interior between those two endpoints We also add M knots to the left and the right outside endpoints We then can compute b-splines recursively (see the book for details) Key point: b-splines are defined locally Upshot: numerically stable Approximate a function by least squares: 𝑓 𝑥 = 𝑖 𝛽 𝑖 𝑏 𝑠 𝑖 (𝑥)

Visual Representation of Basis Functions

Bias vs Variance: Smoothing Spline Example

Many Techniques Have Tuning Parameters In general, question is how to determine those tuning parameters? One approach is to use cross-validation Basic idea: estimate on one sample, predict on another Common approach: leave-one-out (LOO) Across all subsamples where one observation is removed, estimate model, predict error for omitted observation, sum up Balances too much bias (overfitting) against too much variance (oversmoothing)

Example: Smoothing Splines

General Problem: Model Selection and Fit

How to Think About This Problem In general, need to assess fit in-sample Need to assess fit across models Bias-variance decomposition: Optimal solution is to use three-way partitioning of data set: training data (fit model), validation data (choose among models), and test data (show test error)

Related to ML Methods Many ML methods seek to balance the variance-bias tradeoff in some fashion We will see honest trees later on building on this principle through sample splitting There are also some stochastic methods, such as bagging and random forests

Bootstrap and Bagging Bootstrap refers to the process of resampling your observations to (hopefully) learn something about the population, typically standard errors or confidence intervals Resample with replacement many times, compute statistic of interest on that sample; produces distribution Bagging (bootstrap aggregation) is a similar idea: replace sample with bootstrap samples:

Trees One of the settings where bagging can really reduce variance is in trees Trees recursively partition the feature space into rectangles At each node, tree splits sample on basis of some rule (e.g., x3>0.7, or x4={BMW, Mercedes-Benz}) Splits chosen to maximize some criterion (e.g., mean-squared prediction error) Tree grown until some stopping criterion is met Leaf returns type (classification tree) or average value (regression tree)

Example of a Tree

Bagging a Tree Bagging a tree can often lead to great reductions in the variance of the estimate Why? Bagging replaces single estimate using all the data with ensemble of estimates using data resampled with replacement Let 𝜙 𝑥 be a predictor for a given sample x, and let 𝜇 𝑥 = 𝐸 𝑥 (𝜙 𝑥 ). Then:

Random Forest The idea of resampling can be extended to subsampling The random forest is an ensemble version of the regression tree Key difference: estimate trees on bootstrap samples, but restrict set of variables considered at each split to be a finite subset Why? This helps break correlation across trees That’s useful since the variance of a mean of identically distributed RV’s is: 𝜌 𝜎 2 + 1−𝜌 𝐵 𝜎 2

Random Forest Algorithm

Support Vector Machines (linear kernel) Support vector machines are a modified penalized regression method: 𝑖 𝑉 𝑦 𝑖 −𝑓 𝑥 𝑖 + 𝜆 2 𝛽 2 Where 𝑉 𝑟 = 𝑟−𝜖 only if 𝑟>𝜖 Basically, going to ignore “small” errors Nonlinear optimization problem, results in only a subset (“support vector”) of coefficients being non-zero when 𝑓 𝑥 𝑖 is linear

So How Is Any of This Useful? Think about machine learning as combination of model selection and estimation Econometric theory has given us high-level tools for thinking about completely nonparametric estimation These techniques fit between fully parametric and fully nonparametric estimation Key point: we are approximating conditional expectations Economics literature now considering problem of how to take model selection seriously

Counterpoint