Regression, Prediction and Classification

Regression, Prediction and Classification
Microsoft Research 2013 5/13/2018 8:49 AM Regression, Prediction and Classification Jacob LaRiviere © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Terminology 𝑋:𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠. 𝑦:outcomes
5/13/2018 8:49 AM Terminology 𝑋:𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠. 𝑦:outcomes Goal is to model outcomes as a function of features. Width of 𝑋 is 𝑝 Length of 𝑋 and 𝑦 is 𝑛 © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Terminology cont. 𝑋:𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠. 𝑦:outcomes Feature 1 Obs 1 Obs 1
5/13/2018 8:49 AM Terminology cont. 𝑋:𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠. 𝑦:outcomes Feature 1 Obs 1 Obs 1 © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Estimating equations 𝑌=𝑓 𝑋 +𝜖 𝑦 𝑖 =𝑓( 𝑥 𝑖 )+ 𝜖 𝑖
5/13/2018 8:49 AM Estimating equations 𝑌=𝑓 𝑋 +𝜖 𝑦 𝑖 =𝑓( 𝑥 𝑖 )+ 𝜖 𝑖 Where 𝜖 𝑖 is call the error term. © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Example: coin flipping
𝑌=𝑓 𝑋 +𝜖 𝑓𝑙𝑖𝑝 𝑖 =𝑝+ 𝜖 𝑖 if outcome=1 then 𝜖 𝑖 =1−𝑝 if outcome=0 then 𝜖 𝑖 =−𝑝 𝑝 is a constant, called the “success rate” © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Linear regression 𝑦 𝑖 =𝛼+ 𝛽 1 𝑥 𝑖 + 𝜖 𝑖 𝛼 𝛽 gives the slope
5/13/2018 8:49 AM Linear regression 𝛽 gives the slope 𝑦 𝑖 =𝛼+ 𝛽 1 𝑥 𝑖 + 𝜖 𝑖 Ordinary least squares will find the 𝛼 and 𝛽 to minimized the squared distance between the fitted line and the observations 𝑦 𝑖 = 𝛼 + 𝛽 𝑥 𝑖 Minimizes 𝒚 −𝒚 𝟐 𝛼 gives the intercept © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

How does this work intuitively?
5/13/2018 8:49 AM How does this work intuitively? Assume 𝛼=0 mi n β y− y 2 = y− 𝛽 𝑥 2 = y 2 −2 𝛽 𝑥𝑦+ ( 𝛽 𝑥) 2 𝐹𝑂𝐶→−2𝑥𝑦+ 2 𝜷 𝑥 2 =0 → 𝜷 𝑥 2 =𝑥𝑦 𝜷 = 𝑥𝑦 𝑥 2 𝜷 = 𝑋 ′ 𝑋 −1 ( 𝑋 ′ 𝑌) © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

How does this work for inference?
5/13/2018 8:49 AM How does this work for inference? What about hypothesis testing? (Now assume 𝛼≠0) 𝑠𝑒( 𝛽 )= 𝜎 𝛽 1 2 = 1 𝑛 1 𝑛−2 𝑖=1 𝑛 ( 𝑥 𝑖 − 𝑥 )2 𝜖 𝑖 2 𝑖=1 𝑛 ( 𝑥 𝑖 − 𝑥 )2 →𝑡= 𝜷 1 − 𝛽 1,0 𝑠𝑒( 𝛽 ) Both of these are themselves sample statistics from from data y and X. As a result, they both have their own distributions (e.g., normal and Chi Squared). The results ration has the Student t distribution which is used for hypothesis testing for a given sample. © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Height as function of age
Microsoft Research 2013 5/13/2018 8:49 AM Height as function of age Growth is most rapid in ages 12-16 AGE © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Height as function of age
Microsoft Research 2013 5/13/2018 8:49 AM Height as function of age If we estimated ℎ=𝛼+𝛽∗𝑎𝑔𝑒, the constant growth rate 𝛽 would overstate growth in early and later years, and understate during puberty AGE © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

How would we estimate this relationship?
5/13/2018 8:49 AM How would we estimate this relationship? General formula: ℎ 𝑖 =𝑓 𝑎𝑔 𝑒 𝑖 + 𝜖 𝑖 𝜖 𝑖 allows people of he same age to have different heights. Even if we have the correct growth model using age alone, we expect some variation in height conditional on age. To use a linear model, polynomial features allow for a non-linear relationship between the feature, age, and the outcome, height. ℎ 𝑖 =𝛼+ 𝛽 1 𝑎𝑔 𝑒 𝑖 + 𝛽 2 𝑎𝑔 𝑒 𝑖 2 +…+ 𝛽 𝑝 𝑎𝑔 𝑒 𝑖 𝑝 + 𝜖 𝑖 © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Linear regression 𝑦 𝑖 =𝛼+ 𝛽 1 𝑔 1 𝑥 𝑖,1 +…+ 𝛽 𝑝 𝑔 𝑝 ( 𝑥 𝑖,𝑝 )+ 𝜖 𝑖
5/13/2018 8:49 AM Linear regression Linear regression allows you to use non-linear functions of the features, provided they enter the function as an additive term, with weight 𝛽 𝑦 𝑖 =𝛼+ 𝛽 1 𝑔 1 𝑥 𝑖,1 +…+ 𝛽 𝑝 𝑔 𝑝 ( 𝑥 𝑖,𝑝 )+ 𝜖 𝑖 The simplest example is where all features simply enter without any transformations 𝑦 𝑖 =𝛼+ 𝛽 1 𝑥 𝑖,1 +…+ 𝛽 𝑝 𝑥 𝑖,𝑝 + 𝜖 𝑖 © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Linear regression 𝑦 𝑖 =𝛼+ 𝛽 1 𝑥 𝑖 + 𝜖 𝑖
5/13/2018 8:49 AM Linear regression 𝑦 𝑖 =𝛼+ 𝛽 1 𝑥 𝑖 + 𝜖 𝑖 A linear function tends to underestimate for medium-high education And overestimate income for low education © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Linear regression A cubic polynomial fits the data much better
𝑦 𝑖 = 𝛼 + 𝛽 1 𝑥 𝑖 + 𝛽 2 𝑥 𝑖 𝛽 3 𝑥 𝑖 3 𝑦 −𝑦 2 is now much smaller © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Visualizing in two dimensions
5/13/2018 8:49 AM Visualizing in two dimensions Higher order polynomial in educ and seniority could approximate this function 𝑦 𝑖 =𝛼+ 𝛽 1 𝑒𝑑𝑢 𝑐 𝑖 + 𝛽 2 𝑠𝑒𝑛𝑖𝑜𝑟𝑖𝑡𝑦+ 𝜖 𝑖 © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Steps for linear regression
5/13/2018 8:49 AM Steps for linear regression Define the relationship we are trying to model. What is the outcome? What raw features do we have? E.g. ℎ 𝑖 =𝑓(𝑎𝑔 𝑒 𝑖 )+ 𝜖 𝑖 2. Define a linear model to approximate this relationship, e.g. ℎ 𝑖 =𝛼+ 𝛽 1 𝑎𝑔 𝑒 𝑖 + 𝛽 2 𝑎𝑔 𝑒 𝑖 2 +…+ 𝛽 𝑝 𝑎𝑔 𝑒 𝑖 𝑝 + 𝜖 𝑖 3. Create the necessary features eg. Age2=age^2 (or use poly) 4. Estimate the model: mymodel=lm(y~age+age2…+ageP) summary(mymodel) 5. Evaluate the model with an evaluation metric. 6. Repeat steps 2-5 to improve model fit. © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Model prediction (ℎ 𝑖 − ℎ 𝑖 ) 2 :squared loss
5/13/2018 8:49 AM Model prediction After we have fit a model, we can predict the outcome for any input. ℎ 𝑖 =𝛼+ 𝛽 1 𝑎𝑔 𝑒 𝑖 + 𝛽 2 𝑎𝑔 𝑒 𝑖 2 +…+ 𝛽 𝑝 𝑎𝑔 𝑒 𝑖 𝑝 2. For any given observation ℎ 𝑖 − ℎ 𝑖 =𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 𝑒𝑟𝑟𝑜𝑟 (ℎ 𝑖 − ℎ 𝑖 ) 2 :squared loss |ℎ 𝑖 − ℎ 𝑖 | : absolute loss © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

5/13/2018 8:49 AM Model Fit Fit refers to how well our model, an approximation of reality, matches what we actually observe in the data Mean squared error is the average of squared loss for each data point (row) we evaluate © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Loss versus Prediction
5/13/2018 8:49 AM Loss versus Prediction Sample In (Inference & Research Design) Out (prediction) Penalty L1 Norm Down weights Outliers LASSO Cross Validation L2 Norm OLS Estimation Ridge Cross Validation © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

5/13/2018 8:49 AM Training vs. Test We typically want to “train” the model on the data “we have” and test the model on “new data” This is a realistic test of how well the model predicts outcomes. In practice, we will subset our data: X%: Training Y%: Test 80% is commonly used for training. © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Why do we need training vs. test?
5/13/2018 8:49 AM Why do we need training vs. test? MSE: test set MSE: training set © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Why do we need training vs. test?
5/13/2018 8:49 AM Why do we need training vs. test? The green curve “overfits” the data. It fits the noise. Using a “out of sample” test set guards against “overfitting” The blue curve gets closest to the black (truth) curve © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Training vs. test in practice
5/13/2018 8:49 AM Training vs. test in practice Randomly select observations (rows) to be in the test set. The remainder are the training set. Why random: want complete coverage. E.g. don’t want to train in February and test on July. Subset on the training set for all model estimation. E.g. any data.frame you pass to lm Use the predict command to make predictions ( 𝑦 ) for the test set Compute the evaluation metric using test set (you have the observe value and the predicted or modeled value) © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

R-squared Residual sum of squares Total variation in outcome variable
5/13/2018 8:49 AM R-squared Residual sum of squares Total variation in outcome variable Fraction of the variation in the outcome variable captured by the model © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

5/13/2018 8:49 AM Adjusted R-squared R-squared is reported for the training data. That is, it is “in sample” The “in sample” model fit weakly improves with more explanatory variables 𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅 2 =1− 𝑛−1 𝑛−𝑘−1 𝑆𝑆𝑅 𝑇𝑆𝑆 n is number of observations and k is number of explanatory variables. - penalizes adjusted r2 when add variables without explanatory power. © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

A “Fair” R-squared Measure
5/13/2018 8:49 AM A “Fair” R-squared Measure R-squared is reported for the training data. That is, it is “in sample” We learned that we typically want to use out-of-sample evaluation metrics, what do you we do It turns out that 𝑅 2 =𝑐𝑜𝑟𝑟 𝑦 𝑖 , 𝑦 𝑖 2 This means, 𝑐𝑜𝑟𝑟 𝑦 𝑖 , 𝑦 𝑖 2 , for the test set can be interpreted just like R-squared (% of variation explain) and is a “fair test” Related to an important concept called over-fitting. © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Feature types Binary features: equal to 1 or 0. For example:
Continuous features: can take an real value Categorical features: can take one of N values. E.g. Saturday, Sunday, Monday… declaring as.factor then R will treat it as N binary variables for each category (ex. =1 if Saturday, =0 otherwise) © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Understand binary features
5/13/2018 8:49 AM Understand binary features 𝛽 1 allows for a different y-intercept for students 𝑦 𝑖 =𝛼+ 𝛽 1 {1 𝑖𝑓 𝑠𝑡𝑢𝑑𝑒𝑛𝑡}+ 𝛽 2 𝑖𝑛𝑐𝑜𝑚𝑒+ 𝜖 𝑖 © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Understand binary features
5/13/2018 8:49 AM Understand binary features 𝛽 1 allows for a different y-intercept for students 𝛽 3 allows for a different Slope for students 𝑦 𝑖 =𝛼+ 𝛽 1 1 𝑖𝑓 𝑠𝑡𝑢𝑑𝑒𝑛𝑡 + 𝛽 2 𝑖𝑛𝑐𝑜𝑚𝑒+ 𝛽 3 1 𝑖𝑓 𝑠𝑡𝑢𝑑𝑒𝑛𝑡 ∗𝑖𝑛𝑐𝑜𝑚𝑒+ 𝜖 𝑖 © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

5/13/2018 8:49 AM Interaction terms Interaction term: multiplying two features by each other. When done as such: {continuous feature}*{binary feature} it allows for a different slope for the group represented by the binary feature. Note, be sure to include the continuous feature without the interaction as well. Example, suppose temperature has a different impact on # of bikeshare trips taken on weekdays vs. weekends. We can create a binary variable is_weekend. If we add is_weekend to the model, we allow for a different baseline ridership on weekends. If we add is_weekend*temp, we allow temperature to have a different effect for weekends. We can “interact” is_weekend with a polynomial in temp. This effectively gives us a new model for temperature for the weekends vs. weekdays © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

5/13/2018 8:49 AM Feature explosion If we have two raw features, A and B, how many models can we make? Without interactions (4): none, just A, just B, A + B With interactions (8): A*B, A+B + A*B, A + A*B, B+ A*B If we have p features then we have 2 𝑝+1 possible models! 𝑝=29 −→1,073,741,824 We cannot possible run all these models… so we’ll learn methods to guide us. Human intelligence is often a great guide as well. © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Interpreting regression output
5/13/2018 8:49 AM Interpreting regression output Summary(my_model) gives the coefficient estimates (beta’s), t-statistics, standard errors and p-values. t-statistics evaluate the hypothesis that the coefficient is equal to zero. Thus t greater than 1.96 in absolute value allows us to reject this hypothesis at the 95% level. P-values simply convert t-stats into the probability we would get something this far from zero due to sampling chance alone. Note that t-statistics evaluate features “one by one”. Overall model fit is a better guide for explanatory power (e.g. we might be better for leaving insignificant features in sometimes) t-stats can be a good guide to throw out irrelevant features © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Parametric vs. non-parametric regression
𝑌=𝑓 𝑋 +𝜖 Parametric: 𝑓 defined by a model we write down with parameters that we have to estimate (e.g. 𝛼, 𝛽 1 , 𝑒𝑡𝑐.) Non-parametric: we directly fit the data © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

5/13/2018 8:49 AM Local averaging © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Local averaging vs. bins
5/13/2018 8:49 AM Local averaging vs. bins © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

5/13/2018 8:49 AM Kernels A method of assigning higher weight to the points closer to the target point I am trying to fit. Choice of kernel usually does not matter much. © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Bandwidths Bandwidths tell us how wide to make our kernel (window)
5/13/2018 8:49 AM Bandwidths Bandwidths tell us how wide to make our kernel (window) Larger bandwidths  smoother functions because there is more data used to make the averages Bandwidth is sometimes called the smoothing parameter. Bandwidths matter! © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

5/13/2018 8:49 AM K-nearest neighbors An alternative to bandwidths is instead of specifying a fixed window, we can say “use the nearest K nearest neighbors” This ensures each bin has the same amount of data and can be a useful tool Functionally it will often be quite similar to using a bandwidth, except when there are “sparse data” issues (then K-NNN is preferred) Often expressed as a fraction of data. E.g. NN=.1 says “each bin is 10% of data” © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Local polynomial regression
5/13/2018 8:49 AM Local polynomial regression Thing of local averaging as fitting a constant for a given window of data Instead of fitting a constant, we can fit a line (y=a*x + b) or a polynomial (y=a*x + b*x^2 + c), etc. It is thus more general than local averaging, but very similar © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

5/13/2018 8:49 AM Some examples © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

5/13/2018 8:49 AM Local polynomial in R locfit install.packages(locfit) library(locfit) Modeler needs to specify: bandwidth (or fraction of nearest neighbors), degree of the local polynomial Contrast: in parametric regression, we had to write down a model Aside: we’ll learn that non-parametric methods struggle in higher dimensions © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

5/13/2018 8:49 AM When to use each model? If there are a relatively small number of features (e.g. <4) then non- parametric makes a lot of sense. With many features, the “curse of dimensionality” sets in and non- parametric methods fall apart (the “neighborhood” is generally empty) Parametric models using interaction terms and polynomials are the preferred method with many features “Semi-parametric” methods combine elements of both © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Choosing a “demand model”

Demand models approximate how customers make choices
Model definition: distilling the relevant aspects of a real-world situation for systematic and quantitative study Demand models approximate how customers make choices example insight from book airlines leading example

Model 1: The Classic Demand Curve
How many units of product X will be sold at each price point, all else equal? How many consumers think buying X at price p is better than buying any other product available at current prices? (assuming unit demand) Price Quantity Demand example insight from book airlines leading example

A common pitfall is to confuse a demand curve with some other graph that happens to have price and quantity on the axes More expensive models tended to have higher sales Does not mean consumers like price increases! example insight from book airlines leading example

What does a demand curve not model?
example insight from book airlines leading example

What does a demand curve not model? Any form of competition
(it’s held fixed) example insight from book airlines leading example

What does a demand curve not model? Any form of consumer segmentation
(there’s just one curve) example insight from book airlines leading example

Competition All products have to compete to get bought…
consumers don’t have to buy, and won’t if the price is too high consumer could buy something else, so the prices of substitute products are relevant consumers can wait until tomorrow to buy, so future prices of all products are relevant

Segmentation Buyers vary in their price sensitivity and how they value different product features e.g. enterprise vs. consumer markets essentially get different curves for each segment, can maybe set different prices for measurement: can we find buyer groups g1 , g2 etc. and estimate demand separately by group?

Isolating important aspects of demand
Which features of the product drive consumer valuations? Differing across well-defined segments? Empirical methods will help us learn which features matter to consumers For product version, the most important are those that can easily varied based on engineering requirements? (ex. dual sim in phone phones) How sensitive are customers to price? Elasticity of demand: if price drops X%, quantity demand goes up Y%. Are there are various segments with different price sensitivity Ex. customers by country, income level, student, etc. Ex. “mission critical” database needs vs. the average job. How does price sensitivity vary along the demand curve Ex. when generic drugs are released, brand name price goes up  price sensitive customers will always opt for the cheaper alternative, those that remain have higher valuations and less sensitivity to price example insight from book airlines leading example

Model 2: The Logit Demand Model
Models demand in markets rather than demand for a particular product There are J products that compete in a market (e.g. smartphones) Each product has an attractiveness 𝑎𝑗 that depends on its features 𝑥 𝑗 and price 𝑝 𝑗

Model 2: The Logit Demand Model
A product’s market share 𝑠𝑗 depends on how attractive it is relative to its competitors: So if 3 products, product 1 gets:

An Example

An Example Model assumptions Utility from owning a smartphone 4
Additional utility from iPhone 1 Change in attractiveness from $1 increase in price -0.008 Product Price Baseline Utility Attractiveness (net of price) Market Share iPhone 650 5 -0.2 0.35 Rest 450 0.4 0.65

Model 2: Logit Demand Model
Advantage: enables richer scenario planning, for price changes on one or multiple products simultaneously Can simulate impact of price changes, changing price sensitivity and raising product attractiveness

Which model to pick? Two basic models of demand.
Demand curve is simple and can in principle requires only own price and sales data to estimate. For a logit model (and more complex stuff) need market-level data (i.e. sales and prices for all products)

Which model to pick? We’re doing all this to optimize prices, so….
if we have a single product to price and if there are competitor products, their prices are unlikely to change estimating a demand curve is just fine

Which model to pick? We’re doing all this to optimize prices, so….
if we have multiple competing products to price then we need model cannibalization, need logit or if competitors are likely to respond, and we want to model likely future scenarios, we need a logit

Bigger picture Does your demand model match the proposed pricing strategy? Competing product set, segments, revenue model Do you have a way of learning the parameters of this demand model?

More complicated models
Models with multiple latent (not directly observable) consumer types Models of dynamic demand (forward-looking consumers) Models of consumer search

Estimating a demand model

How do you estimate a demand curve?
Price Quantity Demand example insight from book airlines leading example

How do you estimate a demand curve?
What you’d like is some data on prices and sales, while everything else is “held constant” (e.g. experiments…) The problem is that price changes are often correlated with other changes in the environment, so everything else is not held constant. example insight from book airlines leading example

Christmas is bad for econometrics
Prices example insight from book airlines leading example Sales

Spurious correlations
In general price and sales are related in so many ways that are not just through the demand channel, that these sorts of regressions are literally the textbook example of an endogenous regression

Good Data not “Big Data”
In the age of big data, everyone is scrambling to collect all the data they can But it’s really data quality rather than quantity that matters….more data doesn’t magically make the spurious correlations go away, you just measure spurious correlations very precisely!

Getting good data Hypothetical surveys…how can we make them realistic?
Field experiments…when do they work? Detailed historical logs… what can we do with them? How to remove spurious correlation? example insight from book airlines leading example

How can I know something about demand for a new product?
Learn something about features/aspects of new products that are in existing products. Surveys and conjoint analysis methods Pilots and limited releases, initial experimentation: example insight from book airlines leading example

Regression, Prediction and Classification

Similar presentations

Presentation on theme: "Regression, Prediction and Classification"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Regression, Prediction and Classification

Similar presentations

Presentation on theme: "Regression, Prediction and Classification"— Presentation transcript:

Similar presentations

About project

Feedback