Download presentation

Presentation is loading. Please wait.

Published byNoah Ramsey Modified over 3 years ago

1
Harlan D. Harris, PhD Jared P. Lander, MA NYC Predictive Analytics Meetup October 14, 2010 Predicting Pizza in Chinatown: An Intro to Multilevel Regression

2
1. How do cost and fuel type affect pizza quality? 2. How do those factors vary by neighborhood?

3
Linear Regression (OLS) rating i = β 0 + β price *price i + ε i find βs to minimize Σε i 2

4
Linear Regression (OLS) rating i = β p + β price *price i + ε i find βs to minimize Σε i 2

5
Multiple Regression rating i = beta[intercept] * 1 + beta[price] * price i + beta[oven=wood] * I(oven i =wood) + beta[oven=coal] * I(oven i =coal) + error i Goal: find betas/coefficients that minimize Σ i error i 2 3 types of oven = 2 coefficients (gas is reference)

6
Multiple Regression (OLS) rating i = β 0 + β price *price i + β wood *I(oven i = "wood") + β coal *I(oven i = "coal") + ε i find βs to minimize Σε i 2

7
Multiple Regression (OLS) with Interactions rating i = β 0 + β price *price i + β wood *I(oven i = "wood") + β wood,price *price i * I(oven i = "wood") + β coal *I(oven i = "coal") + β coal,price *price i * I(oven i = "coal") + ε i

8
Groups Examples: teachers / test scores states / poll results pizza ratings / neighborhoods

9
Full Pooling (ignore groups) Examples: teachers / test scores states / poll results pizza ratings / neighborhoods rating i = β 0 + β price *price i + ε i

10
No Pooling (groups as factors) rating i = β 0 + β price *price i + β B *I(group i = "B") + β B,price *price i * I(group i = "B") + β C *I(group i = "C") + β C,price *price i * I(group i = "C") + ε i

11
Pizzas NameRating$/SliceFuel TypeNeighborhood Rosarios3.52.00GasLower East Side Rays2.82.50GasChinatown Joes3.31.75WoodEast Village Pomodoro3.83.50CoalSoHo ResponseContinuousCategoricalGroup

12
Data Summary in R > za.df <- read.csv("Fake Pizza Data.csv") > summary(za.df) Rating CostPerSlice HeatSource Neighborhood Min. :0.030 Min. :1.250 Coal: 17 Chinatown :14 1st Qu.:1.445 1st Qu.:2.000 Gas :158 EVillage :48 Median :4.020 Median :2.500 Wood: 25 LES :35 Mean :3.222 Mean :2.584 LittleItaly:43 3rd Qu.:4.843 3rd Qu.:3.250 SoHo :60 Max. :5.000 Max. :5.250 http://github.com/HarlanH/nyc-pa-meetup-multilevel-pizza

13
Viewing the Data in R > plot(za.df)

14
Visualize ggplot(za.df, aes(CostPerSlice, Rating, color=HeatSource)) + geom_point() + facet_wrap(~ Neighborhood) + geom_smooth(aes(color=NULL), color='black', method='lm', se=FALSE, size=2)

15
> lm.full.main plotCoef(lm.full.main) http://www.jaredlander.com/code/plotCoef.r Multiple Regression in R

16
Full-Pooling: Include Interaction > lm.full.int plotCoef(lm.full.int)

17
Visualize the Fit (Full-Pooling) > lm.full.int <- lm(Rating ~ CostPerSlice * HeatSource,data=za.df)

19
No Pooling Model lm(Rating ~ CostPerSlice * Neighborhood + HeatSource,data=za.df)

20
Visualize the Fit (No-Pooling) lm(Rating ~ CostPerSlice * Neighborhood + HeatSource,data=za.df)

21
Evaluation of Fitted Model Cross-Validation Error Adjusted-R 2 AIC BIC RSS Tests for Normal Residuals

22
Use Natural Groupings Cluster Sampling Intercluster Differences Intracluster Similarities

23
Multilevel Characteristics Model gravitates toward big groups Small groups gravitate toward the model Best when groups are similar to each other y_i = Intercept_j[i] + Slope_j[i] + noise Intercept[j] = Intercept_alpha + Slope_alpha + noise Slope[j] = Intercept_beta + Slope_beta + noise Model the effects of the groups

24
Multi-Names for Multilevel Models Multilevel Hierarchical Mixed-Effects Bayesian Partial-Pooling

25
Multi-Names for Multilevel Models (1) Fixed effects are constant across individuals, and random effects vary. For example, in a growth study, a model with random intercepts a_i and fixed slope b corresponds to parallel lines for different individuals i, or the model y_it = a_i + b t. Kreft and De Leeuw (1998) thus distinguish between fixed and random coefficients. (2) Effects are fixed if they are interesting in themselves or random if there is interest in the underlying population. Searle, Casella, and McCulloch (1992, Section 1.4) explore this distinction in depth. (3) "When a sample exhausts the population, the corresponding variable isfixed; when the sample is a small (i.e., negligible) part of the population the corresponding variable is random." (Green and Tukey, 1960) (4) "If an effect is assumed to be a realized value of a random variable, it is called a random effect." (LaMotte, 1983) (5) Fixed effects are estimated using least squares (or, more generally, maximum likelihood) and random effects are estimated with shrinkage ("linear unbiased prediction" in the terminology of Robinson, 1991). This definition is standard in the multilevel modeling literature (see, for example, Snijders and Bosker, 1999, Section 4.2) and in econometrics. http://www.stat.columbia.edu/~cook/movabletype/archives/2005/01/why_i_dont_use.html

26
Bayesian Interpretation Everything has a distribution (including the groups) Group-level model is prior information for the individual-level coefficients Group-level model has an assumed-normal prior (Can fit multilevel models with Bayesian methods, or with simpler/faster/easier approximations.)

27
R Options lme4::lmer() nlme::lme() MCMCglmm() BUGS Others/niche approaches…

28
Back to the Pizza Model the overall pattern among neighborhoods Natural clustering of pizzerias in neighborhoods adds information Neighborhoods with many/few pizzerias Many: trust data, ala no-pooling model Few: trust overall patterns, ala full-pooling model

29
Back to the Pizza Use Neighborhoods as natural grouping

30
5 slope coefficients and 5 intercept coefficients, one of each per neighborhood Slopes/intercepts are assumed to have Gaussian distribution Ideally, could describe all 5 slopes with 2 numbers (mean/variance) Neighborhoods with little data dont get freedom to set their own coefficients – get pulled towards overall slope or intercept Multilevel Pizza

31
R syntax lm.me.cost2 <- lmer(Rating ~ HeatSource + (1+CostPerSlice | Neighborhood), data=za.df)

32
Results (Partial-Pooling) lm.me.cost2 <- lmer(Rating ~ HeatSource + (1+CostPerSlice | Neighborhood), data=za.df)

33
Predicting a New Pizzeria Neighborhood: Chinatown Cost: $4.20 Fuel: Wood

34
Uncertainty in Prediction Fitted coefficients are uncertain arm::sim() Model error term rnorm(1, model matrix %*% sim$Neighborhood[, Chinatown, ], variance) New neighborhood – model possible coefficients mvrnorm(1, 0, VarCorr(model)$Neighborhood) http://github.com/HarlanH/nyc-pa-meetup-multilevel-pizza

35
Red State Blue State Other Examples

36
Tobacco Usage Other Examples

37
Diabetes Prevalence Other Examples

38
Insufficient Fruit and Vegetable Intake Other Examples

39
Clean Drinking Water Other Examples

40
Full-Pooling Model No-Pooling Model Separate Models Two–Step Analysis Steps to Multilevel Models

41
As few as one or two groups Even two observations per group Can have many groups with just one observation How Many Groups? How Many Observations?

42
Andy Gelman: The Blessing of Dimensionality More Data Add Complexity Because you can Larger Datasets

43
Resources Gelman and Hill (ARM) Pineiro & Bates Snijders and Bosker R-SIG-Mixed-Models (http://glmm.wikidot.com/faq) (SAS/SPSS)

44
Thanks!

Similar presentations

OK

R Language. What is R? Variables in R Summary of data Box plot Histogram Using Help in R.

R Language. What is R? Variables in R Summary of data Box plot Histogram Using Help in R.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google