Presentation is loading. Please wait.

Presentation is loading. Please wait.

Linear Regression/Correlation

Similar presentations


Presentation on theme: "Linear Regression/Correlation"— Presentation transcript:

1 Linear Regression/Correlation
Quantitative Explanatory and Response Variables Goal: Test whether the level of the response variable is associated with (depends on) the level of the explanatory variable Goal: Measure the strength of the association between the two variables Goal: Use the level of the explanatory to predict the level of the response variable

2 Linear Relationships Notation:
Y: Response (dependent, outcome) variable X: Explanatory (independent, predictor) variable Linear Function (Straight-Line Relation): Y = a + b X (Plot Y on vertical axis, X horizontal) Slope (b): The amount Y changes when X increases by 1 b > 0  Line slopes upward (Positive Relation) b = 0  Line is flat (No linear Relation) b < 0  Line slopes downward (Negative Relation) Y-intercept (a): Y level when X=0

3 Example: Service Pricing
Internet History Resources (New South Wales Family History Document Service) Membership fee: $20A 20¢ ($0.20A) per image viewed Y = Total cost of service X = Number of images viewed a = Cost when no images viewed b = Incremental Cost per image viewed Y = a + b X = X

4 Example: Service Pricing

5 Probabilistic Models In practice, the relationship between Y and X is not “perfect”. Other sources of variation exist. We decompose Y into 2 components: Systematic Relationship with X: a + b X Random Error: e Random respones can be written as the sum of the systematic (also thought of as the mean) and random components: Y = a + b X + e The (conditional on X) mean response is: E(Y) = a + b X

6 Least Squares Estimation
Problem: a, b are unknown parameters, and must be estimated and tested based on sample data. Procedure: Sample n individuals, observing X and Y on each one Plot the pairs Y (vertical axis) versus X (horizontal) Choose the line that “best fits” the data. Criteria: Choose line that minimizes sum of squared vertical distances from observed data points to line. Least Squares Prediction Equation:

7 Example - Pharmacodynamics of LSD
Response (Y) - Math score (mean among 5 volunteers) Predictor (X) - LSD tissue concentration (mean of 5 volunteers) Raw Data and scatterplot of Score vs LSD concentration: Source: Wagner, et al (1968)

8 Example - Pharmacodynamics of LSD
(Column totals given in bottom row of table)

9 SPSS Output and Plot of Equation

10 Example - Retail Sales U.S. SMSA’s Y = Per Capita Retail Sales
X = Females per 100 Males

11 Residuals Residuals (aka Errors): Difference between observed values and predicted values: Error sum of squares: Estimate of (conditional) standard deviation of Y:

12 Linear Regression Model
Data: Y = a + b X + e Mean: E(Y) = a + b X Conditional Standard Deviation: s Error terms (e) are assumed to be independent and normally distributed

13 Example - Pharmacodynamics of LSD

14 Correlation Coefficient
Slope of the regression describes the direction of association (if any) between the explanatory (X) and response (Y). Problems: The magnitude of the slope depends on the units of the variables The slope is unbounded, doesn’t measure strength of association Some situations arise where interest is in association between variables, but no clear definition of X and Y Population Correlation Coefficient: r Sample Correlation Coefficient: r

15 Correlation Coefficient
Pearson Correlation: Measure of strength of linear association: Does not delineate between explanatory and response variables Is invariant to linear transformations of Y and X Is bounded between -1 and 1 (higher values in absolute value imply stronger relation) Same sign (positive/negative) as slope

16 Example - Pharmacodynamics of LSD
Using formulas for standard deviation from beginning of course: sX = and sY = From previous calculations: b = -9.01 This represents a strong negative association between math scores and LSD tissue concentration

17 Coefficient of Determination
Measure of the variation in Y that is “explained” by X Step 1: Ignoring X, measure the total variation in Y (around its mean): Step 2: Fit regression relating Y to X and measure the unexplained variation in Y (around its predicted values): Step 3: Take the difference (variation in Y “explained” by X), and divide by total:

18 Example - Pharmacodynamics of LSD
TSS SSE

19 Inference Concerning the Slope (b)
Parameter: Slope in the population model (b) Estimator: Least squares estimate: b Estimated standard error: Methods of making inference regarding population: Hypothesis tests (2-sided or 1-sided) Confidence Intervals

20 Significance Test for b
1-sided Test H0: b = 0 HA+: b > 0 or HA-: b < 0 2-Sided Test H0: b = 0 HA: b  0

21 (1-a)100% Confidence Interval for b
Conclude positive association if entire interval above 0 Conclude negative association if entire interval below 0 Cannot conclude an association if interval contains 0 Conclusion based on interval is same as 2-sided hypothesis test

22 Example - Pharmacodynamics of LSD
Testing H0: b = 0 vs HA: b  0 95% Confidence Interval for b : t.025,5

23 Analysis of Variance in Regression
Goal: Partition the total variation in y into variation “explained” by x and random variation These three sums of squares and degrees of freedom are: Total (TSS) dfTotal = n-1 Error (SSE) dfError = n-2 Model (SSR) dfModel = 1

24 Analysis of Variance in Regression
Analysis of Variance - F-test H0: b = HA: b  0 F represents the F-distribution with 1 numerator and n-2 denominator degrees of freedom

25 Example - Pharmacodynamics of LSD
Total Sum of squares: Error Sum of squares: Model Sum of Squares:

26 Example - Pharmacodynamics of LSD
Analysis of Variance - F-test H0: b = HA: b  0

27 Example - SPSS Output

28 Significance Test for Pearson Correlation
Test identical (mathematically) to t-test for b, but more appropriate when no clear explanatory and response variable H0: r = Ha: r  (Can do 1-sided test) Test Statistic: P-value: 2P(t|tobs|)

29 Model Assumptions & Problems
Linearity: Many relations are not perfectly linear, but can be well approximated by straight line over a range of X values Extrapolation: While we can check validity of straight line relation within observed X levels, we cannot assume relationship continues outside this range Influential Observations: Some data points (particularly ones with extreme X levels) can exert a large influence on the predicted equation.


Download ppt "Linear Regression/Correlation"

Similar presentations


Ads by Google