Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Part IB. Descriptive Statistics Multivariate Statistics ( 多變量統計 ) Focus: Multiple regression Spring 2007.

Similar presentations


Presentation on theme: "1 Part IB. Descriptive Statistics Multivariate Statistics ( 多變量統計 ) Focus: Multiple regression Spring 2007."— Presentation transcript:

1 1 Part IB. Descriptive Statistics Multivariate Statistics ( 多變量統計 ) Focus: Multiple regression Spring 2007

2 2 Regression Analysis ( 迴歸分析 ) Y = f(X): Y is a function of X Regression analysis: a method of determining the specific function relating Y to X Linear Regression ( 線性迴歸分析 ): a popular model in social science A brief review offered here –Can see ppt files on the course website

3 3 Example: summarize the relationship with a straight line

4 4 Draw a straight line, but how? ( 怎麼畫 那條直線 ?)

5 5 Notice that some predictions are not completely accurate.

6 6 How to draw the line? Purpose: draw the regression line to give the most accurate predictions of y given x Criteria for “accurate”: Sum of (observed y – predicted y) 2 = sum of (prediction errors) 2 [ 觀察值與估計值之差的平方和 ] Called the sum of squared errors or sum of the squared residuals (SSE)

7 7 Ordinary Least Squares (OLS) Regression ( 普通最小平方法 ) The regression line is drawn so as to minimize the sum of the squared vertical distances from the points to the line ( 讓 SSE 最小 ) This line minimize squared predictive error This line will pass through the middle of the point cloud ( 迴歸線從資料群中間穿 過 )(think as a nice choice to describe the relationship)

8 8 To describe a regression line (equation): Algebraically, line described by its intercept ( 截 距 ) and slope ( 斜率 ) Notation: y = the dependent variable x = the independent variable y_hat = predicted y, based on the regression line β = slope of the regression line α= intercept of the regression line

9 9 The meaning of slope and intercept: slope = change in (y_hat) for a 1 unit change in x (x 一單位的改變導致 y 估計值 的變化 ) intercept = value of (y_hat) when x is 0 解釋截距與斜率時要注意到 x and y 的單 位

10 10 General equation of a regression line: (y_hat) = α +βx where α and β are chosen to minimize: sum of (observed y – predicted y) 2 A formula for α and β which minimize this sum is programmed into statistical programs and calculators

11 11 An example of a regression line

12 12 Fit: how much can regression explain? ( 迴歸能解釋 y 多少的變異? ) Look at the regression equation again: (y_hat) = α +βx y = α +βx + ε Data = what we explain + what we don’t explain Data = predicted + residual ( 資料有我們不能解釋的與可解釋的部分,即 能預估的與誤差的部分)

13 13 In regression, we can think “fit” in this way: Total variation = sum of squares of y explained variation = total variation explained by our predictions unexplained variation = sum of squares of residuals R 2 = (explained variation)/ (total variation) (判定係數) [y 全部的變易量中迴歸分析能解釋的部分 ]

14 14 R 2 = r 2 NOTE: a special feature of simple regression (OLS), this is not true for multiple regression or other regression methods. [ 注意:這是簡單迴歸分析的特性,不 適用於多元迴歸分析或其他迴歸分析 ]

15 15 Some cautions about regression and R 2 It’s dangerous to use R 2 to judge how “good” a regression is. ( 不要用 R 2 來判斷迴 歸的適用性 ) –The “appropriateness” of regression is not a function of R 2 When to use regression? –Not suitable for non-linear shapes [you can modify non-linear shapes] – regression is appropriate when r (correlation) is appropriate as a measure

16 16 補充 : Proportional Reduction of Error (PRE)( 消減錯誤的比例 ) PRE measures compare the errors of predictions under different prediction rules; contrasts a naïve to sophisticated rule R 2 is a PRE measure Naïve rule = predict y_bar Sophisticated rule = predict y_hat R 2 measures reduction in predictive error from using regression predictions as contrasted to predicting the mean of y

17 17 Cautions about correlation and regression: Extrapolation is not appropriate Regression: pay attention to lurking or omitted variables –Lurking (omitted) variables: having influence on the relationship between two variables but is not included among the variables studied –A problem in establishing causation Association does not imply causation. –Association alone: weak evidence about causation –Experiments with random assignment are the best way to establish causation.

18 18 Inference for Simple Regression

19 19 Regression Equation Equation of a regression line: (y_hat) = α +βx y = α +βx + ε y = dependent variable x = independent variable β = slope = predicted change in y with a one unit change in x α= intercept = predicted value of y when x is 0 y_hat = predicted value of dependent variable

20 20 Global test--F 檢定 : 檢定迴歸方程式 有無解釋能力 ( β= 0 )

21 21

22 22 The regression model ( 迴歸模型 ) Note: the slope and intercept of the regression line are statistics (i.e., from the sample data). To do inference, we have to think of α and β as estimates of unknown parameters.

23 23 Inference for regression Population regression line: μ y = α +βx estimated from sample: (y_hat) = a + bx b is an unbiased estimator ( 不偏估計式 )of the true slope β, and a is an unbiased estimator of the true intercept α

24 24 Sampling distribution of a (intercept) and b (slope) Mean of the sampling distribution of a is α Mean of the sampling distribution of b is β

25 25 Sampling distribution of a (intercept) and b (slope) Mean of the sampling distribution of a is α Mean of the sampling distribution of b is β The standard error of a and b are related to the amount of spread about the regression line (σ) Normal sampling distributions; with σ estimated use t-distribution for inference

26 26 The standard error of the least-squares line Estimate σ (spread about the regression line using residuals from the regression) recall that residual = (y –y_hat) Estimate the population standard deviation about the regression line (σ) using the sample estimates

27 27 Estimate σ from sample data

28 28 Standard Error of Slope (b) The standard error of the slope has a sampling distribution given by: Small standard errors of b means our estimate of b is a precise estimate of β SE b is directly related to s; inversely related to sample size (n) and S x

29 29 Confidence Interval for regression slope A level C confidence interval for the slope of “true” regression line β is b ± t * SE b Where t* is the upper (1-C)/2 critical value from the t distribution with n-2 degrees of freedom To test the hypothesis H 0 : β= 0, compute the t statistic: t = b/ SE b In terms of a random variable having the t,n-2 distribution

30 30 Significance Tests for the slope Test hypotheses about the slope of β. Usually: H 0 : β= 0 (no linear relationship between the independent and dependent variable) Alternatives: H A : β > 0 or H A : β < 0 or H A : β ≠ 0

31 31

32 32 Statistical inference for intercept We could also do statistical inference for the regression intercept, α Possible hypotheses: H 0 : α = 0 H A : α≠ 0 t-test based on a, very similar to prior t-tests we have done For most substantive applications, interested in slope (β), not usually interested in α

33 33 Example: SPSS Regression Procedures and Output To get a scatterplot (): 統計圖 (G) → 散佈圖 (S) → 簡單 → 定義(選 x 及 y ) To get a correlation coefficient: 分析 (A) → 相關 (C) → 雙變量 To perform simple regression 分析 (A) → 迴歸方法 (R) → 線性 (L) (選 x 及 y ) (還可選擇儲存預測值及殘差)

34 34 SPSS Example: Infant mortality vs. Female Literacy, 1995 UN Data

35 35 Example: correlation between infant mortality and female literacy

36 36 Regression: infant mortality vs. female literacy, 1995 UN Data

37 37 Regression: infant mortality vs. female literacy, 1995 UN Data

38 38 Hypothesis test example 大華正在分析教育成就的世代差異,他蒐集到 117 組父子教 育程度的資料。父親的教育程度是自變項,兒子的教育 程度是依變項。他的迴歸公式是: y_hat = 0.2915*x +10.25 迴歸斜率的標準誤差 (standard error) 是 : 0.10 1. 在 α=0.05 ,大華可得出父親與兒子的教育程度是有關連 的嗎? 2. 對所有父親的教育程度是大學畢業的男孩而言,這些男 孩的平均教育程度預測值是多少? 3. 有一男孩的父親教育程度是大學畢業,預測這男孩將來 的教育程度會是多少?


Download ppt "1 Part IB. Descriptive Statistics Multivariate Statistics ( 多變量統計 ) Focus: Multiple regression Spring 2007."

Similar presentations


Ads by Google