Download presentation
Presentation is loading. Please wait.
1
Nemours Biomedical Research Statistics April 2, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility
2
Nemours Biomedical Research Relationships Among Variables Response (dependent) variable(s) - measure the outcome of a study. Explanatory (Independent) variable(s) - explain or influence the changes in a response variables Outlier - an observation that falls outside the overall pattern of the relationship. Positive Association - An increase in an independent variable is associated with an increase in a dependent variable. Negative Association - An increase in an independent variable is associated with a decrease in a dependent variable.
3
Nemours Biomedical Research Scatterplots Shows relationship between two variables (age and height in this case). Reveals form, direction, and strength of the relationship.
4
Nemours Biomedical Research Rcmdr:Scatterplots - Demo Graphs->Scatterplot->select x (independent) and y (response) variables -> select options (e.g. least-square line), plotting parameters, span for smooth-> type x and y labels -> plot by groups (e.g. grp) Plot is in the next page
5
Nemours Biomedical Research Rcmdr: Scatterplots
6
Nemours Biomedical Research Scatterplots Strong positive association Points are scattered with a poor association Strong negative association
7
Nemours Biomedical Research Correlation Correlation measures the degree to which two variables are associated. Two commonly used correlation coefficient: –Pearson Correlation Coefficient –Spearman Rank Correlation Coefficient
8
Nemours Biomedical Research Correlation of two variables Pearson Correlation Coefficient: measures the direction and strength of the relationship between two quantitative variables. Suppose that we have data on variables x and y for n individuals. Then the correlation coefficient r between x and y is defined as, Where, s x and s y are the standard deviations of x and y. r is always a number between -1 and 1. Values of r near 0 indicate little or no linear relationship. Values of r near -1 or 1 indicate a very strong linear relationship. The extreme values r=1 or r=-1 occur only in the case of a perfect linear relationship, when the points lie exactly along a straight line.
9
Nemours Biomedical Research Correlation of two variables Positive r indicates positive association i.e. association between two variables in the same direction, and negative r indicates negative association. Scatterplot of Height and Age shows that these two variables possess a strong, positive linear relationship. The correlation coefficient of these two variables is 0.9829632, which is very close to 1.
10
Nemours Biomedical Research Correlation of two variables Spearman Rank Correlation Coefficient: This is non-parametric measure of correlation between two variables This is basically a pearson correlation coefficient of the ranks of data of two variables instead of data itself.
11
Nemours Biomedical Research Rcmdr Teaching demo: Correlation demonstration Select menu ‘Demos’ if it’s not in the menu, select ‘Tools’ and then ‘Load Rcmdr plug-in(s) and you will find ‘Demos’ in the menu. Select simple correlation.
12
Nemours Biomedical Research Correlation of two variables r=0.02 r=-0.96r=0.98
13
Nemours Biomedical Research Rcmdr: Correlation of two variables Statistics -> summaries -> correlation matrix-> pick variables (e.g. hgt and age) and select type of correlation (e.g. pearson product moment or Spearman rank)-> ok. cor(data[,c("age","hgt")], use="complete.obs") age hgt age 1.0000000 0.9829632 hgt 0.9829632 1.0000000
14
Nemours Biomedical Research Strong Association but no Correlation Gas mileage of an auto mobile first increases than decreases as the speed increases like the following data: Scatter plot shows an strong association. But calculated, r = 0, why? r = 0 ? It’s because the relationship is not linear and r measures the linear relationship between two variables.
15
Nemours Biomedical Research Influence of an outlier Consider the following data set of two variables X and Y: r = -0.237 After dropping the last pair, r = 0.996
16
Nemours Biomedical Research Simple Linear Regression Regression refers to the value of a response variable as a function of the value of an explanatory variable. A regression model is a function that describes the relationship between response and explanatory variables. A simple linear regression has one explanatory variable and the regression line is straight. The response variable is quantitative and independent variable (s) can be both quantitative and categorical. Categorical variables are handled by creating dummy variable (s).
17
Nemours Biomedical Research Simple Linear Regression The linear relationship of variables Y and X can be written as in the following regression model form Y= b 0 + b 1 X + e where, ‘ Y ’ is the response variable, ‘ X ’ is the explanatory variable, ‘ e ’ is the residual (error), and b 0 and b 1 are two parameters. Basically, b o is the intercept and b 1 is the slope of a straight line y= b 0 + b 1 X.
18
Nemours Biomedical Research Simple Linear Regression A simple regression line is fitted for height on age. The intercept is 31.019 and the slope (regression coefficient) is.1877.
19
Nemours Biomedical Research Simple Linear Regression Assumptions: 1.Response variable is normally distributed. 2.Relationship between the two variables is linear. 3.Observations of response variable are independent. 4.Residual error is normally distributed with mean 0 and constant standard deviation.
20
Nemours Biomedical Research Simple Linear Regression Estimating Parameters b 0 and b 1 –Least Square method estimates b 0 and b 1 by fitting a straight line through the data points so that it minimizes the sum of square of the deviation from each data point. –Formula:
21
Nemours Biomedical Research Simple Linear Regression Fitted Least Square Regression line –Fitted Line: –Where is the fitted / predicted value of i th observation (Y i ) of the response variable. –Estimated Residual: –Least square method estimates b 0 and b 1 to minimize the summed error:
22
Nemours Biomedical Research Simple Linear Regression e1e1 e3e3 e2e2 In this example, a regression line (red line) has been fitted to a series of observations (blue diamonds) and residuals are shown for a few observations (arrows). Fitted Least Square Regression line
23
Nemours Biomedical Research Simple Linear Regression Interpretation of the Regression Coefficient and Intercept –Regression coefficient (b 1 ) reflects the average change in the response variable Y for a unit change in the explanatory variable X. That is, the slope of the regression line. E.g. –Intercept (b 0 ) estimates the average value of the response variable Y without the influence of the explanatory variable X. That is, when the explanatory variable = 0.0.
24
Nemours Biomedical Research Rcmdr Teaching demo:simple linear regression Demos ->Simple linear regression
25
Nemours Biomedical Research Simple Linear Regression Statistics -> Fit Model -> Linear regression -> In the small window, write name of models, pick response (e.g.hgt) and explanatory (e.g. age) variables, then ok. lm(formula = hgt ~ age, data = data) Call: Residuals: Min 1Q Median 3Q Max -2.53975 -0.55722 0.08105 0.68147 2.24326 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 31.018720 0.439077 70.64 <2e-16 *** age 0.187735 0.004609 40.73 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.07 on 58 degrees of freedom Multiple R-squared: 0.9662,Adjusted R-squared: 0.9656 F-statistic: 1659 on 1 and 58 DF, p-value: < 2.2e-16
26
Nemours Biomedical Research Simple Regression Dummy or Indicator variable: A variable that marks or encodes a particular attribute. A dummy variable usually takes value 1 or 0 to indicate presence or absence of an attribute, e.g., 1 for male and 0 for female. Zero (0) represents the reference category A categorical variable of k levels can be expressed as (k-1) dummy variables.
27
Nemours Biomedical Research Rcmdr:Simple Regression Creating Dummy variable: –Data->Manage variables in active data set -> Recode variables -> pick variable to recode (e.g. sex) -> Give a name of new variable (e.g. sexc) -> Unselect make new variable a factor (by default it is selected) -> Enter recode directives (“m”=1 “f”=0) -> ok –It will create a variable sexc with value 1 and 0. Running Regression: –Statistics -> Fit Models -> Linear Regression -> type a name for model (e.g Model2), pick response (e.g. LWAS) and explanatory (e.g. sexc) variables -> ok
28
Nemours Biomedical Research Rcmdr:Simple Regression (output) Call: lm(formula = LWAS ~ sexc, data = data) Residuals: Min 1Q Median 3Q Max -45.030 -10.186 3.845 10.244 22.649 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 86.805 2.740 31.68 <2e-16 *** sexc -9.454 3.875 -2.44 0.0178 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 15.01 on 58 degrees of freedom Multiple R-squared: 0.09307,Adjusted R-squared: 0.07744 F-statistic: 5.952 on 1 and 58 DF, p-value: 0.01778
29
Nemours Biomedical Research Rcmdr:Simple Regression (output): Interpretetion A dummy variable regression predicts the mean level of each group. Estimate of intercept represents the mean for reference group In our example, mean LWAS for female patients is 86.805. Estimate of the dummy variable indicates difference of means of two categories. Sum of the estimates of intercept and variable itself (e.g. sexc) represents the predicted mean for the other group. In our example, mean LWAS for male patients is (86.805 – 9.454) = 77.351 A p-value of <0.05 for the dummy variable (e.g.sexc) indicates the significant effect of the dummy variable. P-value for sexc is <0.05, which indicates the significant effect of the variable sex on the response LWAS. Another way, we can say the mean LWAS in male and female is significantly different
30
Nemours Biomedical Research Multiple Regression Two or more independent variables to predict a single dependent variable. Multiple regression model of Y on p number of explanatory variables can be written as, Y = b 0 + b 1 X 1 + b 2 X 2 +… +b p X p +e where b i (i=1,2, …, p) is the regression coefficient of X i
31
Nemours Biomedical Research Multiple Regression Fitted Y is given by, The estimated residual error is the same as that in the simple linear regression,
32
Nemours Biomedical Research Rcmdr: Multiple Regression Statistic->Fit Model-> simple linear regression-> response variable (e.g. PLUC.pre) and more than one explanatory variables (e.g. age and LWAS) Every thing is the same as for the simple regression except we need to select more than two explanatory (e.g. age, PLUC.pre) Call: lm(formula = PLUC.pre ~ age + LWAS, data = data) Residuals: Min 1Q Median 3Q Max -5.9201 -1.5830 0.3182 1.4167 4.0089
33
Nemours Biomedical Research Rcmdr: Multiple Regression Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 12.248701 1.677373 7.302 9.97e-10 *** age -0.004814 0.009321 -0.517 0.607 LWAS -0.025224 0.018033 -1.399 0.167 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.16 on 57 degrees of freedom Multiple R-squared: 0.03926,Adjusted R-squared: 0.005548 F-statistic: 1.165 on 2 and 57 DF, p-value: 0.3194
34
Nemours Biomedical Research Coefficient of Determination (Multiple R-squared) Total variation in the response variable Y is due to (i) regression of all variables in the model (ii) residual (error). Total variation of y, SS (y) = SS(Regression) +SS(Residual) The Coefficient of Determination is,
35
Nemours Biomedical Research Coefficient of Determination (Multiple R-squared) R 2 lies between 0 and 1. R 2 = 0.8 implies that 80% of the total variation in the response variable Y is due to the contribution of all explanatory variables in the model. That is, the fitted regression model explains 80% of the variance in the response variable.
36
Nemours Biomedical Research Coefficient of Determination (Multiple R-squared) A R 2 always increases with an increasing number of variables in the model, without consideration of sample size. This increase of R 2 may be due to chance variation. An Adjusted R 2 accounts for sample size and number of variables are being used in the model and reduce the possibility of chance variation.
37
Nemours Biomedical Research Rcmdr: Coefficient of Determination (Multiple R-squared) It’s in the output of Multiple regression. For the previous example, Coefficient of determination is 0.039. Multiple R-squared: 0.03926,Adjusted R-squared: 0.005548 F-statistic: 1.165 on 2 and 57 DF, p-value: 0.3194
38
Nemours Biomedical Research Thank you
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.