HLTH 653 Lecture 2 Raul Cruz-Cano Spring 2013. Statistical analysis procedures Proc univariate Proc t test Proc corr Proc reg.

Slides:



Advertisements
Similar presentations
I OWA S TATE U NIVERSITY Department of Animal Science Using Basic Graphical and Statistical Procedures (Chapter in the 8 Little SAS Book) Animal Science.
Advertisements

Statistical Techniques I EXST7005 Start here Measures of Dispersion.
EPI 809/Spring Probability Distribution of Random Error.
Creating Graphs on Saturn GOPTIONS DEVICE = png HTITLE=2 HTEXT=1.5 GSFMODE = replace; PROC REG DATA=agebp; MODEL sbp = age; PLOT sbp*age; RUN; This will.
Multiple regression analysis
ANOVA notes NR 245 Austin Troy
Copyright ©2011 Brooks/Cole, Cengage Learning Analysis of Variance Chapter 16 1.
QUANTITATIVE DATA ANALYSIS
B a c kn e x t h o m e Parameters and Statistics statistic A statistic is a descriptive measure computed from a sample of data. parameter A parameter is.
The Simple Regression Model
1 Review of Correlation A correlation coefficient measures the strength of a linear relation between two measurement variables. The measure is based on.
Interpreting Bi-variate OLS Regression
Summary of Quantitative Analysis Neuman and Robson Ch. 11
Measures of Central Tendency
Lecture 5 Correlation and Regression
Understanding Research Results
F-Test ( ANOVA ) & Two-Way ANOVA
Describing Data: Numerical
Describing distributions with numbers
APPENDIX B Data Preparation and Univariate Statistics How are computer used in data collection and analysis? How are collected data prepared for statistical.
NONPARAMETRIC STATISTICS
1 Experimental Statistics - week 4 Chapter 8: 1-factor ANOVA models Using SAS.
© Copyright McGraw-Hill CHAPTER 3 Data Description.
1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation.
EIPB 698E Lecture 10 Raul Cruz-Cano Fall Comments for future evaluations Include only output used for conclusions Mention p-values explicitly (also.
1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.
Measures of Central Tendency and Dispersion Preferred measures of central location & dispersion DispersionCentral locationType of Distribution SDMeanNormal.
Analyzing and Interpreting Quantitative Data
Regression For the purposes of this class: –Does Y depend on X? –Does a change in X cause a change in Y? –Can Y be predicted from X? Y= mX + b Predicted.
Chapter 3 Descriptive Statistics: Numerical Methods Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Testing Multiple Means and the Analysis of Variance (§8.1, 8.2, 8.6) Situations where comparing more than two means is important. The approach to testing.
Skewness & Kurtosis: Reference
TYPES OF STATISTICAL METHODS USED IN PSYCHOLOGY Statistics.
Statistical analysis Outline that error bars are a graphical representation of the variability of data. The knowledge that any individual measurement.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
6-1 Introduction To Empirical Models Based on the scatter diagram, it is probably reasonable to assume that the mean of the random variable Y is.
MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT OSMAN BIN SAIF Session 26.
The Completely Randomized Design (§8.3)
Lecture 3 Topic - Descriptive Procedures Programs 3-4 LSB 4:1-4.4; 4:9:4:11; 8:1-8:5; 5:1-5.2.
Week111 The t distribution Suppose that a SRS of size n is drawn from a N(μ, σ) population. Then the one sample t statistic has a t distribution with n.
Chapter 6: Analyzing and Interpreting Quantitative Data
IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.
Other Types of t-tests Recapitulation Recapitulation 1. Still dealing with random samples. 2. However, they are partitioned into two subsamples. 3. Interest.
Analysis of Variance STAT E-150 Statistical Methods.
MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT OSMAN BIN SAIF Session 25.
Experimental Statistics - week 9
EIPB 698D Lecture 5 Raul Cruz-Cano Spring Midterm Comments PROC MEANS VS. PROS SURVEYMEANS For non–parametric: Kriskal-Wallis.
ANOVA and Multiple Comparison Tests
Statistics Josée L. Jarry, Ph.D., C.Psych. Introduction to Psychology Department of Psychology University of Toronto June 9, 2003.
Introduction Dispersion 1 Central Tendency alone does not explain the observations fully as it does reveal the degree of spread or variability of individual.
NURS 306, Nursing Research Lisa Broughton, MSN, RN, CCRN RESEARCH STATISTICS.
1 Experimental Statistics - week 11 Chapter 11: Linear Regression and Correlation.
Central Bank of Egypt Basic statistics. Central Bank of Egypt 2 Index I.Measures of Central Tendency II.Measures of variability of distribution III.Covariance.
Stats Methods at IC Lecture 3: Regression.
Statistical analysis.
Analysis and Empirical Results
Statistical analysis.
APPROACHES TO QUANTITATIVE DATA ANALYSIS
Description of Data (Summary and Variability measures)
Correlation and Simple Linear Regression
Introduction to Statistics
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
Correlation and Simple Linear Regression
Simple Linear Regression and Correlation
MBA 510 Lecture 2 Spring 2013 Dr. Tonya Balan 4/20/2019.
Presentation transcript:

HLTH 653 Lecture 2 Raul Cruz-Cano Spring 2013

Statistical analysis procedures Proc univariate Proc t test Proc corr Proc reg

Proc Univariate The UNIVARIATE procedure provides data summarization on the distribution of numeric variables. PROC UNIVARIATE ; Var variable-1 variable-n; Run; Options: PLOTS : create low-resolution stem-and-leaf, box, and normal probability plots NORMAL: Request tests for normality

data blood; INFILE 'C:\blood.txt'; INPUT subjectID $ gender $ bloodtype $ age_group $ RBC WBC cholesterol; run; proc univariate data =blood ; var cholesterol; run;

OUTPUT (1) The UNIVARIATE Procedure Variable: cholesterol Moments N 795 Sum Weights 795 Mean Sum Observations Std Deviation Variance Skewness Kurtosis Uncorrected SS Corrected SS Coeff Variation Std Error Mean

OUTPUT (1) Moments N 795 Sum Weights 795 Mean Sum Observations Std Deviation Variance Skewness Kurtosis Uncorrected SS Corrected SS Coeff Variation Std Error Mean Moments - Moments are statistical summaries of a distribution N - This is the number of valid observations for the variable. The total number of observations is the sum of N and the number of missing values. Sum Weights - A numeric variable can be specified as a weight variable to weight the values of the analysis variable. The default weight variable is defined to be 1 for each observation. This field is the sum of observation values for the weight variable

OUTPUT (1) Moments N 795 Sum Weights 795 Mean Sum Observations Std Deviation Variance Skewness Kurtosis Uncorrected SS Corrected SS Coeff Variation Std Error Mean Sum Observations - This is the sum of observation values. In case that a weight variable is specified, this field will be the weighted sum. The mean for the variable is the sum of observations divided by the sum of weights.

OUTPUT (1) Moments N 795 Sum Weights 795 Mean Sum Observations Std Deviation Variance Skewness Kurtosis Uncorrected SS Corrected SS Coeff Variation Std Error Mean Std Deviation - Standard deviation is the square root of the variance. It measures the spread of a set of observations. The larger the standard deviation is, the more spread out the observations are.

OUTPUT (1) Moments N 795 Sum Weights 795 Mean Sum Observations Std Deviation Variance Skewness Kurtosis Uncorrected SS Corrected SS Coeff Variation Std Error Mean Variance - The variance is a measure of variability. It is the sum of the squared distances of data value from the mean divided by N-1. We don't generally use variance as an index of spread because it is in squared units. Instead, we use standard deviation.

OUTPUT (1) Moments N 795 Sum Weights 795 Mean Sum Observations Std Deviation Variance Skewness Kurtosis Uncorrected SS Corrected SS Coeff Variation Std Error Mean Skewness - Skewness measures the degree and direction of asymmetry. A symmetric distribution such as a normal distribution has a skewness of 0, and a distribution that is skewed to the left, e.g. when the mean is less than the median, has a negative skewness.

OUTPUT (1) Moments N 795 Sum Weights 795 Mean Sum Observations Std Deviation Variance Skewness Kurtosis Uncorrected SS Corrected SS Coeff Variation Std Error Mean (1)Kurtosis - Kurtosis is a measure of the heaviness of the tails of a distribution. In SAS, a normal distribution has kurtosis 0. (2) Extremely nonnormal distributions may have high positive or negative kurtosis values, while nearly normal distributions will have kurtosis values close to 0. (3) Kurtosis is positive if the tails are "heavier" than for a normal distribution and negative if the tails are "lighter" than for a normal distribution.

OUTPUT (1) Moments N 795 Sum Weights 795 Mean Sum Observations Std Deviation Variance Skewness Kurtosis Uncorrected SS Corrected SS Coeff Variation Std Error Mean Uncorrected Sum of Square Distances from the Mean - This is the sum of squared data values. Corrected SS - This is the sum of squared distance of data values from the mean. This number divided by the number of observations minus one gives the variance.

OUTPUT (1) Moments N 795 Sum Weights 795 Mean Sum Observations Std Deviation Variance Skewness Kurtosis Uncorrected SS Corrected SS Coeff Variation Std Error Mean (1)Coeff Variation - The coefficient of variation is another way of measuring variability. (2)It is a unitless measure. (3)It is defined as the ratio of the standard deviation to the mean and is generally expressed as a percentage. (4) It is useful for comparing variation between different variables.

OUTPUT (1) Moments N 795 Sum Weights 795 Mean Sum Observations Std Deviation Variance Skewness Kurtosis Uncorrected SS Corrected SS Coeff Variation Std Error Mean (1)Std Error Mean - This is the estimated standard deviation of the sample mean. (2)It is estimated as the standard deviation of the sample divided by the square root of sample size. (3)This provides a measure of the variability of the sample mean.

OUTPUT (2) Location Variability Mean Std Deviation Median Variance 2489 Mode Range Interquartile Range NOTE: The mode displayed is the smallest of 2 modes with a count of 12. Median - The median is a measure of central tendency. It is the middle number when the values are arranged in ascending (or descending) order. It is less sensitive than the mean to extreme observations. Mode - The mode is another measure of central tendency. It is the value that occurs most frequently in the variable.

OUTPUT (3) Location Variability Mean Std Deviation Median Variance 2489 Mode Range Interquartile Range NOTE: The mode displayed is the smallest of 2 modes with a count of 12. Range - The range is a measure of the spread of a variable. It is equal to the difference between the largest and the smallest observations. It is easy to compute and easy to understand. Interquartile Range - The interquartile range is the difference between the upper (75% Q) and the lower quartiles (25% Q). It measures the spread of a data set. It is robust to extreme observations.

OUTPUT (3) Tests for Location: Mu0=0 Test -Statistic p Value Student's t t Pr > |t| <.0001 Sign M Pr >= |M| <.0001 Signed Rank S Pr >= |S| <.0001

OUTPUT (3) Student's t t Pr > |t| <.0001 Sign M Pr >= |M| <.0001 Signed Rank S Pr >= |S| <.0001 (1) Sign - The sign test is a simple nonparametric procedure to test the null hypothesis regarding the population median. (2) It is used when we have a small sample from a nonnormal distribution. (3)The statistic M is defined to be M=(N+-N-)/2 where N+ is the number of values that are greater than Mu0 and N- is the number of values that are less than Mu0. Values equal to Mu0 are discarded. (4)Under the hypothesis that the population median is equal to Mu0, the sign test calculates the p-value for M using a binomial distribution. (5)The interpretation of the p-value is the same as for t-test. In our example the M-statistic is 398 and the p-value is less than We conclude that the median of variable is significantly different from zero.

OUTPUT (3) Student's t t Pr > |t| <.0001 Sign M Pr >= |M| <.0001 Signed Rank S Pr >= |S| <.0001 (1) Signed Rank - The signed rank test is also known as the Wilcoxon test. It is used to test the null hypothesis that the population median equals Mu0. (2) It assumes that the distribution of the population is symmetric. (3)The Wilcoxon signed rank test statistic is computed based on the rank sum and the numbers of observations that are either above or below the median. (4) The interpretation of the p-value is the same as for the t-test. In our example, the S-statistic is and the p-value is less than We therefore conclude that the median of the variable is significantly different from zero.

OUTPUT (4) Quantiles (Definition 5) Quantile Estimate 100% Max % % % % Q % Median % Q % 138 5% 123 1% 94 0% Min 17 95% - Ninety-five percent of all values of the variable are equal to or less than this value.

OUTPUT (5) Extreme Observations ----Lowest Highest--- Value Obs Value Obs Missing Values -----Percent Of----- Missing Missing Value Count All Obs Obs Extreme Observations - This is a list of the five lowest and five highest values of the variable

22 Student's t-test Independent One-Sample t-test This equation is used to compare one sample mean to a specific value μ 0. Where s is the grand standard deviation of the sample. N is the sample size. The degrees of freedom used in this test is N-1.standard deviation

23 Student's t-test Dependent t-test is used when the samples are dependent; that is, when there is only one sample that has been tested twice (repeated measures) or when there are two samples that have been matched or "paired". For this equation, the differences between all pairs must be calculated. The pairs are either one person's pretest and posttest scores or one person in a group matched to another person in another group. The average (X D ) and standard deviation (s D ) of those differences are used in the equation. The constant μ 0 is non-zero if you want to test whether the average of the difference is significantly different than μ 0. The degree of freedom used is N-1.

PROC TTEST The following statements are available in PROC TTEST. PROC TTESTPROC TTEST ; CLASSCLASS variable ; PAIREDPAIRED variables ; BYBY variables ; VARVAR variables ; CLASS: CLASS statement giving the name of the classification (or grouping) variable must accompany the PROC TTEST statement in the two independent sample cases (TWO SAMPLE T TEST). The class variable must have two, and only two, levels.

Paired Statements PAIRED: the PAIRED statement identifies the variables to be compared in paired t test 1.You can use one or more variables in the PairLists. 2.Variables or lists of variables are separated by an asterisk (*) or a colon (:). 3.The asterisk (*) requests comparisons between each variable on the left with each variable on the right. 4.Use the PAIRED statement only for paired comparisons. 5.The CLASS and VAR statements cannot be used with the PAIRED statement.

PROC TTEST OPTIONS : ALPHA=p specifies that confidence intervals are to be 100(1-p)% confidence intervals, where 0<p<1. By default, PROC TTEST uses ALPHA=0.05. If p is 0 or less, or 1 or more, an error message is printed. H0=m requests tests against m instead of 0 in all three situations (one-sample, two- sample, and paired observation t tests). By default, PROC TTEST uses H0=0. DATA=SAS-data-set names the SAS data set for the procedure to use

*One sample ttest*; Proc ttest data =blood H0=200; var cholesterol; run;

One sample t test Output The TTEST Procedure Variable: cholesterol N Mean Std Dev Std Err Minimum Maximum Mean 95% CL Mean Std Dev 95% CL Std Dev DF t Value Pr > |t| %CL Mean is 95% confidence interval for the mean. 95%CL Std Dev is 95% confidence interval for the standard deviation.

One sample t test Output Variable: cholesterol N Mean Std Dev Std Err Minimum Maximum Mean 95% CL Mean Std Dev 95% CL Std Dev DF t Value Pr > |t| DF - The degrees of freedom for the t-test is simply the number of valid observations minus 1. We loose one degree of freedom because we have estimated the mean from the sample. We have used some of the information from the data to estimate the mean; therefore, it is not available to use for the test and the degrees of freedom accounts for this T value is the t- statistic. It is the ratio of the difference between the sample mean and the given number to the standard error of the mean. It is the probability of observing a greater absolute value of t under the null hypothesis.

title 'Paired Comparison'; data pressure; input SBPbefore SBPafter diff_BP=SBPafter-SBPbefore ; datalines; ; run; proc ttest data=pressure; paired SBPbefore*SBPafter; run;

Paired t test Output The TTEST Procedure Difference: SBPbefore - SBPafter N Mean Std Dev Std Err Minimum Maximum Mean 95% CL Mean Std Dev 95% CL Std Dev DF t Value Pr > |t| Mean of the differences T statistics for testing if the mean of the difference is 0 P =0.3, suggest the mean of the difference is equal to 0

Two independent samples t-test An independent samples t-test is used when you want to compare the means of a normally distributed interval dependent variable for two independent groups. For example, using the hsb2 data file, say we wish to test whether the mean for write is the same for males and females. hsb2 data file proc ttest data = "c:\hsb2"; class female; var write; run;

Proc corr The CORR procedure is a statistical procedure for numeric random variables that computes correlation statistics (The default correlation analysis includes descriptive statistics, Pearson correlation statistics, and probabilities for each analysis variable). PROC CORR options; VAR variables; WITH variables; BY variables; Proc corr data=blood; var RBC WBC cholesterol; run;

Proc Corr Output Simple Statistics Variable N Mean Std Dev Sum Minimum Maximum RBC WBC cholesterol N - This is the number of valid (i.e., non-missing) cases used in the correlation. By default, proc corr uses pairwise deletion for missing observations, meaning that a pair of observations (one from each variable in the pair being correlated) is included if both values are non-missing. If you use the nomiss option on the proc corr statement, proc corr uses listwise deletion and omits all observations with missing data on any of the named variables.

Proc Corr Pearson Correlation Coefficients Prob > |r| under H0: Rho=0 Number of Observations RBC WBC cholesterol RBC WBC cholesterol Pearson Correlation Coefficients - measure the strength and direction of the linear relationship between the two variables. The correlation coefficient can range from -1 to +1, with -1 indicating a perfect negative correlation, +1 indicating a perfect positive correlation, and 0 indicating no correlation at all. P-value Number of observations

Proc Reg The REG procedure is one of many regression procedures in the SAS System. PROC REGPROC REG ; MODELMODEL dependents= ; BYBY variables ; OUTPUTOUTPUT keyword=names ;

data blood; INFILE ‘F:\blood.txt'; INPUT subjectID $ gender $ bloodtype $ age_group $ RBC WBC cholesterol; run; data blood1; set blood; if gender='Female' then sex=1; else sex=0; if bloodtype='A' then typeA=1; else typeA=0; if bloodtype='B' then typeB=1; else typeB=0; if bloodtype='AB' then typeAB=1; else typeAB=0; if age_group='Old' then Age_old=1; else Age_old=0; run; proc reg data =blood1; model cholesterol =sex typeA typeB typeAB Age_old RBC WBC ; run;

Proc reg output Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model Error Corrected Total DF - These are the degrees of freedom associated with the sources of variance. (1) The total variance has N-1 degrees of freedom (663-1=662). (2) The model degrees of freedom corresponds to the number of predictors minus 1 (P-1). Including the intercept, there are 8 predictors, so the model has 8-1=7 degrees of freedom. (3) The Residual degrees of freedom is the DF total minus the DF model, is 655.

Proc reg output Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model Error Corrected Total Sum of Squares - associated with the three sources of variance, total, model and residual. SSTotal The total variability around the mean. Sum(Y - Ybar) 2. SSResidual The sum of squared errors in prediction. Sum(Y - Ypredicted) 2. SSModel The improvement in prediction by using the predicted value of Y over just using the mean of Y. Hence, this would be the squared differences between the predicted value of Y and the mean of Y, Sum (Ypredicted - Ybar) 2. Note that the SSTotal = SSModel + SSResidual. SSModel / SSTotal is equal to the value of R-Square, the proportion of the variance explained by the independent variables

Proc reg output Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model Error Corrected Total Mean Square - These are the Mean Squares, the Sum of Squares divided by their respective DF. These are computed so you can compute the F ratio, dividing the Mean Square Model by the Mean Square Residual to test the significance of the predictors in the model

Proc reg output Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model Error Corrected Total F Value and Pr > F - The F-value is the Mean Square Model divided by the Mean Square Residual. F-value and P value are used to answer the question "Do the independent variables predict the dependent variable?". The p-value is compared to your alpha level (typically 0.05) and, if smaller, you can conclude "Yes, the independent variables reliably predict the dependent variable". Note that this is an overall significance test assessing whether the group of independent variables when used together reliably predict the dependent variable, and does not address the ability of any of the particular independent variables to predict the dependent variable.

Proc reg output Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Root MSE - Root MSE is the standard deviation of the error term, and is the square root of the Mean Square Residual (or Error).

Proc reg output Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Dependent Mean - This is the mean of the dependent variable. Coeff Var - This is the coefficient of variation, which is a unit-less measure of variation in the data. It is the root MSE divided by the mean of the dependent variable, multiplied by 100: (100*(48.2/201.69) =23.90).

Proc reg output Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept <.0001 sex typeA typeB typeAB Age_old RBC WBC t Value and Pr > |t|- These columns provide the t-value and 2 tailed p-value used in testing the null hypothesis that the coefficient/parameter is 0.

ANOVA A one-way analysis of variance (ANOVA) is used when you have a categorical independent variable (with two or more categories) and a normally distributed interval dependent variable and you wish to test for differences in the means of the dependent variable broken down by the levels of the independent variable..

ANOVA The following example studies the effect of bacteria on the nitrogen content of red clover plants. The treatment factor is bacteria strain, and it has six levels. Five of the six levels consist of five different Rhizobium trifolii bacteria cultures combined with a composite of five Rhizobium meliloti strains. The sixth level is a composite of the five Rhizobium trifolii strains with the composite of the Rhizobium meliloti. Red clover plants are inoculated with the treatments, and nitrogen content is later measured in milligrams. title1 'Nitrogen Content of Red Clover Plants'; data Clover; input Strain $ Nitrogen datalines; 3DOK DOK DOK DOK DOK DOK DOK DOK DOK DOK DOK DOK DOK DOK DOK DOK DOK DOK DOK DOK DOK DOK DOK DOK DOK COMPOS 17.3 COMPOS 19.4 COMPOS 19.1 COMPOS 16.9 COMPOS 20.8 ; run; proc anova data = Clover; class strain; model Nitrogen = Strain; run; proc freq data = Clover; tables Strain; run;

ANOVA The test for Strain suggests that there are differences among the bacterial strains, but it does not reveal any information about the nature of the differences. Mean comparison methods can be used to gather further information.

HLTH 653 This is a required class It is part of the qualifier exams