Presentation on theme: "Technology Short Courses: Spring 2010 Kentaka Aruga"— Presentation transcript:
1Technology Short Courses: Spring 2010 Kentaka Aruga SAS StatisticsTechnology Short Courses: Spring 2010Kentaka Aruga
2Object of the coursePerforming simple descriptive statistics (proc mean, proc freq, and proc corr)Performing basic test statistics(Chi-square test, T-test, F-test)Basic commands for regression analysis and how to export the result into a table(proc reg)SAS can perform lots of statistical analysis such as simple descriptive statistics, ANOVA, ANCOVA, logistic regression, Multivariate analysis, and so on but this course will focus mainly on linear regression analysis.
3Getting data and importing data Section 1 PreparationGetting data and importing data
4Getting data http://www.uri.edu/its/research/vote.txt Download the SAS command that will be used in this practice fromDownload the data file that will be used in this course fromSave the files under ‘C:/’ drive of your windows computer.
5Importing Excel file to SAS Open SAS program and copy and paste the following commands from the file you have just downloaded “sasstat.txt”:libname car ‘c:/’;proc import out= car.autodatafile=“c:/auto.xls”dbms=excel2000 replace;sheet=“auto”;getnames=yes;run;
7Then highlight the command line and execute the command. Either right click on where you highlighted and click “submit selection” or click the “submit” icon
8Proc import Look at the ‘trunk’ column Do you see an empty column? SAS determines the data type based on the most common data type in the first 8 rows. ‘trunk’ column has mixed data.(since the first eight columns are all zero, the remaining columns become all zero)
9SAS determines the data type based on the most common data type in the first 8 rows. ‘trunk’ column has mixed data. Since the first eight columns are all zero, the remaining columns become all zero
10Proc import Add the following statement mixed = yes; Now the command line should look likeproc import out= car.autodatafile=“c:/auto.xls”dbms=excel2000 replace;sheet=“auto”;getnames=yes;run;Execute this commandADDED
12Importing Excel file from the main menu bar From the main menu click “File,” and then click “Import Data.”
13Importing Excel file from the main menu bar Under the “Import Wizard” specify the data source (in this example select MS Excel) and click next.Under the “Connect to MS Excel” wizard, browse the Excel file you are importing.
14Importing Excel file from the main menu bar Under the “Select Table” wizard select the name of the “sheet” of your Excel file and click next.Under the “Select library and member” wizard, specify the library where you want to import the Excel file.Put in the name of the file in the “Member” box to name the file that will be imported to SAS.
15Saving the syntax for importing Excel file You can save the syntax for what we just did to import the Excel file using the main menu bar.Browse and name the file in “Create SAS Statements” wizard.Open the “sas” file you just saved to see the commands.
17How to perform simple descriptive statistics (Review from SAS basics course) How would you see the number of obvs, mean, std, min, and max of all numeric variables in SAS?Ans. proc means data=car.auto;run;How do you analyze frequency of the variables?Ans. proc freq data=car.auto;
18Proc meansBy default “proc means” provides the number of obvs, mean, std, min, and max of all numeric variablesproc means data=car.auto;run;Specifying a certain variablevar variable name ;Q. How would you execute the mean procedure for the variables “price”, “mpg,” and “weight” ?Creating an output tableoutput out= file nameQ. How would you get the output for the meanprocedure for the variables “price”, “mpg,”and “weight”?
19Proc means (Answers) proc means data=car.auto; output out=car.means; var price mpg weight;run;
20Proc freqBy default this procedure creates frequency tables for all variablesproc freq data=car.auto;run;Specifying a certain variabletables variable nameQ. How would you execute the FREQ procedure for the variable “foreign”?Creating an output table/out = file nameQ. How would you get the output for the FREQ procedure for the variable “foreign”?
24Proc corrThe CORR procedure generates ‘Simple Statistics’ based on non missing values, and ‘Pearson Correlation Coefficient’, an index that quantifies the linear relationship between a pair of variablesInsignificant p-value indicates the lack of linear relationship between the two variables.Coefficient of determination = square of the correlation coefficient
25Proc corr Finding correlations between a pair of variables 1) All variablesproc corr data=car.auto;run;2) Three specific variablesvar price mpg weight;
26Rho(XY)=COV(X,Y)/(VAR(X)*VAR(Y))^(1/2) The low p-value indicates a strong negative linear relationship between weight and mpg. The heavier the car is the lower the mpg becomes.
27Performing basic test statistics (Chi-square test, T-test, F-test) Section 3Performing basic test statistics(Chi-square test, T-test, F-test)
28Chi-square test of independence What is the Chi-square test of independence?Ans. It tests whether the variable in the row and column are independent or relatedWhat is the null hypothesis?Ans. The variables in the row and column are independent: there is no relationship between row and column frequenciesThe command for SAS to test this is provided in the option of “proc freq”. Simply use chisq.To display the expected cell frequency for each cell use the option “expected.”Expected value of each cell is based on the independence between row and column variables.
29Chi-square test of independence: exercise There are 34 students in the classroom and there was a vote on whether they wanted to have a turtle in their classroom as a pet. The data file “vote.txt” contains the result of the vote (Yes=y, No=n), and gender of the students (male=m, female=f).Q1 Import the file “vote.txt” into SAS and name the variables “answers” and “gender.”Q2 Using the option “chisq,” test whether or not the answers to the vote and gender are associated with each other.
32What does the result tell you? The null hypothesis that the two variables are independent is rejected at even 1% significance level.The two variables “answers” and “gender” are associated with each other (They are dependent).This is lower than 0.01
33Proc ttestThis procedure is used to test the hypothesis of equality of means for two normal populations from which independent samples have been obtained.Three cases in SASOne-sample t-testComputes the sample mean of the variable and compares it with a given number.Two-sample t-testCompares the mean of the first sample minus the mean of the second sample to a given number.Pair observations t-testCompares the mean of the differences in the observations to a given number.
34Assumptions of “proc ttest” The observations are random samples drawn from normally distributed populations. This can be tested using the UNIVARIATE procedureIf the normality assumptions are not satisfied: use NPAR1WAY procedure.Two populations of a group comparison must be independent.If not independent, you should question the validity of a paired comparison.The default null hypothesis is set as equal to zero. To change this you can use H0=‘number’.” e.g. h0=10The default confidence level is 5%. To change this you can use alpha=‘confidence level’.” e.g. alpha=0.01Source:
35Proc ttest: exerciseHow would you perform a t-test on mpg variable classified by foreign variable?Hint: use “class” and “var” statementWhat will the null hypothesis be in this case?
36Proc ttest (Cont’d) The command proc ttest data=car.auto; class foreign;var mpg;run;CLASS statement: contains a variable that distinguishes the groups being compared.VAR statement: specifies the response variable to be used in calculations.The null hypothesisThe alternative hypothesis
37The first table shows the basic statistics See hereHigh high p-valueThe first table shows the basic statisticsThe second table is the t-test for equal mean. Before using this table you need to look at the third table to determine if the assumption of equal variances is reasonableThe third table is a test of equal variancesIn this example the null hypothesis of equal variance is not rejected.Thus you need to look at the “equal variance” in the second table. The second table suggests there is not a difference in means across domestic and foreign car.For table 3, the test for the mpg variance equality yields an F value of 1.18 with a p-value of Thus the result indicates equal variance.Look for a T value based on the assumption of equal mpg variance.Then looking at table two the null hypothesis of equal mean is not rejected at even 10% significance level (P-value is ). T-test statistics = which is not significant at α=0.05, indicating no significant difference in the mpg values between the domestic and the foreign cars.
38Section 4Basic commands for regression analysis and how to export the result into a table(proc reg)
39Regression analysisRegression analysis : finding a reasonable mathematical model of the relationship between a response variable (y) and a set of explanatory variables (x1, x2,…. xP)General modelProc Reg estimates the coefficients (beta’s)
40Proc reg General command proc reg data = file name model DV = IV ; run;DV: dependent variable IV: independent variableThis procedure also does the following testing:F-test:Tests the null hypothesis that none of the independent variables has any effectT-testTests for each IV the null hypothesis that the independent variable has no effect toward the dependent variable.
41Proc reg: exerciseLet ‘price’ be a response variable (dependent variable (DV)), and ‘mpg’ and ‘length’ be explanatory variables (independent variables (IV))Q1 What will be the commands?Q2 What null hypotheses will be tested?Q3 Will the model be significant?
42Proc reg: answers Q1 proc reg data = car.auto; model price = mpg length;run;Q2 F-testT-test
44Proc reg: Confidence and prediction interval Constructing 95% confidence and prediction interval by adding two options, ‘clm’ and ‘cli’How would you add these options in the case of previous model?proc reg data=car.auto;model price = mpg length / clm cli;run;95% CI means that if the estimation process were repeated for many times 95% of all the calculated intervals would be expected to contain the true parameter value . 95 % PI means that about 95 % of the time, the next measurement you make will be inside this interval
45Proc reg: creating an output table Add “outest = file name” after the “proc reg” commandproc reg data=car.autooutest=car.est1;model price = mpg length /clm cli;run;quit;In order to see the output data file “car.est1” you need to add the statement “quit” in the end.No semicolon here
46You can drop the categories you do not want to see by using the “keep” or “drop” statement data car.est2 (keep=intercept mpg length);set car.est1;run;data car.est3 (drop=price _model_ _depvar_ _type_ _RMSE_);
47Proc reg: creating an output table To see other outputs go to “Help” and type in “REG” and go into “The REG procedure.”Click “Syntax”
50Exporting the output data to Excel General commandsproc export data = Name of the SAS data file you are exportingoutfile = “The name of the drive or the pass to the folder of your computer”dbms = excel2000 replace;run;How would you export the file “car.est2” into an Excel file?Ans. proc export data = car.est2outfile = “c:/est.xls"
51Useful supports: other useful sites Online SAS manualsThis will automatically link you toStatbookstore: useful site for finding program examples