Presentation on theme: "Introduction to SAS. What is a data set? A data set (or dataset) is a collection of data, usually presented in tabular form. Each column represents a."— Presentation transcript:
What is a data set? A data set (or dataset) is a collection of data, usually presented in tabular form. Each column represents a particular variable. Each row corresponds to a given member of the data set in question.
There are three types of datasets Cross-sectional Time-Series Panel (combination of cross-sectional time- series data sets)
Cross-Sectional Data Cross-sectional data refers to data collected by observing many subjects (such as individuals, firms or countries/regions) at the same point of time, or without regard to differences in time. MembersAgeWageYears of schooling John40100k14 Paul34110k17 Mary2875k10 Tom30130k16 Sara3750k15
Time-Series Data A time series is a sequence of data points, measured typically at successive times spaced at uniform time intervals. Frequencies: daily, weekly, monthly, quarterly, annual YearGDP xyzInflation Rate 2004343.2 2005302.5 2006372.7 2007383 2008412.9 2009433.4
Panel Data Panel data, also called longitudinal data or cross-sectional time series data, are data where multiple cases (people, firms, countries etc) were observed at two or more time periods. PersonYearIncomeAgeSex 120031500271 120041700281 120052000291 220032100412 220042100422 220052200432
What should you know about your dataset? What type of dataset do you have? How many variables do you have? How many observations do you have? What kind of variables do you have? – Numeric. numerical variable is an observed response that is a numerical value – String. A string variable is any combination of one or more characters. Are there missing values?
How to store your dataset? Microsoft Excel Spreadsheets
1. What does SAS look like? EDITOR WINDOW LOG WINDOW OUTPUT WINDOW RESULTS WINDOW EXPLORER WINDOW EXECUTE THE PROGRAM NEW LIBRARIES
Anatomy of a SAS Program (1)Data name statement (2)Input statement (list of all variables to be read into the program) (3)Transformation statements (4)Datalines statement (copy & paste from Excel) (5)Placement of data (6)PROC statements – Means – Corr – Reg – Model – Autoreg (7) Run Statement
Need this statement after the data No date will appear on the output
Model Statement print Creation of a data set named datareg which contains the predicted values of the dependent variable and the residuals Test of normality of the residuals autoreg also produces AIC, SIC, and within sample MAE, MAPE, and RMSE. Confidence intervals associated with the estimated coefficients Square of partial correlation coefficients
Statistics in SAS Use PROC MEANS or PROC CORR Proc Means Data = ??? N mean median std min max cv skewness kurtosis var var_name1 var_name2…;
Regression in SAS Use PROC REG or PROC MODEL Simple and Multiple Regression
Using SAS PROC REG for Simple Linear Regression The general syntax for PROC REG is – PROC REG ; ; The most commonly used options are: – DATA=datsetname Specifies dataset – SIMPLE Displays descriptive statistics The most commonly used statements are: – MODEL dependentvar = independentvar ; Specifies the variable to be predicted (dependentvar) and the variable that is the predictor (independentvar) Several MODEL options are available.
Example Proc reg data = spaghettisauce Model qprego = pprego/Pr cli dwprob;
Confidence limits of parameter estimates square of partial correlation coefficients
Using SAS PROC REG for Multiple Linear Regression The general syntax for PROC REG is – PROC REG ; ; The most commonly used options are: – DATA=datsetname Specifies dataset – SIMPLE Displays descriptive statistics The most commonly used statements are: – MODEL dependentvar = independentvar Specifies the variable to be predicted (dependentvar) and the variables that are the predictors (independentvars)
MODEL STATEMENT OPTIONS (Place after slash following the list of explanatory variables.) PRequests a table containing predicted values from the model RRequests that the residuals be analyzed. CLIRequests the 95 percent upper and lower confidence limits for an individual value of the dependent variable.