Lesson 3 Overview Descriptive Procedures Controlling SAS Output

Lesson 3 Overview Descriptive Procedures Controlling SAS Output
PRINT, MEANS, UNIVARIATE, SGPLOT Controlling SAS Output Program 3 in course notes LSB: See syllabus Welcome to lesson 3. Up to now we have been concentrating mainly on how to read our data into SAS. That is an important step; however our ultimate purpose is to do some sort of analyses, whether calculating simple descriptive statistics like means or frequencies or performing more complex analyses like linear regression. Once we get the data in and create a SAS data set we are ready to do this. We do “analyses” using procedures. There are many SAS procedures, each used to perform a certain type of analyses or task. In this lecture we will look at a few of the ones used most often for descriptive analyses, including some graphical methods. This is illustrated in program 3 of the course notes and is covered in above sections of the LSB. We will also look at out to control the output displayed in the output window.

Descriptive Procedures In SAS
This is a list of the most commonly used descriptive procedures, some of which we have seen before. PROC PRINT displays values of variables, PROC MEANS displays summary statistics like the mean and standard deviation for numeric variables. PROC UNIVARIATE gives additional statistics beyond that of PROC MEANS and can display certain plots. PROC FREQ displays one and multi-way frequency distributions for categorical data. PROCs PLOT and CHART display X-Y plots and bar charts in text mode. PROCs GPLOT and GCHART produce high resolution versions of these plots. New in version 9.2 and up is the SGPLOT procedure which is a single procedure that does most types of graphing. The examples I use in this class will use the new SGPLOT procedure, the S standing for statistical.

Syntax for Procedures PROC PROCNAME DATA=datasetname <options> ;
substatements/<options> ; The WHERE statement is a useful substatement available to all procedures. PROC FREQ DATA=demo ; TABLES marstat; WHERE state = 'MN'; RUN; Procedure calls have a common structure. The keyword PROC is followed by the name of the procedure followed by the keyword DATA, an equals sign, and then the dataset name. This is followed by various options that will depend on the procedure. After any options is a semi-colon that ends the PROC statement. Under the PROC statement are one or more sub-statements that depend on the procedure. For example VAR is a sub-statement for both the PRINT and MEANS procedures. Options on sub-statements are placed after a slash (/). The WHERE statement is a useful statement that can be used in all procedures. This statement filters the rows of the dataset in which the procedure operates on. In the example here we display the variable marstat from the demo dataset only for observations where state equals Minnesota. If you forget the syntax for a procedure you can go to the SAS help under the procedure you wish to run.

Data Layout of tomhs.dat TOMHS Data Dictionary (website)
Variable Type Len Pos Inform Description PTID Char 10 1 $10. Patient ID CLINIC Char 1 12 $1. Clinical center RANDDATE Num 6 14 mmddyy10. Randdate SBPBL Num SBP at baseline DATA tomhs; INFILE ‘folderpath\tomhs.dat'; ptid $10. @12 clinic $1. @14 randdate mmddyy10. @115 sbpbl 3. ; Note: You can give any legal variable name.

Program 3 DATA weight; INFILE ‘C:\SAS_Files\tomhs.dat' ;
ptid $10. @12 clinic $1. @30 sex 1. @58 height 4. @85 weight 5. ; * Create new variables here; bmi = (weight* )/(height*height); * BMI is calculated in kg/m2; RUN; OK, let’s look at program 3 which gives examples of running descriptive procedures. We start by creating a dataset called weight, reading in 5 variables from the TOMHS study file, tomhs.dat. The infile statement gives the complete path to the file. You will need to modify that depending on where you save the TOMHS dataset. We input the id, clinical center, sex, and height and weight of the patient – we use the pointer informat method. The input positions can be obtained from the data layout in the course notes. We then compute a new variable called bmi which is calculated as weight divided by height squared. The value is the number that simultaneously converts inches into meters and pounds into kilograms, so that bmi is in kilograms per meter squared. This is a common measure of obesity. Note: The * notation indicates multiplication; the / indicates division.

SAS Data Step: Build in Loop
DATA weight; INFILE ‘C:\SAS_Files\tomhs.dat'; * EOF then stop ptid $10. @12 clinic $1. @30 sex $1. @58 height 4. @85 weight 5. ; bmi = (weight* )/(height*height); OUTPUT; * Inserted by SAS RUN; Gets repeated for each data row Before we look at procedures, let’s carefully go through how the data step works. The first thing you need to know is that there is a build in loop in the data step – the code in the box gets executed for each row in the data set. The second thing is that there is an implied output statement at the end of the loop. The output statement writes the variable values to the SAS dataset. So if there are 100 rows in the data set tomhs.dat then the loop will execute 100 times, eventually writing out 100 observations to the data set weight. The key statement that controls the loop is the infile statement. As long as there is data remaining to read the loop will continue. When the infile statement has no more data to read a stop statement is generated and the loop stops.

PROC PRINT DATA = weight (OBS=5); TITLE
'Proc Print: Five observations from the TOMHS Study'; RUN; PROC MEANS DATA = weight; VAR height weight bmi; TITLE 'Proc Means Example 1'; PROC MEANS DATA = weight MEAN MEDIAN STD MAXDEC=2; TITLE 'Proc Means Example 2 (specifying options)'; Page 258 of Little SAS Book (5th edition) Also see online help under proc means We now run several procedures on the dataset weight. Note that you can just “PROC away”, i.e. once the dataset is created you can run multiple procedures in a row by just listing them one under the other. Here we have one PROC PRINT followed by two PROC MEANs. The output generated in your output window will be in this same order and indexed in the result window. For the PROC PRINT we use the data set option OBS that limits the observations displayed, here set to 5. This is an option available in all procedures but is mostly used for PROC PRINT to limit the output. Since there is no VAR statement under the PROC PRINT all variables will be displayed. We add a TITLE which is enclosed in quotes and complete the procedure with the RUN statement. We follow this with a PROC MEANS that will give descriptive statistics for the 3 variables listed in VAR. The default statistics displayed are the number of non-missing values (N), the mean, the standard deviation, and the minimum and maximum values. The next PROC MEANS tells SAS to display only the mean, median, and the standard deviation. The MAXDEC option limits the number of decimals displayed for each statistic to 2. These are all options that are part of the PROC MEANS statement. To find the entire list of statistics available you can look at the referenced pages of the textbook or look under PROC MEANS in the SAS help.

Proc Print: Five observations from the TOMHS Study
Obs ptid clinic sex height weight bmi 1 C C 2 B B 3 B B 4 D D 5 A A Proc Means Example 1 The MEANS Procedure Variable N Mean Std Dev Minimum Maximum height weight bmi Here is the output from PROC PRINT and the first PROC MEANS that would be displayed in your output window. The titles used for each procedure are shown in blue. The average BMI for patients is just less than 29, ranging from 21.5 to 37.5. .

Proc Means Example 2 (specifying options)
The MEANS Procedure Variable Mean Median Std Dev height weight bmi This is the result from the second PROC MEANS. Only the mean, median, and standard deviation are displayed, each to 2 decimals as given by the MAXDEC option.

OMITTING RUN STATEMENTS
PROC PRINT DATA = weight (OBS=5); PROC MEANS DATA = weight; VAR height weight bmi; PROC MEANS DATA = weight MEAN MEDIAN; THIS CODE WILL RUN THE FIRST TWO PROCEDURES BUT NOT THE LAST I want to clarify the need and function of RUN statements. Run statements tell SAS to “RUN’ the code that precedes it. It is a good practice to put a RUN statement after each procedure and after each data step. However, SAS will sometimes run the procedure anyway. If SAS encounters another PROC or DATA statement it will insert a RUN statement for you. You will have a problem in PC SAS if you omit the RUN statement after your last procedure. You will not get any output from that procedure because SAS is waiting for a RUN statement. The code above would give you results from the first two procedures but not the last. If this happens you can submit a single line of code which is the RUN statement. This would “flush out” the output from the last procedure.

PROC MEANS DATA = weight N MEAN STD MAXDEC=2 ; CLASS clinic;
VAR height weight bmi; TITLE 'Proc Means Example 3 (Using a CLASS statement)'; RUN; N clinic Obs Variable N Mean Std Dev A height weight bmi B height weight bmi C height weight bmi D height weight bmi If you want to display statistics separately for each level of a categorical variable then add a CLASS statement under PROC MEANS with the name of the categorical variable. Here we display summary statistics for the height, weight, and BMI for each of the four clinical centers. The VAR and CLASS statement can be in either order, as is the case for most sub-statements under a procedure. Other typical class variables are gender, treatment group, and race. The FW option stands for format width and controls the number of columns used for each statistic. This can be useful to squeeze more columns of statistics onto a page.

* Adding WAYS statement to get totals and by clinic;
PROC MEANS DATA = weight N MEAN STD MAXDEC=2; CLASS clinic; VAR height weight bmi; WAYS 0 1 ; RUN; N Obs Variable N Mean Std Dev height weight bmi clinic Obs Variable N Mean Std Dev A height weight bmi B height weight bmi C height weight bmi D height weight bmi Often times you will want the statistics by a class variable and the total. To get this you need to either run two procedures (one with the class variable and one without) or one proc means using the WAYS statement as shown here. The value 0 is for the total and the value 1 is for each level of class. If you had multiple class variables then there are various totals and subtotals. The WAYS statement can get you the sub-totals you want.

* Could also sort the data by clinic and then use BY statement;
PROC SORT data=weight; BY clinic; PROC MEANS DATA = weight N MEAN STD MAXDEC=2 ; VAR height weight bmi; TITLE 'Proc Means Example 4 (Using a BY statement)'; BY clinic; RUN; clinic=A Variable N Mean Std Dev ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ height weight bmi clinic=B height weight bmi Partial Output Another method to displaying statistics separately for each level of a variable is to first SORT your data by the class level and then use a BY statement under the procedure. To sort a dataset use PROC SORT with a BY statement as seen here. SAS simply rearranges the rows of the dataset. The BY statement under PROC MEANS can be used in any procedure, it tells SAS to run the procedure for each level of the BY variable. For proc means it is more efficient to use the CLASS statement as you do not need to sort your data.

PROC UNIVARIATE PROC UNIVARIATE DATA = weight; VAR bmi; ID ptid;
TITLE 'Proc Univariate Example 1'; RUN; * Note: PROC UNIVARIATE will give you much output ; The next procedure we will look at is PROC UNIVARIATE which displays extensive statistics for numeric variables. The syntax is similar to that for PROC MEANS. Here we run PROC UNIVARIATE on the variable bmi, listed in the VAR statement. Among the many statistics displayed is the 5 highest and lowest values for BMI. We use the ID statement to label these values with the patient ID. PROC UNIVARIATE will give you much output. If you had several variables in your VAR statement you would end up with many pages of output.

Proc Univariate Example 1
The UNIVARIATE Procedure Variable: bmi Moments N Sum Weights Mean Sum Observations Std Deviation Variance Skewness Kurtosis Uncorrected SS Corrected SS Coeff Variation Std Error Mean Basic Statistical Measures Location Variability Mean Std Deviation Median Variance Mode Range Interquartile Range Tests for Location: Mu0=0 Test Statistic p Value------ Student's t t Pr > |t| <.0001 Sign M Pr >= |M| <.0001 Signed Rank S Pr >= |S| <.0001 Here is the first portion of the output. Note the statistics are placed under categories (Moments, Basic Summary Measure, Tests for Location). Many of these are the same as in PROC MEANS. You may not know the meaning of all of the statistics. But that is OK, just look for the information you need. The tests for location section provides tests of whether the population mean (or median) of the variable is 0. This would only be relevant for change variables, for example, the change in blood pressure after treatment.

Quantiles (Definition 5) Quantile Estimate 100% Max 37.5179
99% 95% 90% 75% Q 50% Median 25% Q 10% 5% 1% 0% Min This section displays the rest of the output. Various quantiles (or percentiles) are given for the variable. The 90th percentile of bmi is The last section is called extreme observations and lists the 5 lowest and highest values. These are identified by the variable in the ID statement, here the patient ID. If you don’t use an ID variable then the observation number only will be displayed. We see the thinnest person is patient A00083 with a bmi of 21.5 and the heaviest person is patient B02059 with a bmi of This section of output can used to identify outliers. If we saw an unrealistic value for BMI displayed here you could go back to the medical chart for the patient to see if there was a data entry error in the height or weight recorded. You can display more than 5 values by using the NEXTROBS= option on the PROC UNIVARIATE statement. Extreme Observations Lowest Highest Value ptid Obs Value ptid Obs A B C B B A A C B B

* High resolution graphs can also be produced.
The following makes a histogram and normal plot ; ODS GRAPHICS ON; PROC UNIVARIATE DATA = weight; VAR bmi; HISTOGRAM bmi / NORMAL MIDPOINTS=20 to 40 by 2; INSET N = 'N' (5.0) MEAN = 'Mean' (5.1) STD = 'Sdev' (5.1) MIN = 'Min' (5.1) MAX = 'Max' (5.1)/ POS=NW HEADER='Summary Statistics'; LABEL bmi = 'Body Mass Index (kg/m2)'; TITLE 'Histogram of BMI'; PROBPLOT bmi/NORMAL (MU=est SIGMA=est); RUN; PROC UNIVARIATE can also be used to display high resolution histograms and normal probability plots. The ODS GRAPHICS ON statement turns on graphics for the procedure that follows. Plots specified in the univariate procedure will be written to an external file in a png format. They can also be viewed by clicking the appropriate link in the results window. To produce a histogram you use the HISTOGRAM statement. The keyword HISTOGRAM is followed by the name of the variable, followed by options, if any, after a slash (/). Here we produce a histogram for bmi with a normal curve superimposed on the plot. For the X-axis we order values from 20 to 40 with bars of width 2 using the MIDPOINTS option. The INSET statement inserts statistics for bmi on the same plot. The POS option here tells SAS to put the statistics in the north-west part of the plot area. Look under the documentation for PROC UNIVARIATE for several examples on using the inset statement. The PROBPLOT statement will produce a high-resolution normal probability plot. The MU and SIGMA options tells SAS to estimate the mean and standard deviation from the data. Needless to say some of this syntax is difficult to remember. However, once you have an example that works you can use it as a template the next time you want to make a histogram.

Here are the two plots generated
Here are the two plots generated. The histogram in combination with the summary statistics in the plot produce a nice summary of the variable. We note the bi-model pattern for BMI. The probability plot tells us a similar story, in perhaps a less intuitive way. If the data were perfectly from a normal distribution then the data points would form a straight line rather than a curved line as seen here.

HISTOGRAM DENSITY VBOX (HBOX) SCATTER SERIES REG STEP HBAR (VBAR)
* PROC SGPLOT can do several types of plots PROC SGPLOT; HISTOGRAM bmi; DENSITY bmi/TYPE=NORMAL; DENSITY bmi/TYPE=KERNEL; YAXIS GRID; TITLE ‘HISTOGRAM of BMI'; RUN; HISTOGRAM DENSITY VBOX (HBOX) SCATTER SERIES REG STEP HBAR (VBAR) You can also use the SGPLOT procedure to create histograms. However, you will not get the statistics displayed inside the graphical area. On the right side of this slide is a list of several plots that can be displayed using PROC SGPLOT. We will see examples of these in this and upcoming lessons. Use use the DENSITY statement to superimpose the best fit normal distribution to the histogram. Density=KERNAL fits the best empirical curve to the data. Note: SAS keeps adding to the overall plot the various pieces you supply. SAS first draws a histogram, then a normal density plot, followed by a kernel density plot.

PROC SGPLOT; HBOX bmi; XAXIS GRID; TITLE 'Boxplot of BMI'; RUN;
* PROC SGPLOT can do several types of plots - here a boxplot; PROC SGPLOT; HBOX bmi; XAXIS GRID; TITLE 'Boxplot of BMI'; RUN; 25th Percentile 75th Percentile Here we use PROC SGPLOT with the HBOX statement (horizontal box plot) to display a high resolution box plot. There is a similar VBOX statement. Side-by-side boxplots are often useful ways to compare distribution for different groups; these can be produced with the category statement. We will see that in the next example. Median

* Using SGPLOT to make side-by-side boxplots; PROC SGPLOT;
TITLE "Boxplot of BMI for Men and Women"; HBOX bmi/CATEGORY=sex; RUN; Here we show how to create a box-plot, separately for men and women and displayed on the same plot. The keyword for the boxplot is HBOX, the option category is where you put the variable for which you want to make separate plots. We see here that women have, on average, lower BMIs than men and also have more variability than men, as shown by the total width of the boxplot, which is the distance between the first and third quartile. .

PROC FORMAT; VALUE gender 1=‘Men’ 2=‘Women’; RUN; PROC SGPLOT;
* Formatting plot; PROC FORMAT; VALUE gender 1=‘Men’ 2=‘Women’; RUN; PROC SGPLOT; TITLE "Boxplot of BMI by Gender"; HBOX bmi/CATEGORY=sex; LABEL sex = ‘Gender’; LABEL bmi = ‘BMI (kg/m2)’; FORMAT sex gender. ; RUN; Here we show how to create a box-plot, separately for men and women and displayed on the same plot. The keyword for the boxplot is HBOX, the option category is where you put the variable for which you want to make separate plots. We see here that women have, on average, lower BMIs than men and also have more variability than men, as shown by the total width of the boxplot, which is the distance between the first and third quartile. .

* Using SGPLOT to make scatter plot; PROC SGPLOT;
TITLE “Weight vs Height"; SCATTER X=height Y=weight; RUN; Here we show how to create a scatter plot. The keyword is SCATTER followed by the X and Y variables. .

* Using SGPLOT to add regression line; PROC SGPLOT;
TITLE “Weight vs Height"; REG X=height Y=weight; RUN; Here we show how to create a regression plot. The keyword is REG followed by the X and Y variables. .

PROC UNIVARIATE DATA = weight ; VAR bmi;
* With the Output Delivery System you can selectively include only portions of the output; ODS TRACE ON/LISTING; * Lists the names of the pieces of output to the output window (need to add this option); PROC UNIVARIATE DATA = weight ; VAR bmi; TITLE 'Proc Univariate Example 1'; RUN; We have seem that PROC UNIVARIATE generates much output. What if we only wanted to display portions of the output. With what is called the output delivery system or ODS we can do that. Each piece of output has a name and we can tell SAS to display only output with these names. How do we know the names of the pieces of output? We could look it up in a manual or SAS help. However, SAS has an ODS statement that will display the names above the output. The statement is ODS TRACE ON. With the option LISTING the names will appear above the output in the output window. The ODS statement is placed before the procedure as seen here.

Output Window Output Added: ------------- Name: Moments Label: Moments
Template: base.univariate.Moments Path: Univariate.bmi.Moments Moments N Sum Weights Mean Sum Observations Std Deviation Variance Skewness Kurtosis Uncorrected SS Corrected SS Coeff Variation Std Error Mean Here is the first part of the univariate output with the ODS trace turned on. The output name is given above the output as seen here. The name of this piece is Moments. We could have guessed that from the heading, however, the heading is not always the same as the output name. Something called a Label, Template, and Path is also displayed. We can ignore that for our purpose. The other pieces of output can be similarly identified.

* This will restrict output to BasicMeasures and Quantiles tables;
ODS TRACE OFF; ODS SELECT BasicMeasures Quantiles; PROC UNIVARIATE DATA = weight ; VAR bmi; RUN; Once you know the names of the pieces of output you can go back to your program and select only the pieces you want displayed. This is done with the ODS SELECT statement. The names of the output we want displayed follows. Here we specify that only the BasicMeasures and Quantiles should be displayed. We will want to turn off the trace before running this procedure. We do this with the ODS TRACE OFF statement.

LIMITING SAS OUTPUT Variable: bmi Basic Statistical Measures
Location Variability Mean Std Deviation Median Variance Mode Range Interquartile Range Quantiles (Definition 5) Quantile Estimate 100% Max 99% 95% 90% 75% Q 50% Median 25% Q 10% 5% 1% 0% Min Here is the output from PROC UNIVARIATE with the ODS SELECT in effect. We see that only the two parts of the output requested are displayed.

Reading SAS Dataset DATA weight; INFILE ‘C:\SAS_Files\tomhs.dat' ;
ptid $10. @12 clinic $1. @30 sex $1. @58 height 4. @85 weight 5. ; bmi = (weight* )/(height*height); * BMI is calculated in kg/m2; RUN; DATA weight2; SET weight (KEEP = ptid clinic sex bmi); WHERE clinic = ‘A’; Before we conclude this lesson I want to show you briefly how you can use the data step to read a SAS data set. The first data step above is the same as before reading raw data to create a SAS data set called weight. What if we wanted to create another SAS data set based on weight. This is illustrated in the second data step. Here we create a new SAS data set called weight2 based on data set weight. The key statement to read a SAS data set is SET. That reads in the data set specified. The keep option brings in only the listed variables and the where statement brings in only rows where clinic = ‘A’. We will look much more at working with SAS data sets later in the course; but I wanted to illustrate this early on to show you the basic set up.

Lesson 3 Overview Descriptive Procedures Controlling SAS Output

Similar presentations

Presentation on theme: "Lesson 3 Overview Descriptive Procedures Controlling SAS Output"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lesson 3 Overview Descriptive Procedures Controlling SAS Output

Similar presentations

Presentation on theme: "Lesson 3 Overview Descriptive Procedures Controlling SAS Output"— Presentation transcript:

Similar presentations

About project

Feedback