Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lesson 8 - Topics Creating SAS datasets from procedures

Similar presentations


Presentation on theme: "Lesson 8 - Topics Creating SAS datasets from procedures"— Presentation transcript:

1 Lesson 8 - Topics Creating SAS datasets from procedures
Creating reports using data-step and PROC TABULATE Using PROC RANK Programs in course notes LSB 4:11;13-17;5:3 Welcome to lesson 8. In this lesson we will see how you can create SAS datasets from procedures and how to use ODS and some data step techniques we have learned to create customized reports. We will also look at the utility of PROC RANK that can be used to create new variables that are ranks or quantiles of existing variables.

2 Making SAS Datasets From Procedures
Output from SAS PROCs can be put into SAS datasets: To do further processing of the information from the output To reformat output to make a report One of the nice features of SAS is that you can take pieces of output from procedures and send it to a SAS dataset. Why would you want to do this? Well, putting the output into a SAS dataset allows you to do further processing of the output using the DATA step. Rows of output will be observations and columns of output will be variables. You can then do anything you could do with any dataset -- things like creating new variables and running procedures. This “messaging” of the output can then be used to create a customized report. We will also see how procedure output datasets can be used to restructure our original dataset and create new variables.

3 Ways to Put Output into SAS Datasets
Using OUTPUT statement available in many procedures Using ODS OUTPUT statement – any output table can be put into a SAS dataset There are two ways to make a dataset from output. The first is using the OUTPUT statement that is available in many procedures. The second is with ODS using the ODS OUTPUT statement. You may remember from a previous session that each piece of output has a table name. With the ODS OUTPUT statement you can place that output in a SAS dataset. The examples in the programs to follow will illustrate each method.

4 Report We Want to Generate
Quartiles of Weight Change by Clinical Center Clinic N P P P75 A B C D Total Suppose we want to create the following report based on the TOMHS data. The report gives for each gender and clinic combination statistics regarding the distributions of weight. The statistics include the counts and the 25th, 50th, and 75th percentiles. Think about how we could get this information from SAS. We know we can get percentiles from PROC UNIVARIATE. We could get them separately for each gender and clinic using a CLASS statement or a BY statement. However, a lot of output would be generated and it would not be displayed in this way. Let’s see how we can generate this report by creating a dataset using the OUTPUT statement from PROC UNIVARIATE.

5 Program 14 LIBNAME class ‘C:\SAS_Files';
* Will use SAS dataset version of TOMHS data; DATA wt; SET class.tomhs (KEEP=ptid clinic wtbl wt12 ); wtchg = wt12 - wtbl; RUN; This is the first portion of program 14. The DATA step creates a dataset called wt reading in data from the SAS dataset version of the TOMHS data, called tomhsp. We bring in the variables ptid, clinic, sex, wtbl, and wt12 (the baseline and 12 month weights). We then compute the change in weight from baseline to 12 months. We then define a format we will use the variable sex.

6 * Create report by clinic using OUTPUT; PROC MEANS DATA = wt NOPRINT;
CLASS clinic; VAR wtchg ; OUTPUT OUT=summary N = n Q1 = p25 MEDIAN = p50 Q3 = P75 ; Dataset summary will have one observation for each clinic and the total. Name of new dataset Statistic name = variable name We then sort the dataset by the variables sex and clinic using PROC SORT. We do this because we will be using a BY statement in the univariate procedure. We then run PROC UNIVARIATE for the variable wt12 for each sex and clinic (using the BY statement). We use the OUTPUT statement within the procedure to create a SAS dataset containing the statistics we want. The syntax is the keyword OUTPUT followed by another keyword OUT followed by the name of the dataset we are creating (univinfo here). This is followed by the statistics we want (using the keyword of the statistic) and the variable names we assign to the statistics. Here we tell SAS to output to the dataset called univinfo the N and the three quartiles (25th, 50th, and 75th percentiles) for the variable wt12 and to give these four statistics the name n, p25, p50, and p75. The dataset univinfo will have one observation for each sex and clinic combination, 8 in all. To see what the dataset look like we will run a PROC PRINT on the dataset. Note, although the OUTPUT statement is rather long it is just one statement, i.e. there is only one semi-colon. You may have also noticed the NOPRINT option on the PROC statement. This tells SAS not to send the standard output to the output window. We will be getting the information we need from the output dataset.

7 PROC PRINT DATA = summary; RUN;
Obs clinic _TYPE_ _FREQ_ n p p p75 A B C D Here is the PROC PRINT statement and the output. It is usually a good idea to run a PROC PRINT after creating an output dataset so you can see what was actually created. Sometimes SAS will add variables as well, so run the PROC PRINT without a VAR statement so all variables will be displayed, as we do here. We see that this listing is pretty much the report we wanted. There is one row for each sex/clinic combination with the statistics we want. To finish the report we will remove the OBS column, reorder the list of variables displayed, and display the percentile variables to one decimal place.

8 * Put total row at the bottom; PROC SORT; BY DESCENDING _type_ clinic;
PROC PRINT ; RUN; Obs clinic _TYPE_ _FREQ_ n p p p75 A B C D Here is the PROC PRINT statement and the output. It is usually a good idea to run a PROC PRINT after creating an output dataset so you can see what was actually created. Sometimes SAS will add variables as well, so run the PROC PRINT without a VAR statement so all variables will be displayed, as we do here. We see that this listing is pretty much the report we wanted. There is one row for each sex/clinic combination with the statistics we want. To finish the report we will remove the OBS column, reorder the list of variables displayed, and display the percentile variables to one decimal place.

9 if missing(clinic) then clinic = ‘Total’; DROP _type_ _freq_; RUN;
* Create final report; DATA summary; LENGTH clinic $5.; SET summary; if missing(clinic) then clinic = ‘Total’; DROP _type_ _freq_; RUN; PROC PRINT NOOBS ; FORMAT p25 p50 p75 6.1; clinic n p p p75 A B C D Total Here is the PROC PRINT statement and the output. It is usually a good idea to run a PROC PRINT after creating an output dataset so you can see what was actually created. Sometimes SAS will add variables as well, so run the PROC PRINT without a VAR statement so all variables will be displayed, as we do here. We see that this listing is pretty much the report we wanted. There is one row for each sex/clinic combination with the statistics we want. To finish the report we will remove the OBS column, reorder the list of variables displayed, and display the percentile variables to one decimal place.

10 Using ODS to Send Output to a SAS Dataset
Syntax: ODS OUTPUT output-table = new-data-set; * Output quantile table to a dataset; ODS OUTPUT quantiles = qwt; PROC UNIVARIATE DATA = wt ; VAR wtbl wt12 ; RUN; ODS OUTPUT CLOSE ; PROC PRINT DATA=qwt; The more general method of putting output into a SAS dataset is with the ODS OUTPUT statement. The syntax is ODS OUTPUT followed by the output table name, an equals sign, followed by the dataset name you assign the output table. The output table must correspond to a table name used in the procedure you will call. Here will be running PROC UNIVARIATE of two variables, weight at baseline and weight at 12-months, and we want to put the quantile table into a SAS dataset. We will name the dataset qwt. We then run the univariate procedure as usual and follow the run statement with an ODS OUTPUT CLOSE statement. This captures the output into our new dataset qwt. To “see what we get” we generate a proc print on the new dataset.

11 Display of Output Dataset
Obs Varname Quantile Estimate 1 wtbl 100% Max 2 wtbl 99% 3 wtbl 95% 4 wtbl 90% 5 wtbl 75% Q 6 wtbl 50% Median 7 wtbl 25% Q 8 wtbl 10% 9 wtbl 5% wtbl 1% wtbl 0% Min wt12 100% Max wt12 99% wt12 95% wt12 90% wt12 75% Q wt12 50% Median wt12 25% Q wt12 10% wt12 5% wt12 1% wt12 0% Min Would like to put side-by-side This is the display of the data. We get two sections, of 11 rows each; one for weight at baseline one for weight at 12 months, with the name of the statistic and the value as variables. This report might be good enough but we might like to put the baseline and 12 month weight data together on the same rows. We know how to do this from methods we have used before to restructure datasets. Let’s see how this is done here.

12 Separate the data into 2 datasets
DATA wtbl wt12 ; SET qwt; if varname = 'wtbl' then output wtbl; else if varname = 'wt12' then output wt12; RUN; PROC DATASETS ; MODIFY wtbl; RENAME estimate = wtbl; MODIFY wt12; RENAME estimate = wt12; DATA all; MERGE wtbl wt12; DROP varname; PROC PRINT; Separate the data into 2 datasets PROC DATASETS used for changing variable names We will use the method we have used earlier where we create separate datasets for certain rows and then merge them together to get them on the same rows. Here we create a dataset for the 11 rows containing the weight statistics at baseline and a dataset for the 11 rows containing the weight statistics at 12-months. We conditionally output the rows based on the variable VARNAME. We then use PROC DATASETS to change the name of estimate to wtbl or wt12. This variable contains the percentile statistics. The last data step uses the MERGE statement to put the two datasets together. We name this combines dataset ALL. The PROC PRINT will show us what our new dataset looks like. Put 2 datasets side-by-side Note: no BY statement, OK here

13 Obs Quantile wtbl wt12 1 100% Max 279.30 271.50 2 99% 274.15 271.50
% % % % Q % Median % Q % 9 5% % % Min This new dataset ALL has 11 rows with the statistics for each weight variable as separate variables. Now we can easily compare any percentile for the two time periods. Getting to know a few data step “Tricks” can be useful because they can be used over and over again to reformat the data to produce the desired report.

14 PROC RANK Used to divide observations into equal size categories based on values of a variable Creates a new variable containing the categories New variable is added to the dataset or to a new dataset Example: Divide weight change into 5 equal categories (Quinitiles) When investigating the relationship of a continuous independent variable to a dependent variable it is often desirable to divide the independent variable into categories and then see how the dependent variable changes across the categories. Suppose you want to investigate the relationship between change in weight and change in blood cholesterol. One analyses you could do is divide people into categories based on their weight change and compare the average cholesterol change across the weight change groups. Sometimes you may have specific weight change categories of interest, for example, weight loss > 10 lbs, weight loss 1-10 lbs, weight gain 1-10 lbs, and weight gain > 10 lbs. However, in some cases you have no pre-specified categories. Then you may just want to divide persons into categories of equal size. To do this you could run PROC UNIVARIATE on weight change, look at the quantiles to determine the cutoff levels and then go back and create a new variable using IF/THEN logic. However, SAS has a utility procedure that can do that for you automatically. The procedure is called PROC RANK. You simply tell SAS the name of the variable you want to form groups for and how many levels you want. SAS will then compute a new variable containing the categories and add it to the dataset.

15 PROC RANK SYNTAX PROC RANK DATA = dataset OUT = outdataset
GROUPS = # of categories VAR varname; RANKS newvarname; Most of the time you can set OUT to be the same dataset specified in DATA. PROC RANK writes no output Here is the general syntax for calling PROC RANK. DATA = is the input dataset. The OUT option is the output dataset that will contain the new variable or variables. Since this dataset will also include all variables in the original dataset you can usually specify the OUT dataset to be the same as the input dataset specified in DATA. GROUPS is set to the number of categories you want to divide the variables into. GROUPS=5 would create quintiles. In VAR you list the continuous variables for which you want to create new categorical variables for, and RANKS is a the list of names you want to call the new variables. Since PROC RANK is just creating a dataset no output will go to the output window.

16 PROGRAM 15 LIBNAME class ‘C:\SAS_Files'; DATA wtchol;
SET class.tomhs (KEEP=ptid wtbl wt12 cholbl chol12); wtchg = wt12 - wtbl; cholchg = chol12 - cholbl; RUN; *This PROC will add a new variable to dataset which is the tertile of weight change. The new variable will be 0,1,or 2; PROC RANK DATA = wtchol GROUPS=3 OUT = wtchol; VAR wtchg; RANKS twtchg; ** We will see how this works in program 15. We start by creating a dataset called wtchol, reading in weight and cholesterol variables from the TOMHS SAS dataset. We then compute new variables for weight and cholesterol change, 12 month minus baseline. Negative values will indicate decreases and positive values will indicate increases. The RUN statement ends the DATA step. We then run PROC RANK. We want to divide the weight change variable into three equal size groups. The new variable will be called twtchg (T for tertile). The output dataset will be the same as the input dataset. It will now contain one new variable. Name of new variable

17 9 SET class.tomhsp (KEEP=ptid clinic sex wtbl wt12 cholbl chol12);
PARTIAL LOG 8 DATA wtchol; 9 SET class.tomhsp (KEEP=ptid clinic sex wtbl wt12 cholbl chol12); 10 wtchg = wt12 - wtbl; 11 cholchg = chol12 - cholbl; 12 RUN; NOTE: There were 100 observations read from the data set CLASS.TOMHSP. NOTE: The data set WORK.WTCHOL has 100 observations and 9 variables. PROC RANK DATA = wtchol GROUPS=3 OUT = wtchol; 20 VAR wtchg; RANKS twtchg; 21 RUN; NOTE: The data set WORK.WTCHOL has 100 observations and 10 variables. Here is a partial log when the program is run. The DATA step creates the dataset wtchol which has 100 observations and 9 variables. The note after the PROC RANK states that there are now 10 variables on the dataset. The new variable is the one that PROC RANK created, variable twtchg.

18 PROC FREQ DATA = wtchol; TABLES twtchg; RUN; OUTPUT:
Rank for Variable wtchg Cumulative Cumulative twtchg Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Frequency Missing = 8 To check the values for the new variable we display a frequency distribution using PROC FREQ. We note that twtchg takes on three values, 0,1, and 2 and that each contains about 1/3rd of the data. Values of 0 indicate the lowest 1/3rd of weight change, values of 1 the middle 1/3rd, and values of 2 the upper 1/3rd.There are 8 missing values – these are persons with missing weight change.

19 Partial Listing of Datset wtchol with new variable added
PROC PRINT DATA = wtchol (obs=20); VAR ptid wtchg twtchg; TITLE 'Partial Listing of Datset wtchol with new variable added'; RUN; Partial Listing of Datset wtchol with new variable added Obs PTID wtchg twtchg 1 A 2 A 3 A 4 A 5 A 6 A 7 A 8 A 9 A 10 A We then do a PROC PRINT displaying the original and ranked weight change variable. We limit the display to 20 observations; 10 are shown here. We note the first patient lost 12 pounds. This put him/her in the middle weight change category. Patient A00354 lost 21 pounds – this person is in the lowest category of weight change which indicates the greatest weight change

20 PROC MEANS N MEAN MIN MAX MAXDEC=2; VAR cholchg wtchg; CLASS twtchg;
TITLE 'Mean Cholesterol Change by Tertile of Weight Change'; RUN; We now want to display the average change in cholesterol by the weight change categories. We do that with PROC MEANS with a class variable. We include in the VAR list cholesterol change and the original weight change variable (variable wtchg). The latter variable will display information so that we know the cutpoints used to define the 3 levels of weight change.

21 Mean Cholesterol Change by Tertile of Weight Change
The MEANS Procedure Rank for Variable N wtchg Obs Variable N Mean Minimum Maximum cholchg wtchg cholchg wtchg cholchg wtchg Could graph this data in an x-y plot (3 points) Cutpoints for tertiles Here is the output from PROC MEANS. We see that mean serum cholesterol change is greatest in the greatest weight loss category (a decrease of mg/dl). The drop in cholesterol for the middle weight change category is less (4.70 mg/dl) and for the upper weight change category the cholesterol drop is just 0.74 mg/dl). So we confirm a direct relationship between weight and cholesterol change. The MAX values for variable wtchg tell us the cutoffs for the three levels of weight change. The cutoffs are and A simple summary of the relationship could be done by plotting the three points noted here, weight change on the X-axis and mean cholesterol change on the Y-axis. You could add standard error bars to the plot if you like.

22 TABLE GENERATION: PROC TABULATE
(dbp12 sbp12)*(N MEAN*f=8.1) GROUP ALL SAS has a procedure called PROC TABULATE that can be used to generate a table of descriptive summaries. The output from say PROC MEANS or UNIVARIATE may give you the information you want but it isn’t put together in a nice table as you may want. PROC TABULATE can be used to generate various formatted tables. Here is an example of one. A table is made up of rows and columns. Here the rows are each treatment group in TOMHS along with the total; the columns are the N and MEAN for diastolic and systolic blood pressure. You could get this information from PROC MEANS with a class statement but the output would not be organized so nicely. Let’s look at the PROC TABULATE syntax to produce this table.

23 PROC TABULATE DATA=class.tomhs FORMAT=8.0; CLASS group;
VAR sbp12 dbp12; TABLES group ALL='Total', (dbp12 sbp12)*(N MEAN*f=8.1)/RTS=20; LABEL dbp12 = 'Diastolic BP'; LABEL sbp12 = 'Systolic BP'; LABEL group = 'RX Group'; FORMAT group fgroup.; TITLE 'Average Blood Pressure at 12-Months'; RUN; Same as PROC MEANS Note first the CLASS and VAR statement in PROC TABULATE are the same as you would use in PROC MEANS. What follows is the TABLES statement. This can be a bit tricky to follow but let’s give it a try. Take consolation that there are entire classes and books on how to use PROC TABULATE, so don’t be too discouraged if you don’t get it all the first time. Remember from the output on the previous slide the row information related to treatment group in TOMHS, the variable group. That is placed first followed by the keyword ALL to give the total. The part in quotes is the label for the total. A comma is then typed followed by the column information you want: here the N and MEAN for each of the variables dbp12 and sbp12. The f=8.1 tells SAS to display the mean as a column of 8 characters and display the mean with one decimal. The RTS option sets the number of spaces in the first column (where only labels are placed and not any data) We add label statements for each variable and apply a format for group so that the formatted values are displayed rather than the values 1-6.

24 Closer Look At TABLES Statement
TABLES group ALL='Total', (dbp12 sbp12)*(N MEAN*f=8.1)/RTS=20; Statement before comma indicates row information to display Statement after comma indicates column information to display A * indicates to crosstabulate data A space indicates to concatenate data Words: For each group and the total display the N and mean of diastolic and systolic BP Let’s take another closer look at the TABLE statement. Remember tables have rows and columns. The code before the comma is the row information, the code after the comma is the column information. There are two important characters in the TABLES statement, the space and the asterisk (*). The * indicates to crosstabulate the information. A space means to concatenate the information. In english, the table statement is telling SAS: For each group and the total display the N and mean of diastolic and systolic BP. I encourage you to take this program and first run it as is to produce the output. Then make some changes to the TABLE statement. Take away the ALL statement and see what happens; or add a statistic such as the standard deviation to the columns.

25 (sex=' ')*(N ROWPCTN*f=10.1) CLINIC
ALL Here is one more example using the tabulate procedure. Back in program 4 we used proc freq to display a crosstabulation of clinical center and gender. All the information was there in the output but it was kind of hard to find because there were so many numbers. Well, here is the crosstabulation using PROC TABULATE, where the table is formatted to display selected information more clearly. The row information is the clinical center (plus the total); the column information is the gender information. The number in the cells is the counts and the row percentages, that is the percent of men and women in each clinic. Note the totals for each row add to 100%. Well, let’s look at the TABULATE code that generated the table.

26 PROC TABULATE DATA=class.tomhsp FORMAT=8.; CLASS clinic sex;
TABLE (clinic ALL='Total'), (sex=' ')*(N ROWPCTN*f=10.1)/RTS=15; FORMAT sex sex. clinic $clinic.; LABEL clinic = 'Clinical Center'; KEYLABEL ROWPCTN = 'Percent'; TITLE 'N and Percent Men and Women Enrolled by Center'; RUN; ODS HTML FILE = ‘mytable.html’; Here there is no VAR statement, only a CLASS statement with the two variables clinic and sex. The row portion of the TABLE statement is just the variable clinic with the keyword ALL; the column portion crosses the variable sex with two statistics, N and the row percent (ROWPCTN). The blank text in quotes after the variable sex tells SAS not to include a label for sex. Note also the KEYLABEL statement. This allows us to set the text for the row percent. To see what this statement does, simply remove it, run the program and see how the output differs. If you want the output to be in html format include the ODS HTML statement before the procedure.


Download ppt "Lesson 8 - Topics Creating SAS datasets from procedures"

Similar presentations


Ads by Google