Typical biostatistics tasks

Typical biostatistics tasks
Design a study (experimental design/clinical trails) SAS Power calculation – GLMPOWER, POWER procedure Collecting data format / code book /safety / run the experiment, etc. Manage Data into SAS Data input and manipulation – SAS data step, SAS SQL Data analysis SAS Stat Procedures Interpretation Biostatistics and others

Input Data Reading in data which has a fixed format directly

Import data to SAS Very useful and convenient
You can also save the import steps in menu-driven operations into a SAS file. You can use/run that SAS file later to save time. It uses SAS procedure IMPORT

Importing Microsoft Excel files
The proc import procedures can read in a variety of popular database files. We will only focus on the most popular file read in with this procedure, a Microsoft Excel file. If our data was stored within a Microsoft Excel file to read in our file we use the code, PROC IMPORT OUT= WORK.d DATAFILE= “C:\yourfile.xls" DBMS=EXCEL REPLACE; RANGE="Sheet1$"; GETNAMES=YES;MIXED=NO;SCANTEXT=YES;USEDATE=YES;SCANTIME=YES; RUN;

Verify the data input / clean up
After read/import the data, print out the first and last 10 rows of data and compare with original data When dataset is very larger (bioinformatics data), verifying is very important Data usually has errors, you need to find this out. Missing? A data in excel, I2, will read as missing, not 12 Yes, YES, Y are the same to human, different for SAS Negative values (age) Out of range

Customized format proc format; value racefmt 1='Caucasian'
2='African American' 3='Other'; value $genderfmt 'M'='Male' 'F'='Female'; value agefmt low-59='50ish or younger' 60-69='60ish' 70-79='70ish' 80-high='80ish or older'; run;

Applying the format We have defined our formats, but we have yet to apply them to our variables. We may do this in a data step, or in each SAS procedure. To actually apply the format to the variables race and gender within a data step, we use the statement. format gender $genderfmt. race racefmt.; Note the required period after the name of the formats. The format will be associated with the variables in all future procedures within the programs. We now examine the output from proc freq after having formatted the variables:

Applying the format The format statement may be included within a set of procedure statements as well, but then the format will only apply for that particular procedure. proc freq data= mydir.patientdata; tables age; format age agefmt.; run;

Data Analysis – Step One: descriptive
For continuous variables Basic statistics: mean, std, Q1, Q3, median etc. proc means Options: plot, statistics (mean, std, Q1, Q3, median, etc.) Statement: var, by, class proc sort Statement: by proc univariate Statement: histogram / kernel

proc means data=patientdata nmiss q1 median q3 min max range qrange maxdec=1;
var height weight; run;

output

More descriptive statistics in SAS
Proc univariate supplies the user with many of the quantities needed for a descriptive analysis of the data. We now analyze the variable age contained in our data set: proc univariate data=patientdata plot; var age; run;

density estimator Without going into the fine mathematical details, we may also consider using a kernel density estimator to summarize the distribution of the data. This statistical technique is based on the fact that the population the data comes from is not made up of rectangles, but is a smooth curve. The following code superimposes a density estimator on top of a histogram. The color option within the parentheses changes the default color of the curve to black. proc univariate data=patientdata noprint; var age; histogram / kernel(color=black); run;

a kernel density estimator

Summarizing categorical data
The summarization of categorical data usually consists of computing the frequency distribution. This consists of the possible data values observed in the data set for a particular variable, the frequency of which each value was observed, and the relative frequency. For ordinal data it is often useful to examine the cumulative frequency and cumulative relative frequency. This is usually not sensible for nominal data.

Summarizing categorical data in SAS
The freq procedure is used to generate a frequency distribution table for discrete or categorical data. To obtain the frequencies, relative frequencies, cumulative frequencies, and cumulative relative frequencies for both the gender and race variables, we may use the following SAS statements: proc freq data=patientdata; tables gender race; run;

proc freq This procedure is an extremely powerful procedure in SAS and using it to generate a frequency distribution table is only the tip of the iceberg in terms of its capabilities.

Data Analysis – Step 2: Marginal effects
Usually, a study has one or more outcomes y1, y2, … and several predictors x1, x2, x3, … One wants find the marginal effects, x_i vs. y_j We consider the associations between two variables, x and y. x – categorical, y – categorical: chi-square test x – categorical, y – continuous: t-test or ANOVA x – continuous, y – continuous: Correlation x – continuous, y – categorical: logistical regression

Data Analysis – Step 3: Multivariate modeling
Regression Logistic regression General linear model Generalized linear model Non-linear model Mixed model

Randomized trails Program: Control Experiment Outcome – loss
Continuous Statistical Question: Test if the losses are the same between two groups Method: t-test

Two-sample inference in SAS, independent samples
We use ttest procedure for such test The var statement will again list the variable(s) containing the sample data. The SAS procedure will perform the analyses for every variable listed in the var statement. The class statement is where we list the variable contain the group assignments for the data points contained in the variable(s) listed in the var statement. Therefore, proc tttest requires that the data be stacked for independent two-sample inference.

proc ttest data=rehab_stacked;
class program; var los; run; The output is below.

The confidence level of the confidence intervals seen above may be changed with an alpha option in the proc ttest statement. The h0 option, also introduced in our discussion of the one-sample t-test, may be used to change the default value of from zero. The above output contains a large amount of results. There are some basic statistics (the mean and standard deviation, and their corresponding confidence intervals, the standard error, and the minimum and maximum) for each separate group. There is an estimate of the difference between the population means, the corresponding confidence interval (assuming ), the pooled standard deviation and its corresponding confidence interval (details omitted), and the standard error of the difference between the two sample means. The results of both sets of hypothesis tests discussed for testing equality of the population means. And finally, the hypothesis test for equality of the variances.

We see that SAS took to be the mean of the control group minus the mean of the experimental group.
SAS uses the alphabetic/numeric order of the two levels contained in the variable listed in the class statement. If formats are used for this variable, it uses the alphabetic/numeric order of the formatted levels. We see that the hypothesis of equal variances is rejected (p-value=0.0189), hinting that the more reliable p-value for testing equality of the means is the one corresponding to unequal variances (p-value=0.0357). Note that the confidence interval for the difference between the two population means when not assuming the variances to be equal is not given in the output.

Inference about µ1 and µ2, paired samples
Paired samples consist of each observation from population 1 having a corresponding observation from population 2 ( ). When pairing is employed, you compare the populations under more homogeneous conditions. Knowing the data is paired is additional information which should be used in the analysis. Let us consider a study of the effect of a physical fitness program targeting young teens. To evaluate the program, a group of 10 subjects took a physical fitness test before and after taking part in the program. Researchers expect the test scores to increase overall as a result of the program.

A listing of the data, contained in the data set fitness, is as follows:
Obs before after Due to the nature of the study, there would be dependence between the observations in the two columns. Studying this dependence is not an objective here, and statistical techniques to aid us in doing so will be covered later. Here we will incorporate this dependence information into our analysis.

To compare the two groups descriptively, we construct some boxplots of the data.
data fitness_stacked; set fitness; length time $ 6; score = before; time = 'before'; output; score = after; time = 'after'; run; proc sort data=fitness_stacked; by time; proc boxplot data=fitness_stacked; plot score*time;

From the above, there does appear to be an increase in the scores after the intervention and an increase in the variability.

To perform paired inference, we again use proc ttest.
proc ttest data=fitness; paired before*after; run; Instead of all the data points being contained in one SAS variable as was the case when performing independent sample inference, here we give the procedure two variables, each containing the data for one of the two groups. SAS assumes that paired observations are in the same row in each of the two variables.

We see that there is not evidence to suggest the populations are different (p-value = 0.4076).

Comparing two groups in regards to a binary variable
Often in applied statistics, we are interested in comparing the probability of some event across two groups. Let us say, as a secondary objective of our study, we are interested to see if the response to treatment differs in those who have a certain genetic marker (marker = “1”) versus those who do not (marker = “0”). That is, we are interested in comparing 1 and 2 where To summarize the relationship between these two variables, we create what is termed a contingency table, using proc freq. proc freq data=cancer_efficacy; tables response*marker; run;

In the output below there are four numbers within each cell.
The first number is the actual frequency observed in the data for observation who had a response equal to the row label and who had a marker equal to the column label. The second number is the overall percent which comes from the first number divided by the sample size. The third number is the row percentage which represents the percentage of patients in that cell who are in that row. The forth number is the column percentage which represents the percentage of patients in that cell who are in that column.

Marginal frequencies and probabilities are also available in the above table which contains information about each variable separately. We see the estimates of 1 and 2 are

The difference between proportions
We may consider a variety of functions in comparing these two estimates. The simplest is the difference which is The above is an estimate of the true difference between the probabilities, It stands to reason we would want to quantify the variability associated with our estimate which may be done using a confidence interval. An approximate 100(1-) percent confidence for is given by

The above is also only asymptotically valid and should not be used unless
To request such a confidence interval from SAS, we add the riskdiff options to the tables statement. proc freq data=cancer_efficacy; tables marker*response / riskdiff; run; The added output is below. Note that when computing this interval, we must make sure the variable which defines our groups is represented by the rows. It is for this reason we switched the variables marker and response in the code above. The analysis is done for both the event represented by the first column and the event represented by the second column.

We see the confidence interval, (-0. 75, -0
We see the confidence interval, (-0.75, -0.03) corresponds to To get the confidence interval for , we multiple both confidence limits by -1: (0.03, 0.75).

The relative risk and odds ratio
As an alternative to the simple difference between proportions, we may consider the relative risk (RR) which is defined as the ratio of the probability of the event of interest in the two groups: We see

Risk and Odds Instead of working with probabilities, we may work with the odds of an event, that is, we may consider Since  takes on values from 0 to 1, the odds take values from 0 to . Note Also,

Odds Ratio The odds ratio (OR) is simply defined as the ratio of the odds of the event of interest in the two groups: Both the parameters RR and OR take on values from 0 to . Our interpretation of the OR is the same as the RR in regards to which probability is higher:

Estimators of the relative risk and odds ratios may be obtained by simple substituting in the estimators in places of the parameters, 1 and 2 We have and Note that For rare events (low probability),  For this reason many interpret the OR as they would the RR.

Confidence intervals are available for both the parameter RR and OR .
We omit the details of these intervals but they are only asymptotically valid and so one should check the same sample size criterion as was used for the confidence interval of To compute the estimates of the RR and OR, and their corresponding confidence intervals, we start with the code below. proc freq data=cancer_efficacy; tables marker*response / relrisk; run; Again the groups must be defined by the row variables.

We note that from the above we have estimates and corresponding confidence interval for
The odds ratio SAS computed is for the column 1 event. Since, mathematically we see that the desired output is contained in the above for the odds ratio.

To obtain the relative risk estimate we desire, we must switch the order of the rows which is done through using proc sort and the option order in the proc freq statement. proc sort data=cancer_efficacy; by descending marker; run; proc freq data=cancer_efficacy order=data; tables marker*response / relrisk; Setting the order option equal to data causes SAS to use the actually order of the values observed in the data set to determine the order of the row and column levels.

We see we obtained a relative risk of 2
We see we obtained a relative risk of 2.06 and its confidence interval, (1.04, 4.08). Unfortunately the validity of the confidence intervals for the difference, the relative risk, and the odds ratio are all questionable since the sample size is very small in one of the cells. For the odds ratio, we do have an exact version which does not need a large sample size to be valid. To obtain this we again use the exact statement with the keyword or. proc freq data=cancer_efficacy order=data; tables marker*response; exact or; run; The added output is as follows:

Testing for association between two categorical variables
Instead of simple estimation of quantities comparing 1 and 2, many scientific problems are answered via a test about the hypotheses, In our specific example, we have These hypotheses may equivalently be written as

In writing the hypotheses as has been done in the two sets above, we see the hypothesis test concerning the proportions is simply a statistical comparison of the conditional distributions of one variable given another variable. Alternatively, we may consider the test as a test of association between two categorical variables,

This test still requires a large enough sample size as the first test introduced does.
We use the chisq option in the tables statement. proc freq data=cancer_efficacy; tables marker*response / chisq; run;

The results of the two tests are found in the first and third rows of the subset of output found below. Note that the degrees of freedom in our example come from the fact that r=c=2, and so (r-1)(c-1)=(2-1)(2-1)=1. We see that SAS automatically checks the sample size requirements for us and warns us if they are not met. Since there appears to be a sample size requirement violation, we need to consider an alternative procedure to test for association.

Data Analysis – Step 2: Marginal effects
Usually, a study has one or more outcomes y1, y2, … and several predictors x1, x2, x3, … One wants find the marginal effects, x_i vs. y_j We consider the associations between two variables, x and y. x – categorical, y – categorical: chi-square test x – categorical, y – continuous: t-test or ANOVA x – continuous, y – continuous: Correlation x – continuous, y – categorical: logistical regression

Correlation and regression
Continuous outcome Continuous predictors

Typical biostatistics tasks

Similar presentations

Presentation on theme: "Typical biostatistics tasks"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Typical biostatistics tasks

Similar presentations

Presentation on theme: "Typical biostatistics tasks"— Presentation transcript:

Similar presentations

About project

Feedback