# Descriptive Statistics Summer Program Brian Healy.

## Presentation on theme: "Descriptive Statistics Summer Program Brian Healy."— Presentation transcript:

Descriptive Statistics Summer Program Brian Healy

What have learned so far What is biostatistics What is biostatistics Role of biostatistician Role of biostatistician How to input data into R How to input data into R Simple R functions Simple R functions

What are we doing today? Types of data Types of data Summary statistics Summary statistics Measures of central tendency Measures of central tendency Tables Tables Graphs Graphs How to do all of these things in R How to do all of these things in R

Big picture When we want to initially describe a data set or summarize a large data set using a graphs or tables, we have several things we can use. When we want to initially describe a data set or summarize a large data set using a graphs or tables, we have several things we can use. –Summary statistics- a single number or set of numbers that describe the entire data set –Frequency table- a table showing the number of members in each of a set of specific groups –Graphs – a picture showing characteristics of the data, usually focusing on one or more aspect of the data set The best way to use these different methods depends on the type of data you have The best way to use these different methods depends on the type of data you have

Tables and graphs The most important part of any scientific paper or presentation are the graphs and tables because these are the things people are most likely to pay attention to and remember. Also these allow a large amount of data to be summarized in a small space. The most important part of any scientific paper or presentation are the graphs and tables because these are the things people are most likely to pay attention to and remember. Also these allow a large amount of data to be summarized in a small space. Statistical papers are somewhat different Statistical papers are somewhat different

Types of data The first thing to notice about a variable is what kind of variable is it. The first thing to notice about a variable is what kind of variable is it. Nominal: Blond hair=1, Brown hair=2, Red hair=3 Nominal: Blond hair=1, Brown hair=2, Red hair=3 –Definition: Values fall into unordered classes –Dichotomous: Only 2 outcomes (male and female) Ordinal: Mild=1, Moderate=2, Severe=3 Ordinal: Mild=1, Moderate=2, Severe=3 –Definition: Values fall into ordered classes, but magnitude has no meaning Discrete: Number of deaths in states in USA Discrete: Number of deaths in states in USA –Definition: Takes on specific values and the magnitiude and order are important –Often considered continuous in analyses, but conclusions can be misleading Continuous: Height and weight Continuous: Height and weight –Definition: Any value is possible

Summary statistics Definition: a single number or group of numbers that describe an entire data set Definition: a single number or group of numbers that describe an entire data set –Example: Ages of class: class<-read.table(“class.dat”, header=T) age<-class[,3] –Maximum: –Minimum: –Range: Each of these provides information about the entire group in one number Each of these provides information about the entire group in one number

Measures of location Measures of the location of a distribution (measure of central tendency) Measures of the location of a distribution (measure of central tendency) –Mean: –Median: the middle value –Mode: the most common value Example: Ages of class Example: Ages of class –Mean: –Median: –Mode:

What happens if we have outliers? Each of measure of central tendency is appropriate in certain circumstances Each of measure of central tendency is appropriate in certain circumstances Outlier: an extreme observation Outlier: an extreme observation –May be important to understand the full picture: rare toxicity –May be error in data entry or other reason and better ignored –Mean: very sensitive –Median: less sensitive, more robust

Computing summary stats in R Question: What is the average high temperature in Boston in August? Question: What is the average high temperature in Boston in August? data<-c(89, 77, 54, 80, 87, 92, 93, 83, 86) data<-c(89, 77, 54, 80, 87, 92, 93, 83, 86) mean(data) mean(data) median(data) median(data) Which better describes the data? Which better describes the data? What are explanations for the outlier? Should we include this data point? What are explanations for the outlier? Should we include this data point?

What about the mean and median of ordinal and nominal data? For our nominal data example, we used Blond hair=1, Brown hair=2, Red hair=3 For our nominal data example, we used Blond hair=1, Brown hair=2, Red hair=3 –Data set: 1, 2, 2, 2, 3, 1, 2, 1, 2, 2 –Mean: 1.8 –Median: 2 –Do these summary statistics have any meaning in this case? For our ordinal data example, Mild=1, Moderate=2, Severe=3 For our ordinal data example, Mild=1, Moderate=2, Severe=3 –Data set: 1, 1, 3, 1, 1, 1, 1, 2, 2, 2, 1 –Mean: 1.455 –Median: 1 –Do these have more meaning than the previous? What must we be careful of? How could describe each of these types of data better? How could describe each of these types of data better?

Measures of spread Beyond the location of the data, we may be interested in how varied the data is Beyond the location of the data, we may be interested in how varied the data is Ex. You are planning to spend a year in London and Los Angeles. You find out that the average temperatures in each place are 65 o F and 75 o F. You could use this information to decide what clothes to bring. Is this all you would want to know? Ex. You are planning to spend a year in London and Los Angeles. You find out that the average temperatures in each place are 65 o F and 75 o F. You could use this information to decide what clothes to bring. Is this all you would want to know? –The spread of the distribution, i.e. the range of possible temperatures

Measures of spread Measures of distance from the mean: Measures of distance from the mean: –Variance: –Standard deviation: –Note that the units on the standard deivation match the units on the mean Interquartile range: 25 percentile and 75 percentile Interquartile range: 25 percentile and 75 percentile Range: Minimum and maximum Range: Minimum and maximum Which of these are sensitive to outliers? Which of these are sensitive to outliers?

Computing measures of spread in R Let’s look at the spread in the heights of the class Let’s look at the spread in the heights of the class height<-c(63,64,66,64,64,67,68,67,63,71) height<-c(63,64,66,64,64,67,68,67,63,71) var(height) var(height) sd(height) sd(height) IQR(height) IQR(height) range(height) range(height) What is the difference in the output for IQR and range? What is the difference in the output for IQR and range? To find any quantile, use quantile(height, 0.75) To find any quantile, use quantile(height, 0.75)

Tables Simple display for group of numbers Simple display for group of numbers Very common in publications Very common in publications Two main types Two main types –Display tables- Shows several characteristics of groups in one display –Frequency tables- Shows number of people in each group.

Frequency tables Hair color Number of people Relative freq Cumulative freq Blond30.30.3 Brown60.60.9 Red10.11.0 Side effect Number of people Relative freq Cumulative freq Mild70.640.64 Moderate30.270.91 Severe10.091.00

Creating a table in R A couple of different methods make tables in R A couple of different methods make tables in R Data: Data: –a<-c(1,1,1,1,2,2,2,2,2,3) –b<-c(2,1,2,2,2,2,2,1,1,1) –table(a) –table(a,b) –tabulate(a) How do these work? How do these work?

Practice Using the class data, answer the following questions: Using the class data, answer the following questions: –How many students in the class have a Master’s degree? –How many students went to college west of the Mississippi and have a Master’s –How many student like baseball (4 or 5)? –What is the longest time anyone was on a plane? –What is the largest family size in the class? –How many people have more than 4 people in their family?

Grouped data Another time you use frequency tables is when you collect sensitive data that people may not be willing to give you the exact values, but will provide a range, like income. Another time you use frequency tables is when you collect sensitive data that people may not be willing to give you the exact values, but will provide a range, like income. With data such as this, how could we find the mean? With data such as this, how could we find the mean? Income Number of people \$10,000- \$29,999 15 \$30,000- \$49,999 35 \$50,000- \$69,999 60 Over \$70,000 44

Grouped mean Since we do not have the specific data points, we cannot calculate the exact mean Since we do not have the specific data points, we cannot calculate the exact mean We can use the groups to estimate the mean using the grouped mean We can use the groups to estimate the mean using the grouped mean where n j is the number of people in each group and m j is the midpoint of the group

Graphs and Plots One of the biggest advantages of R is the quality of the plots One of the biggest advantages of R is the quality of the plots Let’s plot the ages of the class Let’s plot the ages of the class To make plots in R, use the following commands for the appropriate plots To make plots in R, use the following commands for the appropriate plots –histogram- hist(age) –box plot- boxplot(age)

Plot Command The basic command-line command for producing a scatter plot or line graph. col= set colors, lty= set line types, lwd= set line widths, pch= set the character type, type= pick points (type = "p"), lines ("l"), cex= set the "character expansion “, xlab= and ylab= set the labels, xlim= and ylim= set the limits of the axes, main= put a title on the plot, mtext= add a sub-title, help (par) for details

One-Dimensional Plots barplot(height) #simple form barplot(height) #simple form barplot(height, width, names, space=.2, inside=TRUE, beside=FALSE, horiz=FALSE, legend, angle, density, col, blocks=TRUE) barplot(height, width, names, space=.2, inside=TRUE, beside=FALSE, horiz=FALSE, legend, angle, density, col, blocks=TRUE) boxplot(..., range, width, varwidth=FALSE, notch=FALSE, names, plot=TRUE) boxplot(..., range, width, varwidth=FALSE, notch=FALSE, names, plot=TRUE) hist(x, nclass, breaks, plot=TRUE, angle, density, col, inside) hist(x, nclass, breaks, plot=TRUE, angle, density, col, inside)

Two-Dimensional Plots lines(x, y, type="l") lines(x, y, type="l") points(x, y, type="p")) points(x, y, type="p")) matplot(x, y, type="p", lty=1:5, pch=, col=1:4) matplot(x, y, type="p", lty=1:5, pch=, col=1:4) matpoints(x, y, type="p", lty=1:5, pch=, col=1:4) matpoints(x, y, type="p", lty=1:5, pch=, col=1:4) matlines(x, y, type="l", lty=1:5, pch=, col=1:4) matlines(x, y, type="l", lty=1:5, pch=, col=1:4) plot(x, y, type="p", log="") plot(x, y, type="p", log="") abline(coef), abline(a, b), abline(reg), abline(h=), abline(v=) abline(coef), abline(a, b), abline(reg), abline(h=), abline(v=) qqplot(x, y, plot=TRUE) qqplot(x, y, plot=TRUE) qqnorm(x, datax=FALSE, plot=TRUE) qqnorm(x, datax=FALSE, plot=TRUE)

Three-Dimensional Plots contour(x, y, z, v, nint=5, add=FALSE, labex) contour(x, y, z, v, nint=5, add=FALSE, labex) interp(x, y, z, xo, yo, ncp=0, extrap=FALSE) interp(x, y, z, xo, yo, ncp=0, extrap=FALSE) persp(z, eye=c(-6,-8,5), ar=1) persp(z, eye=c(-6,-8,5), ar=1)

Multiple Plots Per Page par(mfrow=c(nrow, ncol), oma=c(0, 0, 4, 0)) par(mfrow=c(nrow, ncol), oma=c(0, 0, 4, 0)) –mfrow=c(m,n) : subsequent figures will be drawn row-by-row in an m by n matrix on the page. –oma=c(xbot,xlef,xtop,xrig):outer margin lines of text. mtext(side=3, line=0, cex=2, outer=T, "This is an Overall Title For the Page") mtext(side=3, line=0, cex=2, outer=T, "This is an Overall Title For the Page") Try this code on your own Try this code on your own –par(mfrow=c(2,1)) –hist(age) –plot(class[,3],class[,4])

Output to a postscript file Often we want to output an R graph to a postscript file to place it into a Latex file or other document Often we want to output an R graph to a postscript file to place it into a Latex file or other document To do this, we use the following code To do this, we use the following code –postscript(“graph1.ps”) – This opens a postscript file in the home directory –hist(age) – This plots a graph into the file –dev.off() – This closes the postscript file

Making plots of your own Make the following plots Make the following plots –Histogram of height in the class with the appropriate labels –Scatterplot of height and age in the class using a different point –Make a postscript file with four plots of your choice –Write a function to make a histogram and boxplot on one graph

Using a for loop Sometimes, we would like to do the same thing several times. One way to do this is to use a for loop Sometimes, we would like to do the same thing several times. One way to do this is to use a for loop Ex. We have a data set with data on several statistics from Red Sox players. We would like to find the mean and median of each of these factors. Ex. We have a data set with data on several statistics from Red Sox players. We would like to find the mean and median of each of these factors. –base<-read.table(“baseball.dat“, header=T) –The columns of this are player id, at bats, hits, home runs, walks, L/R How could we find the mean of the first 5 columns? How could we find the mean of the first 5 columns?

basemean<-basemed<- matrix(0,1,5) for (i in 1:5){ basemean[i]<-mean(base[,i])basemed[i]<-median(base[,i])}basemeanbasemed

Apply function A great way to do a similar action in R is to use the apply function A great way to do a similar action in R is to use the apply function apply(base,2,mean) apply(base,2,mean) Note that you get the same result as the for loop. Note that you get the same result as the for loop. For this example there is limited benefit to the apply function, but in more complex situations it saves a lot of time For this example there is limited benefit to the apply function, but in more complex situations it saves a lot of time Name of data set 1=by row 2=by column function to be applied (built-in or user defined)

Using conditionals Now, we would like to find the total number of at bats and walks by left- handed batters. Remember for left- handed batters LH is 1. Now, we would like to find the total number of at bats and walks by left- handed batters. Remember for left- handed batters LH is 1. We could do this using a for loop and if statements. Try this yourself. We could do this using a for loop and if statements. Try this yourself.

numplayers<-nrow(base)totab<-0totwalks<-0 for (j in 1:numplayers){ if (base[j,6]==1){totab<-totab+base[j,2] totwalks<-totwalks+base[j,5]}} The if statement is only evaluated when the statement is true. You can also have an else if and else statement, which will be evaluated if the initial if statement is false. We will see this later in the summer. The if statement is only evaluated when the statement is true. You can also have an else if and else statement, which will be evaluated if the initial if statement is false. We will see this later in the summer. Although this is one way to get the total number of walks and at bats, it involves a lot of code. Although this is one way to get the total number of walks and at bats, it involves a lot of code.

Subsetting a data set Another great thing about R is that you can imbed if statements Another great thing about R is that you can imbed if statements Ex. As we know to determine the total number of walks we can use Ex. As we know to determine the total number of walks we can use –sum(base[,5]) If we want to find the total number of walks among left- handed players, we can sum over the correct subset of players If we want to find the total number of walks among left- handed players, we can sum over the correct subset of players –sum(base[(base[,6]==1),5]) –This command evaluates when (baseball[,6]==1) is true and sums over that subset only –What happens when you type base[(base[,6]==1),]

Practice Make a histogram of the hits by batters with more than 400 at bats. Make a histogram of the hits by batters with more than 400 at bats. Find the minimum number of at bats by a right-handed batter Find the minimum number of at bats by a right-handed batter

More on R functions Yesterday, we briefly mentioned that you could write your own functions in R. This is one of the most valuable aspects of R. Yesterday, we briefly mentioned that you could write your own functions in R. This is one of the most valuable aspects of R. Let’s look at this function. What does it do? Let’s look at this function. What does it do? fun<-function(x, y){ mx<-mean(x); maxx<-max(x) my<-mean(y); maxy<-max(y) if (maxx>maxy){list(group=1, mean=mx)} else {list(group=2, mean=my)} }

pp<-c(2,3,3,3,2,10) pp<-c(2,3,3,3,2,10) ppp<-c(8,7,6,8,6,5,6,7,8,7,7,9) ppp<-c(8,7,6,8,6,5,6,7,8,7,7,9) fun(pp,ppp) fun(pp,ppp)\$group [1] 1 \$mean [1] 3.833333 Now, try to write functions to do the following things. Now, try to write functions to do the following things. –Take a vector input and find the mean of all of the values except the minimum and maximum –Take a vector input and output a graph with a histogram and boxplot –Take a matrix input. Find the mean and median of each column. Output the mean, median and column number as a list for the column with the highest median

Possible answers fun2<-function(x){ fun2<-function(x){s<-sum(x)-min(x)-max(x)n<-length(x)-2list(mean=s/n)} fun3<-function(x){ fun3<-function(x){par(mfrow=c(2,1)) hist(x); boxplot(x) } fun4<-function(x){ fun4<-function(x){meds<-apply(x,2,median)mns<-apply(x,2,mean)n<-c(1:ncol(x))maxmed<-max(meds)nn<-n[(meds==maxmed)] list(column=nn, mean=mns[nn], median=meds[nn]) }