2 Instructora PhD statistician with an interest in new statistical methods for the genomic analysis of human complex diseasesEducation:BS in Mathematics at Pohang Univ. of Science and Technology in South KoreaMS and PhD in Statistics at the Univ. of MinnesotaPostdoc training in Medical Genetics at the Univ. of Washingtona Research Assistant Professor and work with DC Rao on many grants related to the genetics of blood pressure, cardiovascular disease, and related conditionsAlso a course master of the Fundamental of Genetic Epidemiology (with Treva Rice)
3 Source for lecture material simpleR – using R for Introductory Statistics by John VerzaniR package: SimpleUsing R for Introductory Statistics by John VerzaniR package: UsingRData Analysis and Graphics Using R: An Example-Based Approach by John Maindonald and W. Hohn BraunR package: DAAG
4 Outline Introduction to R Data Univariate Data Bivariate Data Regression AnalysisMultivariate DataWith any programming language, you cannot learn by watching some else: you have to do it yourself. So get your hands dirty!
5 YouTube videos for new R user R Tutorial series by tutorial (https://www.youtube.com/watch?v=ZoPJGmpYJzw&list=PL69A9CCD816A5F3A5&index=1)Statistics with R series by Christoph Scherber (https://www.youtube.com/watch?v=Xh6Rex3ARjc)Statistics with R series by Courtney Brown (https://www.youtube.com/watch?v=2-kw1MlOS1U)
6 include Friday’s material 1. Introduction to Rinclude Friday’s material
7 Brief history of RR was originally written by Ross Ihaka and Robert Gentleman at the University of AucklandIt is an implementation of the S language, which was principally developed by John ChambersIn 1998, the Association for Computing Machinery gave John Chambers its Software Award. His citation reads: “S has forever altered the way people analyze, visualize, and manipulate data ... It is an elegant, widely accepted, and enduring software system, with conceptual integrity.”The R Project (www.r-project.org)
8 Reasons for using RR is free (copy it down from the internet). Use is covered by the Free Software Foundation's GNU General Public License, which is designed to guarantee the freedom of users to develop and give away the softwareR runs on a wide variety of systems: Windows, MacOS X, UNIX (including FreeBSD), and LinuxR has state of the art statistical and graphical abilities, and strong scientific computational abilities, with new features regularly addedR has a vibrant and rapidly growing user community, who contribute by discussion on various lists, by adding new abilities, and by writing books and papers that are intended to help other users
9 More reasons for using R R has become a system of choice for statistical researchers. It is used increasingly for the development of software in many different areas of science and commerceThe R system has had, increasingly in the past five years, a leading role in statistical software innovation. Each year, the American Statistical Association Statistical Computing and Graphics Section makes a $1000 cash award (the John M Chambers award) for statistical software written by, or in collaboration with, an undergraduate or graduate student. All winning entries from 2003 to 2010 have been for software that is associated with R.R makes well-designed publication-quality plots that can incorporate mathematical symbols and formulae as needed
10 Excellent features in R R has an excellent built-in help system.R has excellent graphing capabilities.The language has a powerful, easy to learn syntax with many built-in statistical functions.The language is easy to extend with user-written functions.R is a computer programming language. For programmers it will feel more familiar than others. For new computer users, the next leap to programming will not be so large.Students can easily migrate to the commercially supported S-Plus program if commercial software is desired.
11 R as a calculator> # Simple Arithmetic  2 # The comment character (#) is used to make comments. > * 4 # Operator precedence  14 > 3 ^ 2 # Exponentiation  9 > exp(1) # Basic math. functions are available  > sqrt(10)  > pi # The constant pi is predefined  > 2*pi*6378 # Circumference of earth at equator (in km) 
12 R as a smart calculator> x = 1 # Can define variables > y = 3 # using “=" to assign values. # You can also use “<-”. > z = 4 > x * y * z  12 > X * Y * Z # names are case sensitive Error: Object "X" not found > This.Year = 2004 # names can include period > This.Year  2004
13 R does a lot more! Definitely not just a calculator R can manipulate vectors, matrices and datasetsR has many built-in statistical functionsR produces excellent graphicsR allows you to define your own functions
14 include Friday’s material 2. Datainclude Friday’s material
15 What is data?When we read the newspaper or watch TV news, we find data and its interpretation.Most often the data is presented in a summarized format, letting the reader draw conclusions.Statistics allow us to summarize data in the familiar terms of counts, proportions, and averages.So let us to learn about data: how to summarize it, how to present it, and how to infer from it when appropriate.
16 Entering data with cThe most useful R command for quickly entering in small data sets is the c function, which combines or concatenates terms together.Example: suppose we have the following count of the number of typos per page:In RWe assigned the values to a variable called typosThe value of the typos doesn't automatically print out. It does when we type the nameThe value of typos is prefaced with a funny looking . This indicates that the value is a vector.
17 Data is a vectorThe data is stored in R as a vector. This means that it keeps track of the order that the data is entered in.This is a good thing for several reasonsOur simple data vector typos has a natural order: page 1, page 2 etc. We wouldn't want to mix these up.We can make changes to the data item by item instead of having to enter in the entire data set again.Vectors are also a mathematical object. There are natural extensions of mathematical concepts such as addition and multiplication that make it easy to work with data when they are vectors.
18 Vectors in R Created with c() to concatenate elements rep() to repeat elements or patternsseq() or m:n to generate sequencesMost mathematical functions and operators can be applied to vectors without loops!Possible to select and edit groups of elements simultaneously
19 Example with vectors in R > rep(1,10) # repeats the number 1, 10 times  > seq(2,6) # sequence of integers between 2 and 6  # equivalent to 2:6 > seq(4,20,by=4) # Every 4th integer between 4 and 20  > x = c(2,0,0,4) # Create vector with elements 2,0,0,4 > y = c(1,9,9,9) > x + y # Sums elements of two vectors  > x * 4 # Multiplies elements  > sqrt(x) # Function applies to each element  # Returns vector
20 Accessing vector elements To extract data from a vector, use slicing and extraction as below.Use the  operator to select elementsTo select specific elements, use index or vector of indexes to identify themTo exclude specific elements, use negate index or vector of indexesAlternatively, use vector of T and F values to select subset of elements
21 Example> x = c(2,0,0,4) > x # Select the first element, equivalent to x[c(1)]  2 > x[-1] # Exclude the first element  > x = 3 ; x  > x[-1] = 5 ; x  > y < 9 # Compares each element, returns result as vector  TRUE FALSE FALSE FALSE > y = 1 > y < 9  TRUE FALSE FALSE TRUE > y[y<9] = 2 # Edits elements marked as TRUE in index vector > y 
22 Assignment: Question 1Try to guess the results of these R commands. Remember, the way to access entries in a vector is with . Suppose we assume> x = c(1,3,5,7,9)> y = c(2,3,5,7,11,13)x+1y*2length(x) and length(y)x + ysum(x>5) and sum(x[x>5])sum(x>5 | x< 3)yy[-3]y[x]y[y>=7]
23 Examples with typosSuppose we want to keep track of our various drafts as the typos change.Orthe assignment to the first entry in the vector typos.draft2 is done by referencing the first entry in the vector. This is done with square brackets [ ]parentheses () are for functions, and square brackets [ ] are for vectors (and arrays and lists).
24 Apply a functionR comes with many built-in functions that one can apply to data such as typos. One of them is the mean function for finding the mean or average of the data.Call the median or var to find the median or sample variance.The syntax is the same: the function name followed by parentheses to contain the argument(s):
25 Assignment: Question 2 Let the data x be given by x = c(1, 8, 2, 6, 3, 8, 5, 5, 5, 5)Use R to compute the following functions. Note, we use X1 to denote the first element of x (which is 1) etc.(X1 + X2 + … + X10)/10 (use sum)Find log10(Xi) for each i. (Use the log function which by default is base e)Find (Xi -4.4)/2.875 for each i. (Do it all at once)Find the difference between the largest and smallest values of x. (This is the range. You can use max and min or guess a built in command.)
26 Assignment: Question 3Suppose you track your commute times for two weeks (10 days) and you find the following times in minutesEnter this into R.Use the function max to find the longest commute time, the function mean to find the average and the function min to find the minimum. Oops, the 24 was a mistake. It should have been 18. How can you fix this? Do so, and then find the new average.How many times was your commute 20 minutes or more? To answer this one can try (if you called your numbers commutes) sum (commutes >= 20) What do you get? What percent of your commutes are less than 17 minutes? How can you answer this with R?
27 Use graphs to check data Graphics are important for conveying important features of the data.Numerical summaries, such as an average, can be very useful, but important features of the data may be missed without a glance at an appropriate graph.This is the best way to begin investigation of a new set of data, drawing attention to obvious errors or quirks in the data, or to obvious clues that the data contains.The use of graphs to display and help understand data has a long tradition. John W. Tukey formalized and extended this tradition, giving it the name Exploratory Data Analysis.Data should, as far as possible, have the opportunity to speak for themselves, prior to or as part of a formal analysis!
28 3. Univariate DataGraphics and other simple functions to explore univariate data, data with a single variable.
29 Univariate dataData can be of three types: categorical, discrete numeric and continuous numeric: methods for viewing and summarizing the data depend on the type.The U.S. census (http://www.census.gov) asks questions of a categorical nature.A doctor's chart which records data on a patient.The gender or the history of illnesses can be treated as categories.The age of a person and their weight are numeric quantities. The age is a discrete numeric quantity and the weight as well (most people don't say they are years old). These numbers are usually reported as integers.If one really needed to know precisely, they could in theory take on a continuum of values, and we would consider them to be continuous.
30 Table for categorical data The table command allows us to look at tables. Its simplest usage looks like table(x) where x is a categorical variable.Example: Smoking survey. A survey asks people if they smoke or not. The data isWe can enter this into R with the c() command, and summarize with the table command as followsThe table command simply adds up the frequency of each unique value of the data
31 Assignment: Question 4The number of O-ring failures for the first 23 flights of the US space shuttle Challenger were0 1 0 NA(NA means not available - the equipment was lost).Make a table of the possible categories. Try to find the mean. (You might need to try mean(x, na.rm=TRUE) to avoid the value NA, or look at x[!is.na(x)].)
32 Bar chartsA bar chart draws a bar with a height proportional to the count in the table.Suppose, a group of 25 people are surveyed as to their beer-drinking preference. The categories were (1) Domestic can, (2) Domestic bottle, (3) Microbrew and (4) import. The raw data is
33 Bar charts in RIn RTo read in the data, use scan(), which is very useful for reading data from a file or by typing. You type in the data. It stops adding data when you enter a blank row. (Try ?scan for more information.)We don't use barplot with the raw data.Use the table command to create summarized data, then use barplot to create the barplot of frequencies shown.For proportion, divide summarized data by the number of data points.
35 Center and spread for numeric data R commands for common numerical summaries are mean, var, sd, median and summary.Example: CEO salaries. A sample of CEO annual salaries (in millions):
36 Stem-and-leaf chartsIf the data set is relatively small, the stem-and-leaf diagram is useful for seeing the shape of the distribution and the values.Use apropos() when you think you know the function's name but aren't sure.
37 Histograms The simplest way to view a distribution of numeric data rug() gives the tick marks just above the x-axisjitter(x) gives a little jitter to the x values to eliminate ties
38 BoxplotsThe boxplot is useful to summarize data succinctly, displaying if the data is symmetric or has suspected outliers.The boxplot has a box with lines at Q1, Median, Q3 and whiskers which extend to the min and max.To showcase possible outliers, the whiskers are shorten to a length of 1:5 times the box length. Any points beyond that are plotted with points.We can check quickly for symmetry and outliers (data points beyond the whiskers).
39 Example: Movie salesdata on movie revenues for the 25 biggest movies of a given week.Boxplots of the current and gross salesBoth distributions are skewed, but the grosssales are less so. This shows why Hollywoodis interested in the “big hit", as a real big hitcan generate a lot more revenue than quitea few medium sized hits.
40 Assignment: Questions 5 and 6 5. Make a histogram and boxplot of three data sets: south, crime and aid.Which of these data sets is skewed?Which has outliers?Which is symmetric?6. For the data sets bumpers, firstchi, math make a histogram. Try to predict the mean, median and standard deviation. Check your guesses with the appropriate R commands.
41 4. Bivariate DataGraphics and other simple functions to explore bivariate data, data with two variables
42 Bivariate dataWith univariate data, we summarized a data set with measures of center and spread and the shape of a distribution with words such as “symmetric” and “long-tailed.”With bivariate data we can ask additional questions about the relationship between the two variables.For example, are height and weight related? Are age and heart rate related? Are income and taxes paid related? Is a new drug better than an old drug? Does the weather depend on the previous days weather?If a bivariate data set has a natural pairing, such as (x1, y1), …, (xn, yn), then it makes sense to investigate the data set jointly.We will focus on relationships in numeric data.
43 Scatterplots to compare relationships The scatterplot is simple but important tool for investigating pairwise relationships (for example, the height of a father compared to their sons height).Home data example shows old assessed value (1970) versus new assessed value (2000). There should be some relationship.Linear model will be covered later.
44 Correlation between two variables The correlation between two variables numerically describes whether larger values of one variable are related to larger values of the other variable.A valuable numeric summary of the strength of the linear relationship is the Pearson correlation coefficient.
45 The Spearman rank correlation To get the Pearson correlation coefficient, use corIf the relationship between the variables is not linear but is increasing, we can still use the correlation coefficient to understand the strength of the relationship. We use the ranked data.This is the Spearman rank correlation, which is the Pearson correlation coefficient computed with the ranked data.Is there another way to get the Spearman correlation?
46 Pearson vs. Spearman correlation Example: the Pearson correlation for 4 casesIn the 2nd plot, the Pearson correlation is 0.878, while the Spearman correlation is 928.When a linear fit is inadequate, Spearman correlation better captures the strength of relationship.
47 Assignment: Question 7The data set mammals (in MASS package) contains data on body weight versus brain weight.Use the cor to find the Pearson and Spearman correlation coefficients. Are they similar?Plot the data using the plot command and see if you expect them to be similar. You should be unsatisfied with this plot.Next, plot the logarithm (log) of each variable and see if that makes a difference.
48 Assignment: Question 8The data set mtcars contains information about cars from a 1974 Motor Trend issue. Answer the following:What are the variable names? (Try names.)What is the maximum mpg? Which car has this?What are the first 5 cars listed?What horsepower (hp) does the “Valiant” have?What are all the values for the Mercedes 450slc (Merc 450SLC)?Make a scatterplot of cylinders (cyl) vs. miles per gallon (mpg). Fit a regression line. Is this a good candidate for linear regression?
49 R Basics: Reading in datasets with library and data The library and data command can be used in several different waysTo list all available packages: Use the command library().To list all available datasets: Use the command data() without any argumentsTo list all data sets in a given package: Use data(package='package name') for example data(package=Simple).To read in a dataset: Use data('dataset name'). As in the example data(movies). You need to load the package to access its datasets as in the command library(“Simple”).To find out information about a dataset: You can use the help command to see if there is documentation on the data set. For example help(“movies") or equivalently ?movies
50 Assignment: Question 9In the library MASS, a dataset UScereal contains information about popular breakfast cereals.Investigate the following relationships, and make comments on what you see. You can use tables, barplots, scatterplots etc. to do your investigation.the relationship between manufacturer and shelfthe relationship between fat and vitaminsthe relationship between fat and shelfthe relationship between carbohydrates and sugarsthe relationship between fibre and manufacturerthe relationship between sodium and sugarsAre there other relationships you can predict and investigate?
51 5. Regression AnalysisRegression analysis is fundamental and forms a major part of statistical analysis!
52 Linear regression model Linear regression can be used to study the linear relationship for paired data sets (x, y).When x and y have a linear relationship in a mathematical sense, y = mx + b, where m is the slope of the line and b the intercept.In statistics, we don’t assume these variables have an exact linear relationship: rather, we consider the possibility for noise or error.The regression model is yi=β0+β1xi+εiThe value εi is an error termThe coefficients β0 and β1are the regression coefficients
53 Linear regression analysis The regression model: y = β0 + β1x + εThe values of β0 and β1 are unknown and will be estimated in a reasonable manner from the dataThe estimated regression line is(using "hats" to denote the estimates)For each data point xi we have(called the predicted value)The difference between the true and predicted value is the residual
54 Statistical model: signal vs. noise Statistical models have both deterministic and random error components, or signal components and noise components.observation = signal + noise(β0 + β1x is signal and ε is noise in linear model)After fitting a model, we haveobservation = fitted value + residual( is fitted value and e is residual)which we can think of asobservation = smooth component + rough component.The idea is that fitted value will recapture most of the signal and the residual will contain mostly noise.
56 Example: linear regression with R The maximum heart rate of a person is often said to be related to age by the equation: Max.rate= AgeSuppose this is to be empirically proven and 15 people of varying ages are tested for their maximum heart rate.We use lm() to fit a linear model
57 R’s model formula notation To fit a linear model, we use lm(y~x)The most basic usage for lm islm(formula)The formula is a model formula that represents the simple linear regression model.The response variable is on the left hand side and the predictor on the right: response ~ predictorIn our example, this is y ~ x,where ~ in this notation is read is modeled by.So, the model formula y ~ x would be read: y is modeled by x.The model formula implicitly assumes an intercept term and a linear model.
58 Linear regression with R The result of the lm function can be stored:We can use summary to get regression coefficients (and more).The result of the lm function is of class lm and so the plot and summary commands have been adapted.For several generic functions (including print, plot, and summary), the result depends on the class of object that is given as argument.