# Introduction to R for Absolute Beginners: Part II Melinda Fricke Department of Linguistics University of California, Berkeley

## Presentation on theme: "Introduction to R for Absolute Beginners: Part II Melinda Fricke Department of Linguistics University of California, Berkeley"— Presentation transcript:

Introduction to R for Absolute Beginners: Part II Melinda Fricke Department of Linguistics University of California, Berkeley melindafricke@berkeley.edu D-Lab Workshop Series, Spring 2013

Welcome (back)! What we covered last time: creating and manipulating objects variable assignment=, types of objects single values vs. vectors vs. data frames types of data numerical vs. character vs. factor (categorical)

Welcome (back)! What we covered last time (cont’d): functions for manipulating data c() as.factor(), as.character(), as.numerical table(), aggregate() functions for getting around ls(), rm() read.table(), write.table() dim(), summary(), head(), tail() subscriptinge.g.salary[1,1] dataframerowscolumns

Professors’ salaries, revisited http://linguistics.berkeley.edu/~mfricke/R_Work shop (S. Weisberg (1985). Applied Linear Regression, Second Edition. New York: John Wiley and Sons. Page 194. Downloaded from http://data.princeton.edu/wws509/datasets/#salary on January 31 st, 2013.) read.table(“salary.txt”, header=T) -> salary

Professors’ salaries How many rows and columns are in this dataset? What are the possible values for “rk” (rank)? How many data points are there for each rank? How many males vs. females are there at each rank?

Professors’ salaries How many rows and columns are in this dataset? dim(salary) [1]526 What are the possible values for “rk” (rank)? levels(salary\$rk)as.factor(salary\$rk) -> salary\$rk [1]“assistant” “associate” “full” How many data points are there for each rank? table(salary\$rk) assistant = 18, associate = 14, full = 20 How many males vs. females are there at each rank? table(salary\$rk, salary\$sx)femalemale assistant810 associate212 full416

Plotting Create a plot showing the relationship between years since degree (yd) and salary (sl). plot(y ~ x, data=NameOfDataframe)a basic scatterplot “as a function of”

Plotting Create a plot showing the relationship between years since degree (yd) and salary (sl). plot(sl ~ yd, data=salary)

Plotting Create a plot showing the relationship between years since degree (yd) and salary (sl). plot(sl ~ yd, data=salary) Look at the help file for “plot” and try to add the following to your plot: a main title labels for the x and y axes red dots (instead of the default black dots)

Plotting Create a plot showing the relationship between years since degree (yd) and salary (sl). plot(sl ~ yd, data=salary, main=“Professors’ salaries”, xlab=“years since degree”, ylab=“salary (\$)”, col=“red”) a main title labels for the x and y axes red dots (instead of the default black dots)

Plotting Create a plot showing the relationship between years since degree (yd) and salary (sl). plot(sl ~ yd, data=salary, main=“Professors’ salaries”, xlab=“years since degree”, ylab=“salary (\$)”, col=“red”) a main title labels for the x and y axes red dots (instead of the default black dots)

Plotting Create a plot showing the relationship between years since degree (yd) and salary (sl). plot(sl ~ yd, data=salary, main=“Professors’ salaries”, xlab=“years since degree”, ylab=“salary (\$)”, col=“red”) more customizing: ylim = c(0, 40000)change the y range xlim = c(0, 20)change the x range pch = 19specify “plotting character”

Plotting Create a plot showing the relationship between years since degree (yd) and salary (sl). plot(sl ~ yd, data=salary, main=“Professors’ salaries”, xlab=“years since degree”, ylab=“salary (\$)”, col=“red”, ylim=c(0, 40000), xlim=c(0, 20), pch=19) more customizing: ylim = c(0, 40000)change the y range xlim = c(0, 20)change the x range pch = 19specify “plotting character”

Saving a plot jpeg(“ProfessorsSalaries.jpeg”, width=8, height=5, units=“in”, res=300) plot(sl~ yd, data=salary, main=“Professors’ salaries”, xlab=“years since degree”, ylab=“salary (\$)”, col=“red”) dev.off()

More plotting plot(salary)

More plotting plot(salary) boxplot(sl ~ sx + rk, data=salary)

More plotting plot(salary) boxplot(sl ~ sx + rk, data=salary) show me the salary (= dependent variable) as a function of both sex and rank these variable names can be found in the data frame called “salary”

Packages -To download and install a new package: -In R, go to the “Packages & Data” menu. -Go to the “Package Installer”. -Search for “animation” and click “Get List”. -Select the package and click “Install Selected”. -In the console, type: library(animation). -If nothing happens, that means everything went smoothly!

Packages We’re going to work with some data sets that come installed with R, in the “datasets” package. library(help=datasets) head(faithful) eruptions = duration of each eruption waiting = # minutes since last eruption

Correlation Is there a correlation between the duration of a given eruption and the number of minutes since the preceding eruption? Make a plot that shows the relationship between eruption duration and time since last eruption. plot(y ~ x, data = dataframe)

Correlation plot(eruptions ~ waiting, data = faithful, main = "Old Faithful Eruptions", ylab = "duration of eruption (min)", xlab = "time since last eruption (min)")

Correlation How can we tell if this correlation is statistically significant? cor.test()correlation test cor.test( ~ eruptions + waiting, data=faithful) cor = 0.9, p < 0.0001

Correlation: simple linear regression How can we tell if this correlation is statistically significant? lm()linear model (regression) lm(eruptions ~ waiting, data = faithful) -> faithful.lm summary(faithful.lm)

Correlation: simple linear regression > summary(faithful.lm) Call: lm(formula = eruptions ~ waiting, data = faithful) Residuals: Min 1Q Median 3Q Max -1.29917 -0.37689 0.03508 0.34909 1.19329 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.874016 0.160143 -11.70 <2e-16 *** waiting 0.075628 0.002219 34.09 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4965 on 270 degrees of freedom Multiple R-squared: 0.8115,Adjusted R-squared: 0.8108 F-statistic: 1162 on 1 and 270 DF, p-value: < 2.2e-16

Correlation: simple linear regression plot(eruptions ~ waiting, data = faithful, main = "Old Faithful Eruptions", ylab = "duration of eruption (min)", xlab = "time since last eruption (min)") lines(abline(faithful.lm))

Correlation: “exploratory plotting” plot(eruptions ~ waiting, data = faithful, main = "Old Faithful Eruptions", ylab = "duration of eruption (min)", xlab = "time since last eruption (min)") lines(lowess(faithful\$waiting, faithful\$eruptions), col=“red”, lwd=4) x coordinatesy coordinates color of the line “line width” a function that draws a smooth line

Correlation: “exploratory plotting” plot(eruptions ~ waiting, data = faithful, main = "Old Faithful Eruptions", ylab = "duration of eruption (min)", xlab = "time since last eruption (min)") lines(lowess(faithful\$waiting, faithful\$eruptions), col=“red”, lwd=4) x coordinatesy coordinates color of the line “line width” a function that draws a smooth line

Correlation: “exploratory plotting” plot(eruptions ~ waiting, data = faithful, main = "Old Faithful Eruptions", ylab = "duration of eruption (min)", xlab = "time since last eruption (min)") lines(lowess(faithful\$waiting, faithful\$eruptions), col=“red”, lwd=4) x coordinatesy coordinates color of the line “line width” a function that draws a smooth line

Correlation: “exploratory plotting” plot(eruptions ~ waiting, data = faithful, main = "Old Faithful Eruptions", ylab = "duration of eruption (min)", xlab = "time since last eruption (min)") lines(lowess(faithful\$waiting, faithful\$eruptions), col=“red”, lwd=4)

Practice: correlation Remember the professors’ salaries dataset? – Is there a correlation between salary and years since degree? cor.test( ~ x + y, data=yourdata) lm(y ~ x, data=yourdata) plot(y ~ x, data=yourdata)

Practice: correlation Is there a correlation between salary and years since degree? cor.test( ~ sl + yd, data=salary) cor = 0.675, p = 4.102e-08

Practice: correlation Is there a correlation between salary and years since degree? cor.test( ~ sl + yd, data=salary) cor = 0.675, p = 4.102e-08 lm(sl ~ yd, data=salary) -> salary.lm summary(salary.lm) Coefficients: yd = 390.65, p < 0.0001 Good news! You’ll make \$391 more for every year after you get your degree!

Practice: correlation Is there a correlation between salary and years since degree? plot(sl ~ yd, data = salary, main = “Professors’ salaries”, xlab = “years since degree”, ylab = “salary (\$)”) lines(abline(salary.lm), lty=2)

t-tests A t-test asks the question: Are these two sample distributions drawn from the same underlying population? Example: test scores. Test scores generally form a normal distribution: a few are excellent, and a few are horrible, but the majority are right around the average. Given two sets of test scores, we want to know if the students from one school performed significantly better than the students at a different school.

t-tests A few words about normal distributions: A normal distribution can be described by a mean, and some variation around that mean. Look at the help file for the function rnorm(). rnorm() generates random numbers from a normal distribution. Create a distribution of 100 random numbers, with the mean of the distribution centered around 0.

t-tests rnorm(n, mean, sd) rnorm(100)try this multiple times

t-tests rnorm(n, mean, sd) rnorm(100)try this multiple times hist(rnorm(100))now try this multiple times hist(rnorm(1000, 4.5))and this!

t-tests rnorm(n, mean, sd) rnorm(100)try this multiple times hist(rnorm(100))now try this multiple times hist(rnorm(1000, 4.5))and this! Each time we generate a set of random numbers, we are “sampling” from a distribution. The more numbers we generate, the better idea we get of the “underlying” distribution.

t-tests A t-test asks the question: Are these two sample distributions drawn from the same underlying population? Example: hist(rnorm(10), ylim=c(0,5), xlim=c(-5,5), col=“red”) hist(rnorm(10), add=T, col=“blue”) hist(rnorm(1000), ylim=c(0,500), xlim=c(-5,5), col=“red”) hist(rnorm(1000), add=T, col=“blue”)

t-tests One more example of the normal distribution: library(animation) quincunx()

t-tests Example: American vs. Japanese cars Download the.txt file located at http://linguistics.berkeley.edu/~mfricke/R_Workshop.ht ml, and save it to your working directory. Read it in to R: read.table(“Cars-MPG.txt”, header=T) -> cars Take a minute to inspect the dataframe.

t-tests hist(cars\$American)a basic histogram hist(cars\$Japanese, add=T) How can we fix this problem?

t-tests hist(cars\$American) hist(cars\$Japanese, add=T) How can we fix this problem? xlim=c(0,50)

t-tests hist(cars\$Japanese, breaks=10, col="red", main="Fuel Efficiency in American vs. Japanese Cars", xlab="miles per gallon", xlim=c(0,50)) hist(cars\$American, breaks=10, col="blue", add=T) legend("topright", legend=c("American", "Japanese"), fill=c("blue", "red"))

t-tests hist(cars\$American, breaks=10, col="blue", main="Fuel Efficiency in American vs. Japanese Cars", xlab="miles per gallon", xlim=c(0,50), prob=T) legend("topright", legend=c("American", "Japanese"), fill=c("blue", "red")) hist(cars\$Japanese, breaks=10, col="red", add=T, prob=T) lines(density(cars\$American), lwd=4, lty=2) lines(density(cars\$Japanese), lwd=4, lty=2)

t-tests t.test(cars\$American, cars\$Japanese)

t-tests t.test(cars\$American, cars\$Japanese) Welch Two Sample t-test data: cars\$American and cars\$Japanese t = -17.3377, df = 138.232, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -16.10429 -12.80710 sample estimates: mean of x mean of y 16.02532 30.48101

ANOVA An ANOVA (ANalysis Of VAriance) asks the question: Given more than two sample populations, are any of them drawn from different underlying populations? In other words: are any of these groups different?

ANOVA Look at the “chickwts” dataframe. This is data on chicken weights, according to which type of feed they received. How many data points do we have? How many types of feed were there, and what were they?

ANOVA Look at the “chickwts” dataframe. This is data on chicken weights, according to which type of feed they received. How many data points do we have? dim(chickwts) [1]712 How many types of feed were there, and what were they? levels(chickwts\$feed) [1]“casein” “horsebean” “linseed” “meatmeal” [5]“soybean” “sunflower” length(levels(chickwts\$feed)) [1]6

ANOVA Can you make a boxplot showing weight as a function of feed? boxplot()

ANOVA boxplot(weight ~ feed, data = chickwts, main = "Effect of Chicken Feed on Weight", ylab = "weight (g)", xlab = "type of feed")

ANOVA boxplot(weight ~ feed, data = chickwts, main = "Effect of Chicken Feed on Weight", ylab = "weight (g)", xlab = "type of feed") anova(lm(weight ~ feed, data = chickwts))

ANOVA follow-up: t-test An ANOVA only tells you whether ANY of the groups you’re comparing are different from ANY of the others. If you want to check whether specific groups are different, you need to do a t-test.

ANOVA follow-up: t-test An ANOVA only tells you whether ANY of the groups you’re comparing are different from ANY of the others. If you want to check whether specific groups are different, you need to do a t-test. subset(chickwts, feed==“casein”) -> casein subset(chickwts, feed==“horsebean”) -> horsebean

ANOVA follow-up: t-test An ANOVA only tells you whether ANY of the groups you’re comparing are different from ANY of the others. If you want to check whether specific groups are different, you need to do a t-test. subset(chickwts, feed==“casein”) -> casein subset(chickwts, feed==“horsebean”) -> horsebean t.test(casein\$weight, horsebean\$weight) t = 7.34, p = 0.000000721

Practice: ANOVA Do professors of different ranks make different amounts of money? How can you depict this graphically?

Practice: ANOVA Do professors of different ranks make different amounts of money? How can you depict this graphically? anova(lm(sl ~ rk, data = salary)) p = 1.174e-15(0.000000000000001174) boxplot(sl ~ rk, data = salary)

Practice: ANOVA How do male and female salaries compare? Is this relationship consistent across ranks?

Practice: ANOVA How do male and female salaries compare? Is this relationship consistent across ranks? anova(lm(sl ~ sx * rk, data=salary)) sx, p = 0.001 rk, p < 0.0001 sx:rk, p = 0.85 Men and women make different amounts, and professors of different ranks make different amounts, but there is no interaction between sex and rank: men and women are equally unequal across all ranks!

Practice: ANOVA How do male and female salaries compare? Is this relationship consistent across ranks? anova(lm(sl ~ sx * rk, data=salary)) sx, p = 0.001 rk, p < 0.0001 sx:rk, p = 0.85 Men and women make different amounts, and professors of different ranks make different amounts, but there is no interaction between sex and rank: men and women are equally unequal across all ranks! ‘*’ means “check each of these factors, plus their interaction”

Practice: ANOVA How do male and female salaries compare? Is this relationship consistent across ranks? boxplot(sl ~ sx + rk, data = salary, main = “Professors’ salaries”, xlab = “sex and rank”, ylab = “salary (\$)”)

chi-squared tests A chi-squared test asks the question: If I were sampling multiple times from one underlying population, would this set of counts be surprising? If the probability of a given outcome is very low, then you probably aren’t sampling from one population. You’re probably dealing with two different populations.

chi-squared tests: an example Imagine you are blindfolded, and someone tells you there are two boxes in front of you.

chi-squared tests: an example Imagine you are blindfolded, and someone tells you there are two boxes in front of you.

chi-squared tests: an example You are given one box, and you draw 10 balls out of it. sample 1 red9 blue1

chi-squared tests: an example You are again given a box, and you also draw 10 balls out of it. sample 1sample 2 red90 blue110

chi-squared tests: an example You are again given a box, and you also draw 10 balls out of it. sample 1sample 2 red90 blue110 Do you think you were given two different boxes, or did you sample from the same box twice?

chi-squared tests: an example Let’s do the same thing over again… sample 1sample 2 red54 blue56 What if the counts looked like this?

chi-squared tests: an example Let’s do the same thing over again… sample 1sample 2 red50004000 blue50006000 What if the counts looked like this?

chi-squared tests: an example Let’s do the same thing over again… sample 1sample 2 red50004000 blue50006000 What if the counts looked like this? A chi-squared test asks how surprised we should be at a given outcome, assuming we’re drawing balls out of the same box.

chi-squared tests: an example matrix(data = c(5, 5, 6, 4), nrow=2) -> ex1 matrix(data = c(10, 0, 1, 9), nrow=2) -> ex2 matrix(data = c(5000, 5000, 6000, 4000), nrow=2) -> ex3 ex1 chisq.test(ex1) chisq.test(ex2) chisq.test(ex3)

chi-squared tests: an example matrix(data = c(5, 5, 6, 4), nrow=2) -> ex1 matrix(data = c(10, 0, 1, 9), nrow=2) -> ex2 matrix(data = c(5000, 5000, 6000, 4000), nrow=2) -> ex3 ex1 chisq.test(ex1) X-squared = 0, df = 1, p = 1 chisq.test(ex2) X-squared = 12.9293, df = 1, p-value = 0.0003235 chisq.test(ex3) X-squared = 201.6164, df = 1, p-value < 2.2e-16 If we draw from the same box over and over again, the probability of observing a distribution like example 1 goes to 1. Example 1 should not surprise us at all. There is no reason to think we’re drawing from 2 separate boxes. The probability of drawing from the same box and getting a distribution like example 2, though, is VERY small. So we’re probably not drawing from the same box!

chi-squared tests A chi-squared test asks the question: If I were sampling multiple times from one underlying population, would this set of counts be surprising? If the probability of a given outcome is very low, then you probably aren’t sampling from one population. You’re probably dealing with two different populations.

What we did today A common way of entering formulas in R: (y ~ x, data = dataframe) plot(y ~ x, data = dataframe) boxplot(y ~ x, data = dataframe) lm(y ~ x, data = dataframe) anova(lm(y ~ x, data = dataframe))

What we did today Some functions ask for data in a different format, though. Check the help file to see what a function’s arguments are! cor.test(~ x + y, data = dataframe) t.test(dataframe\$x, dataframe\$y) ?cor.testorhelp(cor.test)

What we did today There are lots of ways to customize plots! main = “”main title xlim = c(x,y)set the range for the x values ylim = c(x,y)set the range for the y values xlab = “”label the x axis ylab = “”label the y axis col = “blue”set the color for an object lwd = 2set the width for a line lty = 2define the type of line to use (2 = dashed) pch = 19set the plotting character legend()add a legend to a plot

Thank you! I hope this was helpful. Please give me your feedback so I can improve the workshop for future iterations. Also be sure to check out the other D-Lab R offerings!

Download ppt "Introduction to R for Absolute Beginners: Part II Melinda Fricke Department of Linguistics University of California, Berkeley"

Similar presentations