What factors are most responsible for height? Outcome = (Model) + Error
Analytics & History: 1st Regression Line http://galton.org/cgi-bin/searchImages/search/pearson/vol3a/pages/vol3a_0019.htm The first “Regression Line”
Galton’s Notebook on Families & Height http://vincentarelbundock.github.io/Rdatasets/doc/HistData/GaltonFamilies.html
Galton’s Family Height Dataset X1 X2 X3 Y
> getwd() [1] "C:/Users/johnp_000/Documents" > setwd()
Dataset Input h <- read.csv("GaltonFamilies.csv") Object Function Filename Data()
str() summary() Data Types: Numbers and Factors/Categorical
Outline One Variable: Univariate Two Variables: Bivariate Dependent / Outcome Variable Two Variables: Bivariate Outcome and each Predictor All Four Variables: Multivariate
Variable Type Steps Y X1, X2 X3 Histogram Child’s Height Continuous Dad’s Height X1, X2 Scatter Continuous Mom’s Height X3 Gender Categorical Boxplot Linear Regression
Frequency Distribution, Histogram hist(h$child) Frequency Distributions: A graph plotting values of observations on the horizontal axis, with a bar showing how many times each value occurred in the data set.
Density Plot plot(density(h$childHeight)) Area = 1
Mode, Bimodal hist(h$childHeight,freq=F, breaks =25, ylim = c(0,0.14)) curve(dnorm(x, mean=mean(h$childHeight), sd=sd(h$childHeight)), col="red", add=T)
Asst. Professor of Statistics at Rice University Industries / Organizations Creating and Using R Hadley Wickham Asst. Professor of Statistics at Rice University Industry Pct. Research 24% Higher Education 7% Information Technology 9% Computer Software Financial Services 6% Banking 2% Pharmaceuticals 4% Biotechnology Market Research 3% Management Consulting Total 69% ggplot2 plyr reshape rggobi profr http://prezi.com/s1qrgfm9ko4i/the-r-ecosystem/ Source: LinkedIN R Group (Sept, 2011) http://ggplot2.org/
ggplot2 library(ggplot2) h.gg <- ggplot(h, aes(child)) h.gg + geom_histogram(binwidth = 1 ) + labs(x = "Height", y = "Frequency") h.gg + geom_density() http://www.cookbook-r.com/Graphs/Plotting_distributions_(ggplot2)/
ggplot2 h.gg <- ggplot(h, aes(child)) + theme(legend.position = "right") h.gg + geom_density() + labs(x = "Height", y = "Frequency") h.gg + geom_density(aes(fill=factor(gender)), size=2)
Variable Type Steps Y X1, X2 X3 Histogram Child’s Height Continuous Dad’s Height X1, X2 Scatter Continuous Mom’s Height X3 Gender Categorical Boxplot Linear Regression
Correlation and Regression http://en.wikipedia.org/wiki/Genetics
Covariance Calculate the difference between the mean and each person’s score for the first variable (x). Calculate the difference between the mean and their value for the second variable (y). Multiply these “error” values. Add these values to get the cross product deviations. The covariance is the average of cross-product deviations
Covariance Y X Persons 2,3, and 5 look to have similar magnitudes from their means
Covariance Calculate the error [deviation] between the mean and each subject’s score for the first variable (x). Calculate the error [deviation] between the mean and their score for the second variable (y). Multiply these error values. Add these values and you get the cross product deviations. The covariance is the average cross-product deviations:
Standardizing the Covariance Covariance depends upon the units of measurement Normalize the data Divide by the standard deviations of both variables. The standardized version of covariance is known as the correlation coefficient
Correlation ?cor cor(h$father, h$child) 0.2660385
Scatterplot Matrix: pairs()
Correlations Matrix library(car) scatterplotMatrix(heights)
ggplot2
Variable Type Steps Y X1, X2 X3 Histogram Child’s Height Continuous Dad’s Height X1, X2 Scatter Continuous Mom’s Height X3 Gender Categorical Boxplot Linear Regression
Box Plot http://web.anglia.ac.uk/numbers/graphsCharts.html
Children’s Height vs. Gender boxplot(h$child~gender,data=h, col=(c("pink","lightblue")), main="Children's Height by Gender", xlab="Gender", ylab="")
Descriptive Stats: Box Plot 69.23 64.10 ====== 5.13
Subset Males men<- subset(h, gender=='male')
Subset Females women <- subset(h, gender==‘female')
Children’s Height: Males hist(men$childHeight) qqnorm(men$childHeight) qqline(men$childHeight)
Children’s Height: Females hist(women$child) qqnorm(women$child) qqline(women$child)
ggplot2 library(ggplot2) h.bb <- ggplot(h, aes(factor(gender), child)) h.bb + geom_boxplot() h.bb + geom_boxplot(aes(fill = factor(gender)))