What factors are most responsible for height?

Slides:



Advertisements
Similar presentations
Rubric Unit Plan Univariate Bivariate Examples Resources Curriculum Statistically Thinking A study of univariate and bivariate statistics.
Advertisements

Outline Research Question: What determines height? Data Input Look at One Variable Compare Two Variables Children’s Height and Parents Height Children’s.
IB Math Studies – Topic 6 Statistics.
SPSS Session 1: Levels of Measurement and Frequency Distributions
Education 793 Class Notes Joint Distributions and Correlation 1 October 2003.
Data Analysis Statistics. Inferential statistics.
LINEAR REGRESSION: Evaluating Regression Models Overview Assumptions for Linear Regression Evaluating a Regression Model.
LINEAR REGRESSION: Evaluating Regression Models. Overview Assumptions for Linear Regression Evaluating a Regression Model.
DATA VISUALIZATION UNIVARIATE (no review- self study) STEM & LEAF BOXPLOT BIVARIATE SCATTERPLOT (review correlation) Overlays; jittering Regression line.
Chapter 13 Conducting & Reading Research Baumgartner et al Data Analysis.
Jan Shapes of distributions… “Statistics” for one quantitative variable… Mean and median Percentiles Standard deviations Transforming data… Rescale:
Quantitative Business Analysis for Decision Making Simple Linear Regression.
Data Analysis Statistics. Inferential statistics.
Conditional Distributions and the Bivariate Normal Distribution James H. Steiger.
1 Multivariate Normal Distribution Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking.
Source Code -Tons of Code Package -More Code -Statistical Functions -Datasets Workspace -Fewer Lines of Code -Capability.
How to Analyze Data? Aravinda Guntupalli. SPSS windows process Data window Variable view window Output window Chart editor window.
Correlation and regression 1: Correlation Coefficient
Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.
Project Categories and Questions How to improve [Financial Metric]? Business Science What Determines Height? Government Sports How are School Districts.
Correlation and Covariance. Overview Continuous Categorical Histogram Scatter Boxplot Predictor Variable (X-Axis) Height Outcome, Dependent Variable (Y-Axis)
Correlation and Covariance
R Example Descriptive Statistics Frequency and Histogram Diagrams Standard Deviation.
Outline Class Intros Overview of Course & Series Example Research Projects Beginning R.
Graphing Parameters Titles X-Axis Title Y-Axis Title Legend Scales Color Gridlines library(help="graphics") Basic Chart Types The R Graphics Package LineHistogram.
I❤RI❤R Kin Wong (Sam) Game Plan Intro R Import SPSS file Descriptive Statistics Inferential Statistics GraphsQ&A.
A Few Handful Many Time Stamps One Time Snapshot Many Time Series Number of Variables Mobile Phone Galton Height Census Titanic Survivors Stock Market.
Part IV Significantly Different: Using Inferential Statistics
Week 5: Logistic regression analysis Overview Questions from last week What is logistic regression analysis? The mathematical model Interpreting the β.
Chapter 8 Making Sense of Data in Six Sigma and Lean
MMSI – SATURDAY SESSION with Mr. Flynn. Describing patterns and departures from patterns (20%–30% of exam) Exploratory analysis of data makes use of graphical.
Statistics with TI-Nspire™ Technology Module E. Lesson 2: Properties Statistics with TI-Nspire™ Technology Module E.
REGRESSION DIAGNOSTICS Fall 2013 Dec 12/13. WHY REGRESSION DIAGNOSTICS? The validity of a regression model is based on a set of assumptions. Violation.
 Two basic types Descriptive  Describes the nature and properties of the data  Helps to organize and summarize information Inferential  Used in testing.
Chapter 3 Correlation.  Association between scores on two variables –e.g., age and coordination skills in children, price and quality.
Correlation. Correlation is a measure of the strength of the relation between two or more variables. Any correlation coefficient has two parts – Valence:
Appendix B: Statistical Methods. Statistical Methods: Graphing Data Frequency distribution Histogram Frequency polygon.
AP Statistics Semester One Review Part 1 Chapters 1-3 Semester One Review Part 1 Chapters 1-3.
Research Question What determines a person’s height?
Chapter 3: Descriptive Study of Bivariate Data. Univariate Data: data involving a single variable. Multivariate Data: data involving more than one variable.
Correlation Chapter 6. What is a Correlation? It is a way of measuring the extent to which two variables are related. It measures the pattern of responses.
Where to Get Data? Run an Experiment Use Existing Data.
What factors are most responsible for height?. Model Specification ERROR??? measurement error model error analysis unexplained unknown unaccounted for.
Outline Research Question: What determines height? Data Input Look at One Variable Compare Two Variables Children’s Height and Parents Height Children’s.
Statistics with TI-Nspire™ Technology Module E Lesson 1: Elementary concepts.
Steps Continuous Categorical Histogram Scatter Boxplot Child’s Height Linear Regression Dad’s Height Gender Continuous Y X1, X2 X3 Type Variable Mom’s.
Continuous Outcome, Dependent Variable (Y-Axis) Child’s Height
FCI Supplement What determines FCI scores?. Explore FCI Dataset Descriptive Statistics Histograms Correlations Factor Analysis?
AP Statistics Review Day 1 Chapters 1-4. AP Exam Exploring Data accounts for 20%-30% of the material covered on the AP Exam. “Exploratory analysis of.
(Unit 6) Formulas and Definitions:. Association. A connection between data values.
Predicting Energy Consumption in Buildings using Multiple Linear Regression Introduction Linear regression is used to model energy consumption in buildings.
Prof. Eric A. Suess Chapter 3
EHS 655 Lecture 4: Descriptive statistics, censored data
Correlation, Bivariate Regression, and Multiple Regression
Chapter 13 Created by Bethany Stubbe and Stephan Kogitz.
Basic Statistics Overview
Chapter 12: Regression Diagnostics
Descriptive Statistics:
Understanding Research Results: Description and Correlation
Bivariate Testing (Chi Square)
Bivariate Testing (Chi Square)
Treat everyone with sincerity,
Part I Review Highlights, Chap 1, 2
(Approximately) Bivariate Normal Data and Inference Based on Hotelling’s T2 WNBA Regular Season Home Point Spread and Over/Under Differentials
Descriptive Stat and Correlation
Correlation and Covariance
Learning outcomes By the end of this session you should know about:
Exercise 1 Use Transform  Compute variable to calculate weight lost by each person Calculate the overall mean weight lost Calculate the means and standard.
What’s your New Year’s Resolution?
Presentation transcript:

What factors are most responsible for height? Outcome = (Model) + Error

Analytics & History: 1st Regression Line http://galton.org/cgi-bin/searchImages/search/pearson/vol3a/pages/vol3a_0019.htm The first “Regression Line”

Galton’s Notebook on Families & Height http://vincentarelbundock.github.io/Rdatasets/doc/HistData/GaltonFamilies.html

Galton’s Family Height Dataset X1 X2 X3 Y

> getwd() [1] "C:/Users/johnp_000/Documents" > setwd()

Dataset Input h <- read.csv("GaltonFamilies.csv") Object Function Filename Data()

str() summary() Data Types: Numbers and Factors/Categorical

Outline One Variable: Univariate Two Variables: Bivariate Dependent / Outcome Variable Two Variables: Bivariate Outcome and each Predictor All Four Variables: Multivariate

Variable Type Steps Y X1, X2 X3 Histogram Child’s Height Continuous Dad’s Height X1, X2 Scatter Continuous Mom’s Height X3 Gender Categorical Boxplot Linear Regression

Frequency Distribution, Histogram hist(h$child) Frequency Distributions: A graph plotting values of observations on the horizontal axis, with a bar showing how many times each value occurred in the data set.

Density Plot plot(density(h$childHeight)) Area = 1

Mode, Bimodal hist(h$childHeight,freq=F, breaks =25, ylim = c(0,0.14)) curve(dnorm(x, mean=mean(h$childHeight), sd=sd(h$childHeight)), col="red", add=T)

Asst. Professor of Statistics at Rice University Industries / Organizations Creating and Using R Hadley Wickham Asst. Professor of Statistics at Rice University Industry Pct. Research 24% Higher Education 7% Information Technology 9% Computer Software Financial Services 6% Banking 2% Pharmaceuticals 4% Biotechnology Market Research 3% Management Consulting Total 69% ggplot2 plyr reshape rggobi profr http://prezi.com/s1qrgfm9ko4i/the-r-ecosystem/ Source: LinkedIN R Group (Sept, 2011) http://ggplot2.org/

ggplot2 library(ggplot2) h.gg <- ggplot(h, aes(child)) h.gg + geom_histogram(binwidth = 1 ) + labs(x = "Height", y = "Frequency") h.gg + geom_density() http://www.cookbook-r.com/Graphs/Plotting_distributions_(ggplot2)/

ggplot2 h.gg <- ggplot(h, aes(child)) + theme(legend.position = "right") h.gg + geom_density() + labs(x = "Height", y = "Frequency") h.gg + geom_density(aes(fill=factor(gender)), size=2)

Variable Type Steps Y X1, X2 X3 Histogram Child’s Height Continuous Dad’s Height X1, X2 Scatter Continuous Mom’s Height X3 Gender Categorical Boxplot Linear Regression

Correlation and Regression http://en.wikipedia.org/wiki/Genetics

Covariance Calculate the difference between the mean and each person’s score for the first variable (x). Calculate the difference between the mean and their value for the second variable (y). Multiply these “error” values. Add these values to get the cross product deviations. The covariance is the average of cross-product deviations

Covariance Y X Persons 2,3, and 5 look to have similar magnitudes from their means

Covariance Calculate the error [deviation] between the mean and each subject’s score for the first variable (x). Calculate the error [deviation] between the mean and their score for the second variable (y). Multiply these error values. Add these values and you get the cross product deviations. The covariance is the average cross-product deviations:

Standardizing the Covariance Covariance depends upon the units of measurement Normalize the data Divide by the standard deviations of both variables. The standardized version of covariance is known as the correlation coefficient

Correlation ?cor cor(h$father, h$child) 0.2660385

Scatterplot Matrix: pairs()

Correlations Matrix library(car) scatterplotMatrix(heights)

ggplot2

Variable Type Steps Y X1, X2 X3 Histogram Child’s Height Continuous Dad’s Height X1, X2 Scatter Continuous Mom’s Height X3 Gender Categorical Boxplot Linear Regression

Box Plot http://web.anglia.ac.uk/numbers/graphsCharts.html

Children’s Height vs. Gender boxplot(h$child~gender,data=h, col=(c("pink","lightblue")), main="Children's Height by Gender", xlab="Gender", ylab="")

Descriptive Stats: Box Plot 69.23 64.10 ====== 5.13

Subset Males men<- subset(h, gender=='male')

Subset Females women <- subset(h, gender==‘female')

Children’s Height: Males hist(men$childHeight) qqnorm(men$childHeight) qqline(men$childHeight)

Children’s Height: Females hist(women$child) qqnorm(women$child) qqline(women$child)

ggplot2 library(ggplot2) h.bb <- ggplot(h, aes(factor(gender), child)) h.bb + geom_boxplot() h.bb + geom_boxplot(aes(fill = factor(gender)))