XIAO WU DATA ANALYSIS & BASIC STATISTICS
PURPOSE OF THIS WORKSHOP Statistics as a useful tool to analyze results Basic terminology and most commonly used tests Exposure to more advanced statistical tools
WHY DO WE NEED STATISTICS?
Summary Classification Interpretation Pattern searching Abnormality identification Prediction Intrapolation Extrapolation
SUMMARY
SUMMARY Mean, median, mode Variance, standard deviation Max, min values and range Quartiles
EXAMPLE Firm A Mean: $5,800 Firm B Mean: $5,000
EXAMPLE Firm A Mean: $5,800 Median: $4,000 SD: $7,270 3 rd Quartile: $4,000 1 st Quartile: $500 Firm B Mean: $5,000 Median: $5,000 SD: $203 3 rd Quartile: $5,175 1 st Quartile: $4,825
EXAMPLE #Salary ($) #Salary ($)
CLASSIFICATION Identification of variable Independent vs. dependent Numeric vs. categorical Variable Categorical Nominal Ordinal Numeric Continuous Discrete
PATTERN SEARCHING Distribution of data Some commonly used distributions Uniform Binomial Poisson … Central limit theorem
UNIFORM Every outcome has equal chance Example: Flipping a coin Rolling a dice What if you need to flip multiple times?
BINOMIAL Two outcomes, probability p and 1- p Multiple trials: n Example: Flipping a coin 100 times Germination of multiple seeds su.edu.stat414/files/lesson09/graph_n15_p02.gif
POISSON Counts of rare, independent events Each with probability, or average rate p Example: radioactive decay
THE MOST IMPORTANT DISTRIBUTION
NORMAL DISTRIBUTION Central limit theorem Every distribution converges to a normal distribution Large sample size normal distribution Parameters: mean standard deviation
PATTERN SEARCHING Hypothesis testing Difference between two populations Z-test or t-test? What does p-value mean? Family-wise error – Bonferroni correction More than two possibilities Chi square test Fisher’s exact test More than two variables ANOVA
EXAMPLE 1 SAT score is related to gender Null hypothesis Alternative hypothesis (3 possibilities) One or two tail? Z or T test? p=0.07, conclusion?
EXAMPLE 2 Predictors of stroke Age Hypertension Gender …
EXAMPLE 3 Genome-wide association studies Scanning markers across the DNA of many people to find genetic variations associated with certain diseases
PATTERN SEARCHING Hypothesis testing One variable Z-test or t-test? What does p-value mean? Family-wise error – Bonferroni correction Compare two categorical variables Chi square test Fisher’s exact test More than two variables ANOVA
CHI SQUARE Punnett Square A cross between two pea plants yields 880 plants, 639 green, 241 yellow Hypothesis: The green allele is dominant and both parents are heterozygous.
CHI SQUARE Gg G GG (green) Gg(green) g gg (yellow) 75% green 25% yellow
CHI SQUARE GreenYellow Observed (o) Expected (e) Deviation (d=o – e)-2121 Deviation squared (d^2) 441 d^2/e Sum2.669 Degree of freedom: number of categories – 1 = 1
CHI SQUARE
PREDICTION Regression Linear regression Multiple linear regression Accuracy vs. simplicity Validation leave-k-out U/s1600/actnactn+1.png
EXAMPLE Use brain structural measurements to predict a subject’s performance on picture vocabulary test 144 total structural measurements 521 subjects First step: eliminate unnecessary variables All zeros? Highly correlated pairs Variables that do not correlate well with performance score
EXAMPLE Run regression Validation: leave 1 out and leave 10 out Principle component analysis …
PREDICTION More complicated models: Baysian approach Use prior knowledge to update prediction Diffusion weights Use local structure to predict neighboring values
STATISTICAL TOOLS EXCEL MatLab R MiniTab …
QUESTIONS?
MY OWN RESEARCH Cost-effectiveness analysis Mathematical modeling in medicine Simulate iterations rather than actual patients
RECENT RESULTS
RESULTS
GROUP EXERCISE