Presentation on theme: "Data Analysis Using R: 2. Descriptive Statistics Tuan V. Nguyen Garvan Institute of Medical Research, Sydney, Australia."— Presentation transcript:
Data Analysis Using R: 2. Descriptive Statistics Tuan V. Nguyen Garvan Institute of Medical Research, Sydney, Australia
Overview Measurements Population vs sample Summary of data: mean, variance, standard deviation, standard error Graphical analyses Transformation
Scales of Measurement In general, most observable behaviors can be measured on a ratio-scale In general, many unobservable psychological qualities (e.g., extraversion), are measured on interval scales We will mostly concern ourselves with the simple categorical (nominal) versus continuous distinction (ordinal, interval, ratio) categoricalcontinuous ordinal interval ratio variables
Ordinal Measurement Ordinal: Designates an ordering; quasi-ranking –Does not assume that the intervals between numbers are equal. –finishing place in a race (first place, second place) 1 hour2 hours3 hours4 hours5 hours6 hours7 hours8 hours 1st place2nd place3rd place4th place
Interval and Ratio Measurement Interval: designates an equal-interval ordering –The distance between, for example, a 1 and a 2 is the same as the distance between a 4 and a 5 –Example: Common IQ tests are assumed to use an interval metric Ratio: designates an equal-interval ordering with a true zero point (i.e., the zero implies an absence of the thing being measured) –Example: number of intimate relationships a person has had 0 quite literally means none a person who has had 4 relationships has had twice as many as someone who has had 2
Statististics: Enquiry to the unknown PopulationSample ParameterEstimate
Estimate the population mean Population height mean = 160 cm Standard deviation = 5.0 cm ht <- rnorm(10, mean=160, sd=5) mean(ht) ht <- rnorm(10, mean=160, sd=5) mean(ht) ht <- rnorm(100, mean=160, sd=5) mean(ht) ht <- rnorm(1000, mean=160, sd=5) mean(ht) ht <- rnorm(10000, mean=160, sd=5) mean(ht) hist(ht) The larger the sample, the more accurate the estimate is!
Estimate the population proportion Population proportion of males = 0.50 Take n samples, record the number of k males rbinom(n, k, prob) males <- rbinom(10, 10, 0.5) males mean(males) males <- rbinom(20, 100, 0.5) males mean(males) males <- rbinom(1000, 100, 0.5) males mean(males) The larger the sample, the more accurate the estimate is!
Summary of Continuous Data Measures of central tendency: –Mean, median, mode Measures of dispersion or variability: –Variance, standard deviation, standard error –Interquartile range R commands length(x), mean(x), median(x), var(x), sd(x) summary(x)
R example height <- rnorm(1000, mean=55, sd=8.2) mean(height)  median(height)  var(height)  sd(height)  summary(height) Min. 1st Qu. Median Mean 3rd Qu. Max
Implications of the mean and SD “In the Vietnamese population aged 30+ years, the average of weight was 55.0 kg, with the SD being 8.2 kg.” What does this mean? 68% individuals will have height between 55 +/- 8.2*1 = 46.8 to 63.2 kg 95% individuals will have height between 55 +/- 8.2*1.96 = 38.9 to 71.1 kg
Implications of the mean and SD The distribution of weight of the entire population can be shown to be: 1SD 1.96SD
Summary of Categorical Data Categorical data: –Gender: male, female –Race: Asian, Caucasian, African Semi-quantitative data: –Severity of disease: mild, moderate, severe –Stages of cancer: I, II, III, IV –Preference: dislike very much, dislike, equivocal, like, like very much
Mean and variance of a proportion For an individual i consumer, the probability he/she prefers A is pi. Assuming that all consumers are independent, then pi = p. Variance of pi is var(pi) = p(1-p) For a sample of n consumers, the estimated probability of preference for A is: and the variance of p_bar is:
Normal approximation of a binomial distribution For an individual i consumer, the probability he/she prefers A is pi. Assuming that all consumers are independent, then pi = p. Variance of pi is var(pi) = p(1-p) For a sample of n consumers, the estimated probability of preference for A is: and the variance of p_bar is: and standard deviation:
Normal approximation of a binomial distribution - example 10 consumbers, 8 preferred product A. Proportion of preference for A: p = 0.8 Variance: var(p) = 0.8(0.2)/10 = Standard deviation of p: s = % CI of p: (0.126) = 0.55 to 1.00
Descriptive Analyses Continuous data
Paired t-test Continuous data Normally distributed Two samples are NOT independent
Paired t-test – an example The problem: Viewing certain meats under red light might enhance judges preferences for meat. 12 judges were asked to score the redness of meat under red light and white light Results: JudgeRed White
Paired t-test – analysis JudgeRed lightWhite lightDifference Mean SD Mean difference: 1.83, SD: 0.81 Standard error (SE): SD/sqrt(n) = 0.81/sqrt(10) = 0.81 T-test = (1.83 – 0)/0.81 = 2.23 P-value = Conclusion: there was a significant effect of light colour.
Paired t-test – R analysis red < -c(20,18,19,22,17,20,19,16,21,17,23,18) white < -c(22,19,17,18,21,23,19,20,22,20,27,24) t.test(red, white, paired=TRUE) data: red and white t = , df = 11, p-value = alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: sample estimates: mean of the differences
Two-sample t-test Sample Group 1 Group2 1x 1 y 1 2x 2 y 2 3x 3 y 3 4x 4 y 4 5x 5 y 5… nx n y n Sample size n 1 n 2 Meanx y SDs x s y Mean difference: D = x – y Variance of D: T-statistic: 95% Confidence interval:
Two-group comparison: an example IDAB IDAB consumers rated their preference for two rice desserts (A and B)
Unpaired t-test using R a<-c(3,7,1,9,3,4,1,2,6,7,5,8,5,9,4,6,4,3,9,5) b<-c(3,1,2,4,5,2,2,5,3,2,3,4,2,3,5,4,3,1,3,2) t.test(red,white) Welch Two Sample t-test data: a and b t = , df = , p-value = alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: sample estimates: mean of x mean of y
Transformation of data: multiplicative effects The following data represent lysozyme levels in the gastric juice of 29 patients with peptic ulcer and of 30 normal controls. It was interested to know whether lysozyme levels were different between two groups. Group 1: Group 2:
Unpaired t-test by R g1 <- c( 0.2, 0.3, 0.4, 1.1, 2.0, 2.1, 3.3, 3.8, 4.5, 4.8, 4.9, 5.0, 5.3, 7.5, 9.8, 10.4, 10.9, 11.3, 12.4, 16.2, 17.6, 18.9, 20.7, 24.0, 25.4, 40.0, 42.2, 50.0, 60) g2 <- c(0.2, 0.3, 0.4, 0.7, 1.2, 1.5, 1.5, 1.9, 2.0, 2.4, 2.5, 2.8, 3.6, 4.8, 4.8, 5.4, 5.7, 5.8, 7.5, 8.7, 8.8, 9.1, 10.3, 15.6, 16.1, 16.5, 16.7, 20.0, 20.7, 33.0) t.test(g1, g2) data: g1 and g2 t = , df = , p-value = alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: sample estimates: mean of x mean of y
Exploration of data par(mfrow=c(1,2)) hist(g1) hist(g2) Group 1: mean(g1) = 14.3 sd(g1) = 15.7 Group 2: mean(g2) = 7.7 sd(g2) = 7.8
Re-analysis of lysozyme data log.g1 <- log(g1) log.g2 <- log(g2) t.test(log.g1, log.g2) data: log.g1 and log.g2 t = 1.406, df = , p-value = alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: sample estimates: mean of x mean of y exp( ) = 1.67 Group 1’s mean is 67% higher than group 2’s
Descriptive analysis Categorical data
Comparison of two proportions - theory Group 12 ____________________________________________ Sample sizen 1 n 2 Number of eventse 1 e 2 Proportion of eventsp 1 p 2 Difference:D = p 1 – p 2 SE difference: SE = [p 1 (1–p 1 )/n 1 + p 2 (1–p 2 )/n 2 ] 1/2 Z = D / SE 95% CI: D (SE) With (n 1 + n 2 ) > 20, and if Z > 2, it is possible to reject the null hypothesis.
Comparison of two proportions - example Group HeroineCocaine __________________________________________ Sample size Number of deaths Mortality rate Thirty-day mortality rate (%) of 100 rats who had been exposed to heroine or cocain. Analysis Difference: D = 0.90 – 0.36 = 0.54 SE (D) = [0.9(0.1)/ (0.64)/100] 1/2 = Z = 0.54 / = % CI: (0.057) 0.43 to 0.65 Conclusion: reject the null hypothesis.
Comparison of two proportions - R events <- c(90, 36) total <- c(100, 100) prop.test(events, total) 2-sample test for equality of proportions with continuity correction data: deaths out of total X-squared = , df = 1, p-value = 8.341e-15 alternative hypothesis: two.sided 95 percent confidence interval: sample estimates: prop 1 prop
Comparison of >2 proportions – Chi square analysis table(sex, ethnicity) ethnicity sex African Asian Caucasian Others Female Male females <- c(4, 43, 22, 0) total <- c(8, 60, 30, 2) prop.test(females, total)
Comparison of >2 proportions – Chi square analysis 4-sample test for equality of proportions without continuity correction data: females out of total X-squared = , df = 3, p-value = alternative hypothesis: two.sided sample estimates: prop 1 prop 2 prop 3 prop Warning message: Chi-squared approximation may be incorrect in: prop.test(females, total)
Summary Examine the distribution of data –Mean and variance: systematic difference? –Normally distributed ? Transformation? Present confidence intervals (and p-values)