Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistics.

Similar presentations


Presentation on theme: "Statistics."— Presentation transcript:

1 Statistics

2 A Word on Statistics - Wislawa Szymborska
Out of every hundred people, those who always know better: fifty-two. Unsure of every step: almost all the rest. Ready to help, if it doesn't take long: forty-nine. Always good, because they cannot be otherwise: four -- well, maybe five. Able to admire without envy: eighteen. Led to error by youth (which passes): sixty, plus or minus. Those not to be messed with: four-and-forty. Living in constant fear of someone or something: seventy-seven. Capable of happiness: twenty-some-odd at most. Harmless alone, turning savage in crowds: more than half, for sure. Wislawa Szymborska won the 1996 Nobel Prize for literature. Her most recent book in English is View With a Grain of Sand (1995). She lives in Krakow, Poland. Joanna Trzeciak, the translator of Szymborska's poem in this issue, is currently at work on a collection of translations of Szymborska's poetry. Copyright © 1997 by The Atlantic Monthly Company. All rights reserved. The Atlantic Monthly; May 1997; A Word on Statistics; Volume 279, No. 5; page 68.

3 Those who are just: quite a few, thirty-five
Those who are just: quite a few, thirty-five. But if it takes effort to understand: three. Worthy of empathy: ninety-nine. Mortal: one hundred out of one hundred -- a figure that has never varied yet. Cruel when forced by circumstances: it's better not to know, not even approximately. Wise in hindsight: not many more than wise in foresight. Getting nothing out of life except things: thirty (though I would like to be wrong). Balled up in pain and without a flashlight in the dark: eighty-three, sooner or later.

4

5

6

7

8 Today Introduction to statistics
Looking at our qualitative data in a quantitative way More exploration of the data Presentations Tutorials

9 Why statistics are important
Statistics are concerned with difference – how much does one feature of an environment differ from another Suicide rates/100,000 people

10 Why statistics are important
Relationships – how does much one feature of the environment change as another measure changes The response of the fear centre of white people to black faces depending on their exposure to diversity as adolescents

11 The two tasks of statistics
Magnitude: What is the size of the difference or the strength of the relationship? Reliability. What is the degree to which the measures of the magnitude of variables can be replicated with other samples drawn from the same population.

12 Magnitude – what’s our measure?
Raw number? Rate? Some aggregate of numbers? Mean, median, mode? Suicide rates/100,000 people

13 Arithmetic mean or average
Mean (M or X), is the sum (SX) of all the sample values ((X1 + X2 +X3.…… X22) divided by the sample size (N). Mean/average = SX/N - Carbon footprint scores 63 71 75 78 80 85 64 72 79 81 66 73 67 83 86 68 74 76 84 89 70 90 77 92 Carbon Footprint scores – 63 = 1.7 planet earths, 92 = 2.7 planet earths; = 1.9 – 2.1. yellow = polynesian participants

14 Compute the mean Total Polynesian Other Total (SX) 3483 971 2512 N 45
13 32 mean 77.4 74.7 78.5

15 The median median is the "middle" value of the sample. There are as many sample values above the sample median as below it. If the number (N) in the sample is odd, then the median = the value of that piece of data that is on the (N-1)/2+1 position of the sample ordered from smallest to largest value. E.g. If N=45, the median is the value of the data at the (45-1)/2+1=23rd position If the sample size is even then the median is defined as the average of the value of N/2 position and N/2+1. If N=32, the median is the average of the 32/2 (16th) and the 32/2+1(17th) position. Why use the median? Why use medians? Where data is skewed the mean gives a false idea of where most of the data lies, the median gives a more useful idea.

16 Other measures of central tendency
The mode is the single most frequently occurring data value. If there are two or more values used equally frequently, then the data set is called bi-modal or tri-modal, etc The midrange is the midpoint of the sample - the average of the smallest and largest data values in the sample. The geometric mean (log transformation) and the harmonic mean (inverse transformation) – both used where data is skewed with the aim of creating a more even distribution

17 Compute the median and mode
63 71 75 78 80 85 64 72 79 81 66 73 67 83 86 68 74 76 84 89 70 90 77 92

18 Mean, median, mode, mid-range
Total Polynesian Other 3483 971 2512 N 45 13 32 mean 77.4 74.7 78.5 median 77 75 mode 75, 79, 84 81 84 midrange

19

20 The underlying distribution of the data

21 Normal distribution

22 Three things we must know before we can say events are different
the difference in mean scores of two or more events - the bigger the gap between means the greater the difference the degree of variability in the data - the less variability the better

23 Variance and Standard Deviation
These are estimates of the spread of data. They are calculated by measuring the distance between each data point and the mean variance (s2) is the average of the squared deviations of each sample value from the mean = s2 = S(X-M)2/(N-1) The standard deviation (s) is the square root of the variance.

24 Standard deviation = sx
(x-Mx) (x-Mx)2 64 -10.7 114.3 66 -8.7 75.6 67 -7.7 59.2 70 -4.7 22.0 71 -3.7 13.6 74 -0.7 0.5 75 0.3 0.1 77 2.3 5.3 79 4.3 18.6 80 28.2 81 6.3 39.8 86 11.3 127.9 Total 971 544.8 Mean (Mx) 74.7 Variance = sx2 41.9 Nx 13 Standard deviation = sx 6.5 Calculating the Variance and the standard deviation for the Polynesian sample

25 All normal distributions have similar properties
All normal distributions have similar properties. The percentage of the scores that is between one standard deviation (s) below the mean and one standard deviation above is always 68.26%

26 Is there a difference between Polynesian and “other” scores

27 Is there a significant difference between Polynesian and “other” scores

28 Three things we must know before we can say events are different
The extent to which the sample is representative of the population from which it is drawn - the bigger the sample the greater the likelihood that it represents the population from which it is drawn - small samples have unstable means. Big samples have stable means.

29 Estimating difference
The measure of stability of the mean is the Standard Error of the Mean = standard deviation/the square root of the number in the sample. So stability of mean is determined by the variability in the sample (this can be affected by the consistency of measurement) and the size of the sample. The standard error of the mean (SEM) is the standard deviation of the normal distribution of the mean if we were to measure it again and again

30 Yes it’s significant. The mean of the smaller sample is not too variable. Its Standard Error of the Mean = 6.5/√13 = The 95% confidence interval =1.96 SDs = This gives a range from 71.2 to The “Other” mean falls just outside this confidence interval Polynesian Mean =74.7 SD=6.5 N= 13 Distribution of Standard error of the mean

31 Is the difference between means significant?
What is clear is that the mean of the Other group is just outside the area where there is a 95% chance that the mean for the Polynesian Group will fall, so it is likely that the Other mean comes from a different population as the Polynesian mean. The convention is to say that if mean 2 falls outside of the area (the confidence interval) where 95% of mean 1 scores are estimated to be, then mean 2 is significantly different from mean 1. We say the probability of mean 1 and mean 2 being the same is less than 0.05 (p<0.05) and the difference is significant p

32 The significance of significance
Not an opinion A sign that very specific criteria have been met A standardised way of saying that there is a There is a difference between two groups – p<0.05; There is no difference between two groups – p>0.05; There is a predictable relationship between two groups – p<0.05; or There is no predictable relationship between two groups - p>0.05. A way of getting around the problem of variability

33 One and two tailed tests
Standard deviations One and two tailed tests If you argue for a one tailed test – saying the difference can only be in one direction, then you can add 2.5% error from the side where no data is expected to the side where it is 2.5% of M1 distri- bution 95% of M1 distri- bution

34 Distribution of Standard error of the mean
If we were to argue for a one tailed test – that Polynesian people were more eco-sustaintable, than the Others – the 95% confidence interval can all be to the left of the of the SEM distribution rather than equally distributed on either side. This means that instead of going to 47.5% line on the right we go to the 45% line = 1.65 SDs or 3.0 units Normal distribution Polynesian Mean =74.7 SD=6.5 N= 13 Distribution of Standard error of the mean

35 T-tests t = (Mx-My)/Sx2/Nx + Sy2/Ny; where t is value generated and : Mx= the mean carbon footprint of participants with higher incomes My= the mean carbon footprint of participants with moderate to low incomes Sx2=the variance of the carbon footprint of participants with higher incomes Sy2= the variance of the carbon footprint of participants with moderate to low incomes Nx=the number of participants with higher incomes Ny=the number of participants with moderate to low incomes

36 t-Test: Two-Sample Assuming Unequal Variances
T-test result. This does exactly what we have done except it argues that in every sample the first data point is fixed and that other data points are free to vary in relation to it. Consequently, when estimating variance we should divide by (N-1) not N. That makes this test more conservative. t-Test: Two-Sample Assuming Unequal Variances Polynesian Other Mean 74.69 78.50 Variance 45.397 44.26 Observations 13 32 Hypothesized Mean Difference Degrees of freedom (df) 43 t Stat -1.73 p(T<=t) one-tail 0.045 t Critical one-tail + or -1.68 p(T<=t) two-tail 0.090 t Critical two-tail + or

37 Impact of gender on safety
t-Test: Two-Sample Assuming Unequal Variances women men Mean 2.05 1.58 Variance 0.95 0.99 Observations 21.00 12.00 Hypothesized Mean Difference 0.00 df 23.00 t Stat 1.30 P(T<=t) one-tail 0.10 t Critical one-tail 1.71 P(T<=t) two-tail 0.21 t Critical two-tail 2.07

38 Impact of religion on safety
t-Test: Two-Sample Assuming Unequal Variances no religion religion Mean 1.81 2.00 Variance 1.23 0.86 Observations 16.00 15.00 Hypothesized Mean Difference 0.00 df 29.00 t Stat -0.51 P(T<=t) one-tail 0.31 t Critical one-tail 1.70 P(T<=t) two-tail 0.61 t Critical two-tail 2.05

39 Impact of work on safety
t-Test: Two-Sample Assuming Unequal Variances work in MPHS not working in MPHS Mean 2.43 1.47 Variance 0.26 1.15 Observations 14.00 19.00 Hypothesized Mean Difference 0.00 df 27.00 t Stat 3.39 P(T<=t) one-tail t Critical one-tail 1.70 P(T<=t) two-tail t Critical two-tail 2.05

40 Correlations and Chi-square

41 The correlation with the glacier went unnoticed
The correlation with the glacier went unnoticed. The debate proceeded and receded  with slow heated monotonous cold regularity although never reversing  at the same point of disagreement. The correlation with the glacier went. . .  The weight of paper and opinion now far-exceeding the frozen mountain, even at its zenith. But no amount of FSC vellum  could paper over the crevasse cracked argument. The correlation with the glacier   The blue-green water vein bled  But no aerial artery replenished the source. The constant melt etching the message of increased bloodletting from the waning carcase

42 The correlation with the. Lost in the science of the unknown
The correlation with the Lost in the science of the unknown. The pre-historic signpost, scarred by graffiti, slowly shrank and collapsed Its incremental deficit matched by political will. The correlation We are, we were, the new dinosaurs, like the sun-burnt beached berg doomed for demise in the new non-ice age. No-one will record its disappearance or ours. The correlation with humanity went unnoticed. Correlation by John S

43 Chi-square test - comparing MPHS samples with the local populations
Looks at the magnitude or size of the difference between observed and expected values (O-E) and then squares those differences to they are all positive - (O-E)2, Adjust those differences so they are relative to the size of the expected values - (O-E)2/E. This is a variance measure and takes care of effects that are due to the size of the expected value, which in turn is related to the sample size. Calculates a chi-square value which is the sum of the adjusted differences ( S(O-E)2/E)=14.03). This is compared with the value that chi-squared would have to reach to be significant for the number of categories used (n). The question: Is the MPHS sample representative of the cultural mix of the MPHS population?

44 What would we predict? MPHS sample population Age 18-30 9 27% 48%
31-40 13 39% 22% 41-50 18% >50 2 6% 13% 33 100% In red are the number of participants we would predict (we EXPECT) based on the percent in each category in the MPHS population (2006). In blue is what we got (we OBSERVED). Is the match sufficiently close?

45 Does the MPHS sample match the population age distribution?
(0-E)2/E 18-30 9 16 -6.69 44.72 2.85 31-40 13 7 5.82 33.89 4.72 41-50 6 3.09 9.54 1.61 >50 2 4 -2.22 4.94 1.17 chi-square= 10.35 Degrees of freedom = N-1 = 3, where N=the number of parametres not the nu number of participants Value of chi-square (χ2) for p<0.05=7.81 Actual χ2 is more than 7.81, therefore there is a significant difference between the MPHS sample and MPHS population Chi-square table click here to get the Chi-Square table

46 Does the MPHS sample match the population age distribution?
Children O E O-E (0-E)2 (0-E)2/E No Children 6 10 -3.79 14.33 1.46 One Child 5 5.00 25.02 5.01 Two Children 7 0.08 0.01 0.00 Three Children 2.12 0.47 Four Children 1 2 -1.48 2.19 0.88 Five Children -0.08 Six or More Children -1.19 1.41 1.19 chi-square= 9.02 Degrees of freedom = N-1 = 6, where N=the number of parametres not the number of participants Value of chi-square (χ2) for p<0.05=12.59 Actual χ2 is less than 12.59, therefore there is no significant difference between the MPHS sample and MPHS population

47 r=0.904 N=33 p<0.00

48 Correlations r =( S(X – MX)*((Y – MY))/(N*SX*SY)
X = GDP purchasing power in $'000s Y= Better Life Index (0-10) MX=Mean of X = 25,200 MY =Mean of Y= 6.34 SX=Standard deviation of X=7.02 SY=Standard deviation of Y=1.44 r =correlation coefficient = +0.90

49 What degrees of freedom? df=N-1= 33-1 = 32
One or two tails? Have we made a prior prediction? Yes, that life satisfaction will increase with wealth = 1 tailed test What degrees of freedom? df=N-1= 33-1 = 32 What level of significance should be chosen? It depends on the number of correlations. p<0.05 – there is only one correlation. Often there 100’s – in which case a tougher criterion should be chosen. Where can we find the critical values of r? HERE Go to correlation table

50 Correlation felt safety and people in the household
Children Adults total people in household total safety Children: 1.00 0.18 0.70 0.83 -0.09 -0.24 -0.22 p<0.05, df=30, r=0.349

51

52 Correlation and regression
Correlation quantifies the degree to which two random variables are related. Correlation does not fit a line through the data points. You simply are computing a correlation coefficient (r) that tells you how much one variable tends to change when the other one does. Linear regression finds the best line that predicts the size of one variable when given another variable which is fixed. The regression co-efficient (r2) tells how much of the variability of our fixed (dependent) variable is accounted for by the independent variable

53 Correlations

54 A perfect relationship, but not a linear correlation

55 A powerful relationship, but not a correlation – what’s happening here?

56 Normality of the data and Homoscedasticity

57 r=0.904 N=33 p<0.00

58

59 How correlation is used and misused

60 Tests of significance Tests of difference – t-tests, analysis of variance, chi-square, odds ratios Tests of relationship – correlation, regression analysis Tests of difference and relationship – analysis of covariance, multiple regression analysis.

61 Inferential statistics

62 How safe do MPHS people feel?
Feeling safe in their own home: yes=1, no=0 Feeling safe in their local part of MPHS: yes =1, no=0 Feeling safe in MPHS generally: yes=1, no=0 Total safety score = add 1-3. range=0 to 3. If people don’t refer to 1. above, score it as =1, If people score 0 on 2, they must be 0 on 3.

63

64


Download ppt "Statistics."

Similar presentations


Ads by Google