Presentation is loading. Please wait.

Presentation is loading. Please wait.

More Statistics measures of uncertainty and small number issues Contributors Mark Dancox Shelley Bradley Jacq Clarkson.

Similar presentations


Presentation on theme: "More Statistics measures of uncertainty and small number issues Contributors Mark Dancox Shelley Bradley Jacq Clarkson."— Presentation transcript:

1 More Statistics measures of uncertainty and small number issues Contributors Mark Dancox Shelley Bradley Jacq Clarkson

2 Things to be covered Summarising data –Common measures –Correlation Common Distributions Measuring uncertainty –Standard error –Confidence Intervals Significance and p-values Small samples

3 Summary Statistics

4 Summarising data Impractical to look at every single piece of data so need to use summary measures Need to reduce a lot of information into compact measures Look at location and spread

5 How do you describe an elephant?

6 How big is it…..?

7 How do you describe an elephant? How varied…?

8 Measures of Location Mean –Commonly used measure of location –Is the sum of values divided by the number of values The sample mean is given by: Can be drastically affected by unusual observations (called ‘outliers’) so it is not very robust. Excel function: average()

9 Measures of location Median: is a value for which 50% of the data lie above (or below) “middle value” –For an odd number of observations, the median is the observation exactly in the middle of the ordered list –For an even number of observations, the median is the mean of the two middle observation is the ordered list Less sensitive to outliers and gives a ‘real’ value (unlike the mean) but does ‘throw away’ a lot of information in the sample Excel function: Median() Mode: The mode is the most frequently occurring value Sometimes too simplistic and not always unique Excel function: Mode()

10 Measures of Spread Variance: The average of the squared deviations of each sample value from the sample mean divided by N-1 Excel function: Vara () Standard deviation: is the square root of the sample variance Excel function: Stdeva()

11 Measures of Spread We can describe the spread of a distribution by using percentiles. The pth percentile of a distribution is the value such that approximately p percent of the values are equal or less than that number. Excel function: percentile () Quartiles divide data into four equal parts. –First quartile (Q 1 ) 25 th Percentile 25% of observations are below Q 1 and 75% above Q 1 –Second quartile (Q 2 ) 50 th Percentile 50% of observations are below Q 2 and 50% above Q 2 –Third quartile (Q 3 ) 75 th Percentile 75% of observations are below Q 3 and 25% above Q 3

12 Measures of Spread Range: the difference between the largest and smallest values Can be misleading if the data set contains outliers. The interquartile range is the difference between the third and the first quartiles in a dataset (50% of the data lie in this range). Interquartile range more robust to outliers.

13 If the data is approximately normal, we can use the mean and the standard deviation s to find intervals within which a given percentage of the data lie. Range 100% IQR 50% Q1Q1 Q3Q3

14 Skewness Values in a distribution may not be spread evenly. This will affect symmetry. Skewness a measure of symmetry. –If skewness =0 the distribution is symmetrical –If skewness >0 there are more larger values –If skewness < 0 there are more smaller values

15 Some skewed distributions Count of persons killed and seriously injured in road traffic accidents, 2003-2005

16 Some skewed distributions

17

18 Relative positions of the mean and median for (a) right-skewed, (b) symmetric, and (c) left-skewed distributions Note: The mean assumes that the data is normally distributed. If this is not the case it is better to report the median as the measure of location.

19 Skewness The degree of skewness affects measures of location. If no skew –Mean = Median If skew > 0 (right or +ve skew) –Mean > Median If skew < 0 (left or -ve skew) –Mean < Median

20 DSRs for Circulatory Disease mortality for persons per 100,000 population, Local Authorities, 2004-06

21 Exercise 1 Calculate some summary statistics for the class size data in sheet one of the exercises.

22 Correlation is a measure of association between two continuous variables Correlation is best visualised graphically, plotting one variable against the other: X Y Correlation

23 Positive Correlation X Y Negative Correlation X Y One variable increases with the other One variable increases as the other decreases X Y Y neither increases nor decreases with X No Correlation

24 Correlation coefficient Correlation coefficients measure strength of (linear) association between continuous variables Pearson’s correlation coefficient r measures linear association i.e. Do the points lie on a straight line? If the points form a perfect straight line, then we have perfect correlation. The closer r is to 0, the weaker the correlation r = 1 Perfect positive correlation r = -1 Perfect negative correlation r = 0 No correlation where Sx and Sy denote standard deviations. Excel function PEARSON( )

25 r = +1r = -1 r = 0.3 r = 0.7 r = -0.5 r = 0

26 Example

27 0  r  1-1  r  0 Positive correlation Negative correlation Spearman’s rank correlation coefficient measures association whenever one or both variables are on an ordinal scale. This does not need to be linear, Does one variable increase/decrease with the other? d is the difference between the rank orderings of the data. Not an inbuilt function in excel

28 WARNING Spurious correlations can arise from: Change of direction of association OutliersSubgroups

29 Ecological Fallacy When correlations based on grouped data are incorrectly assumed to hold for individuals. E.g. investigating the relationship between food consumption and cancer risk. One way to begin such an investigation would be to look at data on the country level and construct a plot of overall cancer risk against per capita daily caloric intake. But it is people, not countries, who get cancer. It could be that within countries those who eat more are less likely to develop cancer.

30 Example: Fat in the Diet and Cancer Source: K. Carroll, “Experimental evidence of dietary factors and hormone-dependent cancers,” Cancer Research vol. 35 (1975) p. 3379. Copyright by Cancer Research.

31 On the country level, per capita food intake may just be an indicator of overall wealth and industrialization. The ecological fallacy was in studying countries when one should have been studying people.

32 Distributions for Public Health Analysts

33 Types of distributions Normal distribution –Used for continuous measures such as height, weight, blood pressure Poisson distribution –Used for discrete counts of things: violent crimes, number of serious accidents, number of horse kicks Binomial distribution –Used to analyse data where the response is discrete count of a category: success/ failure, response/ non- response

34 Normal Distribution Distribution of natural phenomena Continuous Family of distributions with same shape Area under curve is the same (=1) Symmetrical Defined by mean (μ) and standard deviation (σ) Widely assumed for statistical inference

35 Normal Distribution, changes in mean (μ) Keeping the standard deviation constant, changing the mean of a distribution moves the distribution to the left or right…

36 Normal Distribution, changes in standard deviation (σ) Keeping the mean constant but changing the standard deviation affects the ‘narrowness’ of the curve…

37 Distribution of values

38 Poisson A discrete distribution taking on the values X= 0,1,2,3,… Often used to model the number of events occurring in a defined period of time. Determined by a single parameter, lambda (λ), which is the mean of the process. Shape of distribution changes with the value of λ

39 Example of Poisson distribution

40 Binomial Distribution Used to analyse discrete counts Consider when interested in a count expressed as a proportion of a total sample size. –“the proportion of brown-eyed persons in a building” Defined by the probability of an outcome (p) and the sample size (n)

41 Binomial Distribution

42 Which distribution would best describe? Number of abortions by gestational age Percentage of patients with diabetes mellitus treated with ACE inhibitor therapy for Acute sickness Number of Adults on prescribed medication Proportion of Adults who are overweight Average weekly alcohol units consumed

43 Choice of distribution No hard and fast rules about which distribution should be used. If a sample size is big enough choice of distribution may be less important –“everything tends to normality” –Normal distribution will be a good approximation to Poisson and Binomial distributions given big enough sample.

44 Standard Error Summary statistics – such as the mean- are based on samples Different samples from the same population give rise to different values This variation is quantified by the standard error

45 An example….

46 An example….continued….

47 Standard Errors for some common distributions Normal distribution Poisson distribution Binomial Distribution

48 Confidence Intervals How to calculate How to interpret

49 Confidence Intervals Summary statistics are point estimates based on samples Confidence Intervals quantify the degree of uncertainty in these estimates (“wings of uncertainty”) Quoted as a lower limit and an upper limit which provide a range of values within which the population value is likely to lie

50 Calculating Confidence Intervals General form of any 95% C.I.: Point Estimate ± 1.96*(Estimated SE) For 99% CI’s we use 2.57 For 90% CI’s we use 1.64

51 Interpretation A 95% Confidence Interval is a random interval, such that in related sampling 95 out of every 100 intervals succeed in covering the parameter Loose interpretation –“95% chance true value inside interval” 5% of cases (X X) True Value

52 Interpretation of confidence intervals Non overlapping intervals indicative of real differences Overlapping intervals need to be considered with caution Need to be careful about using confidence intervals as a means of testing. The smaller the sample size, the wider the confidence interval

53 Example If the mean weight (kg) for a given sample of 43 men aged 55 is 81.4kg and the standard deviation is 12.7 kg…. Then, A 95% confidence interval is given by 81.4-1.96*(stdev/  n), 81.4+1.96 *(stdev/  n) which evaluates to (77.6, 85.2) kg

54 Exercise 2 Using the CI calculator provided 1.Calculate the 95%CI for the mean class size from exercise 1. 2.If 20% in a sample of 400 are smokers, calculate a 95% confidence interval around this proportion

55 Exercise 3, which areas are significantly higher than England?

56 Measuring uncertainty

57 Types of hypotheses Null Hypothesis (H 0 ) –The hypothesis under consideration –“there is no difference between groups” –The accused is innocent Alternative Hypothesis (H a ) –The hypothesis we accept if we reject the null hypothesis –“there is a difference between groups” –Or the accused is guilty

58 Hypothesis Testing Inferences about a population are often based upon a sample. Want to be able to use sample statistics to draw conclusions about the underlying population values Hypothesis testing provides some criteria for reaching these conclusions

59 General principles of hypothesis testing Formulate null (H 0 ) and alternative (H a ) hypotheses (simple or composite) Choose test statistic Decide on rule for choosing between the null and alternative hypotheses Calculate test statistic and compare against the decision rule.

60 Illustration of acceptance regions

61 Significance Levels Used as the criteria to accept or reject H 0 P-value < 0.05 (or 0.01) indicates that the truth of H O is unlikely Usually 5% or 1% Chosen a priori

62 P-values Criteria to judge statistical significance of results. Quoted as values between 0 and 1 Probability of result, assuming H o true Values less than 0.05 (or 0.01) indicates an observation unlikely under the assumption that H O is true

63 Illustration of P-value under H o

64 Sample size Results may indicate no difference between groups This may be because there is truly no difference between groups or because there was an insufficiently large sample size for this to be detected

65 Determining Sample Size Choice of sample size depends on: –Anticipated size of effect/ required precision –Variability in measurement –Power –Significance levels

66 Sample size formula It is possible to combine information on variability, significance and power with the size of the effect we are trying to detect: Example from Clinical Trials: N = 2σ 2 (z α/2 +z β ) 2 /(μ 0 - μ a ) 2

67 Small samples The smaller the sample, the higher degree of uncertainty in results. Increased variability in small samples Confidence Intervals for estimates are wider Low numbers may affect the calculation of directly standardised rates (for instance) Distribution assumptions may be affected.

68 Dealing with small numbers Confidentiality can be an issue Can combine several years of data –Mortality pooled over several years for rare conditions Suicide, infant mortality, cancers in the young Combine counts across categories of data –Low cell counts in cross-classifications of the data Exact methods may be needed.

69 Problems with small numbers

70 Finding out more: APHO http://www.apho.org.uk/resource/item.aspx?RID=48457

71 Finding out more Lots of useful information can be found at the HealthKnowledge website…

72 Finding out more The NCHOD website also contains useful information on methodology… http://www.nchod.nhs.uk/

73 Finding out more Some further references of interest: –Bland, M. Introduction to Medical Statistics. Third Edition. Oxford University Press, 2000. –Hennekens CH, Buring JE. Epidemiology in Medicine, Lippincott Williams & Wilkins, 1987. –Larson, H.J. Introduction to Probability Theory and Statistical Inference. Third Edition. Wiley, 1982


Download ppt "More Statistics measures of uncertainty and small number issues Contributors Mark Dancox Shelley Bradley Jacq Clarkson."

Similar presentations


Ads by Google