This presentation is made available through a Creative Commons Attribution-Noncommercial license.

Title: Introduction to Thinking about Data
Attribution: Dr. Jim Scott, Clinic on the Meaningful Modeling of Epidemiological Data

2 MMED African Institute for the Mathematical Sciences Muizenberg, South Africa May, 2010 Brian Williams, Ph.D. Jim Scott, Ph.D, M.A., M.P.H.

3 The object of all science, whether natural science or psychology, is to co-ordinate our experiences into a logical system. Einstein A. The Meaning of Relativity (1922)

4  Data are better than anecdotes  Where do the data come from?  Be wary of confounding  Understand variability  What do the data say?  Do you believe them? 4

5 “Data, data, data!”, he cried impatiently. “I can’t make bricks without clay” - Sherlock Holmes Source: Statistics 3 rd, ed. Pisani Purves, Freedman

6 6 "The story became credible because it was published in The Lancet," Alison Singer, president of the Autism Science Foundation, said Tuesday. "It was in The Lancet, and we really rely on these medical journals."The Lancet

7 7 Cancer "cures"

8 Per Capita Expenditures on Road Maintenance(?)

9 AirportLateTotal %LateTotal % Newark 957399823.9100 39925.1 LaGuardia 62 35617.4113 57319.7 Pittsburg 8 6013.3 17 11914.3 Detroit 16 14511.0 16 13911.5 United Totals1043455922.9246123020.0 ContinentalUnited Percent of Planes Delayed from City of Origin January 2009 slide credit: Jeff Witmer, data source:

10  “When the facts change, I change my mind. What do you do sir?” - John Maynard Keynes  Variation is everywhere Observed value = Truth + Bias + Random Error 10

11  Caution:  This may require thoughtful consideration 11

12 12

13 13 What is the relationship?

14 14 What is the relationship?

15 15 Median Global Temperature During the Past 50 Years

16 All cause mortality among men in England and Wales, 1838-2002: Cholera, flu, war and antibiotics. Richard Peto. Deaths/thousand 1850 1900 1950 20001850 1900 1950 2000 1850 1900 1950 2000 1850 1900 1950 20001850 1900 1950 2000

17 Taubenberger, J.K. and Morens, D.M. ‘1918 Influenza: the Mother of All Pandemics’ Emerging Infectious Diseases 12 (2006) 15 Weekly influenza and pneumonia mortality, United Kingdom, 1918–1919 JulAugSepOctNovDecJanFebMarApr 1918 1919 Deaths per 1,000 people

18  4 outbreaks between 1831 and 1854  Most believed cholera was transmitted through vapors – “Miasma”  Snow proposed that cholera was actually spread by contaminated water  Considered “folly” 18

19  Supported convential ‘miasmatic’ threory of disease  Eventually became proponent of the ‘germ’ theory of disease  Contemporary of John Snow 19

20 20 Source: P. Bingham, N.Q. Verlander, M.J. Cheal. Public Health (2004) 118, 387-394

21 21 Source: P. Bingham, N.Q. Verlander, M.J. Cheal. Public Health (2004) 118, 387-394

22  Considered father of field epidemiology  Conducted a series of investigations of cholera outbreaks in London  Acclaimed Anesthesiologist 22

23  Snow had noted that the two water supply companies, Lambeth Co. and Southwark and Vauxhall Co., were drawing water from Thames at a point downstream of London  This water was heavily polluted – a likely source of infection 23

24  Before the 1854 epidemic, the Lambeth Company moved their water intake point upstream of London  Snow decided to compare deaths from cholera in households served by the two different companies  Walked door-to-door to determine the water source for each house 24

25  July 9 th to August 26 th, 1854  Mortality much higher in households that were supplied with drinking water from the Southwark and Vauxhall Company Districts with Water Supplied by Population (1851 Census) Deaths from Cholera Cholera Death Rate per 1,000 Population Southwark and Vauxhall Co. only 167,6548445.0 Lambeth Co. only19,133180.9 Both companies300,1496522.2 25

26  In addition to noting role of the water companies, Snow also described the epidemic that hit the Golden Square area of London  He determined where cholera cases lived and worked and created a map 26

27  Since Snow believed water was the source of infection, he also plotted water pumps on map  More of the cases were clustered around the Broad Street pump, than around other pumps 27

28  To gain support for his hypothesis, Snow approached residents and cases to determine where they drew their water  He found out that residents avoided other pumps because they were either grossly contaminated or located too inconveniently for most residents 28

29  Although these findings provided more support for his theory, he noted that there were relatively few cases in 2 of the blocks near the pump  He conducted more field epidemiology to characterize that neighborhood 29

30  The work house  A large work house in the area had a deep well that served as the only source of drinking water  The brewery  A nearby brewery also supplied its workers with a daily allotment of beer 30

31 31 The Workhouse The Brewery

32 Oxford Street Regent Street Snow 1854 cases of cholera pumps Work house

33  Snow took these findings to the appropriate health authorities and, as the story goes, removed the handle from the Broad Street pump  Shortly afterwards, the cholera epidemic subsided 33

34 34 “Why is it, then, that Dr. Snow is so singular in his opinion? Has he any facts to show in proof? No……The fact is, that the well whence Dr. Snow draws all sanitary truth is the main sewer. In riding his hobby very hard, he has fallen down through a gully-hole and has never since been able to get out again” - The Lancet (editorial) source: The Ghost Map, by Stephen Johnson 2006

35 Lancet 1858 Obituary column DR JOHN SNOW—This well-known physician died at noon on the 16th instant, at his house in Sackville-street, from an attack of apoplexy. His researches on chloroform and other anaesthetics were appreciated by the profession. In 1854 Filippo Paccini published ‘Microscopical observations and pathological deductions on cholera’ in which he discovered a comma-shaped bacillus which he called Vibrio, and described the organism and its relation to the disease.

36  “Always do right. This will gratify some people, and astonish the rest” - Mark Twain  Beware: All data are not created equal 36 Source: Statistics 3 rd, ed. Pisani, Purves, Freedman

37 37

38  1936 Literary Digest Poll  Literary Digest had predicted the winner of every US presidential election since 1916.  In 1936, Literary Digest mailed questionnaires to 10 million people (25% of voters).  2.4 million people responded 38  Returned questionnaires:  Landon: 1,293,66857%  FDR: 972,89743% Source:

39  Actual Result: Roosevelt 61%, Landon 37%.  One of the biggest landslides in U.S. history 39

40  How were the data collected?  Those who received the questionnaire were systematically different than those who didn’t ▪ 10 million sent out (~25% of voters) ▪ 2.3 million returned – sample of convenience ▪ Not representative  Sampling frame: ▪ Telephone books ▪ Automobile registries 40

41  Sample size doesn’t matter if the data collection scheme is flawed 41 Observed value = +

42  "It seems to me what is called for is an exquisite balance between two conflicting needs: the most skeptical scrutiny of all hypotheses that are served up to us and at the same time a great openness to new ideas …  If you are only skeptical, then no new ideas make it through to you …  On the other hand, if you are open to the point of gullibility and have not an ounce of skeptical sense in you, then you cannot distinguish the useful ideas from the worthless ones." - Carl Sagan 42

43  Primary source – use them to inform models  What are the data saying?  Where do they come from?  Are you a believer? 43

