Presentation is loading. Please wait.

Presentation is loading. Please wait.

Public Health Information Network (PHIN) Series II

Similar presentations


Presentation on theme: "Public Health Information Network (PHIN) Series II"— Presentation transcript:

1 Public Health Information Network (PHIN) Series II
Outbreak Investigation Methods: From Mystery to Mastery

2

3 Alternate Web site: http://www.sph.unc.edu/nccphp/phtin/index.htm
Access Series Files Online Session slides Session activities (when applicable) Session evaluation forms Speaker biographies Alternate Web site:

4 Site Sign-in Sheet FAX: (804) 225 - 3888
Please submit your site sign-in sheet and session evaluation forms to: Suzi Silverstein Director, Education and Training Emergency Preparedness & Response Programs FAX: (804)

5 Series II Session VI “Data Analysis”

6 Series II Sessions “Recognizing an Outbreak” “Risk Communication”
“Study Design” “Designing Questionnaires” “Interviewing Techniques” “Data Analysis” “Writing and Reviewing Epidemiological Literature”

7 Today’s Presenters Amy Nelson, PhD Consultant
NC Center for Public Health Preparedness Sarah Pfau, MPH

8 “Analyzing Data” Learning Objectives
Upon completion of this session, you will: Understand what an analytic study contributes to an epidemiological outbreak investigation Understand the importance of data cleaning as a part of analysis planning

9 “Analyzing Data” Learning Objectives
Know why and how to generate descriptive statistics to assess trends in your data Know how to generate and interpret epi curves to assess trends in your outbreak data Understand how to interpret measures of central tendency

10 “Analyzing Data” Learning Objectives (cont’d.)
Know why and how to generate measures of association for cohort and case-control studies Understand how to interpret measures of association (risk ratios, odds ratios) and corresponding confidence intervals Know how to generate and interpret selected descriptive and analytic statistics in Epi Info software

11 Amy Nelson, PhD Consultant, NC Center for Public Health Preparedness
Lecturer Amy Nelson, PhD Consultant, NC Center for Public Health Preparedness

12 Analyzing Data: Session Overview
Analysis planning Descriptive epidemiology Epi curves Spot maps Measures of central tendency Attack rates Analytic epidemiology Measures of association Case study analysis using Epi Info software Today we are going to cover a range of topics that you should be familiar with when you analyze data during an outbreak investigation. An essential part of data analysis are the fundamental procedures that you should complete prior to analyzing data. These include: planning your analysis, cleaning your data and assessing summarizations of your data. After I discuss these precursors to data analysis, I will talk about measures of central tendency, attack rates, and measures of association. We will end today’s session with a comprehensive case-control outbreak investigation example. Our guest lecturer will walk you through the process of cleaning, summarizing and analyzing data as well as interpreting analysis output using Epi Info Software’s Analyze Data component.

13 Analysis Planning The precursor to data analysis is analysis planning.
This process ideally takes place prior to questionnaire design. Because as you learned in the September and October PHIN sessions, many elements of questionnaire design can impact how you can analyze the collected data.

14 Analysis Planning An invaluable investment of time
Helps you select the most appropriate epidemiologic methods Helps assure that the work leading up to analysis yields a database structure and content that your preferred analysis software needs to successfully run analysis programs You may feel rushed to identify the source of an outbreak during an investigation. But analysis planning can: Be an invaluable investment of time; Help you select the most appropriate research methods and statistical tools; and Help assure that the work leading up to analysis yields a database structure and contents that your preferred analysis software needs to successfully run analysis programs.

15 Analysis Planning Several factors influence—and sometimes limit—your approach to data analysis: Research question Exposure and outcome variables Study design Sample population The culmination of methodological decisions made throughout the outbreak investigation process can influence and possibly even limit how you can analyze your data. These include: How your research questions are developed (for example, will you need to collect qualitative versus quantitative data, or both? Will categorical data be nominal or ordinal?) The data on exposure and outcome that you collect also impacts analysis. For example, will you be collecting information on whether a person was exposed in a yes/no format, or will you be measuring the quantity of a substance the person was exposed to. Sometimes you may collect data in one format, but you want to have the ability to put it into categories or other formats. Another determining factor in data analysis is the study design that you use; for example, you will need to interpret distinctly different measures of association for cohort versus case control studies. You will learn how to interpret such analysis output in today’s session. You method of sample population selection can also affect data analysis. For example, some sampling methods, like the cluster sampling within census blocks that you heard about in last month’s session require sophisticated / complex analysis techniques.

16 Analysis Planning Three key considerations as you plan your analysis:
Work backwards from the research question(s) to design the most efficient data collection instrument Study design will determine which statistical tests and measures of association you evaluate in the analysis output Consider the need to present, graph, or map data Three key considerations for planning a study are listed here. Let’s discuss these in detail.

17 Analysis Planning Work backwards from the research question(s) to design the most efficient data collection instrument Develop a sound data collection instrument Collect pieces of information that can be counted, sorted, and recoded or stratified Analysis phase is not the time to realize that you should have asked questions differently!

18 Analysis Planning Study design will determine which statistical tools you will use Use risk ratio (RR) with cohort studies and odds ratio (OR) with case-control studies; need to know which to evaluate, because both are generated simultaneously in Epi Info and SAS Some sampling methods (e.g., matching in case-controls studies) require special types of analysis We covered study designs in the July? session… An important point to remember is that the study design you use will dictate which statistical tests you will conduct in the analysis. In a cohort study…(slide)

19 Analysis Planning Consider the need to present, graph, or map data
Even if you collect continuous data, you may later categorize it so you can generate a bar graph and assess frequency distributions If you plan to map data, you may need X-and Y-coordinate or denominator data There may be additional data you need to complete graphs or maps that you hadn’t thought of including in your data collection instrument.

20 Basic Steps of an Outbreak Investigation
Verify the diagnosis and confirm the outbreak Define a case and conduct case finding Tabulate and orient data: time, place, person Take immediate control measures Formulate and test hypotheses Plan and execute additional studies Implement and evaluate control measures Communicate findings Before moving on, let’s review the basic steps of an outbreak investigation, so that we can put analysis into the broader context of outbreak investigation. Remember that, in reality, some of these steps may occur simultaneously or in a different order. First, we must verify the diagnosis and confirm that there is actually an outbreak occurring, next we need to create a preliminary case definition and conduct active case finding. Once we have information from some cases we need to compile and review it, next we implement preliminary control measures. After that we formulate and then test a hypothesis, and then plan and execute additional studies based on the preliminary results. Finally, we implement and evaluate control measures and communicate our findings. In this session, we will emphasize steps numbers 3, 5 and 6 - all of which involve data management and data analysis.

21 Descriptive Epidemiology
First we’ll discuss step #3, Tabulate and orient data by person, place, and time. This is descriptive epidemiology.

22 Step 3: Tabulate and orient data: time, place, person
Descriptive epidemiology: Familiarizes the investigator with the data Comprehensively describes the outbreak Is essential for hypothesis generation (step #5) Characterizing an outbreak by time, place and person is called “descriptive epidemiology”, and comprises step #3 of the basic steps of an outbreak investigation. Descriptive epidemiology is performed towards the beginning of an investigation once data are collected from several case-patients and often several more times throughout the investigation as more data are collected. Why is descriptive epidemiology important? It’s important for three reasons: 1. To become familiar with the outbreak data 2. To describe the outbreak in terms of its time trend, geographic extent and the population affected 3. It is essential for step #5: hypothesis generation. Let’s examine each of these reasons in more detail.

23 Data Cleaning Check for accuracy Outliers Check for completeness
Missing values Determine whether or not to create or collapse data categories Get to know the basic descriptive findings The first component in descriptive epidemiology involves examining each variable in the data set, individually. This process, referred to as “data cleaning,” identifies outlying values that may be due to data collection or entry error. Data cleaning also allows the investigator to get to know the basic descriptive findings for each variable. Data cleaning involves tasks such as: Checking for accuracy Checking for consistency Checking for completeness Determining whether or not to create or collapse data categories Getting to know the basic descriptive findings of the data If you are the only person who has collected and entered data, you will probably have a good sense of these characteristics. However, if more than one person has collected and/or entered the data into a database it is critical that you familiarize yourself with these aspects of the data before embarking on data analysis. Analyzing data without knowing the data can lead to inaccurate conclusions.

24 Data Cleaning: Outliers
Outliers can be cases at the very beginning and end that may not appear to be related First check to make certain they are not due to a collection, coding or data entry error If they are not an error, they may represent Baseline level of illness Outbreak source A case exposed earlier than the others An unrelated case A case exposed later than the others A case with a long incubation period Cases at the very beginning or end of an outbreak that may not appear to be related to the outbreak are referred to as “outliers.” The first thing that should be done when considering outliers is to make sure they are not mistakes due to data collection, coding or entry error. For example, you may want to check the range of ages among the case-patients. If you have only one 1-year old or one 95 year-old and the remaining case-patients range in age from 10 – 60, you might want to check to see if these values for age, referred to as “outliers,” have been accurately recorded or calculated and entered into the database.

25 Data Cleaning: Distribution of Variables
For example, here is a frequency distribution of date of illness onset for a nursing home outbreak. Notice that most case-patients became ill towards the end of the month while one patient became ill at the beginning of the month. The investigator would want to confirm this date of onset to make certain it wasn’t due to a data collection or entry error. When outliers are not errors, they can provide important information. For example, an early case may not be part of the outbreak; it may just represent the baseline or background level of illness. However, it may also represent the source of the outbreak, such as an infected food handler, or a case that was exposed earlier than the others. Also, a late case may not be part of the outbreak. Alternatively, a late case may represent an individual who had a long incubation period, who was exposed later than the other cases, or who was a secondary case. “Outlier”

26 Data Cleaning: Missing Values
The investigator can check into missing values that are expected versus those that are due to problems in data collection or entry The number of missing values for each variable can also be learned from frequency distributions Data cleaning also gives the investigator a sense which variables have few or many missing values, and why values may be missing. For example, in an outbreak investigation for a foodborne illness it would make sense that vegetarians would be missing responses for any food item containing meat. Alternatively, missing values could be due to problems with data collection or data entry.

27 Data Cleaning: Frequency Distributions
After looking for missing values and outliers, frequency distributions can highlight the basic descriptive findings. For categorical data, the frequency distribution will provide the proportion of each response. For continuous variables, they reveal the shape of the distribution. This frequency table for the variable representing the date of onset of illness shows the investigator that 29 out of 75 records, almost 40%, are missing data for the date of onset of illness. This situation is obviously not ideal, 40% is a lot. This might spur the investigator to try to obtain these dates. If this isn’t possible, the investigator will at least know that this variable, date of illness onset, may not be very reliable.

28 Data Cleaning: Data Categories
Which variables are continuous versus categorical? Collapse existing categories into fewer? Create categories from continuous? (e.g., age) In your dataset you may have continuous numeric variables that you want to collapse into discrete categories. Eventually you will need to determine how many categories you want for a variable, and what the boundaries for each category should be. Because during data analysis, you will have the flexibility to analyze even intervals (for example, in the case of “age”, you could analyze ages 0 – 100 in ten ten-year intervals) or uneven intervals that better answer your research question (for example, maybe you are studying an outbreak among children, and want to look at < 1 year of age, 2 – 4, 5 – 6, and then an “all other ages” category). As an aside, if you are analyzing data but did not create the questionnaire that was used, design the database, or enter data, you may want to review any documentation (sometimes called a “code book”) that accompanies a database to identify any existing pre-coding of variables. Or, in Epi Info, you can look in the MakeView component to see how a field (which Is the Epi Info name for a variable) was programmed into the database.

29 Descriptive Epidemiology
Comprehensively describes the outbreak Time Place Person We just discussed the fact that it is necessary for the investigator to become familiar with the outbreak data and that this is accomplished through a component of descriptive epidemiology called data cleaning. The second reason descriptive epidemiology is important is because it can be used to comprehensively characterize the outbreak. Specifically, it can be used to describe the outbreak in terms of its time trend, geographic extent and the population affected.

30 Descriptive Epidemiology
Time

31 Descriptive Epidemiology: Time
We will first discuss how descriptive epidemiology is used to display time trends and to generate epidemic curves in outbreaks. Time data are typically shown as a line graph or histogram with the time period of interest shown on the x-axis (the horizontal line) and the number cases during the corresponding time period on the y-axis (the vertical line). Note that when you generate a histogram, each time interval along the x-axis is spaced in equal intervals.

32 Descriptive Epidemiology:Time
What is an epidemic curve and how can it help in an outbreak? An epidemic curve (epi curve) is a graphical depiction of the number of cases of illness by the date of illness onset With infectious diseases, once cases in an outbreak have been counted, the tally is used to help solve the investigation by creating an epidemic curve, or epi curve. An epi curve is defined as a graphical depiction of the number of outbreak cases by date of illness onset.

33 Descriptive Epidemiology:Time
An epi curve can provide information on the following characteristics of an outbreak: Pattern of spread Magnitude Outliers Time trend Exposure and / or disease incubation period An epi curve is useful because it can provide information on the outbreak’s: Pattern of spread Magnitude Outliers Time trend Exposure and / or disease incubation period I will discuss each of these aspects of an epi curve in detail.

34 Epidemic Curves The overall shape of the epi curve can reveal the type of outbreak (the pattern of spread) Common source Intermittent Continuous Point source Propagated The overall shape of the curve can reveal the type of outbreak (common source, point source or propagated).

35 Epidemic Curves: Common Source
People are exposed to a common harmful source Period of exposure may be brief (point source), long (continuous) or intermittent A common source outbreak is one in which people are exposed to a common harmful source. The period of exposure may be brief, long or intermittent.

36 Epi Curve: Common Source Outbreak with Intermittent Exposure
Intermittent exposure often results in an epi curve with irregular peaks that reflect the timing and the extent of exposure This epi curve might illustrate an outbreak in which a food handler ill with Hepatitis A worked only on certain days. Cases were exposed to a common source, the food handler, but they were exposed intermittently since the food handler worked only on certain days of the week. Pattern of Spread

37 Epi Curve: Common Source Outbreak with Continuous Exposure
This graph shows an example of an epi curve for a common source outbreak with continuous exposure. In this type of outbreak, the duration of exposure is relatively long and often cases will rise gradually (and possibly plateau, rather than peak) An example of this type of outbreak could be an ill food handler who worked every day for many days in a row. Pattern of Spread

38 Epi Curve: Point Source Outbreak
This graph is an example of an epi curve for a point source outbreak. A point source outbreak is a type of common source outbreak in which all of the cases are exposed within one incubation period. Note how the graph shows a steep upslope and a comparatively gradual downslope. This shape is characteristic for a point source outbreak. Eating a batch of contaminated food at a social function would be an example of this type of outbreak. Everyone was exposed to the same source, a contaminated food, and the exposure was brief - in this case limited to the social function. Pattern of Spread

39 Epi Curve: Propagated Outbreak
This graph shows an example of an epi curve for a propagated outbreak. A propagated outbreak occurs when there is person-to-person spread. Because of this, propagated epidemics can last longer than common source epidemics, and may lead to multiple waves of infection if secondary and tertiary cases occur. The classic epi curve from a propagated outbreak shows successively taller peaks, distanced one incubation period apart. However, in reality, the epi curve for this type of outbreak may not fit this exact pattern. This type of outbreak could occur, for example, with a disease such as tuberculosis, in which one infected person transmits the disease to several other people who, in turn, infect even more people. Pattern of Spread

40 Epidemic Curves Magnitude
An epidemic curve can provide a sense of the magnitude of the outbreak as well. For example, there were 73 cases reported in the point source outbreak shown in a prior slide. This is a fairly large outbreak for certain diseases in a small geographical area! Magnitude

41 Epidemic Curves: Time Trend
Provide information about the time trend of the outbreak Consider: Date of illness onset for the first case Date when the outbreak peaked Date of illness onset for the last case Epi curves also can provide information about the time trend of the outbreak. These three pieces of information can provide insight into the period of exposure to the risk factor causing the outbreak or the incubation period of the organism causing the outbreak.

42 Epidemic Curves Time Trend
Again, using the same point source outbreak as an example, the epi curve allows the investigator to glean some useful information about the time trend involved. Illness onset for the first case was on day 11 and cases continued for the rest of the month. The outbreak peaked on day 21 and then began to decline. No new cases were reported after day 28. Unless there was secondary spread, based on the curve, this outbreak appears to be over. Time Trend

43 Epidemic Curves: Incubation Period
If the timing of the exposure is known, epi curves can be used to estimate the incubation period of the disease The time between the exposure and the peak of the epi curve represents the median incubation period Epi curves can also be used to estimate two important outbreak characteristics: the probable period of exposure and the incubation period of the causative organism. If the timing of the presumed exposure is known, epi curves can be used to estimate the incubation period of the disease, and this may facilitate the identification of the causative agent. This is because the period between the known or hypothesized exposure time and the peak of the epi curve represents the hypothesized median incubation period.

44 Epidemic Curves: Incubation Period
In common source outbreaks with known incubation periods, epi curves can help determine the average period of exposure Find the average incubation period for the organism and count backwards from the peak case on the epi curve In common source outbreaks involving diseases with known incubation periods, epi curves can help determine the probable period of exposure. This can be done by looking up the average incubation period for the organism and counting back from the peak case the amount of time of the average incubation period.

45 Epidemic Curves This can also be done to find the minimum incubation period Find the minimum incubation period for the organism and count backwards from the earliest case on the epi curve Likewise, to estimate the minimum incubation period, count back the minimum incubation period from the earliest case on the epi curve.

46 Exposure / Outbreak Incubation Period
Average and minimum incubation periods should be close and should represent the probable period of exposure Widen the estimated exposure period by 10% to 20% Ideally, the minimum and average outbreak incubation periods should be close, and the time between them will represent the probable period of exposure. Since this technique is not precise, you may want to widen the identified exposure period by 10% to 20% on either side so as not to miss a potential exposure.

47 Calculating Incubation Period
We can use this data from an outbreak of E. coli O157:H7 infection to demonstrate how to calculate the probable exposure period for a point source outbreak. First, we need to look up the minimum and average incubation periods for the organism. Next we identify the peak of the outbreak and count back from it the number of days for the average incubation period. In this case the average incubation period for this organism is 4 days and the average exposure period is on December 6. Then we begin at the earliest identified case and count back the minimum incubation period, which is 1 day and we see that the minimum exposure period is December 7. Since this isn’t a precise technique we should probably widen the interval a little, say from December 5th to the 7th and we should look for exposures that occurred during this time period. Let’s also use this example to learn how to estimate the incubation period for the causative organism. Let’s pretend that we didn’t know the pathogen involved in this outbreak, but we knew that everyone was exposed on December 6th, at a school picnic. We could estimate the median incubation period by counting the number of days between the date of exposure and the peak of the epi curve. In this case, the estimated incubation period would be 5 days. This could help us identify the pathogen because we could focus on organisms that have a median incubation period of approximately 5 days. Source: Onset of illness among cases of E. coli O157:H7 Infection, Massachusetts, December, 1998. Onset of illness among cases of E. coli O157:H7 Infection, Massachusetts, December, 1998.

48 Creating an Epidemic Curve
Provide a descriptive title Label each axis Plot the number of cases of disease reported during an outbreak on the y-axis Plot the time or date of illness onset on the x-axis Include the pre-epidemic period to show the baseline number of cases Let’s now look at the elements that you should address or include when you create an epidemic curve graph. The structure of an epi curve is straight forward. Simply plot the number of cases of disease reported during an outbreak on the y-axis (the vertical line) and the time or date of illness onset on the x-axis (the horizontal line).

49 Epi Curve for a Common Source Outbreak with Continuous Exposure
Y- Axis Here is an example of an epi curve with a descriptive title and labeled axes. A simple but important point to remember is to label the axes correctly and to include a descriptive title with each epi curve. The epi curve, with its title and axes should provide enough information to be completely self explanatory. Also, the pre-epidemic period should always be included on the graph to illustrate the baseline number of cases. Also, keep in mind that epi curves are a type of histogram, so, technically, there should not be any space between the x-axis data intervals. Furthermore, there should be no overlap within data intervals. X - Axis

50 Creating an Epidemic Curve
X-axis considerations Choice of time unit for x-axis depends upon the incubation period Begin with a unit approximately one quarter the length of the incubation period Example: 1. Mean incubation period for influenza = 36 hours x ¼ = 9 3. Use 9-hour intervals on the x-axis for an outbreak of influenza lasting several days One of the trickier aspects of creating an epi curve is choosing the unit of time for the x-axis. This choice is usually based on the incubation period of the illness and the time interval of the outbreak. In general, a time unit that is approximately one quarter of the incubation period is usually a good place to start.

51 Creating an Epidemic Curve
X-axis considerations If the incubation period is not known, graph several epi curves with different time units Usually the day of illness onset is the best unit for the x-axis If the incubation period of the illness (or the illness itself) is not known, several epi curves with different time intervals on the x-axis should be examined to see which one best represents the data. For most diseases, date of onset is appropriate for the x-axis, but for illnesses with very short incubation periods (for example, Staphylococcus aureus food poisoning) hours of onset may be preferable. Likewise, for diseases with long incubation periods, such as tuberculosis, the best time interval may be days, weeks, or months. Epi Info software allows you to plot by hours, minutes, and even seconds, and by a.m. versus p.m. as needed.

52 Epi Curve X-Axis Considerations
For example, consider these data for the same outbreak displayed by week of onset on the left and day of onset on the right. The graph by day of onset looks quite different and is more informative than the graph by week of onset. For example, the epi curve using the day as the unit of time on the x-axis distributes the cases more evenly and highlights the first identified case patient. X-axis unit of time = 1 week X-axis unit of time = 1 day

53 Descriptive Epidemiology
Place That concludes our discussion on the importance of descriptive epidemiology in providing information about time trends in the outbreak. In addition to describing the time element of an outbreak, descriptive epidemiology can provide information about the geography of the outbreak.

54 Descriptive Epidemiology: Place
Spot map Shows where cases live, work, spend time If population size varies between locations being compared, use location-specific attack rates instead of number of cases Characterizing the outbreak by place allows the investigator to assess the geographic extent of the situation and may also reveal patterns, such as clusters of cases, that may provide information about the cause or source of the outbreak. A spot map can be used to describe the outbreak “place”. A spot map is simply a map that indicates the location of a case characteristic. For example, a spot map may show where a case lives or works. Individual cases are usually plotted with spot maps. If the populations in the areas being compared differ, it is best to use location-specific attack rates rather than numbers of cases. This will take the size of the location-specific population into account and will allow for comparison of the different areas. We will discuss attack rates later in this presentation.

55 Descriptive Epidemiology: Place
This is an example of a spot map from an outbreak of histoplasmosis in Minnesota. Each dot on the map represents the location where one case-patient in the outbreak lived. Source:

56 Descriptive Epidemiology
Person In addition to time and place, descriptive epidemiology is also used to characterize the case-patients involved in the outbreak.

57 Descriptive Epidemiology: Person
Data summarization for descriptive epidemiology of the population Line listings Graphs Bar graphs Histograms This may be done using several methods of data summarization. Three commonly used methods are: line listings, histograms and graphs.

58 Line Listing Signs/Symptoms Lab Demographics Case # Report Date
Signs/Symptoms Lab Demographics Case # Report Date Onset Date Physician Diagnosis N V J HAIgM Sex Age 1 10/12/02 10/5/02 Hepatitis A M 37 2 10/4/02 62 3 10/13/02 38 4 10/9/02 NA F 44 5 10/15/02 17 6 10/16/02 10/6/02 43 A line listing allows information regarding time, person, and place to be organized and reviewed quickly. It can be done by hand with pencil and paper or using a software program such as Microsoft Excel or Epi Info. To set up a line listing, create a table in which each row represents a case and each column represents a variable of interest (variables of interest will depend on the nature of the outbreak but should include components of the case definition). New cases should be added to the list as they are identified and all cases should be updated throughout the investigation as new information is obtained. This is an example of a line listing for a Hepatitis A outbreak. It shows the case number (the name of the case is excluded because it is confidential information), the date the illness was reported, the date of symptom onset, the clinical diagnosis, whether the person had nausea, vomiting or jaundice, lab results, and some basic demographics.

59 Bar Graph This is an example of a bar chart graph showing the number of males and females in each exposure category. A bar chart graphs categorical data with space between the intervals. Always remember to include a descriptive title and to label both axes when you create a graph. Earlier in the session you saw numerous Epi Curve examples. Epi curves are histogram graphs that display continuous data with no space between the intervals.

60 Descriptive Epidemiology
Measures of central tendency Mean Median Mode Range Another way to characterize the study population in an outbreak is to use some descriptive statistics, such as measures of central tendency. These measures are ways to summarize or describe the values contained in a variable. The most commonly used measures of central tendency are: the mean, median, mode and range.

61 Measures of Central Tendency
Mean (Average) The sum of all values divided by the number of values Example: Cases 7,10, 8, 5, 5, 37, 9 years old Mean = ( )/7 Mean = 11.6 years of age The mean is simply the arithmetic average. It is calculated by adding up all of the values and then dividing by the number of values. For example, if we wanted to calculate the mean age of seven case-patients in an outbreak investigation who were 7, 10, 8, 5, 5, 37 and 9 years of age we would add these values up (81) and divide by the number of case-patients, 7. The mean, or average age is 11.6 years.

62 Measures of Central Tendency
Median (50th percentile) The value that falls in the middle position when the measurements are ordered from smallest to largest Example: Ages 7,10, 8, 5, 5, 37, 9 Ages sorted: 5, 5, 7, 8, 9,10, 37 Median age = 8 The median is another measure of central tendency. It is the 50th percentile value. In other words it is the value in the middle position of a set of measurements ordered from smallest to largest. Using the same example as before, we first rank the ages from youngest to oldest. Since there are so few case patients, it’s easy to see that the middle value is 8. Compared to the mean, the median is less sensitive to extreme values, because these values are not used to calculate the median. In this example the case who is 37 years of age is quite a bit older than the rest of the cases. If you’ll recall, the mean age for these case-patients was The median value, 8, is less because it is not influenced by the 37 year old.

63 Calculate a Median Value
If the number of measurements is odd: Median = value with rank (n+1) / 2 5, 5, 7, 8, 9,10, 37 n = 7, (n+1) / 2 = (7+1) / 2 = 4 The 4th value = 8 Where n = the number of values Oftentimes there will be too many values to identify the median by sight. In that case there are rules for calculating the median: If the number of values is odd, the median value is the value with rank (n+1)/2. In our example, n, which is the number of values, is 7 and (n+1)/2=4. The 4th ordered value is 8.

64 Calculate a Median Value
If the number of measurements is even: Median=average of the two values with: rank of n / 2 and rank of (n / 2) + 1 Where n = the number of values 5, 5, 7, 8, 9,10, 37 n = 7; (7 / 2) = 3.5. So “8” is the first value (7 / 2) + 1 = 4.5, so “9” is the second value (8 + 9) / 2 = 8.5 The Median value = 8.5 If the number of values is even, the median is the value half way between the values with rank (n/2) and (n/2)+1. In other words, it’s the average of the two middle values.

65 Measures of Central Tendency
Mode [Modal Value] The value that occurs the most frequently Example: 5, 5, 7, 8, 9,10, 37 Mode= 5 It is possible to have more than one mode Example: 5, 5, 7,8,10,10, 37 Modes= 5 and 10 Another way to look at the center of a distribution of values is to look for the value that occurs the most frequently. This is called the mode. In this example, the mode would be 5, because it occurs twice whereas the rest of the values only occur once. Depending on the data, there may be more than one modal value. For example, if we added someone to our study who was 10 years old there would be two values for the mode: 5 and 10.

66 Measures of Central Tendency
Mode [Modal Value]: The value for the variable in which the greatest frequency of records fall Epi Info limitation: If multiple values share the same frequency that is also the highest frequency, Epi Info will identify only the first value it encounters as “Mode” as it scans the table in ascending order Computer programs such as epi-info can be used to calculate these measures of central tendency, which is obviously the easier way to go about it if you have a large data set. However, a limitation of Epi Info is that it will not identify multiple modal values. It will identify the first modal value it encounters.

67 Measures of Central Tendency Mode Software Limitation
Modal Values For example, here is an Epi Info histogram showing the frequency of age values in a dataset. This data set has several modal values: 11, 17, 35 and 62. But Epi Info will identify only 11 as the mode. The ages 11, 17, 35, and 62 all qualify for the status of “mode,” but Epi Info identifies Age 11 as the mode in analysis output for MEANS AGE in viewOswego.

68 Measures of Central Tendency
50th percentile The box on the top of this slide was generated from Epi Info and shows all of the measures of central tendency we have discussed. The results of the central tendency analysis are illustrated at the bottom of the slide. In this data set the mean was 36.8, the median (which is the 50th percentile) is 36, the most commonly appearing value was 11, which is the mode, and the data ranged from 3 to 77. 3 11 36.0 36.8 77 Min Mode Median Mean (average) Max

69 Activity: Calculate Mean and Median
Completion time: 5 minutes We are going to pause here for a short activity.

70 Calculate Mean and Median Age
Case # Age (Years) 1 5 2 9 3 7 4 6 8 We want you to practice hand-calculating two measures of central tendency with the numbers provided in this table. We kept both the number of values and the individual values small so you can complete this activity without a calculator. However, you WILL need paper and pencil / pen. The first value that we are asking you to calculate is the Mean (average) age among the six cases listed in the table. The second value that we are asking you to calculate is the Median (50th percentile) age among the six cases listed in the table. We have provided the formula for calculating the Median with an even number of values. In a few minutes I will show you the answers and how they were calculated. For an even number of measurements, Median = the average of two values ranked: N / 2 (n / 2) + 1

71 Calculate Mean and Median Age
Mean age: =40 40 / 6 = 6.67 years Median age: 5,5,6,7,8,9 Average of values ranked (n/2) and (n/2)+1 =(6/2) and (6/2) +1 = average of 6 and 7 =(6+7) / 2 = 6.5 years Here are the answers to the activity: The mean age was calculated by adding all six of the values, then dividing that total by 6 – the number of values that were added together. The Mean age in years among our six cases is 6.67. Because we had an even number of values to work with, we had to calculate the median by averaging the two middle values. Our median age for these six cases is 6.5 years.

72 Question & Answer Opportunity
What questions do you have about the elements of descriptive epidemiology that I have discussed today?

73 5 minute break

74 Attack Rates At the beginning of the presentation, I noted several important functions that descriptive epidemiology serves: First, data cleaning, which includes the use of descriptive epidemiology tools and familiarizes the investigator with the data Second, descriptive epidemiology allows the investigator to comprehensively describe the outbreak with respect to time, place and the affected population Finally, descriptive epidemiology is essential for hypothesis generation Hypotheses about the risk factor or exposure for the disease under question can be generated by reviewing the medical literature, from information gathered in hypothesis-generating questionnaires and by performing descriptive epidemiology to characterize the cases and to establish commonalities between them. Attack rates are an important way to examine the outbreak data for purposes of generating hypotheses. In fact, they are probably the most common measure of disease frequency used in outbreak investigations.

75 Attack Rates (AR) AR Food-specific AR # of cases of a disease
# of people at risk (for a limited period of time) Food-specific AR # people who ate a food and became ill # people who ate that food Attack rates are calculated by dividing the number of people at risk in a population who become ill by the number of people at risk in the population. Attack rates are useful for comparing the risk of disease in groups with different exposures. For example, food-specific attack rates can be calculated by dividing the number of people who ate a certain food and became ill by the total number of people who ate the food.

76 Food-Specific Attack Rates
Consumed Item Did Not Consume Item Ill Total AR(%) Chicken 12 46 26 17 29 59 Cake 43 61 20 32 63 Water 10 24 42 33 51 65 Green Salad 54 78 3 21 14 Asparagus 4 6 67 69 Let’s discuss an example of food-specific attack rates. These data come from a hypothetical example of gastrointestinal illness among wedding attendees. Attack rates for those who did and did not eat specific food items were calculated by dividing the number of people who consumed an item and became ill by the total number of people who consumed the item, and then doing the same for persons who did not consume the food. Generally, there are three situations to look out for when calculating attack rates in an outbreak: The attack rate is high among those who consumed the food item. The attack rate is low among persons who did not consume the item. Most of the cases were exposed to the food item-making the exposure a reasonable explanation for most or all cases. Let’s see if any of the food items satisfy the three situations we’re looking for: The attack rate for the asparagus among those who consumed asparagus is pretty high-67%, but the attack rate among those who did not consume asparagus is also fairly high-61%. Moreover, only four persons who consumed the asparagus became ill, so it probably isn’t the culprit. However, the green salad looks interesting. The attack rate is high among the exposed (those who consumed a green salad) – it’s 78% - and the attack rate is comparatively low among persons who did not eat green salad. Furthermore, 42 of the cases – which is most of them - ate the green salad, making it a reasonable explanation for the illness. When the attack rate is high among people who ate and did not eat a specific food item, the food is not likely to be the source of infection. In this example, the attack rate for those who ate cake is high, but it is also high among those who did not eat cake. However, when the attack rate is high among people who DID consume a specific food item and low among people who did NOT consume that food item, and many of the ill people consumed the food item in question, making it a plausible explanation for the outbreak, the causal relationship between the food item and illness should be further studied. This is done by conducting an analytic study and calculating an odds ratio or risk ratio to assess the relationship between exposure to the food and outcome of illness. The next portion of today’s session will look more at these aspects of analytic epidemiology. CDC. Outbreak of foodborne streptococcal disease. MMWR 23:365, 1974. This food is probably not the source of infection

77 Stratified Attack Rates
Ill Well Total AR(%) Women 13 16 29 45 Men 5 27 32 Attack rate in women: 13 / 29 = 45% Attack rate in men: 5 / 32 = 16% Sometimes it is helpful to calculate stratified attack rates. This just means to calculate attack rates by a characteristic of interest. For example, this table shows gender-specific attack rates. It is interesting that women have a much higher attack rate compared to men. This might prompt investigators to consider an exposure that is common in women. Stratified attack rates can also be calculated by other variables such as: age group, occupation or race.

78 Hypothesis Generation vs. Hypothesis Testing
As we discussed, descriptive epidemiology (along with medical literature) is one of the methods we can use to generate hypotheses. Descriptive epidemiology may provide some information to give us some ideas about what is causing illness. Once hypotheses have been generated using descriptive epidemiology, they may then be tested, using analytic epidemiology. So next we will talk about analytic epidemiology. But first, let’s contrast hypothesis generation and hypothesis testing. These two crucial aspects to an outbreak investigation are often confused when, in reality, although they both involve data analysis, they are two distinct entities.

79 Hypothesis Generation vs. Hypothesis Testing
Formulate hypotheses Occurs after having spoken with some case –patients and public health officials Based on information form literature review Based on descriptive epidemiology (step #3) Test hypotheses Occurs after hypotheses have been generated Based on analytic epidemiology Medical literature, the information collected from a hypothesis generating questionnaire, and other descriptive epidemiology such as the calculation of attack rates helps you formulate hypotheses about the source of the agent involved in the outbreak, how the disease was transmitted, and the exposure or risk factor that caused the disease. Once you formulate hypotheses based on descriptive epidemiology, the next step is to test the hypotheses with an analytic study. In other words, once a hypothesis or hypotheses have been generated, their plausibility can be evaluated with the use of statistics.

80 Descriptive Epidemiology Analytic Epidemiology
Search for clues Clues available Formulate hypotheses Test hypotheses No comparison group Comparison group Answers: How much, who, what, when, where Answers: How, why Now that we have discussed descriptive epidemiology, and are clear on the difference between hypothesis generation and hypothesis testing, we’re ready to discuss analytic epidemiology. Before we begin, let’s contrast the two approaches.

81 Analytic Epidemiology

82 Analytic Epidemiology
Measures of Association Risk Ratio (cohort study) Odds Ratio (case-control study)

83 Cohort versus Case-Control Study
This slide highlights differences between the design of a cohort and case-control study. This was discussed in greater detail during the August PHIN “Study Design” session. The most important distinguishing point between the two is that in a cohort, exposed and unexposed persons are compared and in a case-control study diseased and non-diseased persons are compared.

84 Cohort versus Case-Control Study
Another difference between the two study designs is that in a cohort study the risk ratio is the measure of association used to assess the relationship between the risk factor and the illness while the odds ratio is used in a case-control study.

85 Analysis Output This is a picture of data analysis output from Epi Info. You can see that it provides both the odds ratio and risk ratio values simultaneously in analysis output, regardless of your study design. It is therefore important to understand the study design you are using to know if you should rely on the risk ratio or the odds ratio.

86 Measure of Association
Cohort Study Measure of Association

87 Risk Ratio Ill Not Ill Total Exposed A B A+B Unexposed C D C+D
[A/(A+B)] [C/(C+D)] The measure of association between the exposure and the illness in a cohort study is called the risk ratio. Another name for it is relative risk. A risk ratio is calculated by first measuring the risk of disease in the exposed group and in the unexposed group. The risk is calculated by dividing the number of new diseases by the total number of people in the exposed group and doing the same for the unexposed group. So, for example, we see that the risk of disease among the exposed is A/(A+B) and the risk of disease in the unexposed is C/(C+D). The risk ratio is the ratio of the risk in the exposed to the risk in the unexposed and is shown at the bottom of the slide. In a cohort study with multiple potential exposures, several 2x2 tables would be created and a risk ratio for the association between each different potential exposure and the disease would be calculated. For example, there might be one 2x2 table and risk ratio for having eaten the potato salad vs. not having eaten the potato salad, and another 2x2 table and risk ratio for having eaten the ham vs. not having had eaten the ham.

88 Risk Ratio Example Ill Well Total Ate alfalfa sprouts 43 11 54
Did not eat alfalfa sprouts 3 18 21 46 29 75 Now let’s practice calculating a risk ratio using a hypothetical outbreak of Salmonella with consumption of alfalfa sprouts as the potential exposure. It is calculated as (43/54), the risk in the exposed, divided by (3/21), the risk in the unexposed. The resulting risk ratio is This indicates that persons who ate alfalfa sprouts were 5.6 times more likely to become ill than those who did not eat alfalfa sprouts . RR = (43 / 54) / (3 / 21) = 5.6

89 Interpreting a Risk Ratio
RR=1.0 = no association between exposure and disease RR>1.0 = positive association RR<1.0 = negative association A risk ratio is interpreted in the following way: If the exposure is not associated with the illness, the RR=1 If the exposure is positively associated with the illness, RR>1 If the exposure is negatively associated with the illness, RR<1, indicating a potential protective effect of the exposure.

90 Measure of Association
Case-Control Study Measure of Association

91 (A/C)/(B/D)=(A*D)/(B*C)
Odds Ratio Cases Controls Exposed A B Unexposed C D Odds Ratio (A/C)/(B/D)=(A*D)/(B*C) In a case-control study, the measure of association used is called the odds ratio. Because we begin with diseased people (the cases) and people without the disease (the controls) we do not know the risk in the exposed and unexposed groups, as we did in the cohort study. For this reason, we are unable to directly calculate the risk ratio, the measure of association used in cohort studies. Instead, in case-control studies, another measure of association, the odds ratio, is used to estimate the risk ratio. Once data are collected from a case-control study, a 2x2 table can be constructed and the odds ratio may be calculated. The odds ratio is generated by first calculating the odds that a case was exposed (A/C) and the odds that a control was exposed (B/D) and then taking their ratio: (A/C)/(B/D) which simplifies to (A*D)/(B*C) Again, in reality, more than one 2x2 table and odds ratio may be calculated to examine the relationship between multiple exposures and the illness.

92 Odds Ratio Example OR = (60 / 18) / (25 / 55) = 7.3 60 25 85 18 55 73
Case Control Total Ate at restaurant X 60 25 85 Did not eat at restaurant X 18 55 73 78 80 158 Let’s use a hypothetical outbreak of hepatitis A as an example to learn how to interpret an odds ratio. The exposure is having eaten at restaurant X in April If we conducted a case-control study and generated an OR of seven we would interpret it in the following way: The odds of having eaten at restaurant X in April, 2004 were about 7 times greater among persons with Hepatitis A than persons without Hepatitis A. OR = (60 / 18) / (25 / 55) = 7.3

93 Interpreting an Odds Ratio
The odds ratio is interpreted in the same way as a risk ratio: OR=1.0 = no association between exposure and disease OR>1.0 = positive association OR<1.0 = negative association The odds ratio is interpreted in the same way as a risk ratio: If the exposure is not associated with the illness, the OR=1 If the exposure is positively associated with the illness, OR>1 If the exposure is negatively associated with the illness, OR<1. Again, this could indicate a protective effect of the exposure.

94 What to do with a Zero Cell
Case Control Total Ate at restaurant X 60 Did not eat at restaurant X 18 55 73 78 133 Try to recruit more study participants Add 1 to each cell* *Remember to document / report this! What happens if one of the cells in the 2x2 table has a zero in it? It’s impossible to calculate an odds ratio or risk ratio in this case. If this happens, a software program will probably produce an error message that says “undefined”. If this happens, the investigator can try to recruit additional study participants into the study to get rid of the zero cell. If this is impossible the investigator may also add 1 to each cell and then calculate the measure of association. Remember, if this latter approach is taken, it must be documents and reported.

95 Confidence Intervals Both measures of association, the risk ratio and the odds ratio, will be accompanied by confidence intervals. By convention, 95% confidence intervals are usually used in epidemiology. Let’s discuss these for a moment.

96 Confidence Intervals Allow the investigator to:
Evaluate statistical significance Assess the precision of the estimate (the odds ratio or risk ratio) Consist of a lower bound and an upper bound Example: RR=1.9, 95% CI: The 95% confidence interval includes a lower limit and an upper limit around the point estimate, the odds ratio or the risk ratio. For example, in a study with an odds ratio of 1.9, the confidence interval may be 1.1 to 3.1. Confidence intervals serve two important purposes. First, they allow you to evaluate statistical significance. If the confidence interval for an odds ratio or a risk ratio includes 1.0, it is not statistically significant. If the confidence interval does not include 1.0, it is statistically significant. This is the same as having a p-value of less than 0.05.

97 Confidence Intervals Provide information on precision of estimate
Narrow confidence intervals =more precise Wide confidence intervals =less precise Example: OR=10, 95% CI: Example: OR=10, 95% CI: Secondly, confidence intervals can be used to evaluate the precision of an estimate. In general, precision can be affected by the sample size and the role of random error, which is variability in the data that we can’t readily explain. A smaller sample size and/or lots of random error will lead to an imprecise estimate and will result in wide confidence intervals. A larger sample size and/or little random error will lead to a precise estimate and will result in more narrow confidence intervals. So, in general, narrow confidence intervals indicate higher precision and wide confidence intervals indicate less precision. For example, if a study were performed and generated an odds ratio of 10 with 95% confidence intervals of 0.9 to 44, what would these results tell us? Since the confidence interval includes 1.0, the estimate is not quite statistically significant, although the magnitude of the association, 10, is quite strong. However, the confidence intervals are so wide that the study either had a small sample size, a lot of random error, or both, so it would behoove the investigators to be somewhat cautious when interpreting these results and arguing that the exposure-disease relationship is causal. On the other hand, if a study were performed and generated an odds ratio of 10 with 95% confidence intervals of 9 to 11, the magnitude of the association, again, is quite strong and the confidence interval is narrow. It is unlikely that an estimate this far from the null value of 1.0 with confidence intervals this tight would be due to random error. Therefore, the investigator would probably feel more confident arguing that the exposure-disease relationship is causal, assuming no bias is present.

98 Plan and Execute Additional Studies
To gather more specific info Example: Salmonella muenchen Intervention study Example: implement intensive hand-washing Unfortunately, analytic studies aren’t always as informative as we hope they will be. For example, perhaps the hypotheses that were being tested were not well founded to begin with. When this happens, investigators must look anew at their existing data, possibly gather more information from the cases, and attempt to generate additional hypotheses. An example of this occurred with Salmonella muenchen in Ohio. The investigators performed a case-control study in which they asked about exposure to many different types of foods but the study failed to reveal any potential food exposure. The investigators noticed that all of their case households but less than half of their control households included individuals between the ages of 15 and 35 years. This prompted them to consider vehicles of transmission that were common among young adults. After additional questioning and a second case-control study, investigators were able to implicate marijuana as the outbreak source. This finding was confirmed by laboratory analysis of the marijuana. Finally, the goal of most outbreak investigations is to control and prevent disease transmission. Sometimes intervention studies are performed to evaluate these measures. For example, after an outbreak of Norwalk virus on a cruise ship an investigator may want to implement a prospective cohort study to assess the effect of implementing a mandatory intensive hand washing schedule for all crew members on the incidence of Norwalk virus illness.

99 Question & Answer Opportunity
What questions do you have about the measures of association involved in analytic epidemiology? If you do not have any further questions, we will take our second and final five minute break for the day. When we resume the session, Sarah Pfau will walk you through a mock analysis in Epi Info software. Sarah will illustrate many of the descriptive and analytic methods that I discussed today by showing you some of the output that you can expect to generate in Epi Info.

100 5 minute break

101 Download Epi Info software for free at: http://www.cdc.gov/epiinfo
Epi Info Analysis Case Study Download Epi Info software for free at: Many of you across the state have participated in Epi Info software trainings. Those trainings, like this session, were developed by the North Carolina Center for Public Health Preparedness and sponsored by the Virginia Department of Health Emergency Preparedness and Response Programs Office. We want to use today’s session to emphasize the practical application of Epi Info software for your outbreak investigation analytic studies. With the application of a just a handful of windows-based, analysis commands, you can generate the descriptive and analytic output that you need to generate and test hypotheses during an outbreak investigation. Our limited time today will not allow me to conduct a live software demonstration, so I have generated some screen shots of the case study analysis output from Epi Info. In addition to the output images, I will provide you with the windows based commands and code that you can use to replicate this case study analysis on your own.

102 Oswego Tutorial 1. Epi Info Main Menu 2. “Help” 3. “Tutorials”
The case study example on which I have based today’s sample analyses is an outbreak that occurred in Oswego County, New York in 1940. You can access the complete Oswego Tutorial that comes installed with Epi Info software from the Epi Info home page. Select “Help” from the toolbar menu, then the “Tutorials” option from the dropdown menu, then “Oswego tutorial” from the menu that appears.

103 Case Study Overview Oswego County, New York: 1940
80 people attended a church supper on 4 / 18 46 people who attended the supper suffered from gastrointestinal illness beginning 4 / 18 and ending 4 / 19 75 people (ill and non-ill) interviewed Investigation focus: church supper as source of infection Let’s jump right into the case study In Oswego County, 80 people attended the same church supper, and by the next day, the local health officer identified an outbreak of acute gastrointestinal illness in the community. All persons known to be ill had attended the church supper; furthermore, family members who had not attended the church supper had not become ill. So investigators focused the outbreak investigation on the church supper.

104 Church Supper Supper held in the church basement.
Foods contributed by numerous families.  Supper from 6:00 PM to 11:00 PM, so food consumed over a period of several hours. Here is the menu of items that were served at the church supper. And I will tell you a little bit more about the setting in which food was served: The supper was held in the basement of the village church.  Foods were contributed by numerous members of the congregation.  The supper began at 6:00 PM and continued until 11:00 PM, so the food spread out upon a table was consumed over a period of several hours.

105 Case Study Descriptive Epidemiology
Investigators needed to determine: The type of outbreak occurring; The pathogen causing the acute gastrointestinal illness; and The source of infection Let’s begin with the descriptive epidemiology element of the Oswego outbreak investigation Even though investigators knew that they wanted to focus the investigation on the church supper, they needed to perform descriptive epidemiology in order to generate a hypothesis about: The type of outbreak occurring; The pathogen causing the acute gastrointestinal illness; and The source of infection So investigators interviewed 75 of the 80 people known to have been present at the church supper. All data for both descriptive and analytic analyses were collected with one survey instrument, but let’s first consider the descriptive elements and walk through generating and interpreting output in Epi Info software.

106 Data Cleaning Know your data! Know the: Number of records
Field formats and contents Special properties Table relationships As Amy discussed today, one of the first steps to generating descriptive statistics is data cleaning. Before you begin, you may want to first re-visit the original survey instrument. This will show you how database fields originated, and how response options were coded on the survey instrument. Within Epi Info software, you can first open your database in MakeView to see how field response options were coded. For example, are they nominal or ordinal? Was a Likert scale used? [These terms were addressed in the September “Designing Questionnaires” session; you may want to review the slides on the Virginia Department of Health Training Web site if these concepts seem fuzzy]. You can also identify any programming associated with specific fields and review the database structure or layout.

107 Data Cleaning Tell Epi Info which records to include in analyses
When you Read in your data set in Epi Info’s Analyze Data component, you will see documentation about the number of records in the database. In the “Set” command in Analyze Data, you can tell Epi Info to include all records, versus only deleted or undeleted records that you may have flagged in the Enter Data component of the software. You can also tell Epi Info to include records with missing values. Tell Epi Info which records to include in analyses “Set” command in Analyze Data

108 Case Study: Line Listing
Organize and review data about time, person, and place that were collected via hypothesis generating interviews. Once you “Read” in your data table in the Analyze Data component of Epi Info, determine the contents of your database, and do any “cleaning” as needed, you can begin to generate descriptive statistics that will help you develop a research hypothesis. The first descriptive output that you might generate is the line listing. You learned today that a line listing allows you to quickly organize and review information regarding time, person, and place. I will show you how to generate one in Epi Info.

109 Case Study: Line Listing
Code for generating output: Here is a line listing that has been sorted on the continuous variable AGE in ascending order. You can include as many variables as you want in a line listing. Here I included the “person” variables AGE, SEX, and DATEONSET (date of onset of illness). Also, for your reference when you download these slides, here is the Epi Info programming code that: Reads in ViewOswego Displays Variables [this is an optional step, just to familiarize yourself with the data table contents so you will know which variable names to call up for a line listing] Sorts on AGE in ascending order Selects a subset of records: in this example, only those for ill people in the data set; and Generates a line listing for the fields AGE, SEX, and date of onset of illness. Try to reproduce this output on your own! I have included an image of the line listing so you can check your work.

110 Line Listing Windows Commands
1. Read (viewOswego in Sample.MDB) 2. Sort (on AGE, in ascending order) 3. Select (only the cases where ILL=“Yes”) 4. List (generate a line listing with the fields AGE, SEX, and DATEONSET) To generate this line listing using Analyze Data windows-based commands, you would use the “Read” “Sort” “Select” and “List” commands in that order.

111 Case Study: Means Means (of AGE) Code: Windows command:
If you do not like looking at all of those rows of data in a line listing and are mostly interested in quickly determining the distribution of a demographic variable such as age among cases in your outbreak, you can alternatively use the “Means” command in Epi Info. The standard output will reveal the minimum, maximum, and average ages among cases. You can generate this output with one, windows-based command-- “Means” -- and select AGE as the variable of interest in the dialogue window. I have pasted the programming code at the bottom of this slide – a simple, two-word program! Here you can see that among the 46 cases in the case study database, the minimum age is 3 years, the maximum age is 77 years, and the average age is 39 years. Code: Windows command: Means (of AGE)

112 Distribution: Frequency by Gender
A frequency distribution is another element of descriptive epidemiology. A frequency table provides quick insight into the distribution of record categories such as “male” versus “female.” Here is the frequency distribution of Oswego cases by gender. You can quickly glean that nearly two thirds of the cases are female. Notice also that Epi Info frequency tables, by default, include yellow, horizontal bar graphs that represent the distribution of cases. For your reference, I have provided the windows-based command on the right side of the slide. The code on the left: Reads in ViewOswego; Sorts on AGE in ascending order; Selects a subset of records: only the ill people in the data set; and Generates a frequency table for the variable SEX. Try to reproduce this output on your own! Windows command: Frequencies (by SEX)

113 Case Study: Epidemic Curve
Variable of Interest: DATEONSET (date of onset of illness) Entered into database mm/dd/yyyy/hh/mm/ss/AM PM Another helpful piece of descriptive output is the epidemic curve graph. By generating a histogram that illustrates an epidemic curve, you can determine the type of outbreak under investigation and estimate a pathogen’s incubation period. This is particularly helpful if you have not yet identified the pathogen, as in this case study. The Oswego case study variable of interest for generating an epi curve is “DATEONSET” [the case date of onset of illness]. DATEONSET was programmed into Epi Info with the “date / time” format shown here.

114 Case Study: Epidemic Curve
Here is a histogram that I generated in Epi Info for the continuous variable, “DATEONSET”. Remember that an epi curve is part of the “time” element of descriptive epidemiology. The X-axis interval on this graph is one hour, beginning with 3:00 p.m. on April 18th and ending with 11:00 a.m. on April 19ht. So right away you know that the epidemic was short-lived. You can also look at the distribution of cases and infer that the type of outbreak is “Point Source.” Let’s look at the text book image that Amy showed you earlier, and compare it with this histogram. . .

115 Point-Source Outbreak
If you compare these two images, you can feel pretty confident that you are correct in labeling this outbreak as “Point-Source.” So who – or what – was the point source of infection at the church supper? You cannot answer that question conclusively until you conduct an analytic study and examine the measures of association in Epi Info. But first we need to look at more descriptive statistics ‘Textbook’ distribution Case Study distribution

116 Case Study: Epidemic Curve
Average incubation period Maximum incubation period Overlap Outlier? We are not yet done assessing the epi curve. The histogram shows that the onset of illness was close to the time of food consumption, since the meal was served in the evening of April 18th, from 6:00 to 11:00 p.m. Isn’t it interesting that we have a case plotted at 3:00 p.m., prior to the meal being served? Could this person be our point-source, or just an outlier? We should now determine the minimum, maximum, and average incubation periods. There are two ways to do this: The first way is to use this histogram, label segments of information, and do some simple subtraction. For the purposes of following the model that Amy showed you earlier, I labeled the maximum incubation period, or the time from the most likely period of exposure to the time that the last case of illness was reported. This is represented by the blue line on the histogram. It runs from the earliest hour at which dinner was served – 6:00 p.m. on April 18th – to the latest case reported: 10:00 p.m. on April 19th. Next I labeled the minimum incubation period. Supper was technically served from 6:00 p.m. to 11:00 p.m., and the first onset of illness was at 9:00 p.m. Since there is overlap between the latest time of “most likely period of exposure” and the first onset of illness, I cannot neatly label a minimum incubation period. However, I can infer that the minimum incubation period was very short – possibly 3 hours or less if someone ate at 6:00 p.m. or later and was already ill at 9:00 p.m. The final element to label was the average incubation period, represented by the red line on the histogram. It runs from the middle of the time that food was served (e.g., 8:30 p.m., because we know from the case study introductory information that food was served from 6:00 p.m. to 11:00 p.m.) to the peak of the outbreak. Using this method to estimate the average incubation period with a histogram, I can count the number of hours from 8:30 p.m. on April 18th to 2:00 a.m. on April 19th and infer that the average incubation period was 5.5 hours.

117 Using Epi Info to Create Epi Curves
Step-by-Step Instructions Open the Analyze Data component Use the “Read” command to access your data table Click on the “Graph” command Choose “Histogram” as the “Graph Type” Choose your date / time of illness onset variable as the x- axis main variable The instructions on this and the following slide are here for your reference so you can try generating an epi curve on your own. I will forward through them, but encourage you to practice generating an epi curve with the Oswego data table.

118 Using Epi Info to Create Epi Curves
Step-by-Step Instructions Choose “count” from the “Show value of” option beneath the y-axis option Choose weeks, days, hours, or minutes for the x-axis interval from the “interval” dropdown menu Type in graph title where it says “Page title” Click “OK”

119 Determine Incubation Period
Alternative: Create a temporary variable called “Incubation” in Analyze Data: INCUBATION = DATEONSET – TIMESUPPER Where field format is identical: Date / time – mm/dd/yyyy/hh/mm/ss/AM PM The alternative way to estimate minimum, maximum, and average incubation periods is to create a temporary variable called, “incubation” in the Analyze Data component of Epi Info. This calculation is based on two fields in the database: The TIMESUPPER field [date and time of day that a case consumed the church supper]; and The DATEONSET field [date and time of day that a case had onset of illness] The formula for calculating incubation in number of HOURS is not as straightforward as the equation I have on this slide. You must tell Epi info to divide the end result by 60. I will provide the precise programming code on the next slide. The two critical factors for being able to have Epi Info calculate an incubation period for each record are: a. You need identical field formats for TIMESUPPER and DATEONSET b. You need DATA for both fields for every record ***In this case study data base, investigators were able to obtain the approximate time of eating supper from only half of the persons who experienced gastrointestinal illness*** This could be problematic in the calculation of incubation period for an unknown pathogen. However, this might be the reality of a data base that you will have to work with.

120 Means INCUBATION Analysis Output
Once you create a temporary “incubation” variable and Epi Info calculates a value for each record with data in both the TIMESUPPER and DATEONSET fields, you can then use the “Means” command in Epi Info to display minimum, maximum, and average incubation period values for cases. Here are the two standard pieces of analysis output that you will get with the Means command. Note here that there are only 22 values even though there are 46 cases in the database. That is due to the fact that the researchers did not have data on time of food consumption from all cases. When I used the epi curve graph to estimate the average incubation period, I got a value of 5.5 hours. And here, Epi Info has calculated the incubation period to be 4.3 hours. My manual epi curve estimate was not exact, but still close enough to the mathematical average to allow me to hypothesize which pathogen might be causing the outbreak. You will see in the next couple of slides how the incubation period for pathogens can extend over many hours, so that being one hour off with the estimate most likely would not direct investigators to the wrong pathogen.

121 Calculate Mean Incubation in Epi Info
Here is the programming code that you can use in Epi Info to replicate the creation of a temporary “incubation” variable and the stratified Means output that I just showed you. The first line of code simply Reads in the data table ViewOswego. The second line creates a name and placeholder for the new, temporary variable “incubation.” The third line assigns the results of the calculation based on the variables TIMESUPPER and DATEONSET to the temporary variable “INCUBATION” for each record where there are data for both the TIMESUPPER and DATEONSET variables. Here, TIMESUPPER is actually subtracted from DATEONSET, but the notation using a comma is how you program this subtraction in Epi Info. The resulting calculation is divided by 60 so you can graph the incubation period in intervals of hours versus minutes. The fourth line generates standard Means analysis output, stratified by illness status. However, you could just run an unstratified “Means” since only people who are ill will have an incubation period.

122 Identify the Pathogen. . . Once you have a solid estimate for the mean incubation period of the illness being investigated, you can use a resource like this one and compare the estimate to the documented characteristics of known pathogens. If we reference the case study mean incubation period of 4.3 hours with this table, it seems to fit within the 1 to 6 hour mean for Staphylococcus aureus. In this table, you can see the wide range in the number of hours for an incubation period to which I referred earlier.

123 Identify the Pathogen. . . CDC’s Foodborne Outbreak Response and Surveillance Unit “Guide to Confirming the Diagnosis in Foodborne Diseases” This Centers for Disease Control and Prevention resource organizes information much in the same way as the table that I just showed you, and you can download the information as a .PDF file. Keep in mind that even after you look at this resource and learn more about the risk factors and other characteristics associated with a pathogen, the only way to verify the involvement of a particular pathogen will be laboratory tests.

124 Case Study: Attack Rates
Obtain the information that you need to calculate food-specific attack rates via: Stratified Frequency Tables Line Listings 2 x 2 Tables Food-specific AR # people who ate a food and became ill # people who ate that food The last element of descriptive epidemiology that I want to review in Epi Info analysis output is food-specific attack rates. We already have an illness incubation period and predominant signs and symptoms, so let’s look at how you can use descriptive statistics to assess each suspected food item on a menu. Once you have food-specific attack rates, you can fine-tune your hypothesis for the analytic portion of your outbreak investigation. You can obtain the information that you need to calculate food-specific attack rates one of three ways. It’s a matter of preference, although you will see that one way stands out as the most efficient. One way is to generate stratified frequency tables; Another way is to generate line listings; and The third way is to generate 2 x 2 tables. You have already seen frequency table and line listing analysis output, but I now want to show you how to pull numbers from the output and calculate attack rates. Let’s quickly review the simple equation: the food-specific attack rate is the number of people who ate a food and became ill divided by the total number of people who ate that food.

125 Stratified Frequency Tables
40 people ate cake; 27 people who ate cake are ill. AR for people who consumed cake: 27 / 40 = 67.5% 35 people did not eat cake; 19 of those people are ill. One way that you can obtain data for food-specific attack rate calculations is to generate stratified frequency tables for ill and non-ill persons. Here is some sample analysis output from Epi Info for the food item, “cakes” in the case study. The text under the second table describes the window-based commands that you need to use in Analyze Data. You can see from the sample calculations that you need numbers from both frequency tables in order to calculate the attack rates [but these tables are generated simultaneously when you request stratified output]. The food-specific attack rate for people who consumed cake in the case study outbreak is 67.5%. That’s a pretty high attack rate. At first glance we might want to hold on to this food item as one to potentially explore further in the analytic study. But more than one half of the people who did not consume cake also became ill, so we might actually re-think including this food item in the analytic study. AR for people who did not consume cake: 19 / 35 = 54.2% Frequencies CAKE ; Stratify by ILL

126 27 people who ate cake are ill
Line Listings people ate cakes 27 people who ate cake are ill AR for people who Consumed cake: 27 / 40 = 67.5% Here, the same data are presented in a different—and inefficient—format. With a line listing, you have to count the raw data that comes so neatly packaged in frequency tables before you can do the math to get to your attack rate. I made this line listing a little easier for the researcher to interpret by first using the “Sort” command and sorting on the variable “ILL.” This way the “ILL” and “Non-ILL” individuals in the database are at least quickly distinguishable for counting purposes. I want to point out that I included only the calculation of the attack rate among people who consumed cake on this slide. However, you cannot forget to also calculate the counterpart: the attack rate among people who did not consume cake. Not Ill Ill

127 Tables Analysis Output
2 x 2 Table The third and most efficient way to generate numbers for calculating attack rates is the windows-based “Tables” command in Analyze Data. One standard part of “Tables” analysis output is a 2 x 2 table. The other part is Odds Ratios, Risk Ratios, and Confidence Intervals. I will get to that output in a few minutes. Here I have generated a 2 x 2 table for the food cake, where the exposure is cake consumption and the outcome is illness. I have a short activity for you so you can practice interpreting this output. . . Windows command: Tables (Exposure = CAKES; Outcome = ILL)

128 Activity: Interpreting Output
Epi Info provides both a count and a relative percentage by row or column within each cell of a 2 x 2 table. If you misinterpret the output, you can end up with an incorrect conclusion that leads you in the wrong direction during your outbreak investigation! It is important that you are able to put the numbers in this table into conclusive statements that make sense to you. So I have one question for you to answer in a sentence. The question is: “What percentage of people who ate cake did not get ill?” Your answer will be, “BLANK percent of people who ate cake did not get ill.” I will give you a few minutes to find the correct row or column percent that you need from this table, and will then review the answer. What percentage of people who ate cake did not get ill?

129 Activity: Interpreting Output
Exposure Outcome Here’s the answer: 32.5% of the people who ate cake did not get ill. Did you pull the correct number from the table? Here’s how to find the correct answer: We know that the question includes an exposure of eating cake, so we find “Cake consumption equals ‘yes’” in the far left column of the table. We know that the question includes an outcome of not getting ill, so we find the “illness outcome equals ‘no’ in the next to last column of the table. We then need a percentage to answer the question, but there are two percentages in the cell where our two criteria intersect. So how do we determine which one is correct? Well, this is where it is helpful to put the numbers in a sentence. If I use the Column percent of 44.8, I am saying, “44.8% of the people who did not become ill ate cake.” That is the answer to a different question: “What percentage of the people who are not ill consumed cake?” If I use the Row percent of 32.5, I am saying, “32.5% of the people who consumed cake did not become ill.” Bingo; that’s the answer to the activity question. Answer: 32.5% of the people who ate cake did not get ill.

130 Case Study Attack Rates
Consumed Item Did Not Consume Item Ill Total AR(%) Baked Ham 29 46 63% 17 59% Cabbage Salad 18 28 64% 47 60% Cakes 27 40 68% 19 35 54% Chocolate Ice Cream 25 53% 20 74% Vanilla 43 54 80% 3 21 14% In a real outbreak investigation, you would calculate the food-specific attack rate for each item on the supper menu. I have prepared a table with attack rates for five items so we can review and compare them quickly. The case study point source of infection is one of the five items included in this table, so let’s take a look. . . So which food item looks suspicious here? Well, all food items have relatively high attack rates among people who consumed the food, but vanilla ice cream stands out as the food with the highest attack rate: 80 percent of the people who consumed vanilla ice cream became ill. Conversely, if we assess the attack rates among people who did not consume those same food items, we see that four of the five also have high attack rates. So if anywhere from 54 percent to 74 percent of the people who did not consume those foods became ill, there must be some other food that is the culprit. This again brings us to vanilla ice cream: 80 percent of people who consumed the vanilla ice cream became ill, but only 14 percent of the people who did not consume vanilla ice cream became ill. Finally, 43 of the case-patients consumed vanilla ice cream, whereas only 3 case-patients did not eat the vanilla ice cream. You would therefore probably conclude that you should further investigate the association of vanilla ice cream consumption and outcome of illness. We should further investigate the association of vanilla ice cream consumption and illness

131 Generate and Test a Hypothesis!
The epi curve is indicative of a Point-Source outbreak Based on the incubation period, we suspect Staphylococcus aureus as the pathogen The food-specific attack rates lead us to believe that vanilla ice cream may be the source of infection Based on the descriptive statistics and epi curves that we have reviewed, we have been able to make some inferences that allow us to formulate a hypothesis and move forward with an analytic study. Let’s review what we know up to this point: [READ SLIDE] Based in this information, we will assess the association between vanilla ice cream consumption and illness. So let’s move on to the analytic part of the outbreak investigation case study.

132 Case Study First, let’s quickly review this table that Amy showed you earlier. You know that you want to evaluate the association between an exposure (eating vanilla ice cream) and an outcome (becoming ill). So you will undertake an analytic study. Since this is a small, well-defined outbreak, you can choose the cohort study design. And the measure of association for a cohort study will be a risk ratio.

133 Tables Analysis Output
2 x 2 Table Shell Epi Info 2 x 2 Table The windows-based Analyze Data command used to generate risk ratios or odds ratios is “Tables.” This slide displays two images: On the left, a text book table shell to review the origin of risk and odds ratios formulas that Amy discussed earlier, and On the right, one part of the standard 2 x 2 Tables analysis output in Epi Info for the food “vanilla ice cream.” You just worked with this output in today’s second activity. Windows command: Tables (for VANILLA)

134 Tables Analysis Output
In addition to a contingency or 2 x 2 table, standard Tables analysis output in Epi Info always includes both odds ratios and risk ratios. So be sure that you have a clear understanding as to which study design you have implemented (case-control versus cohort) so you interpret the appropriate measure of association. We are looking for a risk ratio greater than 1.0 to indicate a positive association between vanilla ice cream consumption and onset of illness. The Risk Ratio in the output is Translated, the inference for this Risk Ratio is, “The risk of becoming ill was more than five times greater for people who consumed vanilla ice cream than for people who did not consume vanilla ice cream.” Finally, do not forget that you should also assess the confidence interval values that accompany the risk ratio. Here the confidence interval range is 1.9 to While the null value of 1.0 is not included in the confidence interval, it is a fairly wide range, probably because the sample size was small. “The risk of becoming ill was more than five times greater for people who consumed vanilla ice cream than for people who did not consume vanilla ice cream.”

135 Case Study Analytic Results
- Point-Source Outbreak - Staphylococcus aureus suspected pathogen based on 4.3 hr average incubation period - Vanilla ice cream suspected source of infection (highest food-specific AR of 80%) - Vanilla ice cream RR = 5.6 - Vanilla ice cream C.I. = 1.9 – 16.0 Let’s conclude this case study outbreak investigation by reviewing a compilation of the descriptive and analytic analysis results. [REVIEW SLIDE] I want to end this case study segment by sharing some report information from the real outbreak investigation that took place in I do not want you to leave today’s session wondering about the final outcome of the Oswego outbreak investigation! All handlers of the ice cream were examined. No external lesions or upper respiratory infections were noted. Nose and throat cultures were taken from two individuals who prepared the ice cream. Bacteriological examinations were made by the Division of Laboratories and Research, Albany, on both ice creams. Their report is as follows: 'Large numbers of Staphylococcus aureus and albus were found in the specimen of vanilla ice cream. Only a few staphylococci were demonstrated in the chocolate ice cream. Report of the nose and throat cultures of the individuals who prepared the ice cream read as follows:  'Staphylococcus aureus and hemolytic streptococci were isolated from nose culture, and Staphylococcus albus from throat culture, of ice cream preparer ‘A.’ Staphylococcus albus was isolated from the nose culture of ice cream preparer ‘B.’ The hemolytic streptococci were not of the type usually associated with infections in man. The source of bacterial contamination of the vanilla ice cream is not clear. Whatever the method of the introduction of the staphylococci, it appears reasonable to assume it must have occurred between the evening of April 17th and the morning of April 18th [ice cream preparation began on April 17, and covered containers were left to stand overnight in the church basement]. No reason for contamination peculiar to the vanilla ice cream is known.

136 Online Epi Info Instruction
8 Self-Instructional Training Modules for various screen components, functions, and commands in Analyze Data Before we adjourn today, I want to let you know that 8 free, online self-instructional modules on using Epi Info’s Analzye Data component are available at the North Carolina Center for Public Health Preparedness’ Training Web site. Each in-depth module includes both PowerPoint slides with a narrated lecture and recorded software demonstration video. One module even teaches you how to generate Epi Curves. Each module runs anywhere from 30 minutes to 1 hour in length.

137 Question & Answer Opportunity
We have a few minutes for questions. If you have questions about generating descriptive or analytic analysis output specific to today’s case study analysis, I can possibly squeeze in a software demonstration since I have Epi Info installed on this laptop.

138 Next Session December 1st, 1:00 p.m. – 3:00 p.m.
Topic: “Writing and Reviewing Epidemiological Literature” Well, we made it through the steps of an outbreak investigation in only six sessions! Our seventh and final session in this series will be, “Writing and Reviewing Epidemiological Literature.” We hope that you can join us. Amy Nelson will join us again as the lecturer. Thank you, and we will see you on December 1st at 1:00 p.m.!

139 Session V Summary Analysis planning can: be an invaluable investment of time; help you select the most appropriate epidemiologic methods; and help assure that the work leading up to analysis yields a database structure and content that your preferred analysis software needs to successfully run analysis programs. As you plan your analysis: 1) Work backwards from the research question(s) to design the most efficient data collection instrument; 2) Consider your study design to guide which statistical tests and measures of association you evaluate in the analysis output; and 3) Consider the need to present, graph, or map data.

140 Session V Summary Descriptive epidemiology: 1) Familiarizes the investigator with data about time, place, and person; 2) Comprehensively describes the outbreak; and 3) Is essential for hypothesis generation. Data cleaning is the first step in preparing to generate descriptive statistics, as it contributes to the accuracy and completeness of the data. Measures of central tendency provide a means of assessing the distribution of data. Measures include mean, median, mode, and range. Epi curves, spot maps, and line listings are all ways in which you can generate and review the time, place, and person elements – respectively – of descriptive statistics.

141 Session V Summary Attack rates are descriptive statistics that are useful for comparing the risk of disease in groups with different exposures (such as consumption of individual food items). Analytic epidemiology allows you to test the hypotheses generated via review of descriptive statistics and the medical literature. The measures of association for case control and cohort analytic studies, respectively, are odds ratios and risk ratios. Confidence intervals that accompany measures of association evaluate the statistical significance of the measures and assess the precision of the estimates.

142 References and Resources
Centers for Disease Control and Prevention (1992). Principles of Epidemiology, 2nd ed. Atlanta, GA: Public Health Practice Program Office. Division of Public Health Surveillance and Informatics, Epidemiology Program Office, Centers for Disease Control and Prevention (January 2003). Epi Info Support Manual. [included with installation of the software, which can be found at: Gordis L. (1996). Epidemiology. Philadelphia, WB Saunders.

143 References and Resources
Rothman KJ. Epidemiology: An Introduction. New York, Oxford University Press, 2002. Stehr-Green, J. and Stehr-Green, P. (2004). Hypothesis Generating Interviews. Module 3 of a Field Epidemiology Methods course being developed in the NC Center for Public Health Preparedness, UNC Chapel Hill. Torok, M. (2004). FOCUS on Field Epidemiology. “Epidemic Curves”. Volume 1, Issue 5. NC Center for Public Health Preparedness


Download ppt "Public Health Information Network (PHIN) Series II"

Similar presentations


Ads by Google