Presentation on theme: "Research Designs Commonly Used In Epidemiology. One of the basic concepts in research designs which are trying to discern cause is that we have to make."— Presentation transcript:
One of the basic concepts in research designs which are trying to discern cause is that we have to make sure our selection of subjects and our selection of dependent and independent variables are consistent with the biological basis of disease. In 1965 A. B. Bradley proposed a set of 9 criteria which must exist in order to conclude “cause”. They include: 1. Strength of association 2. Consistency 3. Specificity 4. Temporal relationship 5. Dose effect 6. Experimental evidence 7. Biological plausibility
In the previous example 200 births were randomly selected from the population, producing the following results: Joint Probability P (D & E) =0.035 Marginal ProbabilityP (D) = 0.07 Conditional Probability P (D | E) =0.119 P (D | E) RR: ———— =2.39 P (D | E) P (D | E) OR ———— ÷ ———— =2.58 P (D | E) P (D | E) P (D | E) P (D | E) P (D) - P (D | E) P (D) - P (D | E) AR ———————— =0.29 P (D) P (D) ER P (D | E) - P (D | E)=0.069 Because all births in the population had an equal chance of being selected; joint, marginal, and conditional probabilities can be estimated with reasonable confidence.
The type of research design illustrated by the previous examples is a population-based design (often called a cross-sectional study). This design is very useful for obtaining estimates of the different probabilities discussed and has a very simple procedure. 1. Obtain a simple random sample of size n from the study population 2. Measure the presence or absence of both D and E for all of the sampled individuals. sampled individuals. 3. Calculate away...
When the population-incidence of E (to the hypothesized) risks for the disease in question is pretty low, then population-based designs may not yield sufficient numbers of people with the E in order to calculate risk profiles associating E and D. One way to get around this is to conduct a cohort study. This type of research design is primarily concerned with defining two sub- populations based on E (or level of E) within the target population; creating two (or more) cohorts – sub-groups with one defining common characteristic. Basic Procedures: 1. Identify two subgroups of the population based on the presence or absence of E (E and E) of E (E and E) 2. Take a separate random sample of equal n from each of the two subgroups 3. Measure the presence or absence of D in both random samples 4. Calculate away...
An example of a cohort study might entail defining our E and E as unmarried and married, respectively and then randomly selecting 100 birth records for each category of exposure. We then determine the presence or absence of our variable D for each record; possibly producing the following data set: Birthweight Birthweight LowNormalTotal Unmarried1288100 Married595100 Total17183200 Because the number of “married” and “unmarried” birth records are equal (and this is certainly NOT representative of the population) neither joint probabilities nor marginal probabilities for the population can be estimated from this sample. Only conditional probabilities can be estimated because only associations between D or D and E or E will be representative of the population. Because AR includes a marginal probability [P (E)], it cannot be calculated either. Only RR, OR, & ER can be calculated with this type of design.
Birthweight Birthweight LowNormalTotal Unmarried1288100 Married595100 Total17183200 So; using the data obtained from a COHORT design, we can calculate the following sample statistics (and if I wasn’t so lazy I also would have calculated confidence intervals for each): P (D | E) RR:———— = (12/100) / (5/100) = 2.40 P (D | E) P (D | E) OR ———— ÷ ———— = [(12/100) / (88/100)] / [(5/100 / (95/100)] = 2.59 P (D | E) P (D | E) P (D | E) P (D | E) ER P (D | E) - P (D | E) = (12/100) - (5/100) = 0.070
When the incidence of D in the population is pretty low, then population- based designs and E-based cohort designs may not yield sufficient numbers of people with the D in order to calculate risk profiles associating D and E. One way to get around this is to conduct a case-control study. This type of research design defines two sub-populations based on D (D and D) within the target population. Basic Procedures: 1. Identify two subgroups of the population based on the presence or absence of D (D and D) of D (D and D) 2. Take a separate random sample of equal n from each of the two subgroups 3. Measure the presence or absence of E in both random samples 4. Calculate away...
An example of a case-control study using the same birth data would entail defining our D as low birthweight and D as normal birthweight and then randomly selecting 100 birth records from each category of disease. We would then measure our E (unmarried and married) for each record in both of the random samples; possibly producing the following data set: Birthweight LowNormalTotal Unmarried 50 28 78 Married 50 72122 Total100100200 Because the number of “low” and “normal” birth records are equal (and this is certainly NOT representative of the population) neither joint probabilities nor marginal probabilities for the population can be estimated from this sample. Only conditional probabilities can be estimated because only associations between D or D and E or E will be representative of the population. In addition, only those associations which condition on disease outcome can be estimated; leaving OR as the only (accurate) calculation possible (this is because the frequency of D in the sampling process was decided by the investigator and any P of D regardless of the conditioned variable will not be relevant to the population).
So, we can calculate the OR: OR = [(50/100) / (50/100)] / [(28/100) / (72/100)] = 2.57 As mentioned previously, if the incidence of D is rare in both exposed and unexposed populations (you know whether this is true from population-based studies) then OR is a close approximation of RR. With some clever rearrangements of the formulas (and using Bayes’ formula twice) AR can be defined as a conditioned function which incorporates RR, therefore we can still get a close estimate of AR (by using the OR in place of RR; rearrangements of formulas not shown). Thus it is still possible to calculate (within defined confidence limits) estimates of the the OR, RR, and AR parameters with a simple case-control design. (Estimates of RR and AR will not be quite as accurate as with population-based designs but then, that is just a limitation to this type of design. Birthweight LowNormalTotal Unmarried 50 28 78 Married 50 72122 Total100100200
Although there may be limitations to the simple case-control design, in terms of calculating estimates of the population’s risk profiles, if one is very clever about selecting controls, then one can avoid the major limitations. Probably the most important limitation to simple case-control designs is that the majority of diseases that we are concerned about are not really rare – violating the rarity assumption that is necessary to be able to estimate RR from OR (breast cancer, lung cancer, CAD,...) yet, in order to have sufficient numbers of people in our studies, case-control designs are by far the most efficient and cost-effective to use. A common characteristic of these diseases is that the interval of risk is long; it takes a long time to produce the disease (many years). To put it another way, as the duration of exposure increases, there is an increase in total risk for the disease – resulting in a relatively high incidence in older people. This aspect of the disease process can be taken advantage of by carefully selecting controls.
One way to handle this is to perform what is called a risk-set sampling or density sampling. For each case, one or more controls are selected. The clever part is that the controls are stratified on the basis of time. (Think of it as a form of time-standardized incidence - exposure rate, similar to the age- standardized mortality rate.) Because of the long duration of the risk period, if the period is divided into small time-frames (for example, every 5 years for a cumulative total of 30 years) then you produce a series of small sub-groups of cases-controls, one for each time period (6 separate case- control groups for this example, each with a separate set of risk profile calculations – just LINKED over time). Notice that this is similar to producing a series sub-groups with levels of exposure as previously discussed for linear regression analysis (a series of separate risk/disease groups LINKED by level of exposure). This time it is time duration of exposure and not level of exposure. Because the separate case-control groups actually comprise a series of sub- sets or sub-samples from the total sample, this version of a case-control design is called a nested case-control design (there are a variety of variations of this nested design that we won’t bother with).
The neat thing about these designs is that at each individual time- frame, the incidence of disease would be rare so the rarity assumption is NOT violated and therefore the various components of the risk profiles can be calculated. This risk over time also introduces a new statistical term; Relative Hazard (RH) – the same basic calculations as RR except time duration of exposure (5 years vs. ten years vs. 15 years...) is the important concept rather than the level of exposure (10 cigarettes smoked/day vs. 20 vs. 30 vs. 40...). As in case-control designs for rare diseases where OR estimates RR, the calculated OR in this version of a nested design estimates RH. This multiple groups over time design allows for a linear regression analysis in order to produce those cool graphs showing associations between P of D and time. With this design, the change in P for each increment of time would represent EH in the same way that each change in P for each level of exposure represents ER.
Variations on this nested case-control concept allow for a variety of sophisticated studies (none of them discussed here) that are capable of controlling for different kinds of population factors that limit the utility of simple case-control designs – leading to a variety of nested case-control designs being the dominant forms of research design in epidemiology. Of course, if you have records of the entire population at your disposal (often possible in those countries with single-payer health- care systems), then population-based case-control designs can be used – yielding much more power... (power refers to the ability to observe differences or associations with a high degree of confidence). This leads to the last major topic... Exactly how do we determine our “confidence” level (as opposed to confidence limits) in “observing” differences or associations. That is the realm of statistical analysis and how statistical analysis estimates the probability that your observed associations or differences happened by random chance or by exposure (treatment?).
Papers for discussion next time: (I’m sure you have already read them long ago, so a quick review should suffice) 1.Friedenreich (both of ‘em) 2.Thune 3.Dees