Presentation on theme: "Forensics and DNA Statistics Harry R Erwin, PhD CIS308 Faculty of Applied Sciences University of Sunderland."— Presentation transcript:
Forensics and DNA Statistics Harry R Erwin, PhD CIS308 Faculty of Applied Sciences University of Sunderland
References Goodwin, Linacre, and Hadi (2007) An Introduction to Forensic Genetics, Wiley. Butler (2005) Forensic DNA Typing, 2 nd edition, Elsevier, chapters 20-23.
Statistics and DNA According to Butler, “Statistical genetic information is often more difficult for DNA analysts to grasp than the technology and biology issues…because of its heavy use of mathematics particularly algebra. The concepts of probabilities can be challenging to forensic scientists schooled in biology rather than mathematics.” The implication is that you may need to provide the necessary expertise. 8(
Lecture Plan Review STR population database analyses (ch. 20) Profile frequency estimates, likelihood ratios, and source attribution (ch. 21) Approaches to statistical analysis of mixtures and degraded DNA (ch. 22) Kinship and parentage testing (ch. 23)
Review: What to Remember Probability – Laws of probability – Likelihood ratios – Bayesian statistics Statistics – Hypothesis testing – Chi-square test – Confidence intervals – Randomization tests
Introduction There are three possible outcomes of a DNA test: 1.No match 2.Inconclusive 3.Match Only a match requires statistics to provide meaning. Which statistics to apply is debatable.
Laws of probability Probability: number of times an event occurs divided by the number of opportunities for it to occur. Three laws of probability to remember 1.Probabilities range between 0.0 and 1.0. 2.If two events are mutually exclusive the probability of either taking place is the sum of their probabilities. 3.If two events are independent the probability of both occurring is the product of their individual probabilities.
Likelihood ratios A Likelihood Ratio (LR) is the comparison of the probabilities of the evidence under two alternative (mutually exclusive) hypotheses. – The Null Hypothesis, and – The Alternative Hypothesis. These hypotheses should cover all possible cases – LR = Pr(H p )/Pr(H d )
Bayesian statistics Posterior odds = (Likelihood ratio)*(Prior odds) – Pr(H p |E)/Pr(H d |E) = LR*Pr(E|H p )/Pr(E|H d ) Verbal terminology for likelihood ratios Likelihood RatioVerbal Equivalent 1-10Limited support for the prosecution hypothesis 10-100Moderate support for the prosecution hypothesis 100-1000Moderately strong support for the prosecution hypothesis 1000-10000Strong support for the prosecution hypothesis 10000-100000Very strong support for the prosecution hypothesis
Fallacies to avoid Prosecutor’s fallacy Defendant’s fallacy
Statistics Statistics measures uncertainty and reliability. A population is the set of objects of interest. A sample is an observable subset of a population. A statistic is some observable property of a sample.
Hypothesis testing Select appropriate statistical model Choose two alternative hypotheses, H 0 and H 1 Specify the level of significance and its critical value, C Collect data and calculate statistic Check region of rejection for statistic Reject? Accept H 0 Accept H 1 YesNo
A Weakness of Hypothesis Testing— The Base Rate Fallacy In a scientific research area where there is a lot of activity, most interesting uninvestigated hypotheses are false and most published results are accidental. Why? Because the Neyman-Pearson protocol only controls the probability of false alarms. Implication: test DNA against as small a group of possible matches as possible to avoid false alarms.
Chi-square test A goodness-of-fit test. Answers “How close do the observations come to the expected results?” The Χ 2 statistic is parameterised by degrees of freedom, df, and large values indicate there’s a significant deviation from theory.
Confidence intervals Usually the sample mean plus and minus two standard deviations. An observation outside that interval is 95% unlikely. Other confidence intervals can be defined. These are used to help visualise measurements against a population.
Randomization tests These explore whether collecting the data differently would affect the results. Usually starts by treating the collected data as representative of the population, and permuting it, leaving samples out, or randomly ‘resampling’ it multiple times to see the range of descriptive statistics Get a computational statistician involved if these questions come up. The tools are available in R to do these kinds of analyses. Keywords: resampling, bootstrap, jackknife
Principles of Population Genetics Laws of genetics Number of alleles and number of possible genotypes
Populations What is a population? – A group of people sharing common ancestry. – Usually defined broadly Hardy-Weinberg Equilibrium – Within a randomly mating population, the genotype frequencies at any single genetic locus will remain constant. This allows genotype frequencies to be predicted from allele frequencies. (See Punnett Square.) – All human populations deviate (mildly) from HWE and your statistics will require (mild) corrections.
Punnett Square Mother: A p Mother: a q Father: A p AA p 2 aA qp Father: a q Aa pq aa q 2 AAP2P2 Aa2pq aq2q2 Note the following: p + q = 1.0 The fitness of the alleles (A and a) must be equal in the population. This usually is the result of hybrid vigor, where the heterozygote has an advantage over both homozygotes.
Deviations from HWE in Human Populations Finite populations produce random genetic drift—not an issue for populations larger than a small town. Non-random mating is not believed to affect the STR loci. Migration effects disappear over a period of several generations. Natural selection is not believed to affect the STR loci. Mutation rates at ~0.2%/generation are not likely to affect allelic frequencies.
STR Population Database Analyses Chapter 20 of Butler, 2 nd edition. Population DNA databases Statistical tests on DNA databases Practical considerations “As population databases increase in numbers, virtually all populations will show some statistically significant departures from random mating proportions. Although statistically significant, many of the differences will be small enough to be practically unimportant.”
Creating a Population DNA Database Not for amateurs! Need >100 samples per local population group Often uses anonymous samples from a blood bank—watch for sampling effects Analysis—use appropriate STR kits Determine allele frequencies at each locus—note sampling bias issues. (Suppose there was a large local population that never gave blood samples for religious reasons.) Check HWE. Note the potential existence of non-interbreeding populations Exploratory sampling from existing databases suggests an unexpectedly high probability of random match on 9 or more loci.
Statistical tests on DNA databases There are a number of computer programmes available to evaluate the usefulness of a DNA database. Consider using DNATYPE first of all Need to test for independence of alleles at each genetic locus and between loci Unfortunately, independence testing does not validate the product rule 8( Compare to other population data sets Watch for population substructure
Practical considerations Watch these journals for population data: – For the Record articles in Journal of Forensic Science – Announcements of Population Data in Forensic Science International Understand the numbers reported. Understand why the markers in use have been chosen. Understand what the most common and rarest genotypes are for the DNA markers in use.
How Statistical Calculations are Made Generate data with set(s) of samples from desired population group(s) – Generally only 100-150 samples are needed to obtain reliable allele frequency estimates Determine allele frequencies at each locus – Count number of each allele seen Allele frequency information is used to estimate the rarity of a particular DNA profile – Homozygotes (p 2 ), Heterozygotes (2pq) – Product rule used (multiply locus frequency estimates) For more information, see Chapters 20 and 21 in Forensic DNA Typing, 2 nd Edition
Allele Frequency Tables Caucasian N= 302 0.0017* -- 0.1027 0.2616 -- 0.2533 0.2152 0.15232 0.01160 African American N=258 -- 0.0019* 0.0892 0.3023 0.0019* 0.3353 0.2054 0.0601 0.0039* 200.0017* 0.0001* D3S1358 Butler et al. (2003) JFS 48(4):908-911 Allele frequencies denoted with an asterisk (*) are below the 5/2N minimum allele threshold recommended by the National Research Council report (NRCII) The Evaluation of Forensic DNA Evidence published in 1996. Most common allele Caucasian N= 7,636 0.0009 0.1240 0.2690 -- 0.2430 0.2000 0.1460 0.0125 Einum et al. (2004) JFS 49(6) Allele 11 13 14 15 15.2 16 17 18 19 12 0.0017* -- 0.0007 0.0031 African American N= 7,602 0.0003* 0.0077 0.0905 0.2920 0.0010 0.3300 0.2070 0.0630 0.0048 0.0045 20 Allele 11 13 14 15 15.2 16 17 18 19 12
Frequency Estimates, Likelihood Ratios, and Source Attribution Chapter 21 of Butler, 2 nd edition. Frequency estimate calculations Likelihood ratio Source attribution Other topics
Frequency Estimate Calculations Work through a frequency estimate calculation. Take a DNA profile and use the allele frequencies in a population database. A random match probability is not the probability that someone is guilty or that someone else left the biological material. Understand how rare alleles and tri-allelic patterns are handled. Understand the product rule Understand the differences between population databases. Understand the impact of population structure Understand the impact of relatives.
Likelihood ratio Practice quantifying the evidentiary value of a match between a reference sample, K, and a questioned sample, Q Explore likelihood ratios.
Source attribution When p x is the random match probability for a profile X, (1-p x ) N is the probability of not observing the particular profile in a sample of N unrelated individuals. When this probability is greater than or equal to a confidence level 1-a, then (1-p x ) N >= 1-a or p x <= 1 – (1-a) 1/N In the American population, a random match probability (RMP) of 3.35 x 10 -11 will confer a 99% confidence that the profile is unique in the population. For the UK, the RMP is 2.01 x 10 -10
Other topics DNA database searches—multiply the RMP by the number of persons in the database to adjust for the possibility of matching that many people. The more persons in your database, the more likely the match will occur randomly. For lineage markers—use the count of the profile in the database as an estimate of its underlying probability in the population and do a frequency estimate with a confidence interval based on that.
DNA Profile Frequency with all 13 CODIS STR loci Locus allele value allele value1 inCombined D3S1358160.2533170.21529.17 VWA170.2815180.20038.8781 FGA210.1854220.218512.351005 D8S1179120.1854140.165616.2916,364 D21S11280.1589300.278211.31185,073 D18S51140.1374160.139126.184,845,217 D5S818120.3841130.14079.2544,818,259 D13S317110.3394140.048030.691.38 x 10 9 D7S82090.177231.854.38 x 10 10 D16S53990.1126110.321213.86.05 x 10 11 THO160.231818.621.13 x 10 13 TPOX80.53483.503.94 x 10 13 CSF1PO100.216921.288.37 x 10 14 The Random Match Probability for this profile in the U.S. Caucasian population is 1 in 837 trillion (10 12 ) AmpFlSTR ® Identifiler™ (Applied Biosystems) AMEL D3 TH01TPOX D2D19 FGA D21 D18 CSF D16 D7 D13 D5 VWA D8 What would be entered into a DNA database for searching: 16,17- 17,18- 21,22- 12,14- 28,30- 14,16- 12,13- 11,14- 9,9- 9,11- 6,6- 8,8- 10,10 PRODUCTRULEPRODUCTRULE
The Same 13 Locus STR Profile in Different Populations 1 in 0.84 quadrillion (10 15 ) in U.S. Caucasian population (NIST) 1 in 2.46 quadrillion (10 15 ) in U.S. Caucasian population (FBI)* 1 in 1.86 quadrillion (10 15 ) in Canadian Caucasian population* 1 in 16.6 quadrillion (10 15 ) in African American population (NIST) 1 in 17.6 quadrillion (10 15 ) in African American population (FBI)* 1 in 18.0 quadrillion (10 15 ) in U.S. Hispanic population (NIST) *http://www.csfs.ca/pplus/profiler.htm 1 in 837 trillion These values are for unrelated individuals assuming no population substructure (using only p 2 and 2 pq) NIST study: Butler, J.M., et al. (2003) Allele frequencies for 15 autosomal STR loci on U.S. Caucasian, African American, and Hispanic populations. J. Forensic Sci. 48(4):908-911. (http://www.cstl.nist.gov/biotech/strbase/NISTpop.htm)
Example Calculations with Population Substructure Adjustments
Example Calculations with Corrections for Relatives
Statistical Analysis of Mixtures and Degraded DNA Chapter 22 of Butler, 2 nd edition. Mixture interpretation Partial DNA profiles
Mixture interpretation This is nasty, but any truth is better than indefinite doubt. The most conservative approach is to judge whether the suspect might be represented by the mixture found in the sample. Some times you can pull apart the alleles, one known person at a time. (In radar/sonar, this is known as adaptive beam forming.) Duplicate alleles among the persons in the mixture are then a problem. When contributions of donors are about equal, you have a serious problem.
Exclusion Probabilities Use the combined probability of exclusion. This is an estimate of the proportion of the population that has at least one allele not observed. The combined probability of exclusion assumes independence and multiplies the excluded population proportion at each locus. Vulnerable to non-detection of alleles Provides a conservative estimate
Likelihood Ratio Set up two competing hypotheses The problem is defining the hypotheses is not straightforward. Uses the evidence better than the exclusion method.
Mixtures Complicated to interpret Basic approach is to identify the alleles from known contributors. Any detected alleles outside that set had to come from unknowns (one or more…) When the mixture results are affected by low- copy number stochastic limits, degradation, or PCR inhibition, so that alleles are missing, all bets are off.
Partial DNA profiles Only loci with results can be interpreted. Degraded samples or low-copy-number samples will cause PCR to fail. Interpret only the detected alleles Any data are better than none at all.
Kinship and Parentage Testing Chapter 23 of Butler, 2 nd edition. When DNA samples being compared are from related individuals, the assumption of independence is violated, and different statistical equations must be applied. Parentage testing – Statistical calculations – Impact of mutational events – Reference samples Reverse parentage testing – Data from both parents is often not available
Conclusions Unfortunately, you’re likely to be the expert. If you have the opportunity, study this on your own or do a forensics qualification (post-graduate or subject-area) You know where to find help. – Michael Oakes – Peter Dunne – Malcolm Farrow Be honest about your level of skill More statistics won’t hurt you.