Parametric and Non-Parametric analysis of complex diseases Lecture #8

Parametric and Non-Parametric analysis of complex diseases Lecture #8
Based on: Chapter 25 & 26 in Terwilliger and Ott’s Handbook of Human Genetic Linkage. Prepared by Dan Geiger. .

Complex Diseases Unknown mode of inheritance (Dominant/recessive)
Several interacting loci (Epistasis) Unclear affected status (e.g., psychiatric disorders) Genetic heterogeneity (allelic and non-allelic) Non genetic factors The result is that computing the likelihood of data given a model and a location of disease loci (specified by ) may be completely off because the model we use is wrong ! We start by specifying how alternative models look like using the Bayes net model we developed.

Mode of Inheritance S13m L11f L11m L13m X11 S13f L12f L12m L13f X12 X13 y3 y2 y1 Specify different conditional probability tables between the phenotype variables Yi and the genotypes L21m L21f L22m L22f S23m X21 X22 S23f Recessive, full penetrance: P(y1 = sick | X11= (a,a)) = 1 P(y1 = sick | X11= (A,a)) = 0 P(y1 = sick | X11= (A,A)) = 0 L23m L23f X23

More modes of Inheritance
Dominant, full penetrance: P(y1 = sick | X11= (a,a)) = 1 P(y1 = sick | X11= (A,a)) = 1 P(y1 = sick | X11= (A,A)) = 0 Dominant, 20% penetrance, 5% penetrance for phenocopies: P(y1 = sick | X11= (a,a)) = 0.2 P(y1 = sick | X11= (A,a)) = 0.2 P(y1 = sick | X11= (A,A)) = 0.05 Dominant, 60% penetrance: P(y1 = sick | X11= (a,a)) = 0.6 P(y1 = sick | X11= (A,a)) = 0.6 P(y1 = sick | X11= (A,A)) = 0 Recessive, 40% penetrance, 1% penetrance for phenocopies: P(y1 = sick | X11= (a,a)) = 0.4 P(y1 = sick | X11= (A,a)) = 0.01 P(y1 = sick | X11= (A,A)) = 0.01

Two or more interacting loci (epistasis)
S23m L21f L21m L23m X21 S23f L22f L22m L23f X22 X23 S13m L11f L11m L13m X11 S13f L12f L12m L13f X12 X13 y3 y2 y1 Specify different conditional probability tables between the phenotype variables Yi and the 2 or more genotypes of person i. Example: Recessive, full penetrance: P(y11 = sick | X11= (a,a), X21= (a,a)) = 1 P(y11 = sick | X11= (A,a), X21= (a,a)) = 0 P(y11 = sick | X11= (A,A), X21= (a,a)) = 0 6 more zero options to specify. There are 512 possible patterns of zeros and ones for two loci ! Not enough data to try them all.

Unclear affection status
L11m L11f L12m L12f S13m X11 X12 S13f Specify a “confusion matrix” regarding the process that determines affected status. Y1 Y2 L13m L13f Z1 X13 Z1 Y3 Z1 L21m L21f L22m L22f S23m S23f X21 X22 L23m L23f X23 P(z1 = measured sick | y1 = sick) = 0.9 P(z1 = measured sick | y1 = not sick) = 0.2

Genetic Heterogeneity
Si3m Li1f Li1m Li3m Xi1 Si3f Li2f Li2m Li3f Xi2 Xi3 1 2 3 Allelic heterogeneity: more than one allele at a locus predisposes to a disease. Non-Allelic heterogeneity: several independent loci predisposes to the disease .

Non genetic factors Under liability class 1 (L1=1):
L11m L11f L12m L12f S13m X11 X12 S13f Liability Class L1 L2 y1 y2 L13m L13f Example: Li = 1 means “old” Li = 2 means “young”. X13 L3 y3 L21m L21f L22m L22f S23m X21 X22 S23f Under liability class 1 (L1=1): P(y1 = sick | X11= (a,a), L1 =1) = 1 P(y1 = sick | X11= (A,a), L1 =1) = 0.05 P(y1 = sick | X11= (A,A), L1 =1) = 0.05 L23m L23f X23 Under L1 =2 (“young”): the first line changes, say, to 0.3 and the other two lines to, say, 0.

Trying all options can be bad
If we try repeatedly many models with various parameters the likelihood of data for one of these models can occasionally be high, just by pure chance. The usual requirement that the LOD score be higher than 3 should no longer be used; it becomes too permissive. LOD() = log10[ Pr(data| ) / Pr(data|  =½) ] > 3 The results of analyzing some Schizophrenia pedigrees in the genetics literature yielded LOD score of 6.49 implying that true gene locations were found with very high confidence. Further scrutiny could not replicate these findings.

But we did use repeated  values !?
Even for Mendelian (i.e., monogenetic) diseases, and assuming we have the correct model (say, dominant), we still try many  values to find the best location. Why is this practice OK ? Answer: because whenever we get a negative answer, we eliminate part of the genome. Assuming the model is correct, eliminating 50% of the genome, doubles the prior probability of finding linkage in the remaining part. However: For complex diseases this answer no longer applies; the model is knowingly wrong.

Correction factor by Kidd & Ott (1984)
For a fixed model with various  values, use the requirement LOD() > 3 + log10(m) where m is the number of  values tried. The length of the human genome is about 5000cM and a distance of 50cM or so is enough to declare two sites to be independent. So with 100 independent markers, one can cover the genome. So the threshold for LOD score is between 3 and 5. When we try many models, this correction idea translates to: LOD() > 5 + log10(n) where n is the number of models tried. In the Schizophrenia study 18 models were reported so the LOD score threshold should be 5+log10(18)= The results of the study 6.49 are not that surprising any more. ONE MUST COUNT ALL MODELS ONE TRIED, even the unpublished attempts.

Affected only analysis
Problem: If we analyze a complex disease using a single locus model, attempting to find one of the influential predisposing gene locations, we get into the following problem: Individual reported to be unaffected could actually posses some predisposing alleles. So affection status is more indicative when the individual is affected (say 9/10 cases) but less so when the individual is reported unaffected (say 1/10 cases). Idea: consider unaffected pedigree members as unobserved (regarding affection status). S13m L11f L11m L13m X11 S13f L12f L12m L13f X12 X13 Y3 Y2 Y1 Unaffected; variable removed The two values for Yi are: affected, not affected. Removing Y2 from the Bayesian network means making unaffected be coded as unknown (unobserved) because y2 p(y2|x12)=1.

Parametric versus Non-Parametric
All analyses considered so far are “parametric” meaning that a mode of inheritance is assumed . In some cases, several options of modes of inheritance are assumed but still the analysis uses each option in turn. For complex diseases it is believed that “non-parametric” methods might work better. In our context, these are methods that do not take mode of inheritance into account. The idea is that computing linkage without assuming mode of inheritance is more robust to error in model specification. Clearly, if the model is correct, parametric methods perform better, but not so if the model is wrong as for complex traits.

Some Non-Parametric Methods
Definitions: Any two identical copies of an allele l are said to be identical by state (IBS). If these alleles are inherited from the same individual then they are also identical by descent (IBD). Clearly, IBD implies IBS but not vice versa. Main idea: if affected siblings share more IBD alleles at some marker locus than randomly expected among siblings, then that locus might be near a locus of a predisposing gene. We will consider the following non-parametric methods: Affected Sib-Pair Analysis (ASP) Extended Affected Sib-Pair Analysis (ESPA) Affected Pedigree Member method (APM)

Identical By Descent (IBD)
1/2 1/1 1/3 1/2 1/3 Exactly one allele IBD. No allele is IBD. One allele is IBS. 1/1 1/2 At least one allele IBD. Expected 1.5 alleles IBD.

Affected Sib-Pair Analysis
The idea is that any two siblings are expected to have one allele IBD by chance (and at most two IBD alleles, ofcourse). When a deviation of this pattern is detected, by examining many sib-pairs, a linkage is established between a disease gene and the marker location. This phenomena happens regardless of mode of inheritance, but its strength is different for each mode.

1/2 1/3 1/4 3/4 There are 16 combinations of sibling marker genotypes: SON1 SON2 IBD SON1 SON2 IBD SON1 SON2 IBD SON1 SON2 IBD 1/3 1/ /4 1/ / / / / 1/3 1/ /4 1/ / / / / 1/ / / / / / / / 1/ / / / / / / / Not surprisingly, the expected number of IBD alleles is (4*2+8*1)/16=1. But now assume a dominant disease coming from the father and is on the haplotype with the 1 allele. The only viable options are marked in the table. The expected IBD is thus (2*2+2*1)/4 = 1.5, which can be detected in analysis. For a recessive disease linked on the haplotype of 1 and 3, the only viable pair is 1/3, 1/3 with expected IBD of 2.

1/2 1/3 1/4 3/4 Standard practice of the ASP method where pedigrees look like the above (two parents, two children, all observed), can be done even by hand. However, one can use general pedigrees, and assume some family members are not observed, and consider more distant relatives such as first-cousins, etc.

Extended Affected Sib-Pair Analysis (e.g, the ESPA program)
?/? 1/3 1/4 3/4 Compute the probability of alleles of every family configuration given the other typed persons in the pedigree. Based on this probabilities compute: E[IBD] = 1Pr(1 allele IBD) + 2Pr(2 allele IBD) (The ESPA program currently assumes no loops and at most 5 alleles at a locus.)

Affected Pedigree Members method (APM)
Computing IBD for distant relatives is considered hard or impossible so researchers used IBS instead. Consider one relative to have alleles (A1,A2) and the other to have (B1,B2). There are four possibilities to have IBS alleles. Weeks and Lang (1988) used the following statistics zij for counting IBS status of two individuals: This measure should be compared to what is expected under no linkage. To use many pedigrees, a conversation to standard normal variables is used.

Taking Gene Frequencies into Account
Clearly it is more surprising for affected relatives to share a rare allele than a common one. So one can use a weighted average: where A popular method that needs to be mentioned. Terwilliger & Ott state however that: “The reliability and robustness of this method is unclear, because it is much more dependent on gene frequencies estimates than standard linkage analysis”.

Using SUPERLINK’s BayesNet Model
What query do we need to answer to decide how likely two affected relatives A and B share a common allele ? Answer: Find all the paths of the form Afounder B that connects them and compute the likelihood of the assignment to the selector variables that activate these paths. It is not needed to resort to IBS calculations.

Example ?/? 1/3 1/4 A B The expected IBD is given using the posterior distribution of the four selector variables SAf,SAm,SBf,SBm given the data in the pedigree. I.e., Pr(SAf=0,SAm =0,SBf =0,SBm =0 | data)*2+ Pr(SAf=0,SAm =0,SBf =0,SBm =1 | data)*1 + Etc, 16 terms all together. Note that we only treat one Marker at a time. Standard for APM method.

Pedigree’s Data (fn.ped)
Liability class (in this file 1 class only) Sex: 1=male 2=female Pedigree number Father’s ID Unknown marker alleles Known marker alleles Individual ID Mother’s ID Status: 1=healthy 2=diseased

Marker File (fn.dat) 1 disease locus + 13 markers
Which program is used 1 2 1 0 0 1 3 6 ...[other 12 markers skipped]... 0 0 locus type = affection (coded 1) Number of alleles = 2 Normal and mutated allele frequencies locus type = “allele numbers” (coded 3) Number of alleles = 6 Last marker founder allele frequencies

Marker File (fn.dat) Number of liability classes
1 2 1 0 0 1 3 6 ...[other 12 markers skipped]... 0 0 Recessive disease, full penetrance Recombination distances between markers

Parametric and Non-Parametric analysis of complex diseases Lecture #8

Similar presentations

Presentation on theme: "Parametric and Non-Parametric analysis of complex diseases Lecture #8"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parametric and Non-Parametric analysis of complex diseases Lecture #8

Similar presentations

Presentation on theme: "Parametric and Non-Parametric analysis of complex diseases Lecture #8"— Presentation transcript:

Similar presentations

About project

Feedback