Basic Model For Genetic Linkage Analysis Lecture #3

Slides:



Advertisements
Similar presentations
Tutorial #8 by Ma’ayan Fishelson. Computational Difficulties Algorithms that perform multipoint likelihood computations sum over all the possible ordered.
Advertisements

CS498-EA Reasoning in AI Lecture #15 Instructor: Eyal Amir Fall Semester 2011.
. Exact Inference in Bayesian Networks Lecture 9.
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Tutorial #5 by Ma’ayan Fishelson. Input Format of Superlink There are 2 input files: –The locus file describes the loci being analyzed and parameters.
Genetic linkage analysis Dotan Schreiber According to a series of presentations by M. Fishelson.
Basics of Linkage Analysis
. Parametric and Non-Parametric analysis of complex diseases Lecture #6 Based on: Chapter 25 & 26 in Terwilliger and Ott’s Handbook of Human Genetic Linkage.
Linkage Analysis: An Introduction Pak Sham Twin Workshop 2001.
. Learning – EM in ABO locus Tutorial #08 © Ydo Wexler & Dan Geiger.
. Hidden Markov Model Lecture #6. 2 Reminder: Finite State Markov Chain An integer time stochastic process, consisting of a domain D of m states {1,…,m}
Markov Chains Lecture #5
. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Tutorial #6 by Ma’ayan Fishelson Based on notes by Terry Speed.
1 How many genes? Mapping mouse traits, cont. Lecture 2B, Statistics 246 January 22, 2004.
Tutorial by Ma’ayan Fishelson Changes made by Anna Tzemach.
. Hidden Markov Models Lecture #5 Prepared by Dan Geiger. Background Readings: Chapter 3 in the text book (Durbin et al.).
Parametric and Non-Parametric analysis of complex diseases Lecture #8
1 Directional consistency Chapter 4 ICS-275 Spring 2007.
. Bayesian Networks For Genetic Linkage Analysis Lecture #7.
. Basic Model For Genetic Linkage Analysis Lecture #3 Prepared by Dan Geiger.
. Hidden Markov Models For Genetic Linkage Analysis Lecture #4 Prepared by Dan Geiger.
Tutorial #11 by Anna Tzemach. Background – Lander & Green’s HMM Recombinations across successive intervals are independent  sequential computation across.
CASE STUDY: Genetic Linkage Analysis via Bayesian Networks
. Learning Parameters of Hidden Markov Models Prepared by Dan Geiger.
. Class 5: Hidden Markov Models. Sequence Models u So far we examined several probabilistic model sequence models u These model, however, assumed that.
Hidden Markov Model Continues …. Finite State Markov Chain A discrete time stochastic process, consisting of a domain D of m states {1,…,m} and 1.An m.
Tutorial #5 by Ma’ayan Fishelson
Lecture 5: Segregation Analysis I Date: 9/10/02  Counting number of genotypes, mating types  Segregation analysis: dominant, codominant, estimating segregation.
Standardization of Pedigree Collection. Genetics of Alzheimer’s Disease Alzheimer’s Disease Gene 1 Gene 2 Environmental Factor 1 Environmental Factor.
. Basic Model For Genetic Linkage Analysis Lecture #5 Prepared by Dan Geiger.
Genetic Mapping Oregon Wolfe Barley Map (Szucs et al., The Plant Genome 2, )
Experimental Design and Data Structure Supplement to Lecture 8 Fall
Markov Chain Monte Carlo Hadas Barkay Anat Hashavit.
Lecture 13: Linkage Analysis VI Date: 10/08/02  Complex models  Pedigrees  Elston-Stewart Algorithm  Lander-Green Algorithm.
Lecture 15: Linkage Analysis VII
Computer Vision Lecture 6. Probabilistic Methods in Segmentation.
1 HMM in crosses and small pedigrees Lecture 8, Statistics 246, February 17, 2004.
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
. Basic Model For Genetic Linkage Analysis Prepared by Dan Geiger.
Guy Grebla1 Allegro, A new computer program for linkage analysis Guy Grebla.
Today Graphical Models Representing conditional dependence graphically
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Lecture 17: Model-Free Linkage Analysis Date: 10/17/02  IBD and IBS  IBD and linkage  Fully Informative Sib Pair Analysis  Sib Pair Analysis with Missing.
Hidden Markov Models BMI/CS 576
Lecture 7: Constrained Conditional Models
1.3 – Characteristics and Inheritance Page 28
Gonçalo Abecasis and Janis Wigginton University of Michigan, Ann Arbor
Difference between a monohybrid cross and a dihybrid cross
Qian Liu CSE spring University of Pennsylvania
HMM in crosses and small pedigrees, cont.
Read R&N Ch Next lecture: Read R&N
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS)
Read R&N Ch Next lecture: Read R&N
Error Checking for Linkage Analyses
Pattern Recognition and Image Analysis
7.L.4A.3 Develop and use models (Punnett squares) to describe and predict patterns of the inheritance of single genetic traits from parent to offspring.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
CONTEXT DEPENDENT CLASSIFICATION
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Lecture 9: QTL Mapping II: Outbred Populations
Linkage Analysis Problems
Genetic linkage analysis
Tutorial #6 by Ma’ayan Fishelson
Bio Get Alternative Inheritance WS checked off if you did not do so last time Today: Pedigrees (family trees) Unit 5 Test Wed 2/22.
Read R&N Ch Next lecture: Read R&N
Hidden Markov Models ..
BIO: Agenda Turn in WS from FRI to be checked if you did not get it stamped Today: Alternate patterns of inheritance Unit 5 Test WED 2/24, THURS 2/25.
Implementation of Learning Systems
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

Basic Model For Genetic Linkage Analysis Lecture #3 . Prepared by Dan Geiger

Using the Maximum Likelihood Approach The probability of pedigree data Pr(data |  ) is a function of the known and unknown recombination fractions denoted collectively by . How can we construct this likelihood function ? The maximum likelihood approach is to seek the value of  which maximizes the likelihood function Pr(data |  ) . This is the ML estimate.

Constructing the Likelihood function First, we need to determine the variables that describe the problem. There are many possible choices. Some variables we can observe and some we cannot. Lijm = Maternal allele at locus i of person j. The values of this variables are the possible alleles li at locus i. Lijf = Paternal allele at locus i of person j. The values of this variables are the possible alleles li at locus i (Same as for Lijm) . Xij = Unordered allele pair at locus i of person j. The values are pairs of ith-locus alleles (li,l’i). As a starting point, We assume that the data consists of an assignment to a subset of the variables {Xij}. In other words some (or all) persons are genotyped at some (or all) loci.

What is the relationships among the variables for a specific individual ? Maternal allele at locus 1 of person 1 Paternal allele at locus 1 of person 1 L11m L11f P(L11m = a) is the frequency of allele a. We use lower case letters for states writing, in short, P(l11m). Unordered allele pair at locus 1 of person 1 = data X11 P(x11 | l11m, l11f) = 0 or 1 depending on consistency

What is the relationships among the variables across individuals ? L11m L11f L12m L12f Mother Father X11 X12 L13m L13f X13 Offspring P(l13m | l11m, l11f) = 1/2 if l13m = l11m or l13m = l11f P(l13m | l11m, l11f) = 0 otherwise First attempt: correct but not efficient as we shall see.

Probabilistic model for two loci L11m L13m X11 L12f L12m L13f X12 X13 Model for locus 1 L23m depends on whether L13m got the value from L11m or L11f, whether a recombination occurred, and on the values of L21m and L21f. This is quite complex. L21f L21m L23m X21 L22f L22m L23f X22 X23 Model for locus 2

Adding a selector variable L11m L11f Selector of maternal allele at locus 1 of person 3 X11 S13m P(s13m) = ½ L13m Maternal allele at locus 1 of person 3 (offspring) Selector variables Sijm are 0 or 1 depending on whose allele is transmitted to offspring i at maternal locus j. P(l13m | l11m, l11f,,S13m=0) = 1 if l13m = l11m P(l13m | l11m, l11f,,S13m=1) = 1 if l13m = l11f P(l13m | l11m, l11f,,s13m) = 0 otherwise

Probabilistic model for two loci S13m L11f L11m L13m X11 S13f L12f L12m L13f X12 X13 Model for locus 1 S23m L21f L21m L23m X21 S23f L22f L22m L23f X22 X23 Model for locus 2

Probabilistic Model for Recombination S13m L11f L11m L13m X11 S13f L12f L12m L13f X12 X13 S23m L21f L21m L23m X21 S23f L22f L22m L23f X22 X23  is the recombination fraction between loci 2 & 1.

Constructing the likelihood function I S13m L11f L11m L13m X11 Observed variable All other variables are not-observed (hidden) P(l11m, l11f,, x11, s13m,l13m) = P(l11m) P(l11f) P(x11 | l11m, l11f,) P(s13m) P(l13m | s13m, l11m, l11f) Joint probability Prob(data) = P(x11) =  l11m  l11f  s13m  l13m P(l11m, l11f,, x11, s13m,l13m) Probability of data (sum over all states of all hidden variables)

Constructing the likelihood function II P(l11m,l11f,x11,l12m,l12f,x12,l13m,l13f,x13, l21m,l21f,x21,l22m,l22f,x22,l23m,l23f,x23, s13m,s13f,s23m,s23f, ) = Product over all local probability tables = P(l11m) P(l11f) P(x11 | l11m, l11f,) … P(s13m) P(s13f) P(s23m | s13m, ) P(s23m | s13m, ) Prob(data| 2) = P(x11 ,x12 ,x13 ,x21 ,x22 ,x23) = Probability of data (sum over all states of all hidden variables) Prob(data| 2) = P(x11 ,x12 ,x13 ,x21 ,x22 ,x23) =  l11m, l11f … s23f [P(l11m) P(l11f) P(x11 | l11m, l11f,) … P(s13m) P(s13f) P(s23m | s13m, ) P(s23m | s13m, ) ] The result is a function of the recombination fraction. The ML estimate is the  value that maximizes this function.

The Disease Locus I L11m L11f X11 S13m Y11 L13m Phenotype variables Yij are 0 or 1 depending on whether a phenotypic trait associated with locus i of person j is observed. E.g., sick versus healthy. For example model of perfect recessive disease yields the penetrance probabilities: P(y11 = sick | X11= (a,a)) = 1 P(y11 = sick | X11= (A,a)) = 0 P(y11 = sick | X11= (A,A)) = 0

The Disease Locus II L11m L11f X11 S13m Y11 L13m Note that in this model we assume the phenotype/disease depends only on the alleles of one locus. Also we did not model levels of sickness.

Introducing a tentative disease Locus S13m L11f L11m L13m X11 S13f L12f L12m L13f X12 X13 Marker locus Disease locus: assume sick means xij=(a,a) S23m L21f L21m L23m X21 S23f L22f L22m L23f X22 X23 Y21 Y22 Y23 The recombination fraction  is unknown. Finding it can help determine whether a gene causing the disease lies in the vicinity of the marker locus.

Locus-by-Locus Summation order Si3m Li1f Li1m Li3m Xi1 Si3f Li2f Li2m Li3f Xi2 Xi3 1 2 3 4 Sum over locus i vars before summing over locus i+1 vars Sum over orange vars (Lijt) before summing selector vars (Sijt). This order yields a Hidden Markov Model (HMM).

Hidden Markov Models in General X1 X2 X3 Xi-1 Xi Xi+1 R1 R2 R3 Ri-1 Ri Ri+1 S1 S2 S3 Si-1 Si Si+1 Which depicts the factorization: Application in communication: message sent is (s1,…,sm) but we receive (r1,…,rm) . Compute what is the most likely message sent ? Application in speech recognition: word said is (s1,…,sm) but we recorded (r1,…,rm) . Compute what is the most likely word said ? Application in Genetic linkage analysis: to be discussed now.

Hidden Markov Model In our case X1 X2 X3 Xi-1 Xi Xi+1 S2 S3 Si-1 Si Si+1 X1 X2 X3 Xi-1 Xi Xi+1 X1 X2 X3 Yi-1 Xi Xi+1 The compounded variable Si = (Si,1,m,…,Si,2n,f) is called the inheritance vector. It has 22n states where n is the number of persons that have parents in the pedigree (non-founders). The compounded variable Xi = (Xi,1,m,…,Xi,2n,f) is the data regarding locus i. Similarly for the disease locus we use Yi. To specify the HMM we need to write down the transition matrices from Si-1 to Si and the matrices P(xi|Si). Note that these quantities have already been implicitly defined.

The transition matrix Recall that: Note that theta depends on I but this dependence is omitted. In our example, where we have one non-founder (n=1), the transition probability table size is 4  4 = 22n  22n, encoding four options of recombination/non-recombination for the two parental meiosis: (The Kronecker product) For n non-founders, the transition matrix is the n-fold Kronecker product:

Efficient Product So, if we start with a matrix of size 22n, we will need 22n multiplications if we had matrix A in hands. Continuing recursively, at most 2n times, yields a complexity of O(2n22n), far less than O(24n) needed for regular multiplication. With n=10 non-founders, we drop from non-feasible region to feasible one.

Probability of data in one locus given an inheritance vector S23m L21f L21m L23m X21 S23f L22f L22m L23f X22 X23 Model for locus 2 P(x21, x22 , x23 |s23m,s23f) = =  P(l21m) P(l21f) P(l22m) P(l22f) P(x21 | l21m, l21f) P(x22 | l22m, l22f) P(x23 | l23m, l23f) P(l23m | l21m, l21f, S23m) P(l23f | l22m, l22f, S23f) l21m,l21f,l22m,l22f l22m,l22f The five last terms are always zero-or-one, namely, indicator functions.

Efficient computation L21m L21f L22m L22f S23m X21 X22 S23f =1 =0 L23m L23f X23 Model for locus 2 ={A1,A2} Assume only individual 3 is genotyped. For the inheritance vector (0,1), the founder alleles L21m and L22f are not restricted by the data while (L21f,L22m) have two possible joint assignments (A1,A2) or (A2,A1) only: p(x21, x22 , x23 |s23m=1,s23f =0) = p(A1)p(A2) + p(A2)p(A1) In general. Every inheritance vector defines a subgraph of the Bayesian network above. We build a founder graph The five last terms are always zero-or-one, namely, indicator functions.

Efficient computation L21m L21f L22m L22f S23m X21 X22 S23f =1 =0 L23m L23f X23 Model for locus 2 ={A1,A2} In general. Every inheritance vector defines a subgraph as indicated by the black lines above. Construct a founder graph whose vertices are the founder variables and where there is an edge between two vertices if they have a common typed descendent. The label of an edge is the constraint dictated by the common typed descendent. Now find all consistent assignments for every connected component. {A1,A2} L21m L21f L22m L22f The five last terms are always zero-or-one, namely, indicator functions.

A Larger Example Descent graph a,b a,c b,d 4 3 6 5 2 1 8 7 a,b a,c b,d Founder graph (An example of a constraint satisfaction graph) {a,b} {a,b} 5 5 3 3 6 6 4 4 {a,b} {b,d} {a,c} 2 1 8 7 Connect two nodes if they have a common typed descendant. The constraint {a,b} means the relation {(a,b)(b,a)}

The Constraint Satisfaction Problem 5 3 6 4 2 1 8 7 {a,b} {b,d} {a,c} The number of possible consistent alleles per non-isolated node is 0, 1 or 2. For example node 2 has all possible alleles, node 6 can only be b because its domain must be {a,b} and {b,d}. and node 3 can be assigned either a or b. namely, the intersection of its adjacent edges labels. For each non-singleton connected component: Start with an arbitrary node, pick one of its values. This dictates all other values in the component. Repeat with the other value if it has one. So each non-singleton component yields at most two solutions. What is the special constriant problem here?

Solution of the CSP 5 3 6 4 2 1 8 7 {a,b} {b,d} {a,c} Since each non-singleton component yields at most two solutions. The likelihood is simply the product of sums each of two terms at most. Each component contributes one term. Singleton components contribute the term 1 In our example: 1 * [ p(a)p(b)p(a) + p(b)p(a)p(b)] * p(d)p(b)p(a)p(c). Complexity. Building the founder graph: O(f2+n). While solving general CSPs is NP-hard. This is graph coloring where domains are often size 2.