Presentation is loading. Please wait.

Presentation is loading. Please wait.

Basic Model For Genetic Linkage Analysis Lecture #3

Similar presentations


Presentation on theme: "Basic Model For Genetic Linkage Analysis Lecture #3"— Presentation transcript:

1 Basic Model For Genetic Linkage Analysis Lecture #3
. Prepared by Dan Geiger

2

3

4

5 Using the Maximum Likelihood Approach
The probability of pedigree data Pr(data |  ) is a function of the known and unknown recombination fractions denoted collectively by . How can we construct this likelihood function ? The maximum likelihood approach is to seek the value of  which maximizes the likelihood function Pr(data |  ) . This is the ML estimate.

6 Constructing the Likelihood function
First, we need to determine the variables that describe the problem. There are many possible choices. Some variables we can observe and some we cannot. Lijm = Maternal allele at locus i of person j. The values of this variables are the possible alleles li at locus i. Lijf = Paternal allele at locus i of person j. The values of this variables are the possible alleles li at locus i (Same as for Lijm) . Xij = Unordered allele pair at locus i of person j. The values are pairs of ith-locus alleles (li,l’i). As a starting point, We assume that the data consists of an assignment to a subset of the variables {Xij}. In other words some (or all) persons are genotyped at some (or all) loci.

7 What is the relationships among the variables for a specific individual ?
Maternal allele at locus 1 of person 1 Paternal allele at locus 1 of person 1 L11m L11f P(L11m = a) is the frequency of allele a. We use lower case letters for states writing, in short, P(l11m). Unordered allele pair at locus 1 of person 1 = data X11 P(x11 | l11m, l11f) = 0 or 1 depending on consistency

8 What is the relationships among the variables across individuals ?
L11m L11f L12m L12f Mother Father X11 X12 L13m L13f X13 Offspring P(l13m | l11m, l11f) = 1/2 if l13m = l11m or l13m = l11f P(l13m | l11m, l11f) = otherwise First attempt: correct but not efficient as we shall see.

9 Probabilistic model for two loci
L11m L13m X11 L12f L12m L13f X12 X13 Model for locus 1 L23m depends on whether L13m got the value from L11m or L11f, whether a recombination occurred, and on the values of L21m and L21f. This is quite complex. L21f L21m L23m X21 L22f L22m L23f X22 X23 Model for locus 2

10 Adding a selector variable
L11m L11f Selector of maternal allele at locus 1 of person 3 X11 S13m P(s13m) = ½ L13m Maternal allele at locus 1 of person 3 (offspring) Selector variables Sijm are 0 or 1 depending on whose allele is transmitted to offspring i at maternal locus j. P(l13m | l11m, l11f,,S13m=0) = 1 if l13m = l11m P(l13m | l11m, l11f,,S13m=1) = 1 if l13m = l11f P(l13m | l11m, l11f,,s13m) = 0 otherwise

11 Probabilistic model for two loci
S13m L11f L11m L13m X11 S13f L12f L12m L13f X12 X13 Model for locus 1 S23m L21f L21m L23m X21 S23f L22f L22m L23f X22 X23 Model for locus 2

12 Probabilistic Model for Recombination
S13m L11f L11m L13m X11 S13f L12f L12m L13f X12 X13 S23m L21f L21m L23m X21 S23f L22f L22m L23f X22 X23  is the recombination fraction between loci 2 & 1.

13 Constructing the likelihood function I
S13m L11f L11m L13m X11 Observed variable All other variables are not-observed (hidden) P(l11m, l11f,, x11, s13m,l13m) = P(l11m) P(l11f) P(x11 | l11m, l11f,) P(s13m) P(l13m | s13m, l11m, l11f) Joint probability Prob(data) = P(x11) =  l11m  l11f  s13m  l13m P(l11m, l11f,, x11, s13m,l13m) Probability of data (sum over all states of all hidden variables)

14 Constructing the likelihood function II
P(l11m,l11f,x11,l12m,l12f,x12,l13m,l13f,x13, l21m,l21f,x21,l22m,l22f,x22,l23m,l23f,x23, s13m,s13f,s23m,s23f, ) = Product over all local probability tables = P(l11m) P(l11f) P(x11 | l11m, l11f,) … P(s13m) P(s13f) P(s23m | s13m, ) P(s23m | s13m, ) Prob(data| 2) = P(x11 ,x12 ,x13 ,x21 ,x22 ,x23) = Probability of data (sum over all states of all hidden variables) Prob(data| 2) = P(x11 ,x12 ,x13 ,x21 ,x22 ,x23) =  l11m, l11f … s23f [P(l11m) P(l11f) P(x11 | l11m, l11f,) … P(s13m) P(s13f) P(s23m | s13m, ) P(s23m | s13m, ) ] The result is a function of the recombination fraction. The ML estimate is the  value that maximizes this function.

15 The Disease Locus I L11m L11f X11 S13m Y11 L13m Phenotype variables Yij are 0 or 1 depending on whether a phenotypic trait associated with locus i of person j is observed. E.g., sick versus healthy. For example model of perfect recessive disease yields the penetrance probabilities: P(y11 = sick | X11= (a,a)) = 1 P(y11 = sick | X11= (A,a)) = 0 P(y11 = sick | X11= (A,A)) = 0

16 The Disease Locus II L11m L11f X11 S13m Y11 L13m Note that in this model we assume the phenotype/disease depends only on the alleles of one locus. Also we did not model levels of sickness.

17 Introducing a tentative disease Locus
S13m L11f L11m L13m X11 S13f L12f L12m L13f X12 X13 Marker locus Disease locus: assume sick means xij=(a,a) S23m L21f L21m L23m X21 S23f L22f L22m L23f X22 X23 Y21 Y22 Y23 The recombination fraction  is unknown. Finding it can help determine whether a gene causing the disease lies in the vicinity of the marker locus.

18 Locus-by-Locus Summation order
Si3m Li1f Li1m Li3m Xi1 Si3f Li2f Li2m Li3f Xi2 Xi3 1 2 3 4 Sum over locus i vars before summing over locus i+1 vars Sum over orange vars (Lijt) before summing selector vars (Sijt). This order yields a Hidden Markov Model (HMM).

19 Hidden Markov Models in General
X1 X2 X3 Xi-1 Xi Xi+1 R1 R2 R3 Ri-1 Ri Ri+1 S1 S2 S3 Si-1 Si Si+1 Which depicts the factorization: Application in communication: message sent is (s1,…,sm) but we receive (r1,…,rm) . Compute what is the most likely message sent ? Application in speech recognition: word said is (s1,…,sm) but we recorded (r1,…,rm) . Compute what is the most likely word said ? Application in Genetic linkage analysis: to be discussed now.

20 Hidden Markov Model In our case
X1 X2 X3 Xi-1 Xi Xi+1 S2 S3 Si-1 Si Si+1 X1 X2 X3 Xi-1 Xi Xi+1 X1 X2 X3 Yi-1 Xi Xi+1 The compounded variable Si = (Si,1,m,…,Si,2n,f) is called the inheritance vector. It has 22n states where n is the number of persons that have parents in the pedigree (non-founders). The compounded variable Xi = (Xi,1,m,…,Xi,2n,f) is the data regarding locus i. Similarly for the disease locus we use Yi. To specify the HMM we need to write down the transition matrices from Si-1 to Si and the matrices P(xi|Si). Note that these quantities have already been implicitly defined.

21 The transition matrix Recall that:
Note that theta depends on I but this dependence is omitted. In our example, where we have one non-founder (n=1), the transition probability table size is 4  4 = 22n  22n, encoding four options of recombination/non-recombination for the two parental meiosis: (The Kronecker product) For n non-founders, the transition matrix is the n-fold Kronecker product:

22 Efficient Product So, if we start with a matrix of size 22n, we will need 22n multiplications if we had matrix A in hands. Continuing recursively, at most 2n times, yields a complexity of O(2n22n), far less than O(24n) needed for regular multiplication. With n=10 non-founders, we drop from non-feasible region to feasible one.

23 Probability of data in one locus given an inheritance vector
S23m L21f L21m L23m X21 S23f L22f L22m L23f X22 X23 Model for locus 2 P(x21, x22 , x23 |s23m,s23f) = =  P(l21m) P(l21f) P(l22m) P(l22f) P(x21 | l21m, l21f) P(x22 | l22m, l22f) P(x23 | l23m, l23f) P(l23m | l21m, l21f, S23m) P(l23f | l22m, l22f, S23f) l21m,l21f,l22m,l22f l22m,l22f The five last terms are always zero-or-one, namely, indicator functions.

24 Efficient computation
L21m L21f L22m L22f S23m X21 X22 S23f =1 =0 L23m L23f X23 Model for locus 2 ={A1,A2} Assume only individual 3 is genotyped. For the inheritance vector (0,1), the founder alleles L21m and L22f are not restricted by the data while (L21f,L22m) have two possible joint assignments (A1,A2) or (A2,A1) only: p(x21, x22 , x23 |s23m=1,s23f =0) = p(A1)p(A2) + p(A2)p(A1) In general. Every inheritance vector defines a subgraph of the Bayesian network above. We build a founder graph The five last terms are always zero-or-one, namely, indicator functions.

25 Efficient computation
L21m L21f L22m L22f S23m X21 X22 S23f =1 =0 L23m L23f X23 Model for locus 2 ={A1,A2} In general. Every inheritance vector defines a subgraph as indicated by the black lines above. Construct a founder graph whose vertices are the founder variables and where there is an edge between two vertices if they have a common typed descendent. The label of an edge is the constraint dictated by the common typed descendent. Now find all consistent assignments for every connected component. {A1,A2} L21m L21f L22m L22f The five last terms are always zero-or-one, namely, indicator functions.

26 A Larger Example Descent graph a,b a,c b,d
4 3 6 5 2 1 8 7 a,b a,c b,d Founder graph (An example of a constraint satisfaction graph) {a,b} {a,b} 5 5 3 3 6 6 4 4 {a,b} {b,d} {a,c} 2 1 8 7 Connect two nodes if they have a common typed descendant. The constraint {a,b} means the relation {(a,b)(b,a)}

27 The Constraint Satisfaction Problem
5 3 6 4 2 1 8 7 {a,b} {b,d} {a,c} The number of possible consistent alleles per non-isolated node is 0, 1 or 2. For example node 2 has all possible alleles, node 6 can only be b because its domain must be {a,b} and {b,d}. and node 3 can be assigned either a or b. namely, the intersection of its adjacent edges labels. For each non-singleton connected component: Start with an arbitrary node, pick one of its values. This dictates all other values in the component. Repeat with the other value if it has one. So each non-singleton component yields at most two solutions. What is the special constriant problem here?

28 Solution of the CSP 5 3 6 4 2 1 8 7 {a,b} {b,d} {a,c}
Since each non-singleton component yields at most two solutions. The likelihood is simply the product of sums each of two terms at most. Each component contributes one term. Singleton components contribute the term 1 In our example: 1 * [ p(a)p(b)p(a) + p(b)p(a)p(b)] * p(d)p(b)p(a)p(c). Complexity. Building the founder graph: O(f2+n). While solving general CSPs is NP-hard. This is graph coloring where domains are often size 2.

29

30

31

32

33


Download ppt "Basic Model For Genetic Linkage Analysis Lecture #3"

Similar presentations


Ads by Google