. Basic Model For Genetic Linkage Analysis Prepared by Dan Geiger.

Slides:

Advertisements

Similar presentations

CS188: Computational Models of Human Behavior

Advertisements

Tutorial #8 by Ma’ayan Fishelson. Computational Difficulties Algorithms that perform multipoint likelihood computations sum over all the possible ordered.

. Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution.

Variational Methods for Graphical Models Micheal I. Jordan Zoubin Ghahramani Tommi S. Jaakkola Lawrence K. Saul Presented by: Afsaneh Shirazi.

CS498-EA Reasoning in AI Lecture #15 Instructor: Eyal Amir Fall Semester 2011.

. Exact Inference in Bayesian Networks Lecture 9.

Exact Inference in Bayes Nets

HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:

Introduction of Probabilistic Reasoning and Bayesian Networks

Genetic linkage analysis Dotan Schreiber According to a series of presentations by M. Fishelson.

Basics of Linkage Analysis

. Parametric and Non-Parametric analysis of complex diseases Lecture #6 Based on: Chapter 25 & 26 in Terwilliger and Ott’s Handbook of Human Genetic Linkage.

Linkage Analysis: An Introduction Pak Sham Twin Workshop 2001.

.. Likelihood Computation  Given a Bayesian network and evidence e, compute P( e ) Sum over all possible values of unobserved variables Bet1Die Win1.

Belief Propagation by Jakob Metzler. Outline Motivation Pearl’s BP Algorithm Turbo Codes Generalized Belief Propagation Free Energies.

. Learning – EM in ABO locus Tutorial #08 © Ydo Wexler & Dan Geiger.

Hidden Markov Models Theory By Johan Walters (SR 2003)

. Hidden Markov Model Lecture #6. 2 Reminder: Finite State Markov Chain An integer time stochastic process, consisting of a domain D of m states {1,…,m}

Markov Chains Lecture #5

Bayesian network inference

1 Graphical Models in Data Assimilation Problems Alexander Ihler UC Irvine Collaborators: Sergey Kirshner Andrew Robertson Padhraic Smyth.

Tutorial #6 by Ma’ayan Fishelson Based on notes by Terry Speed.

1 How many genes? Mapping mouse traits, cont. Lecture 2B, Statistics 246 January 22, 2004.

. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.

Tutorial by Ma’ayan Fishelson Changes made by Anna Tzemach.

. Hidden Markov Models Lecture #5 Prepared by Dan Geiger. Background Readings: Chapter 3 in the text book (Durbin et al.).

. Learning Bayesian networks Slides by Nir Friedman.

Parametric and Non-Parametric analysis of complex diseases Lecture #8

1 Directional consistency Chapter 4 ICS-275 Spring 2007.

. Bayesian Networks For Genetic Linkage Analysis Lecture #7.

. Basic Model For Genetic Linkage Analysis Lecture #3 Prepared by Dan Geiger.

. Hidden Markov Models For Genetic Linkage Analysis Lecture #4 Prepared by Dan Geiger.

Tutorial #11 by Anna Tzemach. Background – Lander & Green’s HMM Recombinations across successive intervals are independent  sequential computation across.

CASE STUDY: Genetic Linkage Analysis via Bayesian Networks

. Learning Parameters of Hidden Markov Models Prepared by Dan Geiger.

Bayesian Networks Alan Ritter.

. Applications and Summary. . Presented By Dan Geiger Journal Club of the Pharmacogenetics Group Meeting Technion.

Hidden Markov Model Continues …. Finite State Markov Chain A discrete time stochastic process, consisting of a domain D of m states {1,…,m} and 1.An m.

Linkage stuff Vibhav Gogate. A Review of the Genetic Model X1X1 X2X3Xi-1XiXi+1Y1Y1 Y2Y2 Y3Y3 Y i-1 YiYi Y i+1 X1X1 X2X3Xi-1XiXi+1S1S1 S2S2 S3S3 S i-1.

. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al.,  Shlomo.

Tutorial #5 by Ma’ayan Fishelson

Lecture 5: Segregation Analysis I Date: 9/10/02  Counting number of genotypes, mating types  Segregation analysis: dominant, codominant, estimating segregation.

. Basic Model For Genetic Linkage Analysis Lecture #5 Prepared by Dan Geiger.

. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.

Genetic Mapping Oregon Wolfe Barley Map (Szucs et al., The Plant Genome 2, )

Markov Chain Monte Carlo Hadas Barkay Anat Hashavit.

Lecture 13: Linkage Analysis VI Date: 10/08/02  Complex models  Pedigrees  Elston-Stewart Algorithm  Lander-Green Algorithm.

Lecture 15: Linkage Analysis VII

Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.

The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)

Probabilistic Graphical Models seminar 15/16 ( ) Haim Kaplan Tel Aviv University.

Lecture 2: Statistical learning primer for biologists

Bayesian networks and their application in circuit reliability estimation Erin Taylor.

1 HMM in crosses and small pedigrees Lecture 8, Statistics 246, February 17, 2004.

Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:

Christopher M. Bishop, Pattern Recognition and Machine Learning 1.

Guy Grebla1 Allegro, A new computer program for linkage analysis Guy Grebla.

Daphne Koller Overview Conditional Probability Queries Probabilistic Graphical Models Inference.

Today Graphical Models Representing conditional dependence graphically

1 Relational Factor Graphs Lin Liao Joint work with Dieter Fox.

Markov Networks: Theory and Applications Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.

Lecture 17: Model-Free Linkage Analysis Date: 10/17/02  IBD and IBS  IBD and linkage  Fully Informative Sib Pair Analysis  Sib Pair Analysis with Missing.

Gonçalo Abecasis and Janis Wigginton University of Michigan, Ann Arbor

Qian Liu CSE spring University of Pennsylvania

Basic Model For Genetic Linkage Analysis Lecture #3

Expectation-Maximization & Belief Propagation

Linkage Analysis Problems

Genetic linkage analysis

Tutorial #6 by Ma’ayan Fishelson

Presentation transcript:

. Basic Model For Genetic Linkage Analysis Prepared by Dan Geiger

2 Using the Maximum Likelihood Approach The probability of pedigree data Pr(data |  ) is a function of the known and unknown recombination fractions denoted collectively by . How can we construct this likelihood function ? The maximum likelihood approach is to seek the value of  which maximizes the likelihood function Pr(data |  ). This is the ML estimate.

3 Constructing the Likelihood function L ijm = Maternal allele at locus i of person j. The values of this variables are the possible alleles l i at locus i. First, we need to determine the variables that describe the problem. There are many possible choices. Some variables we can observe and some we cannot. X ij = Unordered allele pair at locus i of person j. The values are pairs of i th -locus alleles (l i,l’ i ). L ijf = Paternal allele at locus i of person j. The values of this variables are the possible alleles l i at locus i (Same as for L ijm ). As a starting point, We assume that the data consists of an assignment to a subset of the variables {X ij }. In other words some (or all) persons are genotyped at some (or all) loci.

4 What is the relationships among the variables for a specific individual ? L 11f L 11m X 11 Paternal allele at locus 1 of person 1 Unordered allele pair at locus 1 of person 1 = data Maternal allele at locus 1 of person 1 P(L 11m = a) is the frequency of allele a. We use lower case letters for states writing, in short, P(l 11m ). P(x 11 | l 11m, l 11f ) = 0 or 1 depending on consistency

5 What is the relationships among the variables across individuals ? L 11f L 11m L 13m X 11 P(l 13m | l 11m, l 11f ) = 1/2 if l 13m = l 11m or l 13m = l 11f P(l 13m | l 11m, l 11f ) = 0 otherwise L 12f L 12m L 13f X 12 X 13 First attempt: correct but not efficient as we shall see. Mother Father Offspring

6 Probabilistic model for two loci L 11f L 11m L 13m X 11 L 12f L 12m L 13f X 12 X 13 Model for locus 1 L 21f L 21m L 23m X 21 L 22f L 22m L 23f X 22 X 23 Model for locus 2 L 23m depends on whether L 13m got the value from L 11m or L 11f, whether a recombination occurred, and on the values of L 21m and L 21f. This is quite complex.

7 Adding a selector variable L 11f L 11m L 13m X 11 S 13m Selector of maternal allele at locus 1 of person 3 Maternal allele at locus 1 of person 3 (offspring) Selector variables S ijm are 0 or 1 depending on whose allele is transmitted to offspring i at maternal locus j. P(s 13m ) = ½ P(l 13m | l 11m, l 11f,,S 13m =0) = 1 if l 13m = l 11m P(l 13m | l 11m, l 11f,,S 13m =1) = 1 if l 13m = l 11f P(l 13m | l 11m, l 11f,,s 13m ) = 0 otherwise

8 Probabilistic model for two loci S 13m L 11f L 11m L 13m X 11 S 13f L 12f L 12m L 13f X 12 X 13 Model for locus 1 S 23m L 21f L 21m L 23m X 21 S 23f L 22f L 22m L 23f X 22 X 23 Model for locus 2

9 Probabilistic Model for Recombination S 23m L 21f L 21m L 23m X 21 S 23f L 22f L 22m L 23f X 22 X 23 S 13m L 11f L 11m L 13m X 11 S 13f L 12f L 12m L 13f X 12 X 13  is the recombination fraction between loci 2 & 1.

10 Constructing the likelihood function I P(l 11m, l 11f,, x 11, s 13m,l 13m ) = P(l 11m ) P(l 11f ) P(x 11 | l 11m, l 11f, ) P(s 13m ) P(l 13m | s 13m, l 11m, l 11f ) Joint probability Prob(data) = P(x 11 ) =  l11m  l11f  s13m  l13m P(l 11m, l 11f,, x 11, s 13m,l 13m ) Probability of data (sum over all states of all hidden variables) All other variables are not-observed (hidden) Observed variable S 13m L 11f L 11m L 13m X 11

11 Constructing the likelihood function II = P(l 11m ) P(l 11f ) P(x 11 | l 11m, l 11f, ) … P(s 13m ) P(s 13f ) P(s 23m | s 13m,  ) P(s 23m | s 13m,  ) P(l 11m,l 11f,x 11,l 12m,l 12f,x 12,l 13m,l 13f,x 13, l 21m,l 21f,x 21,l 22m,l 22f,x 22,l 23m,l 23f,x 23, s 13m,s 13f,s 23m,s 23f,  ) = Product over all local probability tables Prob(data|  2 ) = P(x 11, x 12, x 13, x 21, x 22, x 23 ) = Probability of data (sum over all states of all hidden variables) Prob(data|  2 ) = P(x 11, x 12, x 13, x 21, x 22, x 23 ) =  l11m, l11f … s23f [ P(l 11m ) P(l 11f ) P(x 11 | l 11m, l 11f, ) … P(s 13m ) P(s 13f ) P(s 23m | s 13m,  ) P(s 23m | s 13m,  ) ] The result is a function of the recombination fraction. The ML estimate is the  value that maximizes this function.

12 The Disease Locus I L 11f L 11m L 13m X 11 S 13m Phenotype variables Y ij are 0 or 1 depending on whether a phenotypic trait associated with locus i of person j is observed. E.g., sick versus healthy. For example model of perfect recessive disease yields the penetrance probabilities: P(y 11 = sick | X 11 = (a,a)) = 1 P(y 11 = sick | X 11 = (A,a)) = 0 P(y 11 = sick | X 11 = (A,A)) = 0 Y 11

13 The Disease Locus II L 11f L 11m L 13m X 11 S 13m Note that in this model we assume the phenotype/disease depends only on the alleles of one locus. Also we did not model levels of sickness. Y 11

14 Introducing a tentative disease Locus S 23m L 21f L 21m L 23m X 21 S 23f L 22f L 22m L 23f X 22 X 23 S 13m L 11f L 11m L 13m X 11 S 13f L 12f L 12m L 13f X 12 X 13 The recombination fraction  is unknown. Finding it can help determine whether a gene causing the disease lies in the vicinity of the marker locus. Disease locus: assume sick means x ij =(a,a) Marker locus Y 22 Y 21 Y 23

15 Locus-by-Locus Summation order Sum over locus i vars before summing over locus i+1 vars Sum over orange vars (L ijt ) before summing selector vars (S ijt ). This order yields a Hidden Markov Model (HMM).

16 Hidden Markov Models in General Application in communication: message sent is (s 1,…,s m ) but we receive (r 1,…,r m ). Compute what is the most likely message sent ? Application in speech recognition: word said is (s 1,…,s m ) but we recorded (r 1,…,r m ). Compute what is the most likely word said ? Application in Genetic linkage analysis: to be discussed now. X1X1 X2X3Xi-1XiXi+1R1R1 R2R2 R3R3 R i-1 RiRi R i+1 X1X1 X2X3Xi-1XiXi+1S1S1 S2S2 S3S3 S i-1 SiSi S i+1 Which depicts the factorization:

17 Hidden Markov Model In our case X1X1 X2X3Xi-1XiXi+1 X1X1 X2X2 X3X3 Y i-1 XiXi X i+1 X1X1 X2X3Xi-1XiXi+1 S1S1 S2S2 S3S3 S i-1 SiSi S i+1 The compounded variable S i = (S i,1,m,…,S i,2n,f ) is called the inheritance vector. It has 2 2n states where n is the number of persons that have parents in the pedigree (non-founders). The compounded variable X i = (X i,1,m,…,X i,2n,f ) is the data regarding locus i. Similarly for the disease locus we use Y i. To specify the HMM we need to write down the transition matrices from S i-1 to S i and the matrices P(x i |S i ). Note that these quantities have already been implicitly defined.

18 The transition matrix Recall that: Note that theta depends on I but this dependence is omitted. In our example, where we have one non-founder (n=1), the transition probability table size is 4  4 = 2 2n  2 2n, encoding four options of recombination/non-recombination for the two parental meiosis: (The Kronecker product) For n non-founders, the transition matrix is the n-fold Kronecker product:

19 Efficient Product So, if we start with a matrix of size 2 2n, we will need 2 2n multiplications if we had matrix A in hands. Continuing recursively, at most 2n times, yields a complexity of O(2n2 2n ), far less than O(2 4n ) needed for regular multiplication. With n=10 non-founders, we drop from non-feasible region to feasible one.

20 Probability of data in one marker locus given an inheritance vector S 23m L 21f L 21m L 23m X 21 S 23f L 22f L 22m L 23f X 22 X 23 Model for locus 2 P(x 21, x 22, x 23 |s 23m,s 23f ) = =  P(l 21m ) P(l 21f ) P(l 22m ) P(l 22f ) P(x 21 | l 21m, l 21f ) P(x 22 | l 22m, l 22f ) P(x 23 | l 23m, l 23f ) P(l 23m | l 21m, l 21f, S 23m ) P(l 23f | l 22m, l 22f, S 23f ) l 21m,l 21f,l 22m,l 22f l 22m,l 22f The five last terms are always zero-or-one, namely, indicator functions.

21 Efficient computation S 23m L 21f L 21m L 23m X 21 S 23f L 22f L 22m L 23f X 22 X 23 Model for locus 2 Assume only individual 3 is genotyped. For the inheritance vector (0,1), the founder alleles L 21m and L 22f are not restricted by the data while (L 21f,L 22m ) have two possible joint assignments (A 1,A 2 ) or (A 2,A 1 ) only: The five last terms are always zero-or-one, namely, indicator functions. ={A 1,A 2 } =1 =0 p(x 21, x 22, x 23 |s 23m =1,s 23f =0 ) = p( A 1 )p( A 2 ) + p( A 2 )p( A 1 ) In general. Every inheritance vector defines a subgraph of the Bayesian network above. We build a founder graph

22 Efficient computation S 23m L 21f L 21m L 23m X 21 S 23f L 22f L 22m L 23f X 22 X 23 Model for locus 2 The five last terms are always zero-or-one, namely, indicator functions. ={A 1,A 2 } =1 =0 In general. Every inheritance vector defines a subgraph as indicated by the black lines above. Construct a founder graph whose vertices are the founder variables and where there is an edge between two vertices if they have a common typed descendent. The label of an edge is the constraint dictated by the common typed descendent. Now find all consistent assignments for every connected component. L 21f L 21m L 22f L 22m {A 1,A 2 }

23 A Larger Example a,b a,c b,d {a,b} 5364 {b,d} {a,c} Descent graph Founder graph (An example of a constraint satisfaction graph) Connect two nodes if they have a common typed descendant.

24 The Constraint Satisfaction Problem {a,b} 5364 {b,d} {a,c} The number of possible consistent alleles per non-isolated node is 0, 1 or 2. For example node 2 has all possible alleles, node 6 can only be b and node 3 can be assigned either a or b. namely, the intersection of its adjacent edges labels. For each non-singleton connected component: Start with an arbitrary node, pick one of its values. This dictates all other values in the component. Repeat with the other value if it has one. So each non-singleton component yields at most two solutions.

25 Solution of the CSP {a,b} 5364 {b,d} {a,c} Since each non-singleton component yields at most two solutions. The likelihood is simply the product of sums each of two terms at most. Each component contributes one term. Singleton components contribute the term 1 In our example: 1 * [ p(a)p(b)p(a) + p(b)p(a)p(b)] * p(d)p(b)p(a)p(c). Complexity. Building the founder graph: O(f 2 +n). While solving general CSPs is NP-hard.

. Summary

27 Road Map For Graphical Models Foundations Probability theory –subjective versus objective Other formalisms for uncertainty (Fuzzy, Possibilistic, belief functions) Type of graphical models: Directed, Undirected, Chain Graphs, Dynamic networks, factored HMM, etc Discrete versus continuous distributions Causality versus correlation Inference Exact Inference Variable elimination, clique trees, message passing Using internal structure like determinism or zeroes Queries: MLE, MAP, Belief update, sensitivityApproximate Inference Sampling methods Loopy propagation (minimizing some energy function) Variational method

28 Road Map For Graphical Models Learning Complete data versus incomplete data Observed variables versus hidden variables Learning parameters versus learning structure Scoring methods versus conditional independence tests methods Exact scores versus asymptotic scores Search strategies vs. Optimal learning of trees/polytrees/TANs Applications Diagnostic tools: printer problems to airplanes failures Medical diagnostic Error correcting codes: Turbo codes Image processing Applications in Bioinformatics: gene mapping, regulatory, metabolic, and other network learning