. Learning – EM in The ABO locus Tutorial #9 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.

. Learning – EM in The ABO locus Tutorial #9 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger

2 Genotype statistics Mendelian Genetics: locus - a particular location on a chromosome (genome) - Each locus has two copies – alleles (one paternal and one maternal) - Each copy has several relevant states - genotypes locus genotype is determined by the combined genotype of both copies. locus genotype yields phenotype (physical features) We wish to estimate the distribution of all possible genotypes. Suppose we randomly sample N individuals and found the number N s,t.  The MLE is given by: Sampling genotypes is costly Sampling phenotypes is cheap

3 The ABO locus ABO locus determines blood-type It has six possible genotypes {a/a, a/o, b/o, b/b, a/b, o/o}. They lead to four possible phenotypes: {A, B, AB, O} We wish to estimate the proportion in a population of the 6 genotypes. - Sample genotype – sequence a genomic region - Sample phenotype - checking presence of antibodies (simple blood test) Problem: phenotype doesn’t reveal genotype (in case of A,B)

4 The ABO locus Problem: phenotype doesn’t reveal genotype The probabilistic model: Allele genotypes are distributed i.i.d w.p  a,  b,  o, and determine probabilities for locus genotypes:  a/b =2  a  b ;  a/o =2  a  o ;  b/o =2  b  o  a/a =  a 2 ;  b/b =  b 2 ;  o/o =  o 2 This implies probabilities for phenotypes: Pr[P= A |Θ] =  a/a +  a/o =  a 2 +2  a  o Pr[P= B |Θ] =  b/b +  b/o =  b 2 +2  b  o Pr[P= AB |Θ] =  a/b = 2  a  b Pr[P= O |Θ] =  o/o =  o 2 Hardy-Weinberg equilibrium Θ - model parameter set Θ={  a,  b,  o }

5 Likelihood of phenotype data Given a population phenotype sample: Data = {B,A,B,B,O,A,B,A,O,B, AB} the likelihood of our parameter set Θ={  a,  b,  o } is: A B AB O Maximum of this function yields the MLE  Use EM to obtain this

6 The EM algorithm The setting for the algorithm: Our data is a series of outcomes of experiments. Each experiment is conducted identically and independently. The outcome of an experiment is a function of values selected for a set of discrete random variables – X 1,.. X n. The actual values selected for X 1,.. X n may be hidden from us.  We wish to find the MLE of the p.d’s for X 1,.. X n.

7 The EM algorithm The setting for the algorithm: Our data is a series of outcomes of experiments. Each experiment is conducted identically and independently. The outcome of an experiment is a function of values selected for a set of discrete random variables – X 1,.. X n. The actual values selected for X 1,.. X n may be hidden from us.  We wish to find the MLE of the p.d’s for X 1,.. X n. Examples: 1.Genotyping in the ABO locus: Single hidden variable X – a single allele genotype ( a, b, or o ) Model parameters - Θ={  a,  b,  o } 2.Hidden Markov Models: Two hidden variables T s, E s for every state state s ( E s – chooses signal ; T s – chooses next state) Model parameters – transition and emmission probabilities.

8 The EM algorithm Start with some set of parameters- Θ. Iterate until convergence: E-step: calculate the expected count for every possible result of every hidden variable in the model, as implied by data and Θ M-step: For every hidden variable: - Use expected counts as statistics to yield Θ’  MLE(data,Θ)

9 The EM algorithm E-step: calculate the expected count for every possible result of every hidden variable in the model, as implied by data and Θ M-step: For every hidden variable: - Use expected counts as statistics to yield Θ’  MLE(data,Θ) In our example: Single hidden variable X – a single allele genotype ( a, b, or o ) Model parameters - Θ={  a,  b,  o } E-step: count the expected number of a, b, o alleles in population (total number of counts - 2n ). M-step: set  ’ a = # a / 2n ;  ’ b = # b / 2n ;  ’ o = # o / 2n.

10 E-step calculations – gene counting genotype a/o a/a b/o b/b a/b o/o gene count abo 101 200 011 020 110 002 pheno -type A B AB O prob gene count abo 0 0 110 002 1* +2* 1* +2* 1*1* observed outcome of “experiment” result(s) of hidden variables

11 Data type #people A 100 B 200 AB 50 O 50 We start with an initial guess:  0 = { 0.2, 0.2, 0.6} A numeric example Sufficient statistics: n A, n B, n AB, n O  a  b  o

12 1 st iteration:  0 = {0.2, 0.2, 0.6} A numeric example - execution of EM Data type #people A 100 B 200 AB 50 O 50 E-step: ABABO E[(# a )] = E[(# b )] = E[(# o )] = 800 = 2n M-step:  1 = {0.205, 0.348, 0.447}

13 A numeric example - execution of EM Data type #people A 100 B 200 AB 50 O 50 E-step: ABABO E[(# a )] = E[(# b )] = E[(# o )] = 800 = 2n M-step: 2 nd iteration:  1 = {0.205, 0.348, 0.447}  2 = {0.211, 0.383, 0.406}

14 E-step: Sufficient statistics – n A, n B, n AB, n O M-step: EM algorithm for the ABO locus - summary

15 Iteration update formula: Sufficient statistics – n A, n B, n AB, n O, EM algorithm for the ABO locus - summary

16 EM algorithm – ABO example Data type #people A 100 B 200 AB 50 O 50 0.20 0.38 0.42  a,  b,  o Learning iteration

17 EM algorithm – ABO example Data type #people A 100 B 200 AB 50 O 50 0.20 0.38 0.42  a,  b,  o Learning iteration good convergence (maybe)

18 Alternative solution Alternative view: Single hidden variable X’ – a maternal allele genotype ( a, b, or o ) Model parameters - Θ={  a,  b,  o } E-step: count the expected number of maternal a, b, o alleles in population (total number of counts - n ). M-step: set  ’ a = # a / n ;  ’ b = # b / n ;  ’ o = # o / n. Initial view: Single hidden variable X – a single allele genotype ( a, b, or o ) Model parameters - Θ={  a,  b,  o } E-step: count the expected number of a, b, o alleles in population (total number of counts - 2n ). M-step: set  ’ a = # a / 2n ;  ’ b = # b / 2n ;  ’ o = # o / 2n.

19 count abo 1* 0 1* 0 1* 1/2 0 001 E-step calculations – gene counting mat. gen. o a o b a b o count abo 001 100 001 010 100 010 001 pheno -type A B AB O prob observed outcome of “experiment” result(s) of hidden variables Exactly ½ of what we got by gene counting

20 Iteration update formula: Sufficient statistics – n A, n B, n AB, n O, EM algorithm for the ABO locus - summary

. Learning – EM in The ABO locus Tutorial #9 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.

Similar presentations

Presentation on theme: ". Learning – EM in The ABO locus Tutorial #9 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

. Learning – EM in The ABO locus Tutorial #9 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.

Similar presentations

Presentation on theme: ". Learning – EM in The ABO locus Tutorial #9 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger."— Presentation transcript:

Similar presentations

About project

Feedback