Presentation is loading. Please wait.

Presentation is loading. Please wait.

24/07/2007ISMB/ECCB 2007 24/07/2007ISMB/ECCB 2007 Bayesian association of haplotypes and non-genetic factors to regulatory and phenotypic variation in.

Similar presentations


Presentation on theme: "24/07/2007ISMB/ECCB 2007 24/07/2007ISMB/ECCB 2007 Bayesian association of haplotypes and non-genetic factors to regulatory and phenotypic variation in."— Presentation transcript:

1 24/07/2007ISMB/ECCB 2007 24/07/2007ISMB/ECCB 2007 Bayesian association of haplotypes and non-genetic factors to regulatory and phenotypic variation in human populations Anitha Kannan and John Winn Jim Huang * Probabilistic and Statistical Inference Group, Edward S. Rogers Department of Electrical and Computer Engineering University of Toronto Toronto, ON, Canada Microsoft Research Cambridge Machine Learning and Perception Group Cambridge, UK

2 24/07/2007ISMB/ECCB 2007 24/07/2007ISMB/ECCB 2007 Outline Main contributions: Joint Bayesian modelling of genetic variation data and quantitative trait measurements Rich probabilistic model for genotype data State-of-the-art results on predicting missing genotypes

3 24/07/2007ISMB/ECCB 2007 Outline Genotype: Unordered pair of SNPs along both chromosomes Haplotype: Ordered set of SNPs along a chromosome Presence of recombination hotspots partitions haplotypes into blocks [Daly, 2001]

4 24/07/2007ISMB/ECCB 2007 Part I: Learning haplotype block structure Our model for genotype data should: –Account for phase & parent-child information –Account for uncertainty in ancestral haplotypes –Account for uncertainty in block structure –Account for population-specific haplotype block statistics –Allow for prior knowledge of haplotype block structure

5 24/07/2007ISMB/ECCB 2007 24/07/2007ISMB/ECCB 2007 Previous models for genotype data Previous methods learn a low-dimensional representation of the genotype data: HAPLOBLOCK (Greenspan, G. and Geiger, D. RECOMB 2003) –Hard partitioning of data into set of haplotype blocks using low- dimensional “ancestral” haplotypes fastPHASE (Scheet P. and Stephens, M. Am J Hum Genet 2006) –Learn ancestral haplotypes from high-dimensional genotype data while accounting for uncertainty in haplotype blocks Jojic, N., Jojic, V. and Heckerman, D. UAI 2004.

6 24/07/2007ISMB/ECCB 2007 Low-dimensional latent representation Probabilistic generative model for genotype data High-dimensional data Unsupervised learning via maximum likelihood

7 24/07/2007ISMB/ECCB 2007 A probabilistic model for genotype data

8 24/07/2007ISMB/ECCB 2007 Maximum likelihood: Lower bound on log likelihood: Learning the model for genotype data Inference Learning/ Parameter estimation

9 24/07/2007ISMB/ECCB 2007 Exact inference is intractable! Approximate the posterior distribution: Baum-Welch-like algorithm: –Run forward-backward algorithm separately on each chain of states –Estimate transition probabilities and ancestral haplotypes given distributions over states Variational inference and parameter estimation

10 24/07/2007ISMB/ECCB 2007 Predicting missing genotype data Have we learned a good density model for genotype data? Gains from –Accounting for uncertainty in haplotype block structure –Accounting for uncertainty in ancestral haplotypes –Accounting for parental relationships Assess model using cross-validation/test prediction error

11 24/07/2007ISMB/ECCB 2007 Predicting missing genotype data Crohn’s/5q31 data set (Daly et al., 2001) –Crohn’s disease data from Chromosome 5q31 containing genotypes for 129 children + 258 parents across 103 loci (phases given for children) For each test set, make ρ fraction of data missing Retain model parameters from model learned from training data, then draw 1000 samples over missing data Compute fill-in error rate over 1000 samples, for all missing data

12 24/07/2007ISMB/ECCB 2007 Prediction error for Crohn’s/5q31 data

13 24/07/2007ISMB/ECCB 2007 Comparative performance for Crohn’s/5q31 data

14 24/07/2007ISMB/ECCB 2007 Reconstructing phase Run EM using 10 random initializations on the full data set Estimate phase from posterior Compute phase error over all loci where phase is known, unambiguous and where alleles are completely observed Compute average and standard deviation of phase error over the 10 initializations

15 24/07/2007ISMB/ECCB 2007 Reconstructing phase Daly 5q31 data (children w/ phase) (phase frozen during EM) Daly 5q31 data (children w/out phase) (phase learned during EM): Daly 5q31 data (children w/ phase + parents) (phase frozen during EM) Daly 5q31 data (children w/out phase + parents) (phase learned during EM) Mean phase error rate 0.59%8.21%0.39%9.51% Standard deviation of phase error rate 1.00%1.09%0.07%1.78% Minimum free energy (nats) 1.50 x 10 4 2.23 x 10 4 1.45 x 10 4 1.36 x 10 4

16 24/07/2007ISMB/ECCB 2007 How many ancestors?

17 24/07/2007ISMB/ECCB 2007 Establishing haplotype block boundaries Define the recombination prior γ on transition probabilities –Different γ correspond to different “blockiness” of data For each locus k, can compute the probability of transition p k –Can establish a threshold t and establish block boundaries Once blocks are defined, can assign block labels l b = (m,n)

18 24/07/2007ISMB/ECCB 2007 Establishing haplotype block boundaries Smaller number of larger blocks… Larger number of smaller blocks…

19 24/07/2007ISMB/ECCB 2007 Haplotype block structure in the ENm006 region 573 SNP markers for 270 individuals from 3 sub- populations: –90 Yoruba individuals (30 parent-parent-offspring trios) from Ibadan, Nigeria (YRI); –90 individuals (30 trios) of European descent from Utah (CEU) –45 Han Chinese individuals from Beijing (CHB+JPT)/45 Japanese individuals from Tokyo (JPT)

20 24/07/2007ISMB/ECCB 2007 Pattern usage in Chromosome 5q31

21 24/07/2007ISMB/ECCB 2007 Part II: Linking haplotype block structure and gene expression data

22 24/07/2007ISMB/ECCB 2007 A model for linking haplotype structure to quantitative trait measurements Observed quantitative trait profile + x 1.0 x 0.0 Relevance variable = Latent block profile Haplotype block 2 Individual 1 Individual 2 Individual 3 Individual 4 Individual 5 Individual 1 Individual 2 Individual 3 Individual 4 Individual 5 Haplotype block 1 Label 1 Label 2 Label 3Label 4 x x

23 24/07/2007ISMB/ECCB 2007 SbjSbj zgjzgj μbgμbg w bg ρgρg individuals j = 1,…,J blocks b = 1,…,B quantitative traits g = 1,…,G α0,β0α0,β0 τ0,μ0τ0,μ0 Noise precision Latent block profile Relevance variable Observed trait Block label π0π0 A Bayesian model for linking haplotype structure to quantitative measurements TbjTbj

24 24/07/2007ISMB/ECCB 2007 lbjlbj zgjzgj μbgμbg w bg ρgρg individuals j = 1,…,J blocks b = 1,…,B genes g = 1,…,G α0,β0α0,β0 π0π0 Noise precision Latent block profile Relevance variable Observed gene expression Block label τ0,μ0τ0,μ0 A Bayesian model for linking haplotype structure to quantitative measurements SbjSbj TbjTbj

25 24/07/2007ISMB/ECCB 2007 A Bayesian model for linking haplotype structure to quantitative measurements μbgμbg w bg ρgρg α0,β0α0,β0 π0π0 Noise precision Latent block profile Relevance variable τ0,μ0τ0,μ0

26 24/07/2007ISMB/ECCB 2007 zgjzgj μbgμbg w bg ρgρg α0,β0α0,β0 τ0,μ0τ0,μ0 Noise precision Latent block expression profile Relevance variable Observed trait π0π0 A Bayesian model for linking haplotype structure to quantitative measurements SbjSbj Block label TbjTbj Block labels

27 24/07/2007ISMB/ECCB 2007 A Bayesian model for linking haplotype structure to quantitative measurements Likelihood Joint probability Priors

28 24/07/2007ISMB/ECCB 2007 Variational Bayes for inferring relationships between haplotype blocks and quantitative measurements Posterior over block labels is held fixed Factorized variational approximation: VB Inference and Learning

29 24/07/2007ISMB/ECCB 2007 Variational Bayes updates

30 24/07/2007ISMB/ECCB 2007 Linking haplotype blocks to phenotype 387 individuals with Crohn’s (+1) or non-Crohn’s (-1) phenotype; Link 10 haplotype blocks from 5q31 to phenotype Average cross-validation error: 23.1% + 3.45% Haplotype blocks 2 and 10 most relevant to Crohn’s phenotype (p < 4.76 x 10 -5 ) Test cases (sorted) Test data splits

31 24/07/2007ISMB/ECCB 2007 Robustness of GeneSNP to irrelevant genes 5 irrelevant genes 10 irrelevant genes Adding irrelevant genes doesn’t hurt much…

32 24/07/2007ISMB/ECCB 2007 Robustness of GeneSNP to irrelevant blocks 1 irrelevant block 2 irrelevant blocks 10 irrelevant blocks …but adding irrelevant haplotype blocks does hurt (bad)! LESSON: Important to group together large numbers of SNPs into smaller number of haplotype blocks!

33 24/07/2007ISMB/ECCB 2007 Linking haplotype blocks to gene expression ENm006 data set: 19 haplotype blocks (573 SNPs) 28 gene expression profiles in ENm006 region (Stranger et al., 2007)

34 24/07/2007ISMB/ECCB 2007 Addressing population stratification …whereas variation between individuals is the effect we’re interested in The population variable affects phenotype/gene expression…

35 24/07/2007ISMB/ECCB 2007 Associations between haplotype blocks and gene expression GDI1 - HapBlock2 (YRI) GDI1 - HapBlock5 (CHB+JPT) p < 2.5 x 10 -4 p < 3.33 x 10 -4

36 24/07/2007ISMB/ECCB 2007 Summary Enhanced version of Jojic et al. (UAI 2004) model for haplotype inference/ discovering block structure Novel Bayesian model for associating haplotype blocks to gene expression We re-discover population-specific block structures across populations in the HapMap data Predictions for Crohn’s disease from Chromosome 5q31 data Cis- associations between blocks and gene expression in ENm006 in presence of non-genetic factors Cis- association between HapBlocks 2 and 5 and GDI1

37 24/07/2007ISMB/ECCB 2007 The road ahead… Applying to larger portions of the HapMap data Finding trans- associations Non-linear models for associating block structure to quantitative traits Joint learning of haplotype block structure and associations Accounting for patterns of gene co-expression/similar phenotypes

38 24/07/2007ISMB/ECCB 2007 Acknowledgements Manolis Dermitzakis and Richard Durbin, Wellcome Trust Sanger Institute Nebojsa Jojic, Microsoft Research Redmond Paul Scheet, University of Michigan - Ann Arbor US National Science Foundation (NSF)


Download ppt "24/07/2007ISMB/ECCB 2007 24/07/2007ISMB/ECCB 2007 Bayesian association of haplotypes and non-genetic factors to regulatory and phenotypic variation in."

Similar presentations


Ads by Google