Presentation is loading. Please wait.

Presentation is loading. Please wait.

HMM Sampling and Applications to Gene Finding and Alignment European Conference on Computational Biology 2003 Simon Cawley * and Lior Pachter + and thanks.

Similar presentations


Presentation on theme: "HMM Sampling and Applications to Gene Finding and Alignment European Conference on Computational Biology 2003 Simon Cawley * and Lior Pachter + and thanks."— Presentation transcript:

1 HMM Sampling and Applications to Gene Finding and Alignment European Conference on Computational Biology 2003 Simon Cawley * and Lior Pachter + and thanks to Eli Rusman * Affymetrix + UC Berkeley Mathematics Dept

2 Conservation of alternative splicing between human and mouse Modrek and Lee: 40-60% of human genes have alternative splice forms. Nature Genetics 2002. Nurtdinov et al. 75% of human alternative splice forms are conserved in mouse. Human Molecular Genetics 2003. Can we develop ab-initio methods for detecting conserved alternative splice sites?

3 A A C A T T A G A AGATTACCACA Sequence Alignment

4 A A C A T T A G A AGATTACCACA max Finding the optimal alignment

5 a i,j = w a i-1,j + w a i,j-1 + s i,j a i-1,j-1 A A C A T T A G A AGATTACCACA Alignment forward variables for positions [1,i] and [1,j] in each sequence Match/mismatch probabilities for positions i,j in each sequence gap probabilities Sampling to find alternative alignments

6 Linear Space Sampling Sequences length T,U To obtain k samples Time complexity: O(TU+k(T+U)) Memory requirements: O(T+U) Hirschberg’s divide and conquer algorithm Time complexity: O(TU) Memory requirements: O(T+U)

7 Alternative Splicing in Mammalian Genomes pre-mRNA TRANSLATION SPLICING Protein I ALTERNATIVE SPLICING Protein II TRANSLATION

8 M. Alexandersson, S. Cawley, L. Pachter, SLAM- Cross-species gene finding and alignment with a generalized pair hidden Markov model, Genome Research, 13 (2003) p 496-502 Cross-species simultaneous gene finding and alignment

9 Modeling gene features 5’ 3’ Exon 1 Exon 2 Exon 3 Intron 1Intron 2 CNS [human] [mouse]

10 The SLAM hidden Markov model

11 SLAM components Splice site detector –VLMM Intron and intergenic regions –2nd order Markov chain –independent geometric lengths Coding sequence –PHMM on protein level –generalized length distribution Conserved non-coding sequence –PHMM on DNA level

12 SLAM input and output Input: –Pair of homologous sequences. Output: –CDS and CNS predictions in both sequences. –Protein predictions. –Protein and CNS alignment.

13 http://bio.math.berkeley.edu/slam/

14 Input:

15 Output:

16

17

18 Methodology for identifying alternative splice sites Compiled SLAM gene predictions for the human, mouse and rat genomes. Identified a set of 3400 human/mouse/rat gene triples with consistent predictions from hs/mm and hs/rn analyses. For each triple, sampled sub-optimal parses from hs/mm and hs/rn runs Collected alternative exons (non-Viterbi exons) that appeared in both the hs/mm and hs/rn runs Examined overlap with RefSeq genes, mRNAs and ESTs

19 SLAM whole genome predictions Built a whole genome homology map (Colin Dewey) http://baboon.math.berkeley.edu/~cdewey/homologyMaps/ Pre-aligned the homologous blocks to reduce the SLAM search space (Nicolas Bray using AVID) http://baboon.math.berkeley.edu/mavid/ http://hanuman.math.berkeley.edu/kbrowser/ Ran SLAM on the resulting blocks http://bio.math.berkeley.edu/slam/mouse/ http://bio.math.berkeley.edu/slam/rat/

20 [human] [mouse] [rat]

21

22

23

24 Comparing predicted alternative exons to ESTs and mRNAs human/mouse/rat alternative exons human/mouse alternative exons EST/mRNA No EST/mRNAEST/mRNA No EST/mRNA Gene count293444613296 Alt. Exon count294415577240 Shifties282092622227 Newbies12322955013

25 Conclusions Sampling is memory efficient, fast, and should be used routinely for alignment applications. Conserved alternative splice forms can be detected ab-initio. The extent of alternative splicing conservation is currently unclear. Sampling provides an alternative approach for investigating this problem- one that is not sensitive to biases in EST data. Problem: design effective and scalable validation strategies for alternative splice sites.


Download ppt "HMM Sampling and Applications to Gene Finding and Alignment European Conference on Computational Biology 2003 Simon Cawley * and Lior Pachter + and thanks."

Similar presentations


Ads by Google