Presentation is loading. Please wait.

Presentation is loading. Please wait.

Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.

Similar presentations

Presentation on theme: "Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze."— Presentation transcript:

1 Phylogenetic Shadowing Daniel L. Ong

2 March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze these sequences must be linear to be tractable Finding genes is important to Molecular Biologists, first step to understanding

3 March 9, 2005RUGS, UC Berkeley3 Outline Introduction Alignments Phylogenetic trees Sequence models –Example: mRNA and scRNA models Conclusions

4 March 9, 2005RUGS, UC Berkeley4 Introduction to Biosequences 4 nucleotides: A matches T; G matches C –In RNA, U replaces T The NIH GenBank has 188 GB of sequence data; UC Santa Cruz has another 128 GB The central dogma:

5 March 9, 2005RUGS, UC Berkeley5 Alignments Alignment: given two sequences, insert gaps or allow mismatches in input sequences to minimize a cost function –Similar to edit distance –Generalizes to n sequences Exploited to predict genes –Greater similarity in protein- coding genes –Mutated as a pair in structural RNA genes (Chakrabarti & Pachter, 2004)

6 March 9, 2005RUGS, UC Berkeley6 Multiple alignment Considering multiple sequences allows us to leverage the comparative genomics paradigm –Functionally important regions of the genome are more likely to be conserved across species –The converse is also true Genomes should be closely related –About 5-7 species of a family (Boffelli, et. al. 2003) –Additional genomes increase sensitivity (true positives) and decrease specificity (true negatives)

7 March 9, 2005RUGS, UC Berkeley7 Phylogenetic Trees Use directed binary tree to track the relationships between organisms Each node represents the nucleotide at a particular position in an aligned sequence –Current organisms are leaves of tree (observed) –Internal nodes are the common ancestor (unobserved) Edges are speciation events and represent “evolutionary distance” as an extra parameter Assume each nucleotide evolves independently (site independent evolution) [Durbin, et. al., 1998]

8 March 9, 2005RUGS, UC Berkeley8 Phylogenetic Tree Site independent model computes probability of independent columns –Used for protein-coding genes Pairwise site dependent model computes probability of base-paired columns –Used for scRNA genes Marty Yanofsky

9 March 9, 2005RUGS, UC Berkeley9 How to find a Phylogenetic Tree? Given n sequences, we want to find the correct tree topology –Search works for small n –Maximum likelihood: choose the tree that maximizes the probability of the alignment

10 March 9, 2005RUGS, UC Berkeley10 Biosequence analysis Phylogenetic trees encapsulate evolutionary time across sequences Sequence model predicts changes along the length of a particular sequence –Sequence models are typically HMMs

11 March 9, 2005RUGS, UC Berkeley11 Example: mRNA genes Suppose we want to identify coding genes with an HMM –Exon: DNA segment that gets transcribed to mRNA –Have states in HMM corresponding to exon regions (Alexandersson, et. al., 2003) Other types of RNA that get transcribed from DNA but not translated into protein are noncoding

12 March 9, 2005RUGS, UC Berkeley12 Structural RNA (scRNA) A sequence with many self-binding sites, forming a stable structure. Implicated in regulating critical biochemical pathways Michael W. King

13 March 9, 2005RUGS, UC Berkeley13 Example: Structural RNA Due to semi-palindromic structure, sequence model would be a PCFG –Violates the site-independent assumption of phylogenetic trees –Modify to allow pairwise site-dependencies in addition to non-matches Gene length can be in the thousands –Limit the length of scRNA to constant L; time O(L 3 + N*L 2 ), N = length of multi-alignment [Chakrabarti & Ong, 2004]

14 March 9, 2005RUGS, UC Berkeley14 Example completed Can combine HMM and the PCFG to form a supermodel Use a generic framework to identify mRNA, scRNA, and other regions [Chakrabarti & Ong, 2004]

15 March 9, 2005RUGS, UC Berkeley15 Phylogenetic shadowing Use multiple alignment of several closely related genomes Analysis of data becomes more reliable (Boffelli, et. al., 2003) –More genomes reduce probability of false positives –Still need closely related species to decrease chance of false negatives

16 March 9, 2005RUGS, UC Berkeley16 Conclusions Phylogenetic shadowing uses a multiple alignment to analyze multiple genomes simultaneously, increasing success AI techniques have been proven useful in Computational Biology –Still many more problems to solve

17 March 9, 2005RUGS, UC Berkeley17 References M. Alexandersson, S. Cawley, and L. Pachter. “SLAM: Cross-Species Gene Finding and Alignment with a Generalized Pair Hidden Markov Model.” Genome Research, 13 (2003) p 496-- 502. D. Boffelli, J. McAuliffe, D. Ovcharenko, K.D. Lewis, I. Ovcharenko, L. Pachter, and E.M. Rubin. “Phylogenetic shadowing of primate sequences to find functional regions of the human genome.” Science, 299 (2003), p 1391-1394. K. Chakrabarti and D.L. Ong. “Computational Identification of Noncoding RNA Genes through Phylogenetic Shadowing.” ACM/ISCB RECOMB 8 (2004), poster. K. Chakrabarti and L. Pachter. “Visualization of multiple genome annotations and alignments with the K-BROWSER.” Genome Research 14 (2004), p 716--720. R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. “Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids.” New York: Cambridge University Press, 1998.

Download ppt "Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze."

Similar presentations

Ads by Google