Gene Prediction: Past, Present, and Future Sam Gross.

Slides:



Advertisements
Similar presentations
Hidden Markov Model in Biological Sequence Analysis – Part 2
Advertisements

HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY CS 594: An Introduction to Computational Molecular Biology BY Shalini Venkataraman Vidhya Gunaseelan.
Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz
Ab initio gene prediction Genome 559, Winter 2011.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
Ka-Lok Ng Dept. of Bioinformatics Asia University
Hidden Markov Models in Bioinformatics
Profiles for Sequences
Hidden Markov Models (HMMs) Steven Salzberg CMSC 828H, Univ. of Maryland Fall 2010.
Gene Finding BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Hidden Markov Models in Bioinformatics Example Domain: Gene Finding Colin Cherry
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
CISC667, F05, Lec18, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Gene Prediction and Regulation.
Lecture 6, Thursday April 17, 2003
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Hidden Markov Models. Decoding GIVEN x = x 1 x 2 ……x N We want to find  =  1, ……,  N, such that P[ x,  ] is maximized  * = argmax  P[ x,  ] We.
Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken from (and rapidly mixed) Larry Hunter, Tom Madej, William Stafford Noble,
Gene Finding. Finding Genes Prokaryotes –Genome under 10Mb –>85% of sequence codes for proteins Eukaryotes –Large Genomes (up to 10Gb) –1-3% coding for.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Hidden Markov Model Special case of Dynamic Bayesian network Single (hidden) state variable Single (observed) observation variable Transition probability.
Lecture 5: Learning models using EM
Gene Finding (DNA signals) Genome Sequencing and assembly
Conditional Random Fields
CSE182-L10 Gene Finding.
CSE182-L12 Gene Finding.
Comparative ab initio prediction of gene structures using pair HMMs
Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Eukaryotic Gene Finding
CSE182-L8 Gene Finding. Project EST clustering and assembly Given a collection of EST (3’/5’) sequences, your goal is to cluster all ESTs from the same.
Lecture 12 Splicing and gene prediction in eukaryotes
CSE182-L10 MS Spec Applications + Gene Finding + Projects.
Deepak Verghese CS 6890 Gene Finding With A Hidden Markov model Of Genomic Structure and Evolution. Jakob Skou Pedersen and Jotun Hein.
Eukaryotic Gene Finding
Hidden Markov Models In BioInformatics
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Gene Finding BIO337 Systems Biology / Bioinformatics – Spring 2014 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BIO337/Spring.
Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
BME 110L / BIOL 181L Computational Biology Tools October 29: Quickly that demo: how to align a protein family (10/27)
BME 110L / BIOL 181L Computational Biology Tools February 19: In-class exercise: a phylogenetic tree for that.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
10/29/20151 Gene Finding Project (Cont.) Charles Yan.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Mark D. Adams Dept. of Genetics 9/10/04
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
From Genomes to Genes Rui Alves.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group.
John Lafferty Andrew McCallum Fernando Pereira
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
(H)MMs in gene prediction and similarity searches.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
A knowledge-based approach to integrated genome annotation Michael Brent Washington University.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
Schematic of Eukaryotic Protein-Coding Locus
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Spectral Algorithms for Learning HMMs and Tree HMMs for Epigenetics Data Kevin C. Chen Rutgers University joint work with Jimin Song (Rutgers/Palentir),
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
bacteria and eukaryotes
Eukaryotic Gene Finding
Ab initio gene prediction
Presentation transcript:

Gene Prediction: Past, Present, and Future Sam Gross

Genes ATG Gene  RNA  Protein Proteins are about 500 AA long Genes are about 1500bp long TAG TAA TGA

ORF Scanning In “lower” organisms, genes are contiguous We expect about 1 stop codon per 64bp If we see a long ORF, it’s probably a gene! –And conversely, all genes are long ORFs

Introns GT AG ATG TGA TAA TAG Drosophila: 3.4 introns per gene on average mean intron length 475, mean exon length 397 Human: 8.8 introns per gene on average mean intron length 4400, mean exon length 165 ORF scanning is defeated

Splicing GT AG ATG TGA TAA TAG GT AG ATG TGA TAA TAG AG

Needles in a Haystack Human genome is about 3.2Gbp 20,000 – 25,000 genes 78% intergenic, 20% introns, 2% coding

Gotta Find ‘Em All 60-85% of all human genes have been found, mostly by random EST sequencing –This probably won’t work for the rest For most genes, only one splice variant is known If we can computationally predict a gene, we have a cheap experiment (RT-PCR) to verify

Looking For Clues Signals used by the cell –99% of introns begin with GT, end with AG –0.8% of introns begin with GC, end with AG –Gene begins with ATG –Gene ends with TAG, TAA, or TGA Other properties of genes –Exons have characteristic lengths –Base composition of exons is characteristic due to genetic code –Exons tend to be conserved between species Pattern of conservation is three-periodic

Three-Periodicity Most amino acids can be coded for by more than one DNA triplet (codon) Usually, the degeneracy is in the last position HumanCCTGTT(Proline, Valine) MouseCCAGTC(Proline, Valine) RatCCAGTC(Proline, Valine) DogCCGGTA(Proline, Valine) ChickenCCCGTG(Proline, Valine)

Hidden Markov Models The de facto standard for gene prediction Probabilistic finite state machine Transition to a state, emit a character, transition to a new state –Many independence assumptions CDSNC ACG

HMMs For Gene Prediction Generative model –Define P(X, Y) as a product of many independent terms P( ACG ) = P(start in noncoding) * P(noncoding emits A) * P(noncoding transitions to noncoding) * P(noncoding emits C) * P(noncoding transitions to coding) * P(coding emits A) Terms are of the forms P(y i | y i-1 ) and P(x i | y i ) –Trained by collecting counts

HMMs For Gene Prediction To predict genes given a sequence X, calculate argmax Y P(Y | X) = argmax Y P(X, Y) / P(X) = argmax Y P(X, Y)

Generalized Hidden Markov Models Like a HMM, but state durations are explicit Transition to a state, pick a duration d, emit d characters, transition to a new state Dynamic programming algorithm complexity goes from O(N 2 L) to O(N 2 LK) –K is the maximum state duration –Not so bad in practice

Predicting Genes With HMMs Given a sequence, we can calculate the most likely annotation Internal Exon Intron Inter- genic Final Exon Initial Exon Single Exon GGTGAGGTGACCAAGAACGTGTTGACAGTA

The Past: GENSCAN Chris Burge, Stanford, 1997 Before the Human Genome Project –No alignments available –People still thought there were 100,000 human genes

The GENSCAN Model

Output probabilities for NC and CDS depend on previous 5 bases (5 th -order) –P(X i | X i-1, X i-2, X i-3, X i-4, X i-5 ) Each CDS frame has its own model Special 2 nd -order positional models for start codon, stop codon, and acceptor site Even fancier model for donor sites –Maximal dependence decomposition (MDD) –Long-range dependencies Separate model for different isochores

GENSCAN Performance First program to do well on realistic sequences –Multiple genes in both orientations Pretty good sensitivity, poor specificity –70% exon Sn, 40% exon Sp Not enough exons per gene Was the best gene predictor for about 4 years

Comparative Gene Prediction Exon Intron Exon Intron Human A A G G T G Human A A G G T G Mouse A A G G T GMouse A A T G T G Chicken A A G G T GChicken A A _ A C G A B

The Recent Past: TWINSCAN Korf, Flicek, Duan, Brent, Washington University in St. Louis, 2001 Uses an informant sequence to help predict genes –For human, informant is normally mouse Informant sequence consists of three characters –Match:| –Mismatch:: –Unaligned:. Informant sequence assumed independent of target sequence

The TWINSCAN Model Just like GENSCAN, except adds models for conservation sequence 5 th -order models for CDS and NC, 2 nd - order models for start and stop codons and splice sites –One CDS model for all frames Many informants tried, but mouse seems to be at the “sweet spot”

TWINSCAN Performance Slightly more sensitive than GENSCAN, much more specific –Exon sensitivity/specificity about 75% Much better at the gene level –Most genes are mostly right, about 25% exactly right Was the best gene predictor for about 4 years

The Present: N-SCAN Gross and Brent, Washington University in St. Louis, 2005 If one informant sequence is good, let’s try more! Also several other improvements on TWINSCAN

N-SCAN Improvements Multiple informants Richer models of sequence evolution Frame-specific CDS conservation model Conserved noncoding sequence model 5’ UTR structure model

GENSCAN TWINSCAN N-SCAN HMM Outputs TargetGGTGAGGTGACCAAGAACGTGTTGACAGTA Conservation|||:||:||:|||||:|||||||| sequence TargetGGTGAGGTGACCAAGAACGTGTTGACAGTA Informant1GGTCAGC___CCAAGAACGTGTAG Informant2GATCAGC___CCAAGAACGTGTAG Informant3GGTGAGCTGACCAAGATCGTGTTGACACAA...

N-SCAN State Diagram

Two-Component Output Distributions Target sequence model Phylogenetic model for informants Product gives the probability of a multiple alignment column

Phylogenetic Bayesian Network Models

Graph Transformation

Inference Slightly-modified version of Felsenstein’s algorithm At each of the O(N) nodes, we calculate 6 o+1 summations over 6 o+1 values Total time complexity is O(N 6 2(o+1) )

Training Simple with labeled multiple alignment of all sequences Can use known genes as a labeling Don’t know ancestral genome sequences –Treat them as missing data and use EM

CPD Parameterizations Each Bayesian network of order o has (2N-1)(6 o+1 )(6 o+1 -1) free parameters We can reduce this number by restricting the form of the CPDs Partially reversible models –Relative frequency of DNA k-mers remains constant as sequence evolves –Gaps and unaligned regions introduced over time

N-SCAN Phylogenetic Models vs. Traditional Phylogenetic Models Root (target) node is observed –Can use existing single-sequence models –Can use higher-order models –Can estimate target sequence model optimally

No assumption of homogeneous substitution process –Gaps and unaligned regions can be treated naturally –Robust against Function-changing mutation Alignment error Sequencing error –The price is many more parameters N-SCAN Phylogenetic Models vs. Traditional Phylogenetic Models

Conservation Score Coefficient N-SCAN uses log-likelihood scores internally. The score of a position i under state S is Values of k between 0.3 and 0.6 result in the best performance –Performance is roughly constant in this range

Whole-Genome Human Gene Prediction Annotations used were cleaned RefSeqs –16,259 genes –20,837 transcripts N-SCAN used human, mouse, rat, chicken alignment

Exact Exon Accuracy

Exact Gene Accuracy

Intron Sensitivity By Length

Human Informant Effectiveness

Drosophila Informant Effectiveness

The Future(?): CONTRAST New gene predictor currently in the works Based not on a generalized HMM, but a semi-Markov conditional random field (SCRF)

HMMs For Gene Prediction Generative model –Define P(X, Y) as a product of many independent terms P( ACG ) = P(start in noncoding) * P(noncoding emits A) * P(noncoding transitions to noncoding) * P(noncoding emits C) * P(noncoding transitions to coding) * P(coding emits A) Terms are of the forms P(y i | y i-1 ) and P(x i | y i ) –Trained by collecting counts

HMMs For Gene Prediction To predict genes given a sequence X, calculate argmax Y P(Y | X) = argmax Y P(X, Y) / P(X) = argmax Y P(X, Y) Advantage: simplicity –Extremely fast training, efficient inference Disadvantage: simplicity –Makes many unwarranted independence assumptions –Inaccurate model will get us into trouble

When HMMs Go Wrong Normal HMM training optimizes wrong function –We use P(Y | X) for prediction, but we’re optimizing P(X, Y) = P(Y | X) P(X) –This means we may prefer parameters that lead to worse predictions if they assign a higher probability to the sequence

When HMMs Go Wrong NC A 3% B 2% C 95% CDS A 49% B 49% C 2% NC A 3% B 2% C 95% CDS A 3% B 95% C 2% NNC A 2% B 2% C 96% CNS A 96% B 2% C 2% CDS A 49% B 49% C 2% A = Conserved triplet B = Synonymous substitution C = Nonsynonymous substitution …CCCCCCCCCCCCCAAAAAAAAAACCCC…CCCCCCCBBABAAABBABBABCC…

Can We Fix It? Directly optimize No closed form solution –But function and gradient can be calculated efficiently using DP If we’re going to numerically optimize anyway, might as well switch to a more expressive model

CRFs For Gene Prediction Discriminative model –Define P(Y | X) as a product of many terms Individual terms are not probabilities! Terms are of the form f j (y i-1, y i, X, i) w j The Good –Independence assumptions much weaker than in HMMs –Inference complexity is the same as for HMM The Bad –Training requires numerical optimization of (convex) likelihood function

The Math CRFs HMMs

HMMs vs. CRFs y1y1 x1x1 y2y2 x2x2 y3y3 x3x3 y4y4 x4x4 y5y5 x5x5 y6y6 x6x6 … HMM y1y1 x1x1 y2y2 x2x2 y3y3 x3x3 y4y4 x4x4 y5y5 x5x5 y6y6 x6x6 … CRF

HMMs vs. CRFs HMM-style “features” –Last state is exon, current state is intron –Current state is exon, current sequence character is “C” CRF-style features –Current state is exon, CG percent in 100Kbp window is between 40% and 50%, at least one CpG island predicted within 10Kbp –Current state is exon, 3 unspliced ESTs with at least 95% identity aligned near current position –Current state is exon, 1 spliced EST with at least 95% identity aligned near current position

Semi-Markov CRFs Semi-Markov CRFs are to CRFs as generalized HMMs (or semi-HMMs) are to HMMs Instead of assigning labels to each position, assign labels to segments Features are f(y i-1, y i, X, i, j)

Future Directions SVM-based splice site models that use alignment information –Splice site models in current gene predictors are pretty primitive Alternative splicing! –Not yet handled well –Very poor experimental coverage of transcriptome