Gene Prediction: Past, Present, and Future Sam Gross.

Gene Prediction: Past, Present, and Future Sam Gross

Genes ATG Gene  RNA  Protein Proteins are about 500 AA long Genes are about 1500bp long TAG TAA TGA

ORF Scanning In “lower” organisms, genes are contiguous We expect about 1 stop codon per 64bp If we see a long ORF, it’s probably a gene! –And conversely, all genes are long ORFs

Introns GT AG ATG TGA TAA TAG Drosophila: 3.4 introns per gene on average mean intron length 475, mean exon length 397 Human: 8.8 introns per gene on average mean intron length 4400, mean exon length 165 ORF scanning is defeated

Splicing GT AG ATG TGA TAA TAG GT AG ATG TGA TAA TAG AG

Needles in a Haystack Human genome is about 3.2Gbp 20,000 – 25,000 genes 78% intergenic, 20% introns, 2% coding

Gotta Find ‘Em All 60-85% of all human genes have been found, mostly by random EST sequencing –This probably won’t work for the rest For most genes, only one splice variant is known If we can computationally predict a gene, we have a cheap experiment (RT-PCR) to verify

Looking For Clues Signals used by the cell –99% of introns begin with GT, end with AG –0.8% of introns begin with GC, end with AG –Gene begins with ATG –Gene ends with TAG, TAA, or TGA Other properties of genes –Exons have characteristic lengths –Base composition of exons is characteristic due to genetic code –Exons tend to be conserved between species Pattern of conservation is three-periodic

Three-Periodicity Most amino acids can be coded for by more than one DNA triplet (codon) Usually, the degeneracy is in the last position HumanCCTGTT(Proline, Valine) MouseCCAGTC(Proline, Valine) RatCCAGTC(Proline, Valine) DogCCGGTA(Proline, Valine) ChickenCCCGTG(Proline, Valine)

Hidden Markov Models The de facto standard for gene prediction Probabilistic finite state machine Transition to a state, emit a character, transition to a new state –Many independence assumptions CDSNC ACG

HMMs For Gene Prediction Generative model –Define P(X, Y) as a product of many independent terms P( ACG ) = P(start in noncoding) * P(noncoding emits A) * P(noncoding transitions to noncoding) * P(noncoding emits C) * P(noncoding transitions to coding) * P(coding emits A) Terms are of the forms P(y i | y i-1 ) and P(x i | y i ) –Trained by collecting counts

HMMs For Gene Prediction To predict genes given a sequence X, calculate argmax Y P(Y | X) = argmax Y P(X, Y) / P(X) = argmax Y P(X, Y)

Generalized Hidden Markov Models Like a HMM, but state durations are explicit Transition to a state, pick a duration d, emit d characters, transition to a new state Dynamic programming algorithm complexity goes from O(N 2 L) to O(N 2 LK) –K is the maximum state duration –Not so bad in practice

Predicting Genes With HMMs Given a sequence, we can calculate the most likely annotation Internal Exon Intron Inter- genic Final Exon Initial Exon Single Exon GGTGAGGTGACCAAGAACGTGTTGACAGTA

The Past: GENSCAN Chris Burge, Stanford, 1997 Before the Human Genome Project –No alignments available –People still thought there were 100,000 human genes

The GENSCAN Model

Output probabilities for NC and CDS depend on previous 5 bases (5 th -order) –P(X i | X i-1, X i-2, X i-3, X i-4, X i-5 ) Each CDS frame has its own model Special 2 nd -order positional models for start codon, stop codon, and acceptor site Even fancier model for donor sites –Maximal dependence decomposition (MDD) –Long-range dependencies Separate model for different isochores

GENSCAN Performance First program to do well on realistic sequences –Multiple genes in both orientations Pretty good sensitivity, poor specificity –70% exon Sn, 40% exon Sp Not enough exons per gene Was the best gene predictor for about 4 years

Comparative Gene Prediction Exon Intron Exon Intron -3 -2 -1 +1 +2 +3 Human A A G G T G -3 -2 -1 +1 +2 +3 Human A A G G T G Mouse A A G G T GMouse A A T G T G Chicken A A G G T GChicken A A _ A C G A B

The Recent Past: TWINSCAN Korf, Flicek, Duan, Brent, Washington University in St. Louis, 2001 Uses an informant sequence to help predict genes –For human, informant is normally mouse Informant sequence consists of three characters –Match:| –Mismatch:: –Unaligned:. Informant sequence assumed independent of target sequence

The TWINSCAN Model Just like GENSCAN, except adds models for conservation sequence 5 th -order models for CDS and NC, 2 nd - order models for start and stop codons and splice sites –One CDS model for all frames Many informants tried, but mouse seems to be at the “sweet spot”

TWINSCAN Performance Slightly more sensitive than GENSCAN, much more specific –Exon sensitivity/specificity about 75% Much better at the gene level –Most genes are mostly right, about 25% exactly right Was the best gene predictor for about 4 years

The Present: N-SCAN Gross and Brent, Washington University in St. Louis, 2005 If one informant sequence is good, let’s try more! Also several other improvements on TWINSCAN

N-SCAN Improvements Multiple informants Richer models of sequence evolution Frame-specific CDS conservation model Conserved noncoding sequence model 5’ UTR structure model

GENSCAN TWINSCAN N-SCAN HMM Outputs TargetGGTGAGGTGACCAAGAACGTGTTGACAGTA Conservation|||:||:||:|||||:||||||||...... sequence TargetGGTGAGGTGACCAAGAACGTGTTGACAGTA Informant1GGTCAGC___CCAAGAACGTGTAG...... Informant2GATCAGC___CCAAGAACGTGTAG...... Informant3GGTGAGCTGACCAAGATCGTGTTGACACAA...

N-SCAN State Diagram

Two-Component Output Distributions Target sequence model Phylogenetic model for informants Product gives the probability of a multiple alignment column

Phylogenetic Bayesian Network Models

Graph Transformation

Inference Slightly-modified version of Felsenstein’s algorithm At each of the O(N) nodes, we calculate 6 o+1 summations over 6 o+1 values Total time complexity is O(N 6 2(o+1) )

Training Simple with labeled multiple alignment of all sequences Can use known genes as a labeling Don’t know ancestral genome sequences –Treat them as missing data and use EM

CPD Parameterizations Each Bayesian network of order o has (2N-1)(6 o+1 )(6 o+1 -1) free parameters We can reduce this number by restricting the form of the CPDs Partially reversible models –Relative frequency of DNA k-mers remains constant as sequence evolves –Gaps and unaligned regions introduced over time

N-SCAN Phylogenetic Models vs. Traditional Phylogenetic Models Root (target) node is observed –Can use existing single-sequence models –Can use higher-order models –Can estimate target sequence model optimally

No assumption of homogeneous substitution process –Gaps and unaligned regions can be treated naturally –Robust against Function-changing mutation Alignment error Sequencing error –The price is many more parameters N-SCAN Phylogenetic Models vs. Traditional Phylogenetic Models

Conservation Score Coefficient N-SCAN uses log-likelihood scores internally. The score of a position i under state S is Values of k between 0.3 and 0.6 result in the best performance –Performance is roughly constant in this range

Whole-Genome Human Gene Prediction Annotations used were cleaned RefSeqs –16,259 genes –20,837 transcripts N-SCAN used human, mouse, rat, chicken alignment

Exact Exon Accuracy

Exact Gene Accuracy

Intron Sensitivity By Length

Human Informant Effectiveness

Drosophila Informant Effectiveness

The Future(?): CONTRAST New gene predictor currently in the works Based not on a generalized HMM, but a semi-Markov conditional random field (SCRF)

HMMs For Gene Prediction Generative model –Define P(X, Y) as a product of many independent terms P( ACG ) = P(start in noncoding) * P(noncoding emits A) * P(noncoding transitions to noncoding) * P(noncoding emits C) * P(noncoding transitions to coding) * P(coding emits A) Terms are of the forms P(y i | y i-1 ) and P(x i | y i ) –Trained by collecting counts

HMMs For Gene Prediction To predict genes given a sequence X, calculate argmax Y P(Y | X) = argmax Y P(X, Y) / P(X) = argmax Y P(X, Y) Advantage: simplicity –Extremely fast training, efficient inference Disadvantage: simplicity –Makes many unwarranted independence assumptions –Inaccurate model will get us into trouble

When HMMs Go Wrong Normal HMM training optimizes wrong function –We use P(Y | X) for prediction, but we’re optimizing P(X, Y) = P(Y | X) P(X) –This means we may prefer parameters that lead to worse predictions if they assign a higher probability to the sequence

When HMMs Go Wrong NC A 3% B 2% C 95% CDS A 49% B 49% C 2% NC A 3% B 2% C 95% CDS A 3% B 95% C 2% NNC A 2% B 2% C 96% CNS A 96% B 2% C 2% CDS A 49% B 49% C 2% A = Conserved triplet B = Synonymous substitution C = Nonsynonymous substitution …CCCCCCCCCCCCCAAAAAAAAAACCCC…CCCCCCCBBABAAABBABBABCC…

Can We Fix It? Directly optimize No closed form solution –But function and gradient can be calculated efficiently using DP If we’re going to numerically optimize anyway, might as well switch to a more expressive model

CRFs For Gene Prediction Discriminative model –Define P(Y | X) as a product of many terms Individual terms are not probabilities! Terms are of the form f j (y i-1, y i, X, i) w j The Good –Independence assumptions much weaker than in HMMs –Inference complexity is the same as for HMM The Bad –Training requires numerical optimization of (convex) likelihood function

The Math CRFs HMMs

HMMs vs. CRFs y1y1 x1x1 y2y2 x2x2 y3y3 x3x3 y4y4 x4x4 y5y5 x5x5 y6y6 x6x6 … HMM y1y1 x1x1 y2y2 x2x2 y3y3 x3x3 y4y4 x4x4 y5y5 x5x5 y6y6 x6x6 … CRF

HMMs vs. CRFs HMM-style “features” –Last state is exon, current state is intron –Current state is exon, current sequence character is “C” CRF-style features –Current state is exon, CG percent in 100Kbp window is between 40% and 50%, at least one CpG island predicted within 10Kbp –Current state is exon, 3 unspliced ESTs with at least 95% identity aligned near current position –Current state is exon, 1 spliced EST with at least 95% identity aligned near current position

Semi-Markov CRFs Semi-Markov CRFs are to CRFs as generalized HMMs (or semi-HMMs) are to HMMs Instead of assigning labels to each position, assign labels to segments Features are f(y i-1, y i, X, i, j)

Future Directions SVM-based splice site models that use alignment information –Splice site models in current gene predictors are pretty primitive Alternative splicing! –Not yet handled well –Very poor experimental coverage of transcriptome

Gene Prediction: Past, Present, and Future Sam Gross.

Similar presentations

Presentation on theme: "Gene Prediction: Past, Present, and Future Sam Gross."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Gene Prediction: Past, Present, and Future Sam Gross.

Similar presentations

Presentation on theme: "Gene Prediction: Past, Present, and Future Sam Gross."— Presentation transcript:

Similar presentations

About project

Feedback