Gene prediction 10. June 2004 Irmtraud Meyer University of Oxford

Gene prediction 10. June 2004 Irmtraud Meyer University of Oxford meyer@stats.ox.ac.uk

Overview: ● Gene finding: definitions, aim, data ● Gene finding in prokaryotes versus eukaryotes ● Sequence signals ● Gene prediction methods Generalised HMMs: example Genscan PairHMMs: example Doublescan ● Evaluating gene prediction methods ● References: other gene prediction programs & publications

Definition gene: ● G ene : continous section of the genome which is transcribed (and translated, if gene protein coding) and which corresponds to a functional product DNA RNA mRNA functional product Transcription Translation Splicing Aminoacid sequence RNA gene (protein coding) gene In the following: focus on protein coding genes

Genefinding: Ab initio genefinding: prediction of the location of genes and its encoded aminoacid sequences given a raw DNA sequence.....tttttgcagtactcccgggccctctgttggggcctccccttcctctccagggtggagtcgaggaggcggggc tgcgggcctccttatctctagagccggccctggctctctggcgcggggccccttagtccgggctttttgccatggggtctctgttcc ctctgtcgctgctgttttttttggcggccgcctacccgggagttgggagcgcgctgggacgccggactaagcgggcgcaaagcccca agggtagccctctcgcgccctccgggacctcagtgcccttctgggtgcgcatgagcccggagttcgtggctgtgcagccggggaagt cagtgcagctcaattgcagcaacagctgtccccagccgcagaattccagcctccgcaccccgctgcggcaaggcaagacgctcagag ggccgggttgggtgtcttaccagctgctcgacgtgagggcctggagctccctcgcgcactgcctcgtgacctgcgcaggaaaaacac gctgggccacctccaggatcaccgcctacagtgagggacaggggctcggtcccggctggggtgaggggagggggctggaagaggtgg gggaagggtagttgacagtcgctctatagggagcgcccgcggacctcactcagaggctcccccttgccttagaaccgccccacagcg tgattttggagcctccggtcttaaagggcaggaaatacactttgcgctgccacgtgacgcaggtgttcccggtgggctacttggtgg tgaccctgaggcatggaagccgggtcatctattccgaaagcctggagcgcttcaccggcctggatctggccaacgtgaccttgacct acgagtttgctgctggaccccgcgacttctggcagcccgtgatctgccacgcgcgcctcaatctcgacggcctggtggtccgcaaca gctcggcacccattacactgatgctcggtgaggcacccctgtaaccctggggactaggaggaagggggcagagagagttatgacccc gagagggcgcacagaccaagcgtgagctccacgcgggtcgacagacctccctgtgttccgttcctaattctcgccttctgctcccag cttggagccccgcgcccacagctttggcctccggttccatcgctgcccttgtagggatcctcctcactgtgggcgctgcgtacctat gcaagtgcctagctatgaagtcccaggcgtaaagggggatgttctatgccggctgagcgagaaaaagaggaatatgaaacaatctgg ggaaatggccatacatggtgg.... Input data

Genefinding: 5'....tttttgcagtactcccgggccctctgttggggcctccccttcctctccagggtggagtcgaggaggcggggc tgcgggcctccttatctctagagccggccctggctctctggcgcggggccccttagtccgggctttttgccATGGGGTCTCTGTTCC CTCTGTCGCTGCTGTTTTTTTTGGCGGCCGCCTACCCGGGAGTTGGGAGCGCGCTGGGACGCCGGACTAAGCGGGCGCAAAGCCCCA AGGGTAGCCCTCTCGCGCCCTCCGGGACCTCAGTGCCCTTCTGGGTGCGCATGAGCCCGGAGTTCGTGGCTGTGCAGCCGGGGAAGT CAGTGCAGCTCAATTGCAGCAACAGCTGTCCCCAGCCGCAGAATTCCAGCCTCCGCACCCCGCTGCGGCAAGGCAAGACGCTCAGAG GGCCGGGTTGGGTGTCTTACCAGCTGCTCGACGTGAGGGCCTGGAGCTCCCTCGCGCACTGCCTCGTGACCTGCGCAGGAAAAACAC GCTGGGCCACCTCCAGGATCACCGCCTACAgtgagggacaggggctcggtcccggctggggtgaggggagggggctggaagaggtgg gggaagggtagttgacagtcgctctatagggagcgcccgcggacctcactcagaggctcccccttgccttagAACCGCCCCACAGCG TGATTTTGGAGCCTCCGGTCTTAAAGGGCAGGAAATACACTTTGCGCTGCCACGTGACGCAGGTGTTCCCGGTGGGCTACTTGGTGG TGACCCTGAGGCATGGAAGCCGGGTCATCTATTCCGAAAGCCTGGAGCGCTTCACCGGCCTGGATCTGGCCAACGTGACCTTGACCT ACGAGTTTGCTGCTGGACCCCGCGACTTCTGGCAGCCCGTGATCTGCCACGCGCGCCTCAATCTCGACGGCCTGGTGGTCCGCAACA GCTCGGCACCCATTACACTGATGCTCGgtgaggcacccctgtaaccctggggactaggaggaagggggcagagagagttatgacccc gagagggcgcacagaccaagcgtgagctccacgcgggtcgacagacctccctgtgttccgttcctaattctcgccttctgctcccag CTTGGAGCCCCGCGCCCACAGCTTTGGCCTCCGGTTCCATCGCTGCCCTTGTAGGGATCCTCCTCACTGTGGGCGCTGCGTACCTAT GCAAGTGCCTAGCTATGAAGTCCCAGGCGtaaagggggatgttctatgccggctgagcgagaaaaagaggaatatgaaacaatctgg ggaaatggccatacatggtgg.... 3' 5' 3' Intron Exon (protein coding) Intergenic sequence Legend:

Genefinding: prokaryotes versus eukaryotes

Types of protein coding genes: 5' 3' 5' 3' Typical prokaryotic gene: Typical eukaryotic gene:

Eukaryotic versus prokaryotic genomes: Prokaryotic genome: ● High gene density ● Most genes have no introns ● Short genome (mostly < 10 Mb) Eukaryotic genome: ● Low gene density (< 3% human) ● Most genes have introns ● Introns are typically much longer (one order of magnitude) than exons ● Long genome (human 3000 Mb) Part of prokaryotic genome: Part of eukaryotic genome:

Eukaryotic versus prokaryotic gene finding strategies: ● Prokaryotes: ●Search for long open reading frames (ORF) ●=> gene finding relatively easy ● Eukaryotes: ● Two main problems to solve: ● gene location problem: predict location of gene in genome ● gene structure problem: predict exon- intron structure of gene ● => gene prediction difficult

Prokaryotic gene prediction: ● Main idea: ●A gene is a not too short stretch of DNA which starts (at 5' end) with a start codon and ends (at 3' end) with a stop codon and which has no in- frame stop codons Start codon (usually ATG) Stop codon (usually TGA, TAG, TAA) Codon (X1,X2,X3) Ok: Not ok: In frame stop codons In frame start codons, out of frame stop codons > 100 codons

Eukaryotic gene prediction: ● G ene location and gene structure problem => cannot simply search for ORFs ● Main idea: ● Search for signals in the genomic DNA which indicate a functional element or a boundary between two functional elements. 5' 3' Start codon signal (ATG codon, Kozak signal) Stop codon signal (TGA, TAA, TAG) 5' splice site signal (consensus GT at 5' end of intron) 3' splice site signal (consensus AG at 3' end of intron) Indicate boundary Indicate element Region of codon bias or high hexamer frequencies

Problem with sequence signals: ● Each type of sequence signal alone does not reliably indicate the location and structure of a gene. 5' 3' Known gene structure Predicted sequence signals ● Idea: ● Combine set of predicted sequence signals according to a gene model into valid gene structures. }

Definitions: ● Valid gene structure: gene structure which could in principle be translated into a protein sequence. ● Gene model: set of rules which define what a valid gene structure is. 5' 3' 123 Start codon Codons are not stop codons Strong 5' splice site signal Strong 3' splice site signal Exon frame conserved across intron Stop codon...

Sequence signals

Sequence signals: ● Kozak signal: signal around translation start ie start codon ATG ( ), reference: M. Kozak (1981), Nucleic Acids Research 9, 5233-5252 21 bp, [-9, 11] ● Splice site signals: signal around the 5' end (consensus GT) and 3' end (consensus AG) of introns 21 bp, [-10, 10]... 26 bp, [-15, 10] 0 0 0 0 0 Reference: A. Levine, Mphil thesis, University of Cambridge, 2001, page 8

Sequence signals: ● Hexamer frequencies: frequency of 6-letter words within protein coding regions different from that in non-protein coding regions ● CpG island (CpG = CG pair) : region of higher than average frequency of unmethylated CG pairs, associated with the 5' ends of many genes (56 % human genes), tends to overlap promoter and extend about 1000 bp downstream in the transcription unit. Most Cs of CpG dinucleotides in the human genome are methylated and tend to mutate via C->T into TpG (or CpA) so that CpG dinucleotides occur about five times less frequently than expected. References: A. Bird (1987) Trends in Genetics 3, 342-347; Antequera and Bird (1993) PNAS 90, 11995-11999.

Sequence signals: ● G+C isochores: the genome of higher organisms can be partitioned into regions of different G+C contents (= percentage of C and G nucleotides in that region) which are typically longer than 20 kb. The density of genes varies with the G+C contents, in the mouse and human genome 75-80 % of the genes are found in the G+C-richest half of the genome. Reference: Mouse Seq. Consortium (2002), Nature 420, 520-562. 20 kb windows Mouse (blue), human (red) Strong correlation between G+C contents of orthologous mouse and human genes (data not shown)

Summary: sequence signals: PromoterUTR Exon Kozak signal Splice site signals Stop codon CpG island G+C isochore (extends beyond gene boundaries) Promoter signal Difficult to predict Poly-A cleavage site

Scoring sequence signals: ●(1) Methods for scoring the strength of sequence signals are for example: ● Weigth matrix models (WMM) (splice site scores, start codon scores, poly-A sites) ● maximal dependence decomposition (MDD) (more sophisticated splice site scores, for example used in Genscan) ● Hexamer frequencies (coding region scores) ● Codon frequencies (coding region scores) ● ●(2) Methods for scoring the length of a sequence signal are ● Length distributions (exon lengths) Score: measure of the likelihood of the signal being true, for example: the log-odds score of a sequence signal is the ratio of the probability of the sequence containing the signal and the probability of the sequence containing the signal by chance

Gene prediction methods

Gene prediction methods: generalised HMMs ● ● Motivation: the lengths of the sequences generated by each state of a HMM in simulation mode follow a geometric distribution, however, the exon lengths are not geometrically distributed. Intron length > 50 bp required for splicing C. Burge, PhD thesis, Stanford University 1997

Genscan: Exons of phase 0, 1 or 2 Initial exon Terminal exon Introns of phase 0, 1 or 2 Exon of single exon genes 5' UTR Promoter Poly-A signal 3' UTR Intergenic sequence State with length distribution Omitted: reverse strand part of the HMM

Genscan: ● Sequence signals used: – Splice site signals: consensus 5' GT – AG 3', use MDD – Promoter signal: TATA-containing promoter with 0.7 probability consisting of 15 bp TATA-box WMM and 8 bp cap site WMM with 14-20 bp of intergenic characteristics in between, TATA-less promoter with 0.3 probability consisting of 40 bp of intergenic characteristics – Poly-A site signal: 6 bp WMM (consensus: AATAAA) – Kozak signal: 12 bp WMM (consensus: gccAcCATGgcg) – Stop codon signal: one of three stop codons plus 3 bp WMM – Hexamer frequencies ● Viterbi is used to derive state path with highest overall probability ● Simultaneously predicts non-overlapping genes on both strands

Genscan: ● Transition probabilities are chosen according to G+C-contents of the sequence (four G+C intervals are used: [0, 0.43), [0.43, 0.51), [0.51, 0.57), [0.57, 1]) ● Algorithmical complexity of Viterbi: – N 2 *L for an HMM of N states and a sequence of length L – N 2 *L 3 for a generalised HMM, but if one assumes that a state with length distribution can have at most a duration D, then this reduces to N 2 *D 2 *L ● Parameters (transition and emission probabilities and length distributions) have been set up to predict human genes ● Reference: C. Burge and S. Karlin (1997), JMB 268, 78-94; C. Burge, PhD thesis, Stanford University, 1997; http://genes.mit.edu/GENSCAN.html ●

Motivation for comparative gene prediction: 99 % of genes have homologous partner 80 % of genes have orthologous partner 86 % of orthologous pairs: number of exons conserved 85 % identity (protein coding DNA) versus 69 % identity (intronic DNA) 70 % identity (orthologous proteins) => aim: try to detect genes in a comparative way making use of two similar DNA sequences

X: Y: X: (1) Y: (2) Use fact that the encoded aminoacid sequences are similar and that they are encoded in the same or a similar number of exons Use fact that exons are on average more conserved and show a different conservation pattern than non-protein coding sequences Strategies for comparative gene prediction: ATTGTATGCCACGACCAAAGA ATCGTCTGTCATGATCAAAGG Exons: 3-periodicity ATTAGTTGCACCGACCAAAGA ATCCGTTGCATTGATCAAAGG Non-Exons: no 3-periodicity

Motivation: simultaneous alignment & gene prediction Idea: gene prediction aids alignment and vice versa

Doublescan: -:GT-:AG emit y intron -:y -:y1GT -:AGy2y3 (…) -:y1y2GT-:AGy3 (…) start:startstop:stop Emit x:- Emit -:y match intergenic x:y match exon x1x2x3:y1y2y3 emit x1x2x3:- emit -:y1y2y3 x1GT:y1GTAGx2x3:AGy2y3 (…) x1x2GT:y1y2GTAGx3:AGy3 (…) match intron x:y GT:GTAG:AG emit x:-emit -:y x1GT:- AGx2x3:- (…) x1x2GT:-AGx3:- (…) GT:-AG:- emit x intron x:- Start End are connected to all other states

Doublescan's pairHMM: reads two input sequences simultaneously aligns and predicts genes simultaneously models splice sites in UTRs, but does not try to predict transcription start and end sites can model also exon-fusion and exon-splitting events underlying pairHMM (ie the directed graph of states and transitions) can be used for any pair of eukaryotic genomes, only pairHMM's parameters are trained for a special pair of genomes has so far been trained on human-mouse and c.elegans-c.briggsae gene pairs

Doublescan's pairHMM: Sequence signals used: – Splice site scores (derived by external program) – Kozak signal (derived by external program) – Similarity on protein level, conservation patterns (assigned by emission probabilities of pairHMM) State path with highest overall probability is reported Algorithmical complexity: N states, two sequences of length S and Q – Viterbi algorithm: O(N 2 *S*Q) (memory and time) – Hirschberg algorithm: O(N 2 *S*Q) (time), O(N*min(S,Q)) (memory) – Stepping Stone algorithm: O(N*sqrt(S 2 +Q 2 ))

Stepping Stone Algorithm: 0 1 2 3 4 4 3 2 1 0 CGAACCCGGCTAGGGCAGGGTCCCTAACC y AGCCGGCGGCTAGGGCCCCGAAGTAAGGC x...ACGACCCAA......ACAACCCAG... ….ACGACCCAA... …ACAACCCAG...

Stepping Stone Algorithm: 0 1 2 3 4 4 3 2 1 0 CGAACCCGGCTAGGGCAGGGTCCCTAACC y AGCCGGCGGCTAGGGCCCCGAAGTAAGGC x

Summary Doublescan: Advantages/Disadvantages: Main similarities in the two sequences have to appear in collinearity better alignment than with methods who do not know about genes (Blast etc.) Computationally more expensive than non-comparative HMM- based methods better performance than non-comparative ab initio methods References: D. S. Hirschberg (1975), Communications of the ACM, 18, 341-343 I. M. Meyer and R. Durbin (2002), Bioinformatics, 18, 1309-1318, http://www.sanger.ac.uk/Software/analysis/doublescan

Evaluating gene prediction methods

Evaluating the performance of a gene prediction program: Annotation: Prediction: Motivation: need a measure to evaluate the quality of a gene prediction and to compare the quality of different gene prediction or gene annotation methods Idea: compare a set of known genes (annotation) to a set predicted genes (prediction) by comparing them on three different levels - nucleotide level: fine scale – compare nucleotides - intermediate level: medium scale – compare entire CDS, start and stop codons - gene level: coarse scale – compare entire gene structures

Evaluation on different levels: ● Evaluation of genes on gene level: ● Evaluation of exons on intermediate level: Annotation: Prediction: Tp Tp(overlapping) Fn Fp Annotation: Prediction: Tp Tp(overlapping) Fn Fp ● Definitions: Tp = true positive, Tn = true negative, Fp = false positive, Fn = false negative

Evaluation on different levels (cont'd): ● Evaluation of exons on nucleotide level: Annotation: Prediction: Fn Fp Tp Tn

Main performance measures: ● Assume: annotation is correct and complete (ie no genes missing) ● Main performance measures are: Sensitivity: fraction of known features which were correctly predicted Examples: gene sensitivity, exon sensitivity, start codon sensitivity Between 0 (no known feature found) and 1 (all known features found) Specificity: fraction of predicted features which match the known features Examples: gene specificity, exon specificity, etc. Between 0 (no predicted feature correct) and 1 (all predicted features correct) ● Note: sensitivity and specificity are not correlated

Overview: performance measures For a given entity and label one can compute: ● sensitivity = (# tp) / (# tp + # fn) the fraction of annotated entities which were correctly predicted ● specificity = (# tp) / (# tp + # fp) the fraction of predicted entities which are correct ● missing = (# fn) / (# tp + # fn) the fraction of annotated entities which are missing in the prediction ● overlapping_1 = (# tp(overlapping)) / (# tp + # fn) the fraction of annotated entities which are overlapped by a predicted entity ● overlapping_2 = (# tp(overlapping)) / (# tp + # fp) the fraction of predicted entities which are overlapping an annotated entity ● wrong = (# fp) / (# tp + # fp) the fraction of predicted entities which do not overlap any annotated entity ● Where a label is for example “exon” or “start codon” and an entity can be either a nucleotide, exon, star codon, stop codon or an entire gene

Words of caution: ● Gene prediction methods have been usually created with one genome in mind and usually cannot be (easily or at all) adapted to other genomes (underlying gene model or set of parameters would have to be changed). However, most of the recently developed comparative gene prediction methods can at least in principle be adjusted to analyse other pairs of genomes by retraining only their set of parameters. ● The performance of a gene prediction program usually depends on the set of genes on which it was tested. The reported performance therefore does not generally generalise to other data sets. This implies that the performance of different gene prediction methods should be compared on the same test set. ● The best way to evaluate a novel gene prediction method is to compare its performance to that of the best existing gene prediction methods on a test set of genes which is large and diverse and whose composition is ideally similar to that found in the entire genome (if known).

Overview: references to other gene prediction programs and publications

1.) Ab initio gene prediction programs: ● Geneparser: http://beagle.colorado.edu/~eesnyder/GeneParser.html ● Morgan: http://www.tigr.org/~salzberg/morgan.html ● Genscan: http://genes.mit.edu/GENSCAN.html ● Genefinder: http://argon.cshl.org/genefinder/ ● Genlang: http://www.cbil.upenn.edu/genlang/genlang_home.html ● Genie: http://www.fruitfly.org/seq_tools/genie.html ● Geneid: http://www1.imim.es/geneid.html ● Fgenes, Fgenesh, Fgenesh+: http://genomic.sanger.ac.uk/gf/gfs.shtml ● Grail: http://compbio.ornl.gov ● Glimmer: http://www.tigr.org/software/glimmerm/ ● HMMgene: http://www.cbs.dtu.dk/services/HMMgene/ MZEF: http://www.ebi.ac.uk/~thanaraj/MZEF-SPC.html and http://www.cshl.org/public/SCIENCE/zhang.html

● Y ● Task: Given known aminoacid sequence of protein x, find corresponding gene in DNA which encodes the same or a very similar protein. ● Programs: ● Genewise (HMM based, http://www.ebi.ac.uk/Wise2/) ● Procrustes (HMM based, no working url, see M.S. Gelfand et. al., PNAS, 93: 9061-9066, 1996) Gene y Protein x 2a.) Homology based gene prediction:

Y ● Task: Given known gene x, find corresponding related gene y. ● Program: ● Projector (HMM based, www.sanger.ac.uk/Software/analysis/projector, paper by I.M.Meyer and R.Durbin to appear in Nucleic Acids Research in 2004) Gene y 2b.) Homology based gene prediction: X Gene x

3.) Comparative ab-initio gene prediction: ● Task: given two related DNA sequences, find the encoded pairs of related genes ● Program: Evogene requires two or more pre-aligned sequences, finds gene in all of them – http://www.birc.dk/Software/evogene ● Program: requires two pre-aligned sequences, finds genes in only one of them – Twinscan (http://genes.cs.wustl.edu/) (Genscan re-implementation which takes the alignment into account, HMM based) ● Programs: require pre-aligned sequences – CEM (V. Bafna et. al. (Celera), ISMB Proceedings 2000, program not available) – SGP-1 (http://195.37.47.237/sgp-1/) ● Programs (all pairHMM based), align and predict genes simultaneously: – Pro-gen (http://www.anchorgen.com/pro_gen/pro_gen.html) – Doublescan (www.sanger.ac.uk/Software/analysis/doublescan) – Slam (http://baboon.math.berkeley.edu/~syntenic/about.html)

4.) Homology assisted gene prediction: ● Task: take one DNA sequence (target sequence) and several protein sequences which have partial Tblastx (ie protein level) matches to it (informant sequences) and predict genes in target sequence ● Program: – Genomescan (http://genes.mit.edu/genomescan.html) (Genscan re- implementation which takes the alignments into account) ● Task: take one DNA sequence (target sequence) and several DNA sequences which have partial Tblastx (ie protein level) matches to it (informant sequences) and predict genes in target sequence ● Program: – SGP-2 (Genis Parra et. al. Genome Research 2003, 13, 108-117) (Geneid re- implementation which takes the alignments into account)

Appendix

PairHMMs: Doublescan Full pairHMM of Doublescan and Projector

Gene prediction 10. June 2004 Irmtraud Meyer University of Oxford

Similar presentations

Presentation on theme: "Gene prediction 10. June 2004 Irmtraud Meyer University of Oxford"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Gene prediction 10. June 2004 Irmtraud Meyer University of Oxford

Similar presentations

Presentation on theme: "Gene prediction 10. June 2004 Irmtraud Meyer University of Oxford"— Presentation transcript:

Similar presentations

About project

Feedback