Presentation on theme: "Gene finding pipelines for automatic annotation of new eukaryotic and bacterial genomes Victor Solovyev Professor of computer science, Royal Holloway,"— Presentation transcript:
1Gene finding pipelines for automatic annotation of new eukaryotic and bacterial genomes Victor SolovyevProfessor of computer science, Royal Holloway, University of LondonChairman, Softberry Inc.
3Expression stages and structural organization of typical eukaryotic protein-coding gene TranslationEnhancer3’-5’-Core promoterStart of transcriptionTranscription, 5’-Capping and 3’-polyadenilationSplicing (removing of intron sequences)Pre-mRNAmRNAProtein5’-non-codingexon3’-non-codingPoly-AsignalInternal exonsIntronsATG-codonStop-codon
4Ab initio multiple gene prediction approaches Genescan (Burge, Karlin,1997)HMMgene (Krogh, 1977)Fgenesh (Salamov, Solovyev,1998)Genie (Reese et al., 2000)GeneID (Guigo at al. 1992)Neural networksProbabilisticPattern recognitionFgenes (Solovyev,1997)Discriminant functionsLikelihoods of genecomponents, HMMFlexible combinationsof any discriminative featuresBalanced score as productionof likelihoods, simple features
5E5 I0 I1 I2 E0 E1 E2 EL 3’- 5’- EF N E3 I5 I3 Hidden Markov model of PolyAPrNE3I5I3Hidden Markov modelofmultiple eukaryoticgenesUsed inGenescan and FgeneshprogramsEi and Ii are different exonand intron states,respectively (i=0,1,2 reflect3 possible different ORF).E 5/3 marks non-coding exons andI5/I3 are 5’- and 3’-intronsadjacent to non-coding exons.
9Importance of good specific parameters: Rice example Fgenesh with Monocot gene-finding parametres .
10Strategy to make gene-finding parameters for new genomes Using GeneBank genes for close organismsUsing new genomic sequencea) Having known mRNA/cDNA sequencesMap mRNA by EST_MAP program on genomic sequenceExtract genes and use them as learning setb) Using ab initio gene-predictionPredict genes, Select genes with protein supportc) Using a database of known proteins (NR) that can be mapped on genome by Prot_map program with reconstructing gene-structureIn addition find protein coding ORF by BESTORF program in a set of ESTs and use them in learning of coding parameters
11Learning parameters using GeneBank genes for close organisms Select GeneBank organism class having enough known genesCreate Infogene database with reconstructed genes running Infog program (some genes might be described in several GeneBank entries)Run GetGenes program to extract genes from Infogen to use in learning programs(with cleaning genes with errors in annotation in ORF and splice sites)Run Efeature program to create:set of coding regions (usually significantly bigger than set of genes)set of non-coding regionsRun scripts/programs of learning coding parameters (might be several GC zones)Run scripts/programs of learning splice sites parametersRun scripts/programs to create exon length distributions and other statisticsCheck parameters of initial probabilities (exons/introns/noncoding)depending on gene density in genome and gene structureTest and edit parameters to select the best variant.Repeat learning on bigger or smaller organism classes and select the best learning set.
12Developed parameters for fgenesh group of programs: Human, Mouse, Drosophila, C. elegans, Fish (WUSTL, Baylor, CSHL, JGI)Dicots (Arabidopsis), Nicotiana tabacum,Monocots (Corn, Rice, Wheat, Barley) (TIGR, Rutgers University)Algae, Plasmodium falciparum, Anopheles gambiaeSchizosaccharomyces pombe, Neurospora crassa,Aspergillus nidulans, Coprinus cinereus, Cryptococcus neoformans, Fusarium graminearum, Magnaporthe grisea, Ustilago maydis (MIT/Broad Institute)Medicago (University of Minnesota)Brugie malayi (TIGR)
13FGENESH++: AUTOMATIC EUKARYOTIC GENOME ANNOTATION PIPELINE RefSeq mRNA mapping by Est_map program - mapped genes are excluded from further gene prediction process.Map all known proteins (NR) on genome by Prot_map program with gene structure reconstruction (find regions occupied by genes)Run Fgenesh+ using mapped proteins and selected genome sequencesRun ab initio Fgenesh gene prediction on the rest of genome.Search for protein homologs (by BLAST) of all products of predicted genes in NR.Run Fgenesh+ gene prediction on sequences (from stage 4) having protein homologs.Second run of Fgenesh in regions free from genes selected on stages 1,3,5.Run of Fgenesh gene predictions in large introns of known and predicted genes.Special variant of FGENESH++ can take into account synteny (human-mouse, for example) using FGENESH-2 program that predicts genes using 2 similar genomic sequences from different species.
14Components of Fgenesh++ automatic pipeline: Gene-finding group of program have mostly common components and working with the same organism-specific parametersFgenesh – ab initio gene prediction. Run on whole chromosomes (~300MB). FAST: The Human genome of 3 GB sequences is processed for ~ 4 hoursFgenesh+ This derivative of Fgenesh uses information on homologous proteins to improve accuracy of gene prediction, if such homologs can be found.Fgenesh-2 Variant of Fgenesh that uses homology between two genomic DNAsequences, such as human and mouse, as an extra factor for more accurate gene prediction.Fgenesh_C uses information on homologous mRNA/EST to improve accuracy of gene prediction. Can be used to reconstruct alternatively spliced genes.
15Components of Fgenesh++ automatic pipeline: Programs for mapping known mRNA/Est or proteins with reconstruction of gene structureEst_map a program for fast mapping of a set of mRNAs/ESTs to a chromosome sequence. It takes into account splice site weight matrices for accurate mapping (important for accurate mapping very small exons).Prot_map is used for fast mapping a database of protein sequences to genome with accounting for splice sites (useful for genomes with a few known genes and to search for pseudogenes).
17Prot_map example of alignment gatcacagaggctgg(..)agtgtctgtgtttca?[GGRIVSSKPFAPLNFRINSRNLSg(..)evdhqlkerfanmke GGRIVSSKPFAPLNFRINSRNLS-]gtaagaaactctcat(..)ctgtggctcctgcag[acIGTIMRVVELSPLKGSVSWTGK(..) dIGTIMRVVELSPLKGSVSWTGKPVSYYLHTIDRTI]gtgagtatctcgctg(..)ctttcttctttttag[LENYFSSLKNPPVSYYLHTIDRTI (..) LENYFSSLKNPKLR]gtaagtttgtgtgtt(..)ctgctctccttccag[EEQEAARRRQQRESKSNAATPKLR (..) EEQEAARRRQQRESKSNAATPTKGPEGKVAGPADAPM]gtaaggccccagcct(..)ccttgtgtcctccag[DSGAEEEKTKGPEGKVAGPADAPM (..) DSGAEEEKAGAATVKKPSPSKARKKKLNKKGRKMAGRKRGRPKKMNTANPERKPKKNQTALDALHAQT
18Analysis of gene-finding accuracy and running time Test on 83 small (< bp) human genes using mouse homolog:Prot_map: Sne= 73.7 Sn_pe Spe Sn_n= 93.9 Sp_n= 88.6 C= Time ~ 1 minGenewise: Sne= 76.4 Sn_pe Spe Sn_n= 94.9 Sp_n=89.4 C= Time ~ 90 minFgenesh:Test on 8 big (> bp) human genes using mouse homolog:Prot_map: Sne= 87.9 Sn_pe Spe Sn_n= 94.3 Sp_n= 96.0 C= Time ~ 1 minGenewise: Sne= 91.9 Sn_pe Spe Sn_n= 95.1 Sp_n= 97.0 C= Time ~ 1200 minProt_map mapping of Human protein setof proteins on chromosome 19 (~59 MB)takes 90 min (best hit for each protein) and148 min (all significant hits for each protein) Can be used for fast finding of an initial gene set in new genome mapping all known proteins Used for pseudogenes finding as mapping with frameshifts damaging ORFs
19New Fgenesh+ and Genewise 1) 700 genes with 6508 exons having similar protein with > 90% similarityGeneWISE: Sne= 94.1 Sn_pe Spe Sn_n= 98.9 Sp_n= 99.6 C=0.992FGENESH+: Sne= 96.9 Sn_pe= 98.5 Spe= 97.9 Sn_n= 99.0 Sp_n= 99.5 C=0.9922) 18 genes with 116 exons having similar Drosophila proteinwith identity 28-70%GeneWISE: Sne= 40.5 Sn_pe Spe Sn_n= 68.3 Sp_n= 99.7 C=0.813Fgenesh+: Sne= 70.7 Sn_pe= 84.5 Spe= 82.0 Sn_n= 84.8 Sp_n= 96.9 C=0.8985’-exon Observed - 18 Predicted - 14 Correct - 2 (11 by Fgenesh+)Intr: Observed - 80 Predicted - 43 Correct - 38 (59 by Fgenesh+)3’-exon: Observed - 18 Predicted - 14 Correct - 7 (12 by Fgenesh+)Run time: Fgenesh+ 50 – 1000 times faster than GeneWise
21Automated Gene Calling at Center for Genome Research MIT Sequencing: 2003/2004 – 6 new yeast genomes2004/2005 ~ 20 new yeast genomesGene structures are predicted using a combination of FGENESH, FGENESH+, and GENEWISE (Sanger Institute). the protein used in the previous had >90% amino acid identity to the translated genome (cumulative across sub-alignments), then the GENEWISE call, if valid, was favored over the FGENESH+ call, and was used as the EVIDENCE_GENEIf this protein had >80% but less than 90% amino acid identity to the translated genome (cumulative across sub-alignments), then the FGENESH+ call, if valid, was favored over the GENEWISE call, and was used as the EVIDENCE_GENE
26Examples of usage Fgenesh suit in genome annotations Grimwood J, Gordon LA, Olsen A, .., Salamov A., Solovyev V., ..., Lukas S. (2004) The DNA sequence and biology of human chromosome 19. Nature, 428(6982), Using Fgenesh, Fgenesh+, est_map to annotate genes in Himan cjromosome 19. annotation. · Heiliget al. (2003) The DNA sequence and analysis of human chromosome 14. Nature 421, FGENESH used for human chromosome 14 annotation. · Hillier et al. (2003) The DNA sequence of human chromosome 7. Nature 424, Extensive use of FGENESH-2 for human chromosome 7 annotation. · Feng et al. (2002) Sequence and analysis of rice chromosome 4. Nature 420, FGENESH used for annotation of rice chromosome 4.Galagan et al. (2003) The genome sequence of the filamentous fungus Neurospora crassa. Nature 422: Neurospora genome annotation based on FGENESH and FGENESH+. · Lander et al. (2001) Initial sequencing and analysis of the human genome. Nature 409, Original paper on sequencing human genome by public consortium also reports use of FGENESH genefinder for genome annotation. · Deloukas et al. (2001) The DNA sequence and comparative analysis of human chromosome 20. Nature 414, Use of FGENESH for annotation of human chromosome 20. · Yu et al. (2002) A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science 296: Rice genome sequencing and annotation project used FGENESH as primary source of gene predictions.· Holt et al. (2002) The Genome Sequence of the Malaria Mosquito Anopheles gambiae. Science 298: Use of FGENESH for annotation of Anopheles genome.
27Canonical and Non-canonical splice sites SpliceDB (Burset, Seledtsov, Solovyev, NAR 1999,2000)GT-AG: 99.24% GC-AG: 0.69% AT-AC: 0.05% other sites: 0.02%GT-AG group (canonical splice sites): examplesM70A60G80|GTR95A71G81T46Y73Y75Y78Y79Y80Y79Y78Y81Y86Y86NC71AG|G52b) GC-AG group: 126 examplesM83A89G98|GCA87A84G97T71c) AT-AC group: 8 annotated examples + 2 examples recovered from annotation errorsS90|ATA100T100C100C100T100T90T70 T70G50C70NC60AC|A60T60Gene prediction is usually done with only standard splice sites
28Additional sources of genes Identified with synteny data helpNon canonical splice sitesAlternatively splicedAlternative promotersAlternative poly-AAdditional studies of the above topics will update the current gene collections
30Exon-based synthenyRun Gene-finding annotation pipeline for each genomeSelect chains of similar exons between 2 genomes comparing coding exons by Blast95% in agreement with filtered genomealignmentsBrudno et al.(2004) Automated Whole-Genome Multiple Alignment of Rat, Mouse, and Human Genome Research Journal, 14(4):
32Pseudogene finding using Prot_map nnnnnnnn(..)nnnnnnnnnnnnnag[KEFDFESANAQFNKEEMGREFHNKLKLKEDKL(..) KDFDFESANAQFNKEEIDREFHNKLKLKEDKLEKEEKPVNGEDKGDSGVDTQNSEGHADEEDALGPNCFYDQTKSSFDNISGDDNRERRPTWEKQEKPVNGEDKGDSGVDTQNSEGNADEEDPLGPNCYYDKTKSFFDNISCDDNRERRPTWAEGRRLNAETFGIPLCPNRGHGGYRGRGaGLGFHGGRGRg]gtggcagaagtggta(..)AEERRLNAETFGIPLRPNRGRGGYRGRG-GLGFRGGRGR (..)
33Pseudogene finder Generation pseudogene candidates: Run script finding genes having almost identical coding proteins (or part of them) with lesser number introns (or without introns).Run prot_map mapping Human (mammalian) proteins and selecting damaged onesSelecting pseudogenes using additional features: like poly_A tail, ratio ks/kn
38Accuracy of prediction by TSSP on plant genomic sequences Selected known genomic regions upstream of CDSTrue positives 92%Total number of False positives for 40 TATA promoters: 22(1 per 3648 bp)True positives 95%Total number of False positives for 25 TATA –less promoters: 15(1 per 3300 bp)For every class (TATA and TATA-less) promoters only one predicted TSS with highest score in an interval of 300 bp was taken during the search.
43Fgenesb_annotator - Bacterial Gene/Operon Prediction and Annotation Pipeline FGENESB is a new complex package for annotation of bacterial genomes. Its gene prediction algorithm is based on Markov chain models of coding regions and translation and termination sites.Operon models are based on distances between ORFs, frequencies of different genes neighboring each other in known bacterial genomes, predicted promoters and terminatorsThe parameters of gene prediction are self-learning, so the only input necessary for annotation of new genome is a sequence.
45rRNA and tRNA annotation STEP 1.Finds all potential ribosomal RRNA genes using BLAST against bacterial and/or archaeal RRNA databases.and masks detected RRNA genes.STEP 2.Predicts tRNA genes using tRNAscan-SE program.Inside bactg_ann.pl - run tRNAscan-SE and masksdetected TRNA genes .
46Genes and Operon identification STEP 3.Initial predictions of long, slightly overlapping ORF that are used as a starting point for calculating parameters of predictions. Iterates until stabilizes.Generates parameters such as 5th-order in-frame Markov chains for coding regions, 2nd-order Markov models for region around start codon and upstream RBS site, Stop codon and probability distributions of ORF lengths.Protein coding genes predictionSTEP 4.it predicts operons based only on distances between predicted genes.
47Annotate genes comparing with databases of known proteins STEP 5.Runs blastp for predicted proteins against COG database-cog.pro and annotate by COGs descriptionsSTEP 6.Run blastp against NR for proteins having no COGs hitsAnd annotate by NR descriptions.
48Promoters and Terminators prediction and improvement of operons assignment STEP 7.Uses information about conservation of neighbor gene pairs in known genomes to improve operon prediction.STEP 8.predicts potential promoters (tssb) and terminators (bterm) in the corresponding 5'-upstream and 3'-downstream regions of predicted genes.Tssb- bacterial promoter prediction (sigma70), using dicriminant function with characteristics of sequence features of promoters (such as conserved motifs, binding sites and etc)Bterm - prediction of pho-independent terminators as hairpins,with energy scoring based on discriminant function of hairpin elements.STEP 9.refines operon predictions using predicted promoters and terminators as additional evidences.
49Fgenesb_annotator output: Op / CDS ## COG0593 ATPase involved in DNA+ Term+ PromOp / CDS ## COG0592 DNA polymerase+ Term+ PromOp / CDS ## COG2501 Uncharacterized ACROp / CDS ## COG1195 Recombinational DNA2 Op / CDS ## COG0187 DNA gyrase (topoisomerase II) B subunit+ Term+ PromOp CDS ## COG0188 DNA gyrase (topoisomerase II) A subunit+ Term+ SSU_RRNA # AY [D: ] # 16S ribosomal RNA # Bacillus cereus+ TRNA # Ile GAT 0 0+ TRNA # Ala TGC 0 0+ LSU_RRNA # AF [D: ] # 23S ribosomal RNA # BacillusOp CDS+ 5S_RRNA # AE [D: ] # 5S ribosomal RNA # BacillusOp CDS ## Similar_to_GBOp CDS- Prom
59Oceans/Acid mines/agriculture Annotation of new bacteriaNew drugsAnnotation of bacterial communities DNA fromSpecific sources(not growing in Labs)Oceans/Acid mines/agriculture(with mix of 100s species)New ferments
60Main Collaborators:Asaf Salamov, Igor Seledtsov,Ilham Shahmuradov