Presentation on theme: "Martin G. Reese Nomi L. Harris George Hartzell Suzanna E. Lewis"— Presentation transcript:
1 The challenge of annotating a complete eukaryotic genome: A case study in Drosophila melanogaster Martin G. ReeseNomi L. HarrisGeorge HartzellSuzanna E. LewisDrosophila Genome Center Department of Molecular and Cell Biology 539 Life Sciences Addition University of California, Berkeley
2 AbstractMany of the technical issues involved in sequencing complete genomes are essentially solved. Technologies already exist that provide sufficient solutions for ascertaining sequencing error rates and for assembling sequence data. Currently, however, standards or rules for the annotation process are still an outstanding problem.How shall the genomes be annotated, what shall be annotated, which computational tools are most effective, how reliable are these annotations, how organism-specific do the tools have to be and ultimately how should the computational results be presented to the community? All these questions are unsolved. This tutorial will give an overview and assessment of the current state of annotation based upon experiences gained at the Drosophila melanogaster genome project.In the tutorial we will do three things. First, we will break down the annotation process and discuss the various aspects of the problem. This will serve to clarify the term "annotation", which is often used to collectively describe a process that has a number of discrete steps. Second, with the participation of computational biologists from the community we will compare existing tools for sequence annotation. We will do this by providing a 3 megabase sequence that has already been well-characterized at our center as a testbed for evaluating other feature-finding algorithms. This is similar to what has been done at the CASP (critical assessment of techniques for protein structure prediction) conferences (http://predictioncenter.llnl.gov) for protein structure prediction. Third, we will discuss which annotation problems are essentially solved and which problems remain.
3 Tutorial goals Review the algorithms currently used in annotation Assess existing methods under “field” conditionsIdentify open issues in annotation
4 Tutorial organization DefinitionsAnnotation“Biological” issues“Engineering” issuesApplication of tools within an existing annotation systemBreak (20 minutes)Review of existing toolsOur annotation experimentConclusions and outstanding issues
5 What is a gene?Definition: An inheritable trait associated with a region of DNA that codes for a polypeptide chain or specifies an RNA molecule which in turn have an influence on some characteristic phenotype of the organism.
6 What are annotations?Definition: Features on the genome derived through the transformation of raw genomic sequences into information by integrating computational tools, auxiliary biological data, and biological knowledge.
7 How does an annotation differ from a gene? Many annotations are the same as ‘genes’The annotation describes an inheritable trait associated with a region of DNA.But an annotation may not always correspond in this way, e.g. an STS, or sequence overlapRegion of genomic DNA or RNA is not translated or transcribed
14 Definitions for data modeling Feature: An interval or an ordered set of intervals on a sequence that describes some biological attribute and is justified by evidence.Sequence: A linear molecule of DNA, RNA or amino acids.Evidence: A computational or experimental result coming out of an analysis of a sequenceAnnotation: A set of features
15 Annotation Depth of knowledge Breadth of knowledge Detailed analysis (typically biological) of single genesAnnotated genomeDepth of knowledgeLarge-scale analysis (typically computational) of entire genomeBreadth of knowledge
16 Annotation process overview 11 April 2017Annotation process overviewDataMethodsGenomeSequenceAuxiliaryDataComputationalToolsDatabaseResourcesAnnotation SystemsUnderstanding of a Genome
17 Types of sequence data Chromosomal sequence mRNA sequences EuchromaticHeterochromaticmRNA sequencesFull length cDNA5’ EST3’ ESTProtein sequencesInsertion site flanking sequences
18 Auxiliary data Maps Expression data Phenotypes Genetic, physical, radiation hybrid map (RH), deletion, cytogeneticExpression dataTissue, stagePhenotypesLethality, sterility
19 Computational annotation tools Gene findingRepeat findingEST/cDNA alignmentHomology searchingBLAST, FASTA, HMM-based methods, etc.Protein family searchingPFAM, Prosite, etc.
20 Database resources Curated sequence feature data sets Repeat elementsTransposonsNon-redundant mRNASTSs and other sequence markersGenome sequence from related speciesD. melanogaster vs. D. virilis, D. hydeiGenome sequence from more distant speciesProtein sequences from distant species
21 Biological issues in annotation CommonGenes within genesAlternative splicingAlternative poly-adenylation sitesRareTranslational frame shiftingmRNA editingEukaryotic operonsAlternative initiation
22 Engineering issues in annotation What sequence to start with?Because features are intervals on a sequence, problems can be caused by gaps, frameshifts, and other changes to the sequence. How do you track these changes over time and model features that span gaps?When to annotate?Feature identification can aid in sequencing. It may be advisable to carry out sequencing and annotation in parallel thus enabling them to complement one another.What analyses need to be run and how?What dependencies are there between various analysis programs?What parameters settings to use?
23 Engineering issues in annotation What public sequence data sets are needed?What are the mechanics of obtaining public sequence databases?Are curated data sets available or do you need to set up a means of maintaining your own (for repeats, insertions, organism of interest)How do you achieve computational throughput?Workstation farm, or simply a big, powerful box?Job flow controlWhat do you do with the results?Homogenize results into single format?Filter results for significance and redundancy
24 Engineering issues in annotation Interpreting the resultsIs human curation needed?How can you achieve consistency between curators?How do you design the user interface so that it is simple enough to get the task completed speedily but complex enough to deal with biology?How do you capture curations?How are annotation translations to be described?EC terminologyProSite familiesPfam domainsIs function distinguishable from process?
25 Engineering issues in annotation How do you manage data?What is the appropriate database schema design?How is the database to be kept up to date? Will it be directly from programs running user interfaces and analyses or via a middleware layer?Is a flat file format needed and what should it be?What query and retrieval support is needed?How do you distribute data?For bulk downloads what is the format of the data?What information is best summarized in tables?What information requires an integrated graphical view?
26 Engineering issues in annotation How do you update the annotations?How frequently are they re-evaluated?How can re-evaluation be minimized (only subsets of the databanks, only modified sequences)?How can differences between old and new computational results be detected?Changes in computational results may need to trigger changes in curated annotations
27 Drosophila melanogaster Drosophila is the most important model organism*Drosophila genome:4 chromosomes180 Mb total sequence140 Mb euchromatic sequence12-14,000 genes* source: G.M. Rubin
28 Drosophila Genome Project Laboratories working on Drosophila sequencing:BDGP (Berkeley Drosophila Genome Project)EDGP (European Drosophila Genome Project)Celera Genomics Inc.“Complete” D. melanogaster sequence will be finished by the end of 1999Comprehensive database - FlyBase
29 Goals of the Drosophila Genome Project Complete genome sequenceStructure of all transcriptsExpression pattern of all genesPhenotype resulting from mutation of all ORFsAnd more...
30 Sequencing at the BDGP Genomic sequence Complete tiling path in BACs P1 and BAC clones24Mb of completed sequence (as of July 22, 1999)18Mb unfinished sequence in processComplete tiling path in BACs1.5x-path draft sequencingESTs and cDNAs80,942 ESTs finished (as of March 19, 1999)Over 800 full-length cDNAs
32 What sequence to start with? Unit of sequencing at the BDGPCompleted high-quality clone sequencesReassembling the genomic sequenceNeed to place clones in correct genomic positionsNeed to integrate genes that span multiple clonesSolved by using genomic overlaps to reconstitute full genomic sequence
33 Which analyses need to be run? Similarity searchesBLAST (Altschul et al., 1990)BLASTN (nucleotide databases)BLASTX (amino acid databases)TBLASTX (amino acid databases, six-frame translation)sim4 (Miller et al., 1998)Sequence alignment program for finding near-perfect matches between nucleotide sequences containing intronsGene predictorsGenefinder (Green, unpublished)GenScan (Burge and Karlin, 1997)Genie (Reese et al., 1997)Other analysestRNAscanSE (Lowe and Eddy, 1996)
34 Which analyses need to be run and how? mRNAsORFFinder(Frise, unpublished)Protein translationsHMMPFAM 2.1 (Eddy 1998) against PFAM (v Sonnhammer et al. 1997, Bateman et al. 1999)Ppsearch (Fuchs 1994) against ProSite (release 15.0) filtered with EMOTIF ( Nevill-Manning et al. 1998)Psort II (Horton and Nakai 1997)ClustalW (Higgins et al. 1996)
35 What public sequence data sets are needed? Automating updates of public databases:Genbank, SwissProt, trEMBL, BLOCKS, dbEST, EDGPCurated data setsD. melanogaster genes (FlyBase)Transposable elements (EDGP)Repeat elements (EDGP)STSs (BDGP)
37 How do you achieve computational throughput? BDGP computing powerSun Ultra 450 (3 machines, 4 processors each)Sun Enterprise (1 machine, 8 processors)Used these directly, without any system for distributed computing.Job flow control: the Genomic DaemonAutomatic batch analysis of genomic clonesBerkeley Fly Database is used for queuing system and storage of resultsMany clones can be analyzed simultaneouslyResults are processed and saved in XML format for interactive browsing
38 What do you do with the results? Berkeley Output Parser (BOP)Input to BOP:Genomic sequenceResults of computational analysesFiltering preferencesParses results from BLAST, sim4, GeneFinder, GenScan, and tRNAscan-SE analysesFilters BLAST and sim4 resultsEliminates redundant or insignificant hitsMerges hits that represent single region of homologyHomogenizes results into single formatOutput: sequence and filtered results in XML format
39 Is human curation needed? Not for everythingSome features are obvious and can be identified computationallyKnown D. melanogaster genes are detected automatically by GeneSkimmerRepetitive elementsBut still for many thingsAnnotating complete gene structure is still hardWe use CloneCurator (BDGP’s Java graphical editor) for curation
40 Gene SkimmerQuick way of identifying genes in new sequence before curationStart with XML output from BOPLook for sim4 hits with known Drosophila genesFind gene hits with sequence identity >98%, coverage >30%Verify that hits represent real genes
41 Gene Skimmer URL: http://www.fruitfly.org/sequence/genomic-clones.html 11 April 2017Gene SkimmerURL:
42 CloneCuratorDisplays computational results and annotations on a genomic cloneInteractive browsingZoom/scrollChange cutoffs for display of resultsAnalyze GC content, restriction sites, etc.Interactive annotation editingExpert “endorses” selected resultsPresents annotations to community via Web site
44 How do we annotate gene/protein function? Gene Ontology ProjectControlled hierarchical vocabulary for multiple-genome annotations and comparisonsStandardized vocabulary facilitates collaborationGood data modeling allows better database queryingOntology browser provides interactive search of hierarchical terms“GO” project (http://www.ebi.ac.uk/~ashburn/GO)
48 How do you distribute the data? Bulk downloadsFASTA atCurated data setsTabular dataAtSequenced genomic clonesClone contigs sorted by genomic locationClone contigs sorted by sizeRibbon provides integrated graphical view of annotations on physical contigs
49 Ribbon Human curator annotates individual clones (~100Kb) Clones are assembled into physical contigs (regions of physical map)Clone annotations are merged and renumbered for display on whole physical contigsRibbon is our Java display tool for displaying curated annotations on physical contigsWill soon be available on Web
51 How do you manage the data? Using Informix as our database serverUpdated via Perl dbi.pm moduleDevelopment underway inSchema revisionsGAME DTD (Genome Annotation Markup Entities)Perl module for annotation objects(Ewan Birney)
52 How do you maintain annotations? Open questionsHow frequently are annotations re-evaluated?How can re-evaluation be minimized (only subsets of the databanks, only modified sequences)?How can differences between old and new computational results be detected?Changes in computational results may need to trigger changes in curated annotations
53 Integrated annotation systems ACeDBGenotatorMagpieGAIATIGR
54 Integrated annotation systems: ACeDB Developed for analysis of the C. elegans genomeSophisticated database designed for storing annotations and related informationNew Java and Web-based versions availableWritten by Jean Thierry-Mieg and Richard Durbin
56 GenotatorBack end automates sequence analysis; browser provides interactive viewing and editing of annotationsNomi Harris (1997), Genome Research 7(7),
57 Magpie Expert system based (PROLOG) Data collection daemonData analysis and report daemon“Intelligent” integration of various individual feature prediction systemsAllows human interactionsGaasterlund and Sensen (1996), TIG, 12,
58 GAIA Web-based system Results displayed as Java applets Bailey, L.C., J. Schug, S. Fischer, M. Gibson, J. Crabtree, D.B. Searls, and G.C. Overton (1998), Genome Research.
59 TIGR Human Gene Index Gene Indices for various organisms Databases for transcribed genes linked into external/internal genomic databasesInternal backend analysis software
60 Computational analysis tools Gene findingRepeat findingEST/cDNA alignmentHomology searchingBLAST, FASTA, HMM-based methods, etc.Protein family searchingPFAM, Prosite, etc.
61 Gene finding: Prokaryotes vs. Eukaryotes Contiguous open reading frames (ORF)Short intergenic sequencesGood method: detecting large ORFsComplications:Partial sequencesSequencing errorsStart codon predictionOverlapping genes on both strands
62 Gene finding: Prokaryotes vs. Eukaryotes Complex gene structures (exon/introns)D. melanogaster has an average of 4 introns/geneVery long genes (D. melanogaster X gene 160 kb)Very long intronsMany introns“Nested”, overlapping, and alternatively spliced genes5’ UTRs with non-coding exonsLong 3’ UTRsComplex transcription machineryORF-finding alone is not adequate
63 Integrated gene finding AssumptionsSignals and content method sensors alone are not sufficient for predicting gene structureGene structure is hierarchicalEach component (exon, intron, splice site, etc.) can be modeled independentlyThe approachGenerate a list of candidates for each component (with scores)Assemble the components into a “gene model”
64 Integrated gene finding: Dynamic programming Determines the best combination of componentsTwo-part problem:Develop an “optimal” scoring functionUse dynamic programming to find an “optimal” alignment through scoring matrix
66 Integrated gene finding: Linear and Quadratic Discriminant Analysis (LDA/QDA) Deterministic calculation of thresholdsn-class discriminationExample:HSPL, Solovyev et al. (1997), ISMB, 5,QDACan represent a great improvement over LDAMZEF, Michael Zhang (1997), PNAS, 94,
67 Integrated gene finding: Feed-forward neural networks Supervised learningTraining to discriminate between several feature classesComputing unitsGradient descent optimizationMulti-layer networksLimitationsBlack-box predictionsLocal minimaExample:GRAIL, Uberbacher et al. (1991), PNAS, 88,
68 Approaches to gene finding: Hidden Markov models A finite model describing a probability distribution over all possible sequences of equal length“Natural” scoring function(Conditional) Maximum likelihood “training”Markovk-order Markov chain: current state dependent on k previous statesThe next state in a 1st-order Markov model depends on current stateHiddenHidden states generate visible symbolsAssumptionsIndependence of statesNo long range correlationExample: HMMgene, A. Krogh (1998), In Guide to Human Genome Computing,
69 Approaches to gene finding: Generalized hidden Markov models Each HMM state can be a probabilistic sub-modelComplex hierarchical systemRequires care in modeling state overlapsExample:Genie, Kulp et al. (1996), ISMB, 4,GenScan, Burge and Karlin (1997), JMB, 268(1), 78-94
70 Gene finding software Signal recognition Coding potential Coding exons Promoter predictionSplice site predictionStart codon predictionPoly-adenylation site predictionCoding potentialCoding exonsGene structure predictionSpliced alignmentLDA/QDANeural networksHMMs and GHMMs
71 Promoter recognition PromoterScan MatInd and MatInspector Identify potential promoter regionsBased on databases of known TF binding sitesTFD (Gosh (1991), TIBS, 16, )TRANSFAC (Heinemeyer et al. (1999), NAR, 27, )Prestridge (1995), JMB, 249,MatInd and MatInspectorFinding consensus matches to known TF binding sitesBased on TRANSFACHeinemeyer et al. (1999), NAR, 27,Quandt et al. (1995), NAR, 23,
72 Promoter recognition (cont.) TSSG/TSSWLDA based combination of several features (TATA-box, Inr signal, upstream regions)Solovyev et al. (1997), ISMB, 5,Transcription Element Search SoftwareIdentify TF binding sitesBased on TRANSFAC
73 Promoter recognition (cont.) CBS Promoter 2.0 Prediction ServerSimulated transcription factorsPrinciples common to neural networks and genetic algorithmsKnudsen (1999), Bioinformatics 13(5),CorePromoterPosition dependent 5-tupleQDAMichael Zhang (1998), Genome Research, 8,
74 Promoter recognition (cont.) Neural network promoter prediction (NNPP)Time-delay neural networkCombining TATA box and initiatorReese (1999), in preparation.
76 Promoter recognition (cont.) 11 April 2017Promoter recognition (cont.)Markov chain promoter finderCompeting interpolated Markov chains for promoters, exons, intronsPromoter model consists of five states representing the core promoter partsOhler, Reese et al., Bioinformatics 13(5),
77 Splice site prediction Nakata, 1985Nakata (1985), NAR, 13(14),BCM GeneFinderHSPL - Prediction of splice sites in human DNA sequencesTriplet frequencies in various functional parts of splice site regionsCombined with codon statisticsSolovyev et al. (1994), NAR, 22(24),
78 Splice site prediction (cont.) Neural Network splice site predictor (NNSPLICE)Multi-layered feed-forward neural networkModeled after Brunak et al. (1991), JMB, 220,Reese et al. (1997), JCB, 4(3),NetGene2Combination of neural networks and rule-based systemSplice site signal neural network combined with coding potentialHebsgaard et al. (1996), NAR, 24(17),Brunak et al. (1991), JMB, 220,
79 Splice site prediction (cont.) SplicePredictorLogitlinear models for splice site regionsDegree of matching to the splice site consensusLocal compositional contrastBrendel and Kleffe (1998), NAR, 26(20),
80 Start codon prediction NetStartTrained on cDNA-like sequencesNeural network basedLocal start codon informationGlobal sequence informationPedersen and Nielsen (1997), ISMB, 5,
81 Poly-adenylation signal prediction BCM GeneFinderPOLYAH - Recognition of 3'-end cleavage and poly-adenylation regionTriplet frequencies in various functional parts in poly-adenylation regionsLDASolovyev et al. (1994), NAR, 22(24),
82 Prediction of coding potential Periodicity detectionCoding sequences have an inherent periodicity of threeEspecially good on long coding sequencesAuto-correlationSeeking the strongest response when shifted sequence is compared with originalMichel (1986), J. Theor. Biol. 120,Fourier transformation: Spectral analysisDetection of peak at position corresponding to 1/3 of the frequencySilverman and Linsker (1986), J. Theor. Biol. 118,
83 Prediction of coding potential (cont.) Trifonov (1980;1987)G-notG-U periodicityJMB , 194,Fickett (1982)Position asymmetry in the three codon positionsNAR 10(17),Staden (1984)Codon usage in tablesNAR 12,
84 Prediction of coding potential (cont.) Claverie and Bougueleret (1987)Hexamer frequency differentialsNAR 14,Fichant and Gautier (1987)Codon usage homogeneityCABIOS, 3(4),GRAIL I (1991)Neural network using a shifting fixed size window7 sensors as input, 2 hidden layers and 1 unit as outputUberbacher et al. (1991), PNAS, 88(24),
85 Prediction of coding potential (cont.) GeneMark (1986)Inhomogeneous Markov chain modelsEasy trainable (closed solution for Maximum Likelihood)Used extensively in prokaryotic genomesBorodovsky et al. (1993), Computers & Chemistry, 17,Glimmer (1998)Interpolated Markov chains from first to eighth orderSalzberg et al. (1998), NAR, 26(2),
86 Prediction of coding potential (cont.) Review by Fickett (1992)“Assessment of protein coding measures”, NAR, 20,
87 Prediction of coding exons SorFindDetection of “spliceable” ORFsHutchinson, NAR, 20(13),BCM GeneFinderFEXD, FEXN, FEXA, FEXY, FEXH, HEXONLDASolovyev et al. (1994), NAR, 22(24),GRAIL IIExon candidates, heuristic integration, learning with neural networkUberbacher et al., Genet. Eng., 16,
88 “Integrated” gene models: LDA/QDA FGeneLDA basedDynamic programming for the integration of LDA outputSolovyev et al. (1995), ISMB, 3,
89 “Integrated” gene models: NN GeneParser“Gene-parsing” approachPotential alternative splicing recognizedNeural network and dynamic programmingSnyder and Stormo (1995), JMB, 248, 1-18.
90 “Integrated” gene models: Artificial intelligence approaches GeneIDRule-based systemHomology integrationGuigó et al. (1992), JMB , 226,GeneID using DPDP to combine a set of potential exonsGuigó et al. (1998), JCB , 5,
91 “Integrated” gene models: Artificial intelligence approaches GenLangSyntactic pattern recognition systemFormal grammarTools from computational linguisticsDong and Searls (1994), Genomics, 23,
92 “Integrated” gene models: HMMs HMMGeneSeveral genes per sequence possibleUser constraints possibleKrogh (1997), ISMB, 5,GeneMark.hmmBased on GeneMark program for bacterial sequencesCan predict frame shiftsTrained for various organismsLukashin and Borodovsky (1998), NAR, 26,
93 “Integrated” gene models: GHMMs 11 April 2017“Integrated” gene models: GHMMsGenieGeneralized hidden Markov model with length distributionIntegration of multiple content and signal sensorsContent: codon statistics, repeats, intron, intergenic, database homology hitsSignal: promoter, start codon, splice sites, stop codonDynamic programming to find optimal parseSeveral genes per sequence possibleKulp et al. (1996), ISMB, 4,Reese et al. (1997), JCB, 4(3),
95 “Integrated” gene models: GHMMs GenScanMultiple content and signal modelsSemi-hidden Markov model sensors with length distributionTakes GC content into account (separate models)Several genes per sequence possibleBurge and Karlin (1997), JMB, 268(1),
96 EST/cDNA alignment for gene finding: Spliced alignments PROCRUSTESSpliced alignment algorithmDynamic programming to combine a set of potential exonsFrame conservationHomologous sequence neededGelfand et al. (1996), PNAS, 93,
97 EST/cDNA alignment Sim4 GeneWise Aligns cDNA to genomic sequence Uses local similarityFlorea et al. (1998), Genome Research, 8,GeneWiseDynamic programmingPartial genes allowedBased on Pfam and statistical splice site modelsBirney (1999), unpublished
98 EST/cDNA alignment (cont.) ACEMBLYAligns ESTs to genomic sequenceIdentifies alternative splicingIntegrated in ACeDBJean Thierry-Mieg (unpublished)
99 Repeat finders Censor BLAST Uses database of repeat sequences Jurka et al. (1996), Comp. and Chem., 20(1),BLASTIntegrated masking operationsXBLAST procedureClaverie (1994), In Automated DNA Sequencing and Analysis Techniques, M. D. Adams, C. Fields and J. C. Venter, eds.,http//:www.ncbi.nlm.nih.gov/BLAST
100 Repeat finders (cont.) RepeatMasker Detection of interspersed repeats Smit and Green, unpublished results
101 Homology searching BLAST suite FASTA suite HMM-based searching BLASTN, BLASTX, TBLASTX, PSI-BLASTAltschul et al. (1990), JMB, 215,FASTA suiteFASTA, TFASTAPearson and Lipman (1988), PNAS, 85,HMM-based searchingSAM (UCSC group)HMMER, Sean Eddy
102 Gene family searching BLOCKS PROSITE PFAM SCOP PROSITEPFAMSCOP
103 The genome annotation experiment (GASP1) Genome Annotation Assessment Project (GASP1)Annotation of 2.9 Mb of Drosophila melanogaster genomic DNAOpen to everybody, announced on several mailing listsParticipants can use any analysis methods they like (gene finding programs, homology searches, by-eye assessment, combination methods, etc.) and should disclose their methods.“CASP” like12 participating groups
105 Goals of the experiment Compare and contrast various genome annotation methodsObjective assessment of the state of the art in gene finding and functional site predictionIdentify outstanding problems in computational methods for the annotation process
106 Adh contig2.9 Mb contiguous Drosophila sequence from the Adh region, one of the best studied genomic regionsFrom chromosome 2L (34D-36A)Ashburner et al., (to appear in Genetics)222 gene annotations (as of July 22, 1999)375,585 bases are coding (12.95%)We chose the Adh region because it was thought to be typical. A representative test bed to evaluate annotation techniques.
109 Drosophila data sets provided to participants Curated Drosophila nuclear DNA "coding sequences" (CDS)Curated non-redundant Drosophila genomic DNA data (275 “multi”- and 144 “single”-exon sequence entries from Genbank)Drosophila 5' and 3' splice sitesDrosophila start codon sitesDrosophila promoter sequencesDrosophila repeat sequencesDrosophila transposon sequencesDrosophila cDNA sequencesDrosophila EST sequencesURL:
110 Timetable May 13, 1999 - June 30, 1999 June 30, 1999 - July 31, 1999 Distribution of the sample sequence and associated data to the predictors. Collection of predictions.June 30, July 31, 1999Evaluation of the predictions by the Drosophila Genome Center.August 4, 1999External expert assessment of the prediction results (HUGO meeting, EMBL)August 6, 1999Tutorial #3 at the ISMB ‘99 conference in Heidelberg, Germany
111 Resources for assessing predictions 80 cDNA sequences NOT in Genbank before experiment deadlineSequenced from 5 different cDNA libraries3 paralogs to other genes in the genome19 cDNAs with cloning artifacts2 apparently representing unspliced RNAMultiple inserts (2 cDNAs cloned in the same vector)58 “usable” cDNAs33 cDNA sequences in Genbank during experimentAnnotations from Adh paper
112 Curated data sets for assessing predictions Standard 1 (Adh.std1.gff) “conservative gene set”43 gene structures (7 single- and 36 multi- coding exon genes)Criteria for inclusion:>=95% (most >=99%) of the cDNA aligned to genomic DNA (using sim4)“GT”/”AG” splice site consensus sequencesSplice site score from neural net5’ splice sites: >=0.35 threshold ( 98% True Positive score)3’ splice sites: >=0.25 threshold ( 92% True Positive score)Start codon and stop codon annotations from Standard 3 (derived from Adh paper)These 43 genes represent “typical” genes
113 Curated data sets for assessing predictions Standard 2 (Adh.std2.gff)Superset of Standard 115 additional gene structuresSame alignment criteria as Standard 1 but no splice site consensus requirementNot used in the experiment
114 Curated data sets for assessment Standard 3 (Adh.std3.gff) “more complete gene set”222 gene structures (39 single- and 183 multi- coding exon genes)Criteria:Annotated as described in Ashburner et al.cDNA to genomic alignment using sim4Start codons predicted by ORFFinder (Frise et al., unpublished)~182 genes have similarity to a homologous protein sequence in another organism or have a Drosophila EST hitEdge verification by partial EST/cDNA alignmentsBLASTX, TBLASTX homology resultsPFAM alignmentsGene structure verification using GenScan (human)14 genes had EST/homology hits but no gene finding predictions~40 genes only have “strong” GenScan predictions
115 Submission format GFF (Durbin and Haussler, 1998, unpublished)
117 Submissions MAGPIE Team Credit Method Terry Gaasterland, Alexander Sczyrba, Elizabeth Thomas, Gulriz Kurban, Paul Gordon, Christoph SensenLaboratory for Computational Genomics, Rockefeller and Institute for Marine Biosciences, CanadaMethodAutomatic genome analysis system integrating Drosophila Genscan predictions, confirming exons boundaries using database searches, repeat finding (Calypso, REPupter) and gene function annotations.
118 Submissions (cont.) References “Multigenome MAGPIE” poster at ISMB ‘99.Gaasterland and Ragan (1998), J. of Microbial and Comparative Genomics, 3,Gaasterland and Sensen (1996), Biochimie 78,REPupter: Kurtz and Schleiermacher (1999), Bioinformatics 15(5),
119 Submissions (cont.) Computational Genomics Group, The Sanger Centre CreditVictor Solovyev, Asaf SalamovMethodDiscriminant analysis based gene prediction programs FGenes (trained for Human) and FGenesH (trained for Drosophila); Combining the output of Fgenes, FGenesH and BLAST using FGenesH+. 3 different “threshold” annotations are submitted.The programming running time is linear with the sequence length.Automatic, plus additional user interactive screening.Non-redundant NCBI database used for BLAST.URL/References
120 Submissions (cont.) Genome Annotation Group, The Sanger Centre Credit Ewan BirneyMethodProtein family based gene identification using Wise2 (previously Genewise) and PFAM.URL
121 Submissions (cont.) Pattern Recognition, The University of Erlangen CreditUwe Ohler, Georg Stemmer, Stefan Harbeck, Heinrich NiemannMethodPromoter recognition based on interpolated Markov chains; “Genscan” like promoter model (MCPromoter); maximal mutual information based estimation of interpolated Markov chains.Automatic.Promoter training data set from
122 Submissions (cont.) References URL Ohler, Harbeck, Niemann, Noeth and Reese (1999), Bioinformatics 15(5),Ohler, Harbeck and Niemann (1999), Proc. EUROSPEECH, to appear.URL
123 Submissions (cont.)Computational Biosciences, Oakridge National LaboratoryCreditRichard J. Mural, Douglas Hyatt, Frank Larimer, Manesh Shah, Morey ParangMethodIntegrated neural network based system including gene assembly using EST and homology information (GRAILexp).URL:
124 Submissions (cont.)Center for Biological Sequence Analysis, Technical University of DenmarkCreditAnders KroghMethodModular HMM incorporating database hits (proteins and ESTs/cDNAS) and other “external information” probabilistically (HMMGene); the HMM has modules for coding regions, splice sites, translation start/stop, etc..It will be a fully automated system.Trained on Drosophila dataandVictor Solovyev (personal communication)
125 Submissions (cont.) References URL Krogh (1998), In S.L. Salzberg et al., eds., Computational Methods in Molecular Biology, 45-63, Elsevier.Krogh (1997), Gaasterland et al., eds., Proc. ISMB 97,URLNot yet for Drosophila.
126 Submissions (cont.)BLOCKS group, Fred Hutchinson Cancer Research Center in Seattle, WashingtonCreditJorja Henikoff, Steve HenikoffMethodDNA translation in 6 frames and search against BLOCKS+ and against BLOCKS extracted from Smart3.0 (http://coot-embl-heidelberg.de/SMART/) using BLIMPS; automatic post-processing to join multiple predictions from the same block.Automatic with some user interactive screening of results.
127 Submissions (cont.) References URL Henikoff, Henikoff and Pietrokovski (1999), Nucl. Acids Res., 27,Henikoff and Henikoff (1994), Proc. 27th Ann. Hawaii Intl. Conf. On System Sciences,Henikoff and Henikoff (1994), Genomics, 19,URLname>
128 Submissions (cont.) Genome Informatics Team, IMIM, Barcelona, Spain CreditRoderic Guigó, Josep F. Abril, Enrique Blanco, Moises Burset, Genis ParraMethodDynamic programming based system to combine potential exon candidates modeled as a fifth order Markov model and functional sequence sites modeled as a position weight matrix (Geneid version 3).Fully automatic, very fast.Trained on Drosophila data
129 Submissions (cont.) References URL Guigó et al. (1998), JCB , 5,URLInformation on training process:
130 Submissions (cont.)Mark Borodovsky's Lab, School of Biology, Georgia Institute of TechnologyCreditMark Borodovsky, John BesemerMethodMarkov chain models combined with HMM technology (Genemark.hmm).URL
131 Submissions (cont.)Biodivision, GSF Forschungszentrum für Umwelt und Gesundheit, Neuherberg, GermanyCreditMatthias Scherf, Andreas Klingenhoff, Thomas WernerMethodUniversal sequence classifier which is based on a correlated word analysis to predict initiators and promoter associated TATA boxes (CoreInspector V1.0 beta). Sequences of 100 bp are classified at once.Trained on Eukaryotic Promoter Database (EPD version 5.9).Fully automatic, 2 seconds per 1Kb.ReferencesScherf et al. (1999), in preparation.URL
132 Submissions (cont.)The Department of Biomathematical Sciences, Mount Sinai School of Medicine, New YorkCreditGary BensonMethodTandem repeats finder (TRF v2.02) uses theoretical model of the similarity between adjacent copies of pattern (pattern from bp recognized); dynamic programming for candidate validation.Fully automatic; very fast (seconds per 1Mb).ReferencesBenson (1999), Nucl. Acids Res., 27(2),URL
133 Submissions (cont.) Genie, UC Berkeley/UC Santa Cruz/ Neomorphic Inc. CreditMartin G. Reese, David Kulp, Hari Tammana, David HausslerMethodGeneralized hidden Markov model with optional integration of EST hits and homology searches (Genie).Trained on Drosophila dataSemi-automatic, in that the overlaps of the analyzed sequence contigs (110kb) where manual run again with Genie to resolve conflicts.BLAST used for homology searches on non-redundant protein database (nr).
134 Submissions (cont.) References URL Reese et al. (1997), JCB, 4(3),Kulp et al. (1997), Biocomputing: Proc. Of the 1997 PSB conference,Kulp et al. (1996), ISMB, 4,URL
154 Definition: “Joined” and “split” genes # Actual genes that overlap predicted genesJG =# Predicted genes that overlap one or more actual genes# Predicted genes that overlap actual genesSG =# Actual genes that overlap one or more predicted genesJG > 1, tendency to join multiple actual genes into one predictionSG > 1, tendency to split actual genes into separate gene predictionsInspired by Hayes and Guigó (1999), unpublished.
156 Annotation experiment results Results available during tutorial and at
157 Results: Base level Sensitivity: Specificity Low variability among predictors~95% coverage of the proteomeSpecificity~90%Programs that are more like Genscan (used for original annotation) might do better?
158 Results: Exon level Higher variability among predictors Up to ~75% sensitivity (both exon boundaries correct)55% specificityLow specificity because partial exon overlaps do not countMissing exons below 5%Many wrong exons (~20%)
160 Results: Gene level 60% of actual genes predicted completely correct Specificity only 30-40%5-10% missed genes (comparable to Sanger Center)40% wrong genes, a lot of short genes over-predicted (possibly not annotated in Standard 3)Splitting genes is a bigger problem than joining genes
175 Idfg1, Idfg2, Idfg3 (cont.) Chitinase-related Gene function has changed (now a growth factor)
176 Conclusion of GASP1 95% coverage of the proteome Base level prediction is easier, exon level prediction is harderSmall genes over predicted (?)Long intronsThe high number of “wrong genes” indicates possible incomplete annotation in Standard 3 (Are there more genes?)HMM seems to currently be the best approachMajor improvements in multiple gene regions
177 Conclusion GASP1 (cont.) Much lower false positive ratesMethods optimized for organism of interest do betterGene finding including homology not always improves predictionSplit genes is more of a problem than joined genesNo program is perfect
178 Discussion GASP1 Genes in introns Alternative splicing Genomic contamination in cDNA librariesTranslation start predictionBiological verification of prediction neededImprove test bed by cDNA sequencingMore regulation data needed to confirm promoter assessmentCombining methodsBetter methods neededGASP 2 ?
179 Conclusions on annotating complete eukaryotic genomes Throughput has to improve dramaticallyNot only genes but also their relationships have to be elucidatedComplete transcript cDNAs very powerful tool for annotation including alternative transcriptsComparative genomics as well as expression analysis improves/completes genome annotationStandardization efforts needed (ontology working group, OMG, OiB, NCBI/EBI, Bioxml, etc.)Standards for description of gene productsExchange format (GFF, Genbank, EMBL, XML)
180 Conclusions on annotating complete eukaryotic genomes (cont.) Maintenance requires even more effort than the original developmentAutomated methods are not good enoughHuman curators can cause problems tooFunctional assignment by homology is sometimes unreliable
181 Discussion on annotating complete eukaryotic genomes Re-annotation: updating results and annotations over timeGenomic sequence changes (indels, point mutations)Analysis software changesNew entries in public sequence databasesEntries removed from sequence databasesAudit trail for annotationsMaster copy of genome annotations should reside in the model organism databases where the expertise residesCommunity collaborative annotation
182 Acknowledgments Uwe Ohler (University of Erlangen, Germany) Gerry Rubin (UC Berkeley)Sima Misra (UC Berkeley)Erwin Frise (UC Berkeley)Roderic Guigó (Barcelona)GFF team (headed by Richard Bruskiewich, Sanger Centre)Assessment team: Michael Ashburner (EBI), Peer Bork (EMBL), Richard Durbin (Sanger), Roderic Guigó (Barcelona), Tim Hubbard (Sanger)Annotation experiment participants