TWAIN: a new tool for parallel gene finding (and other gene finders) Mihaela Pertea William Majoros Steven Salzberg.

TWAIN: a new tool for parallel gene finding (and other gene finders) Mihaela Pertea William Majoros Steven Salzberg

First, some background…

Genomes completed and published by TIGR and our collaborators, 1995-present OrganismReference Arabidopsis thalianaLin et al., Nature 402: 761-8 (2000) Archaeoglobus fulgidusKlenk et al., Nature 390:364-370 (1997) Bacillus anthracis AmesRead et al., Nature 423: 81-86 (2003) Bacillus anthracis FloridaRead et al., Science 296, 2028-33 (2002) Borrelia burgdorferiFraser et al., Nature 390: 580-586 (1997) Brucella suisPaulsen et al., PNAS 99 (2002) Caulobacter crescentusNierman et al., PNAS 98 (2001) Chlamydia pneumoniaeRead et al., Nucl. Acids Res. 28, (2000) Chlamydia muridarumRead et al., Nucl. Acids Res. 28, (2000) Chlamydophila caviaeRead et al., Nucl. Acids Res. 31, (2003) Chlorobium tepidumEisen et al., PNAS 99: 9509-9514 (2002) Coxiella burnetii RSA 493Seshadri et al., PNAS 100: 5455-60 (2003) Deinococcus radioduransWhite et al., Science 286 (1999) Enterococcus faecalis Paulsen et al., Science 299: 2071-2074 (2003) Haemophilus influenzaeFleischmann et al., Science 269, (1995) Helicobacter pyloriTomb et al., Nature 388:539-547 (1997) Methanococcus jannaschiiBult et al., Science 273:1058-1073 (1996) Mycobacterium tuberculosisFleischmann et al., J. Bact.184, (2002) Mycoplasma genitaliumFraser et al., Science 270:397-403 (1995) Neisseria meningitidisTettelin et al., Science 287 (2000) Oryza sativa (rice) chr 10Wing et al., Science 300: 1566-1569 (2003) Plasmodium falciparum Gardner et al., Nature 419:531-534 (2002) Plasmodium yoeliiCarlton et al., Nature 419:512-519(2002) Porphyromonas gingivalis Nelson et al., J. Bact., in revision. Pseudomonas putida Nelson et al., Envir. Microbiol. (2002) Shewanella oneidensis Heidelberg et al., Nat. Biotech. 20 (2002) Streptococcus agalactiaeTettelin et al., PNAS. 99 (2002) Streptococcus pneumoniaeTettelin et al., Science 293 (2001) Sulfolobus islandicus virusArnold et al., Virology 15:252-66 (2000) Thermotoga maritimaNelson et al., Nature 399: 323-329 (1999) Treponema pallidumFraser et al., Science 281: 375-388 (1998) Vibrio choleraeHeidelberg et al., Nature 406, (2000)

Genomes in progress or recently completed Fibrobacter succinogenes Prevotella intermedia Pseudomonas fluorescens Silicibacter pomeroyi DSS-3 Streptococcus agalactiae A909 Streptococcus gordonii Streptococcus mitis Streptococcus pneumoniae 670 Acidobacterium capsulatum Bacillus anthracis A01055 Bacillus anthracis A0402 Bacillus anthracis Ames 0581 Burkholderia thailandensis Campylobacter coli RM2228 Campylobacter upsaliensis RM3195 Clostridium perfringens SM101 Epulopiscium fishelonii Hyphomonas neptunium Listeria monocytogenes F6854 Listeria monocytogenes H7858 Mycoplasma arthritidis Mycoplasma capricolum Myxococcus xanthus Prevotella ruminicola Pyrococcus furiosus Verrucomicrobium spinosum Actinomyces naeslundii Bacillus anthracis A0071 Bacillus anthracis Kruger B Erwinia chrysanthemi Gemmata obscuriglobus Mycobacterium tuberculosis Ruminococcus albus Streptococcus sobrinus Aspergillus fumigatus Brugia malayi Coccidioides immitis Cryptococcus neoformans Entamoeba histolytica Oryza sativa Chromosome 3 & 10 Plasmodium vivax Schistosoma mansoni Solanum spp. Tetrahymena thermophila Toxoplasma gondii Theileria parva Trichomonas vaginalis Trypanosoma brucei Trypanosoma cruzi Acidithiobacillus ferrooxidans Bacillus anthracis Kruger B Burkholderia mallei Clostridium perfringens ATCC13124 Dehalococcoides ethenogenes Desulfovibrio vulgaris Ehrlichia chaffeensis Ehrlichia sennetsu Geobacter sulfurreducens Listeria monocytogenes Methylococcus capsulatus Mycobacterium avium 104 Mycobacterium smegmatis Pseudomonas syringae Staphylococcus aureus Staphylococcus epidermidis Treponema denticola Wolbachia sp. Anaplasma phagocytophila Bacillus cereus 10987 Bacteroides forsythes Brucella ovis Baumannia cicadellinicola Campylobacter jejuni Carboxydothermus hydrogenoformans Colwellia sp. 34H Dichelobacter nodosus

Anatomy of a Genome Sequencing Project Shotgun sequencingGenome AssemblyAnnotation Data release Downstream research Library construction Colony picking Template preparation Sequencing reactions Base calling Sequence files Assembler-> Genome scaffold Ordered contig set Gap closure sequence editing Re-assembly ONE ASSEMBLY! (per molecule) Combinatorial PCR POMP Gene finding Homology searches Function assignments Metabolic pathways Gene families Comparative genomics Transcriptional/ translational regulatory elements Repetitive sequences Publication www.tigr.org LIMS entry point Microarray studies Vaccine, drug development Human disease studies

Gene Finding  Gene finding plays an ever-larger role in high-speed DNA sequencing projects  1000’s of genes generated each week at a high- throughput sequencing facility  Separate gene finders are needed for every organism  Training on organism X, finding genes on Y, generates inferior results  Bootstrapping problem: training data is hard to find Prokaryotic – “easy” bacteria, viruses, archaea have high gene density no introns Eukaryotic – hard low gene density many introns

G LIMMER : A Bacterial Gene Finder  G LIMMER 2.0: released late 1999  > 2000 sites worldwide (Open Source)  Also handles Archaea, viruses, others  Refs: Salzberg et al., NAR, 1998, Genomics 1999; Delcher et al., NAR, 1999, Pertea et al, Nature 2000; Pertea and Salzberg, Plant Mol Biol 2001; Majoros et al, NAR, 2003  Web site and code: http://www.tigr.org/software

Bacterial gene finding, pre-Glimmer: Uniform Markov Models Use conditional probability of a sequence position given previous k positions in the sequence, e.g. ACCGAT Fixed, k th -order model: bigger k ‘s yield better models (as long as data is sufficient). Probability (score) of sequence s 1 s 2 s 3 … s n is:

Advantages: –Easy to train. Count frequencies of (k+1)mers in training data. –Easy to assign a score to a sequence. Disadvantages: –(k+1)mers can be undersampled; i.e., occur too infrequently in training data. –Choosing a single value of k may not be the best way to model the data Uniform Markov Models

Glimmer: Interpolated Markov Models  Use a linear combination of 8 different Markov chains; for example:  c 8 P (g|atcagtta) + c 7 P (g|tcagtta) + …  + c 1 P (g|a) + c 0 P (g)  where c 0 + c 1 + c 2 + c 3 + c 4 = 1  Equivalent to interpolating the results of multiple Markov chains  Score of a sequence is the product of interpolated probabilities of bases in the sequence

IMM’s vs. Fixed-Order Models Performance: –IMM should always do at least as well as fixed-order. E.g., even if k th -order model is correct, it can be simulated by (k+1) st -order –Our results support this. IMM can be used as fixed-order model.

How G LIMMER Works  Three separate programs:  long-orfs: automatically extract long open reading frames that do not overlap other long orfs.  IMM model builder. Takes any kind of sequence data.  Gene predictor. Takes genome sequence and finds all the genes.

G LIMMER 2.0 ’s Performance Organism Genes Genes Additional Annotated Found Genes H. influenzae17381720(99.0%)250(14%) M. genitalium483480(99.4%)81(17%) M. jannaschii17271721(99.7%)221(13%) H. pylori15901550(97.5%)293(18%) E. coli42694158(97.4%)824(19%) B. subtilis41004030(98.3%)586(14%) A. fulgidis24372404(98.6%)274(11%) B. burgdorferi853843(99.3%)62(7%) T. pallidum10391014(97.6%)180(17%) T. maritima18771854(98.8%)190(10%)

G LIMMER on “known” genes Organism Genes Known Correct Annotated Genes Predictions H. influenzae173815011496(99.7%) M. genitalium483478 476(99.6%) M. jannaschii172712591256(99.8%) H. pylori159010921084(99.3%) E. coli426926562632(99.1%) B. subtilis410012491231(98.6%) A. fulgidis243717991786(99.3%) B. burgdorferi853601600(99.8%) T. pallidum1039755747(98.9%) T. maritima187715041493(99.3%) Average(99.3%)

 Speed  Training for 2 Megabase genome: < 30 sec (on a Linux desktop)  Find all genes in 2Mb genome: < 30 sec  Impact: G LIMMER has been used for:  B. anthracis (anthrax) (TIGR)  B. burgdorferi (Lyme disease), T. pallidum (syphilis) (TIGR)  C. pneumoniae (pneumonia) (Berkeley/Stanford/UCSF)  T. maritima, D. radiodurans, M. tuberculosis, V. cholerae, S. pneumoniae, C. trachomatis, C. pneumoniae, N. meningitidis (TIGR)  X. fastidiosa (Brazilian consortium)  Plasmodium falciparum (malaria) [GlimmerM]  Arabidopsis thaliana (model plant) [GlimmerM]  and many others: viruses, simple eukaryotes, more bacteria

Eukaryotic gene finding Much harder Overall accuracy usually below 50% –Human (mammalian) gene finding is hardest –very long introns, and lots of them Leading methods: HMMs, GHMMs New ideas needed New opportunity: use sequence of related species

GlimmerHMM GlimmerHMM Intergenic I0I1I2 Exon0Exon1Exon2 Exon Sngl Initial ExonTerminal Exon

GlimmerHMM: results on Arabidopsis thaliana NuclExonGene SnSpAccSnSpAccSnSpAcc GlimmerHMM959997717874.5333232.5 Genscan+939996748177.5353535 Train data set: 3237 genes Test data set: 809 non-homologous genes All genes confirmed by full-length Arabidopsis cDNAs

Exonomy: a generalized HMM ProgramNucleotideExon accuracyWhole-gene accuracyspecsensaccuracy Unveil94%75%74%46% Exonomy95%63%61%42% GlimmerM93%71%71%44% Genscan94%80%75%27% Arabidopsis test results, 300 genes (Majoros et al., 2003)

Aspergillus species experiments  Training data: –589 Genbank genomic sequences containing 625 genes that have the phrase ‘complete cds’ in their description –1166 introns inferred from spliced alignments of ESTs to a recent genome assembly  Test data: –85 genes for Aspergillus fumigatus manually curated and with strong protein evidence

Gene Finding in A. fumigatus GlimmerHMM Unveil GlimmerM Phat Exonomy “Truth”

Aspergillus fumigatus test results

Example: D. melanogaster vs. D. pseudobscura (alignment generated by MUMmer/Promer) D. melanogaster chr 2L annotated genes amino acid matches

Ortholog Detection in TWAIN  Promer/MUMmer to identify conserved regions  Individual gene finder to predict coding regions separately in each genome  Combine these two types of evidence with protein sequence homology Species 1 Species 2 Run TWAIN on these

TWAIN approach: Premise Instead of independently choosing the optimal gene models for two conserved regions, we want to find the pair of nearly optimal gene models which produce the most similar proteins.

Build Parse Graphs Parse graph: keep N highest scoring ORFs according to individual gene finder Parse graphs are built without regard for synteny Nodes are: start, stop, donor, acceptor sites

Align Parse Graphs The two parse graphs are aligned using a global alignment algorithm on gene structures. Optimal alignment corresponds to the best pair of orthologous gene predictions.

Gene Alignment in TWAIN  Ideally, each cell links back to the “optimal” predecessor  A cell with a diagonal link to its left denotes homologous signals  A cell with a horizontal or vertical link to its left indicates that a signal in one species is not present in the other

Some Examples Intron insertionExon insertionMultiple insertions

Pair HMM equivalence E1,E2I1,I2

Pair HMM Equivalence Intron insertion E1,E2p I1,-- E1,E2p

Orthogonal vs. Oblique Linking The oblique (red) alignment matches up the two introns The orthogonal alignment denotes coding regions that have shifted across introns

Dynamic Programming Optimizations We only look back to cells left and below the current cell, and only those having an edge in both parse graphs to the current cells Depending on the Promer alignments we might “cut corners” to improve performance

Scoring Model where: P i (  i ) is the probability that sequence i has parse  i P(align) is the probability that an alignment PHMM generates this pair of proteins (evaluated by the forward algorithm) P(align)

MRNDCACQEGHLINRFPDNAR ||||| || |||| MRNDCTCQRGHLI ATG..................................................................TAG Partial Alignments not penalized Evaluation of a cell in the alignment matrix depends in many cases on the alignment of partial proteins up to this point in the parse graph (e.g., at a GT-GT cell). Because a terminal portion of a partial sequence may be matched later, we do not penalize for insertions/deletions at the right end of the alignment

Pair HMM results available within a few weeks….

Priority organisms  Human-mouse gene finding not very high-impact –lots of ancillary data gives better evidence –most genes now known –nonetheless, this problem is getting all the attention  Countless other species really need gene finders: –Brugia malayi (causes lymphatic filariasis) –Toxoplasma gondii –Schistosoma mansoni (Schistosomiasis) –Entamoeba histolytica (50 million cases/year) –Tetrahymena thermophila (model organism) –Plants: potato, maize, sorghum –Mammals: chimp, dog, cow, pig

Acknowledgements GLIMMER: Arthur Delcher, Simon Kasif, Owen White GlimmerM, GlimmerHMM: Mihaela Pertea Exonomy, Unveil: Bill Majoros TWAIN: Mihaela Pertea, Bill Majoros Funding support: National Institutes of Health (NLM) National Science Foundation (CISE, BIO) Software downloads: http://www.tigr.org/software

TWAIN: a new tool for parallel gene finding (and other gene finders) Mihaela Pertea William Majoros Steven Salzberg.

Similar presentations

Presentation on theme: "TWAIN: a new tool for parallel gene finding (and other gene finders) Mihaela Pertea William Majoros Steven Salzberg."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

TWAIN: a new tool for parallel gene finding (and other gene finders) Mihaela Pertea William Majoros Steven Salzberg.

Similar presentations

Presentation on theme: "TWAIN: a new tool for parallel gene finding (and other gene finders) Mihaela Pertea William Majoros Steven Salzberg."— Presentation transcript:

Similar presentations

About project

Feedback