JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.

Slides:



Advertisements
Similar presentations
EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
Advertisements

Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Ab initio gene prediction Genome 559, Winter 2011.
Profiles for Sequences
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Gene predictions for eukaryotes attgccagtacgtagctagctacacgtatgctattacggatctgtagcttagcgtatct gtatgctgttagctgtacgtacgtatttttctagagcttcgtagtctatggctagtcgt.
Alignment of mRNAs to genomic DNA Sequence Martin Berglund Khanh Huy Bui Md. Asaduzzaman Jean-Luc Leblond.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
CSE182-L10 Gene Finding.
CSE182-L12 Gene Finding.
Comparative ab initio prediction of gene structures using pair HMMs
Eukaryotic Gene Finding
Lecture 12 Splicing and gene prediction in eukaryotes
UCSC Known Genes Version 3 Take 10. Overall Pipeline Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster.
Eukaryotic Gene Finding
Genome Annotation BCB 660 October 20, From Carson Holt.
Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.
Chapter 6 Gene Prediction: Finding Genes in the Human Genome.
International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.
Tomato genome annotation pipeline in Cyrille2
Genome Annotation BBSI July 14, 2005 Rita Shiang.
Coding Domain Sequence Prediction and Alternative Splicing Detection in Human Malaria Gambiae Jun Li 1, Bing-Bing Wang 2, Jose M. Ribeiro 3, Kenneth D.
COURSE OF BIOINFORMATICS Exam_31/01/2014 A.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Analysis of the RNAseq Genome Annotation Assessment Project by Subhajyoti De.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.
Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
The Havana-Gencode annotation GENCODE CONSORTIUM.
Mark D. Adams Dept. of Genetics 9/10/04
From Genomes to Genes Rui Alves.
Introduction to ab initio and evidence-based gene finding Wilson Leung08/2015.
Curation Tools Gary Williams Sanger Institute. SAB 2008 Gene curation – prediction software Gene prediction software is good, but not perfect. Out of.
Eukaryotic Gene Prediction Rui Alves. How are eukaryotic genes different? DNA RNA Pol mRNA Ryb Protein.
Bioinformatics and Computational Biology
A Non-EST-Based Method for Exon-Skipping Prediction Rotem Sorek, Ronen Shemesh, Yuval Cohen, Ortal Basechess, Gil Ast and Ron Shamir Genome Research August.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
How can we find genes? Search for them Look them up.
Research about Alternative Splicing recently 楊佳熒.
Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
Applied Bioinformatics
(H)MMs in gene prediction and similarity searches.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Gene Finding in Chimpanzee Evidence based improvement of ab initio gene predictions Chris Shaffer06/2009.
A knowledge-based approach to integrated genome annotation Michael Brent Washington University.
COURSE OF BIOINFORMATICS Exam_30/01/2014 A.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
10. Decision Trees and Markov Chains for Gene Finding.
Web Databases for Drosophila
Annotation for D. virilis
bacteria and eukaryotes
EGASP 2005 Evaluation Protocol
What is a Hidden Markov Model?
EGASP 2005 Evaluation Protocol
Genes, Genomes, and Genomics
PlantGDB: Annotation Principles & Procedures
Eukaryotic Gene Finding
Ab initio gene prediction
Gene Annotation with DNA Subway
Introduction to Bioinformatics II
Geneid: training on S. lycopersicum
Ensembl Genome Repository.
Volume 117, Issue 3, Pages (September 1999)
Volume 11, Issue 7, Pages (May 2015)
Presentation transcript:

JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions. Genome Biology 2007, 7(Suppl):S9. J. E. Allen and S. L. Salzberg. JIGSAW: integration of multiple sources of evidence for gene prediction. Bioinformatics 21(18): , J. E. Allen, M. Pertea and S. L. Salzberg. Computational gene prediction using mutliple sources of evidence. Genome Research, 14(1), 2004.

Collecting gene structure evidence for JIGSAW Figure 1. Evidence from the UCSC genome browser used as input to JIGSAW. Evidence includes: computational gene finders, alignments from gene expression evidence and evidence of cross-species sequence conservation.

Representing gene structure evidence in JIGSAW Each evidence source can predict up to six gene features: –Start codon –Stop codon –Intron –Protein coding nucleotides –Donor site –Acceptor site

Figure 3. Four evidence sources mapped to sequence S: gene prediction (GP1) with no confidence score, gene prediction with confidence score 0.65 (GP2), cDNA aligned with 86% identity and an EST aligned with 95% identity. Examples of the different feature vector types are shown: start codon (sta), stop codon (stp), donor site (don), acceptor site (acc), intron (inr) and amino acid codon (cod). Each element in the feature vector is an evidence source’s prediction for that feature type. The possible exon boundaries are k0, k1, …, k6.

Gene pred. 1 Gene pred. 2 cDNA EST alignment % 95% S2S2 Single exon % 92% S1S1 Initial exon Terminal exon % 85% SmSm Initial exon Terminal exon … Internal exon 85% 0.92 Start site feature vectors Stop site feature vectors Donor site feature vectors Acceptor site feature vectors Example coding feature vectors Example intron feature vectors Schematic of the JIGSAW training procedure. Known genes are used to evaluate the accuracy of the different combinations of evidence. Prediction accuracy for each feature type (start codon, stop codon, acceptor, donor, amino acid codon and intron) is measured separately. Training

Fig 4a. The plot shows the accuracy of predictions based on alignments to non-human sequences that overlap a gene finder’s predictions. Each point is a pair of alignments observed in training and their percent identity to the genomic sequence. ‘+’ points are labeled ‘accurate’ and ‘x’ points are labeled ‘inaccurate.’ The two lines correspond to the non-leaf nodes in the decision tree.

Figure 4b. Decision tree used to partition the feature vector space from Figure 4a into three sub-regions. This decision tree indicates that non-human cDNA alignments with > 95% identity to the human sequence (region “V 1 ”) are accurate protein coding predictors.

Interval: assigns state to the subsequence from to. JIGSAW dynamic programming Dynamic programming algorithm: at the end of each interval (e 0, for example), store the score of the best parse ending at that location Modification: store scores for every parse “type” ending at e 0 Types are start, stop, coding, intron, donor, acceptor

JIGSAW GHMM gene model

Evidence types for JIGSAW experiments on human DNA cDNA from human genes UniGene transcripts GenBank cDNAs matching SwissProt proteins w/at least 98% identity RefSeq genes from non-human species TIGR Gene Index (human and other) Ab initio gene finders –Genscan, GeneID, GeneZilla, GlimmerHMM –NOTE: JIGSAW allows you to use the same gene finder as multiple “lines” of evidence - e.g., GlimmerHMM with different parameter settings Alignment-based gene finders –Twinscan –SGP Predicted conserved elements from phylogenetic analysis (PhastCons)

Effects of different evidence sources Figure 6. JIGSAW prediction performance using different combinations of evidence. Gene finders = ab initio gene finders only; non-human EST = gene finders + non human expression evidence; human mRNA = gene finders + human mRNA; curated cDNA = gene finders + KnownGene, All = all evidence. KnownGene = cDNA evidence from curated proteins (from UCSC) without using JIGSAW.

Comparison of JIGSAW and other methods on human ENCODE regions Sensitivity(Sn)= % of exons correctly predicted Specificity(Sp)= % exons predictions that are correct F-score=(2 x Sn x Sp) / (Sn + Sp)

Gene Prediction Accuracy at the exon level: Sensitivity versus specificity. Top panel: dotplot for sensitivity versus specificity at the exon level for CDS evaluation. Each dot represents the overall value for each program on the 31 test sequences. Fig. 6 from Guigo et al., Genome Biology 2006, 7(Suppl 1):S2

Gene Prediction Accuracy at the exon level: Sensitivity versus specificity. Bottom panel: boxplots of the average sensitivity and specificity for each program. Each dot corresponds to the average in each of the test sequences for which GENCODE annotation existed. Fig. 6 from Guigo et al., Genome Biology 2006, 7(Suppl 1):S2.

EGASP results: Gene level accuracy

JIGSAW on other species