Analysis of the RNAseq Genome Annotation Assessment Project by Subhajyoti De.

Slides:



Advertisements
Similar presentations
Application to find Eukaryotic Open reading frames. Lab.
Advertisements

RNA-Seq based discovery and reconstruction of unannotated transcripts
EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
RNAseq.
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Homology Based Analysis of the Human/Mouse lncRNome
Ab initio gene prediction Genome 559, Winter 2011.
SBI 4U November 14 th, What is the central dogma? 2. Where does translation occur in the cell? 3. Where does transcription occur in the cell?
Gene Finding BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Reese, E-GASP Short comparion GASP ‘99- EGASP ‘05 Martin Reese Omicia Inc Horton Street Emeryville, CA
RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes 3 Serghei Mangul*, Adrian Caciula*, Ion.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
CSE182-L12 Gene Finding.
Bioinformatics Alternative splicing Multiple isoforms Exonic Splicing Enhancers (ESE) and Silencers (ESS) SpliceNest Lecture 13.
Eukaryotic Gene Finding
Lecture 12 Splicing and gene prediction in eukaryotes
 Assemble the DNA  Follow base pair rules  Blue—Guanine  Red—Cytosine  Purple—Thymine  Green--Adenine.
A combination of the words Proteomics and Genomics. Proteogenomics commonly refer to studies that use proteomic information, often derived from mass spectrometry,
Chapter 6 Gene Prediction: Finding Genes in the Human Genome.
LECTURE 2 Splicing graphs / Annoteted transcript expression estimation.
Transcription Transcription is the synthesis of mRNA from a section of DNA. Transcription of a gene starts from a region of DNA known as the promoter.
Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.
Discussion on Metagenomic Data for ANGUS Course Adina Howe.
The Ensembl Gene set The “Genebuild” 21 April 2008.
Todd J. Treangen, Steven L. Salzberg
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
Gene Finding BIO337 Systems Biology / Bioinformatics – Spring 2014 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BIO337/Spring.
1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.
Use cases for Tools at the Bovine Genome Database Apollo and Bovine QTL viewer.
LOC_Os02g08480 Supplementary Figure S1. Exons shorter than a read length have few or no reads aligned. The gene at LOC_Os02g08040 contains exons shorter.
ModENCODE August 20-21, 2007 Drosophila Transcriptome: Aim 2.2.
The iPlant Collaborative
Wfleabase.org/docs/tilexseq0904.pdf What is all this genome expression? Observations and statistics for expression at the base level April 2009Don Gilbert.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.
SPIDA Substitution Periodicity Index and Domain Analysis Combining comparative sequence analysis with EST alignment to identify coding regions Damian Keefe.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
The Havana-Gencode annotation GENCODE CONSORTIUM.
From Genomes to Genes Rui Alves.
Eukaryotic Gene Prediction Rui Alves. How are eukaryotic genes different? DNA RNA Pol mRNA Ryb Protein.
Introduction to RNAseq
Multiple Species Gene Finding using Gibbs Sampling Sourav Chatterji Lior Pachter University of California, Berkeley.
How can we find genes? Search for them Look them up.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Fgenes++ pipelines for automatic annotation of eukaryotic genomes Victor Solovyev, Peter Kosarev, Royal Holloway College, University of London Softberry.
Advisory Board Meeting, Caltech 2004 Genome Sequence Updates. Paul Davis The Sanger Institute.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
Comparative transcriptomics of fungi Group Nicotiana Daan van Vliet, Dou Hu, Joost de Jong, Krista Kokki.
-1- Module 3: RNA-Seq Module 3 BAMView Introduction Recently, the use of new sequencing technologies (pyrosequencing, Illumina-Solexa) have produced large.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
CFE Higher Biology DNA and the Genome Transcription.
AceView Danielle and Jean Thierry-Mieg NCBI = global annotation of the whole human genome ● Restricted to the Gencode Regions ●
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
Using public resources to understand associations Dr Luke Jostins Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015.
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
Discussion on Genomic/Metagenomic Data for ANGUS Course Adina Howe.
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
SNP Detection Congtam Pham 2/24/04 Dr. Marth’s Class.
”Gene Finding in Eukaryotic Genomes”
EGASP 2005 Evaluation Protocol
EGASP 2005 Evaluation Protocol
Protein Synthesis Genetics.
Ab initio gene prediction
Genome organization and Bioinformatics
Introduction to Bioinformatics II
The transcription process is similar to replication.
Gene Structure.
Gene Structure.
Presentation transcript:

Analysis of the RNAseq Genome Annotation Assessment Project by Subhajyoti De

The RNAseq Genome Annotation Assessment Project Introduction and a summary of submissions Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes The RGASP aims to assess the current progress of automatic gene building using RNAseq as its primary dataset. More specifically we aim to evaluate the status of computational methods to map human RNAseq data, assemble them into transcripts and quantify the abundance of that transcript in particular datasets. Promising transcript predictions not covered by Gencode annotation will be validated by experimental methods   

The RNAseq Genome Annotation Assessment Project Introduction and a summary of submissions Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes 3 species: human, worm and fly. Multiple RNA-seq daasets for each organism. 15 submitters. 304 submissions

The RNAseq Genome Annotation Assessment Project Introduction and summary of submissions Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes Analysis methodology 1.we carried out independent evaluation for the coding portions of the mRNA transcripts (CDS focused) and the mRNA transcripts as a whole (mRNA focused). 2.Analysis was carried out at multiple levels: 1.Nucleotide level 2.Exon level 3.Transcript level 3.For each of the levels, we calculated the sensitivity and specificity of the predictions (as discussed later). As a summary measure we also reported the average of the two statistic.

The RNAseq Genome Annotation Assessment Project Introduction and summary of submissions Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes Annotation set Prediction set True positives False positives False negatives Sensitivity = Number of annotated nucleotides correctly predicted Number of annotated nucleotides in the annotation set Specificity = Number of predicted nucleotides correctly also annotated Number of predicted nucleotides in the annotation set Nucleotide level analysis

The RNAseq Genome Annotation Assessment Project Introduction and summary of submissions Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes Nucleotide level analysis Points to note: 1.Nucleotide predictions had to be on the same strand as the annotations to be considered as correct. 2.Individual nucleotides present in multiple transcripts in either the annotation or the predictions are considered only once. 3.As a summary measure, we also calculated the arithmetic average of specificity and sensitivity.

The RNAseq Genome Annotation Assessment Project Introduction and summary of submissions Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes Nucleotide level analysis (H. sapiens)

The RNAseq Genome Annotation Assessment Project Introduction and summary of submissions Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes Nucleotide level analysis (D.melanogaster)

The RNAseq Genome Annotation Assessment Project Introduction and summary of submissions Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes Nucleotide level analysis (C.elegans)

The RNAseq Genome Annotation Assessment Project Introduction and summary of submissions Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes Annotation set Prediction set True positives False positives False negatives Exon level analysis Sensitivity = Number of annotated exons correctly predicted Number of annotated exons in the annotation set Specificity = Number of predicted exons correctly also annotated Number of predicted exons in the annotation set

The RNAseq Genome Annotation Assessment Project Introduction and summary of submissions Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes Exon level analysis Points to note: 1.An exon in the prediction must have identical start and end coordinates and also the same strand as an exon in the annotation to be counted correct. 2.If an exon is present in multiple transcripts in either the annotation or the predictions, it is counted only once. 3.As a summary measure, we also calculated the arithmetic average of specificity and sensitivity.

The RNAseq Genome Annotation Assessment Project Exon level analysis (H.sapiens) Introduction and summary of submissions Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes

The RNAseq Genome Annotation Assessment Project Exon level analysis (D.melanogaster) Introduction and summary of submissions Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes

The RNAseq Genome Annotation Assessment Project Exon level analysis (C.elegans) Introduction and summary of submissions Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes

The RNAseq Genome Annotation Assessment Project Introduction and summary of submissions Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes Annotation set Prediction set True positives False positives False negatives Transcript level analysis Sensitivity = Number of annotated transcripts correctly predicted Number of annotated transcripts in the annotation set Specificity = Number of predicted transcripts correctly also annotated Number of predicted transcripts in the annotation set

The RNAseq Genome Annotation Assessment Project Introduction and summary of submissions Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes Transcript level analysis Points to note: 1.We consider a transcript accurately predicted if the number of exons in a transcript and their boundaries match exactly between the annotation and the prediction. 2.for the CDS-focused evaluation if the beginning and end of translation are correctly annotated and each of the 5' and 3' splice sites for the coding exons are correct we consider the transcript to be correctly predicted. 3.for the mRNA evaluation, a transcript is counted correct if all of the exons from the start of transcription to the end of transcription match perfectly between the annotation and prediction sets.

The RNAseq Genome Annotation Assessment Project Transcript level analysis Human, (CDS-focused) Introduction and summary of submissions Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes

The RNAseq Genome Annotation Assessment Project Introduction and summary of submissions Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes Annotation set Prediction set True positives False positives False negatives Relaxed Transcript level analysis Sensitivity = Number of annotated transcripts correctly predicted Number of annotated transcripts in the annotation set Specificity = Number of predicted transcripts correctly also annotated Number of predicted transcripts in the annotation set

The RNAseq Genome Annotation Assessment Project Introduction and summary of submissions Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes Relaxed Transcript level analysis Points to note: 1.We consider a transcript ‘accurately’ predicted if the number of exons in a transcript match exactly between the annotation and the prediction, and their boundaries differ by no more than 5bp. 2.All other criteria remain same as that of Transcript-level analysis.

The RNAseq Genome Annotation Assessment Project Introduction and summary of submissions Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes Annotation set Prediction set True positives False positives False negatives Very relaxed Transcript level analysis Sensitivity = Number of annotated transcripts correctly predicted Number of annotated transcripts in the annotation set Specificity = Number of predicted transcripts correctly also annotated Number of predicted transcripts in the annotation set

The RNAseq Genome Annotation Assessment Project Very relaxed Transcript level analysis Worm, (exon-focused) Points to note: 1.We consider a transcript ‘accurately’ predicted if 1.the number of exons in a transcript differ by no more than two (terminal exons only) between the annotation and prediction, and 2. the boundaries of all equivalent exons differ by no more than 5bp between the annotation and the prediction. 2.All other criteria remain same as that of Transcript-level Analysis. Introduction and summary of submissions Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes

The RNAseq Genome Annotation Assessment Project Introduction and summary of submissions Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes 'missing exons' (MEs:): the annotated exons that have no overlap with predicted exons by at least 1 bp 'wrong exons' (WEs): the predicted exons not overlapping annotated exons by at least 1 bp. Annotation set Prediction set Missed exons Wrong exons 'wrong exons' (WEs) that are predicted independently by more than two predictors are recorded, and some of them will be tested experimentally.

The RNAseq Genome Annotation Assessment Project Introduction and summary of submissions Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes Annotation set Prediction set Dubious wrong exons ’Dubious wrong exons' (WEs) that are predicted independently by more than two predictors are reported. Screen shot of the list of dubious wrong exons dubious wrong exons in the whole human genome dubious wrong exons in the whole worm genome.

The RNAseq Genome Annotation Assessment Project Introduction and summary of submissions Analysis methodology Nucleotide level analysis Exon level analysis Transcript level analysis Missing and wrong genes Acknowledgement Jen Harrow Felix Kokocinski Tim Hubbard The RGASP community