Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.

Slides:



Advertisements
Similar presentations
EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
Advertisements

1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Finding Eukaryotic Open reading frames.
Finding prokaryotic genes and non intronic eukaryotic genes
Assessment of sequence alignment Lecture Introduction The Dot plot Matrix visualisation matching tool: – Basics of Dot plot – Examples of Dot plot.
Chapter 6 Gene Prediction: Finding Genes in the Human Genome.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Locating genes in Plasmodium falciparum You have seen how artemis is used to view, analyse and annotate bacterial genomes, but now we are going to move.
ENCODE pseudogene updates Adam Frankish, HAVANA 6/10/05.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
 GEP Digital Laboratory Notebook Nick Reeves, Mt. San Jacinto Community College.
Annotation of Drosophila GEP Workshop – August 2015 Wilson Leung and Chris Shaffer.
BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics– a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses.
LOC_Os02g08480 Supplementary Figure S1. Exons shorter than a read length have few or no reads aligned. The gene at LOC_Os02g08040 contains exons shorter.
ANALYSIS AND VISUALIZATION OF SINGLE COPY ORTHOLOGS IN ARABIDOPSIS, LETTUCE, SUNFLOWER AND OTHER PLANT SPECIES. Alexander Kozik and Richard W. Michelmore.
Programs and Web Tools Status Update GEP Alumni Workshop Wilson Leung 08/05/2011.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
1 P6a Extra Discussion Slides Part 1. 2 Section A.
BLAST Basic Local Alignment Search Tool (Altschul et al. 1990)
Figure 2: over-representation of neighbors in the fushi-tarazu region of Drosophila melanogaster. Annotated enhancers are marked grey. The CDS is marked.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
 GEP Implementation at Mt. San Jacinto Community College Nick Reeves, Ph.D.
Introduction to ab initio and evidence-based gene finding Wilson Leung08/2015.
Web Databases for Drosophila An introduction to web tools, databases and NCBI BLAST Wilson Leung08/2015.
Overview of the Drosophila modENCODE hybrid assemblies Wilson Leung01/2014.
Annotation of Drosophila primer
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
A Non-EST-Based Method for Exon-Skipping Prediction Rotem Sorek, Ronen Shemesh, Yuval Cohen, Ortal Basechess, Gil Ast and Ron Shamir Genome Research August.
Annotation of Drosophila virilis Chris Shaffer GEP workshop, 2006.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.
Gene Structure and Identification III BIO520 BioinformaticsJim Lund Previous reading: 1.3, , 10.4,
SRB Genome Assembly and Analysis From 454 Sequences HC70AL S Brandon Le & Min Chen.
-1- Module 3: RNA-Seq Module 3 BAMView Introduction Recently, the use of new sequencing technologies (pyrosequencing, Illumina-Solexa) have produced large.
Finding genes in the genome
Annotation of eukaryotic genomes
What is BLAST? Basic BLAST search What is BLAST?
Gene Finding in Chimpanzee Evidence based improvement of ab initio gene predictions Chris Shaffer06/2009.
454 Genome Sequence Assembly and Analysis HC70AL S Brandon Le & Min Chen.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Web Databases for Drosophila
What is BLAST? Basic BLAST search What is BLAST?
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on
Annotation of Drosophila
Annotation for D. virilis
C acaatataATGGAGCGTGAGACTTCGTCATCTTCAACTCCTCCGGAGGATCTTGTTACATCGATGATCGGAAAGTTCGTCGCTGTCATGTCTA b acaatataATGGAGCGTGAGACTTCGTCATCTTCAACTCCTCCGGAGGATCTTGTTACATCGATGATCGGAAAGTTCGTCGCTGTCTTGTCTA.
MAKRLVLFATVVIALVALTVAEGEASRQLQCERELQESSLEACRLVVDQQLAGRLPWSTGLQMRCCQQLRDISAKCRPVAVSQVARQYGQ MAKRLVLFATVVIGLVALTVAEGEASRQLQCERELQESSLEACRLVVDQQLAGRLPWSTGLQMRCCQQLRDISAKCRPVAVSQVARQYGQ.
22 kDa a-coixin gene cluster
Basics of BLAST Basic BLAST Search - What is BLAST?
Genomics and Personalized Care in Health Systems Lecture 7 Gene Finding (Part 2) Ab initio and Evidence-Based Gene Finding Leming Zhou, PhD School of.
TSS Annotation Workflow
GEP Annotation Workflow
Visualization of genomic data
Genome Center of Wisconsin, UW-Madison
Ab initio gene prediction
Bioinformatics and BLAST
Gene Annotation with DNA Subway
Comparative Genomics.
Basic Local Alignment Search Tool (BLAST)
BLAT Blast Like Alignment Tool
Alex M. Plocik, Brenton R. Graveley  Molecular Cell 
Jun Xu, Roger W. Hendrix, Robert L. Duda  Molecular Cell 
Basic Local Alignment Search Tool
Determine CDS Coordinates
Multiple sequence alignment of STAT6 and other STAT proteins produced by ClusterW and ESpript (espript.ibcp.fr/ESPript/ESPript/). Multiple sequence alignment.
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
Presentation transcript:

Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus based on BLASTX alignments Over-reliance on BLAST alignments Over-reliance on gene predictors Not annotating all genes or all isoforms Missing small exons Annotating incorrect splice sites

Over-reliance on BLASTX alignment BLASTX Alignment Annotations Gene Predictors RNA-Seq Data

Relying on a single gene predictor

Strategies to resolve common errors Dot plot TBLASTN / BLASTX with exon by exon strategy RNA-Seq Identify small coding exons using “Small Exon Finder” Use dot plot and peptide sequence alignment to check

An interesting annotation problem: contig34 (Liz Chen’s project from Bio 4342), reconciliation by Thomas Quisenberry Submitted annotations: Did Liz include an extra exon at ? Her model has 10 exons, while the Drosophila melanogaster model only has 9.

Continuing investigation of contig34 Checked other student’s submission forms for CG1909, the gene in question:  y-axis= student annotation submission; x-axis=D. melanogaster gene model  Gap (red) indicates residues in D. melanogaster gene that are not present in student annotation  All in all, this dot plot warrants further investigation

contig34 continuing investigation Check UCSC Genome Browser view for this gene in D. biarmipes: Above: blue box marks BLASTX alignment and RNA-Seq data in the region of extra exon. Right-hand exon (fifth) is supported by RNA-Seq data, conservation. Below: TBLASTN results using a.a. sequence of fourth exon in D. melanogaster model as the query and nucleotide sequence of contig34 as the subject  two regions of conservation

contig34 completed Gene model checker dot plot output for model including additional exon Much better than before! Amino acid sequence conserved Appropriate splice junctions maintaining ORF identified Model has 1 more exon

Using RNA-Seq and TopHat to identify 5’ and 3’ UTR, start and stop codons

Interesting annotation challenges: Read-through stop codons Jungreis et al “Evidence of abundant stop codon read-through in Drosophila and other metazoa.” Genome Res : Genome Res.

Interesting annotation challenges Errors in the consensus sequence TBLASTN search of exon against contig shows a frame shift in the middle of the exon (problem with 454 sequencing)

To avoid these discrepancies, students should remember to … check the dot plot and peptide sequence alignment comparison with D. melanogaster (output from Gene Model Checker); be able to explain & defend any differences! look for discrepancies by going back to the Gene Record Finder and comparing exon lengths and locations; double check all splice sites; check whether any proposed non- canonical splice sites are also observed in the D. melanogaster model or nearby species; check all final annotation models with BLASTP alignments to the D. melanogaster orthologue (higher resolution); for 454 sequenced species, check DNA sequence using added Illumina reads or RNA-Seq data if needed.