Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.

Slides:



Advertisements
Similar presentations
Finding Eukaryotic Open reading frames.
Advertisements

Finding prokaryotic genes and non intronic eukaryotic genes
Chapter 6 Gene Prediction: Finding Genes in the Human Genome.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Locating genes in Plasmodium falciparum You have seen how artemis is used to view, analyse and annotate bacterial genomes, but now we are going to move.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
 GEP Digital Laboratory Notebook Nick Reeves, Mt. San Jacinto Community College.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Annotation of Drosophila GEP Workshop – August 2015 Wilson Leung and Chris Shaffer.
BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics– a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses.
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
ANALYSIS AND VISUALIZATION OF SINGLE COPY ORTHOLOGS IN ARABIDOPSIS, LETTUCE, SUNFLOWER AND OTHER PLANT SPECIES. Alexander Kozik and Richard W. Michelmore.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
1 P6a Extra Discussion Slides Part 1. 2 Section A.
BLAST Basic Local Alignment Search Tool (Altschul et al. 1990)
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
 GEP Implementation at Mt. San Jacinto Community College Nick Reeves, Ph.D.
Web Databases for Drosophila An introduction to web tools, databases and NCBI BLAST Wilson Leung08/2015.
Annotation of Drosophila primer
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
A Non-EST-Based Method for Exon-Skipping Prediction Rotem Sorek, Ronen Shemesh, Yuval Cohen, Ortal Basechess, Gil Ast and Ron Shamir Genome Research August.
Annotation of Drosophila virilis Chris Shaffer GEP workshop, 2006.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.
Construction of Substitution matrices
SRB Genome Assembly and Analysis From 454 Sequences HC70AL S Brandon Le & Min Chen.
Finding genes in the genome
Annotation of eukaryotic genomes
What is BLAST? Basic BLAST search What is BLAST?
Gene Finding in Chimpanzee Evidence based improvement of ab initio gene predictions Chris Shaffer06/2009.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Web Databases for Drosophila
What is BLAST? Basic BLAST search What is BLAST?
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on
Annotation of Drosophila
Annotation for D. virilis
bacteria and eukaryotes
Annotating The data.
C acaatataATGGAGCGTGAGACTTCGTCATCTTCAACTCCTCCGGAGGATCTTGTTACATCGATGATCGGAAAGTTCGTCGCTGTCATGTCTA b acaatataATGGAGCGTGAGACTTCGTCATCTTCAACTCCTCCGGAGGATCTTGTTACATCGATGATCGGAAAGTTCGTCGCTGTCTTGTCTA.
What is a Hidden Markov Model?
MAKRLVLFATVVIALVALTVAEGEASRQLQCERELQESSLEACRLVVDQQLAGRLPWSTGLQMRCCQQLRDISAKCRPVAVSQVARQYGQ MAKRLVLFATVVIGLVALTVAEGEASRQLQCERELQESSLEACRLVVDQQLAGRLPWSTGLQMRCCQQLRDISAKCRPVAVSQVARQYGQ.
22 kDa a-coixin gene cluster
Basics of BLAST Basic BLAST Search - What is BLAST?
Genomics and Personalized Care in Health Systems Lecture 7 Gene Finding (Part 2) Ab initio and Evidence-Based Gene Finding Leming Zhou, PhD School of.
TSS Annotation Workflow
GEP Annotation Workflow
Visualization of genomic data
Genome Center of Wisconsin, UW-Madison
Mark M Metzstein, H.Robert Horvitz  Molecular Cell 
Ab initio gene prediction
Bioinformatics and BLAST
From: TopHat: discovering splice junctions with RNA-Seq
Gene Annotation with DNA Subway
Bacterial genomics: The controlled chaos of shifty pathogens
Comparative Genomics.
Basic Local Alignment Search Tool (BLAST)
Warthog and Groundhog, novel families related to Hedgehog
BLAT Blast Like Alignment Tool
Alex M. Plocik, Brenton R. Graveley  Molecular Cell 
Evaluation of re-annotation for non-melanogaster Drosophila species.
Basic Local Alignment Search Tool
Determine CDS Coordinates
Universal Alternative Splicing of Noncoding Exons
Multiple sequence alignment of STAT6 and other STAT proteins produced by ClusterW and ESpript (espript.ibcp.fr/ESPript/ESPript/). Multiple sequence alignment.
Volume 11, Issue 7, Pages (May 2015)
A, Schematic diagram of identified splice variants of PD-L1.
Presentation transcript:

Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus based on BLASTX alignments Over-reliance on BLAST alignments Over-reliance on gene predictors Not annotating all genes or all isoforms Missing small exons Annotating incorrect splice sites

Over-reliance on BLASTX alignment Annotations BLASTX Alignment Gene Predictors RNA-Seq Data

Relying on a single gene predictor

Strategies to resolve common errors Dot plot TBLASTN / BLASTX with exon by exon strategy RNA-Seq Identify small coding exons using “Small Exon Finder” Use dot plot and peptide sequence alignment to check

05/16/2019 An interesting annotation problem: contig34 (Liz Chen’s project from Bio 4342), reconciliation by Thomas Quisenberry Submitted annotations: Did Liz include an extra exon at 32298-32363? Her model has 10 exons, while the Drosophila melanogaster model only has 9. Here I’m going to outline the steps I took in revising an interesting annotation problem CG1909 is a gene from project 34, which Liz worked on in 4342

Continuing investigation of contig34 Checked other student’s submission forms for CG1909, the gene in question: y-axis= student annotation submission; x-axis=D. melanogaster gene model Gap (red) indicates residues in D. melanogaster gene that are not present in student annotation All in all, this dot plot warrants further investigation

contig34 continuing investigation Check UCSC Genome Browser view for this gene in D. biarmipes: Above: blue box marks BLASTX alignment and RNA-Seq data in the region of extra exon. Right-hand exon (fifth) is supported by RNA-Seq data, conservation. Below: TBLASTN results using a.a. sequence of fourth exon in D. melanogaster model as the query and nucleotide sequence of contig34 as the subject  two regions of conservation

05/16/2019 contig34 completed  Gene model checker dot plot output for model including additional exon Much better than before! Amino acid sequence conserved Appropriate splice junctions maintaining ORF identified Model has 1 more exon -> Demonstrates the importance of scrutiny when revising these projects

Using RNA-Seq and TopHat to identify 5’ and 3’ UTR, start and stop codons

Interesting annotation challenges: Read-through stop codons 05/16/2019 Interesting annotation challenges: Read-through stop codons http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3227100/figure/F3/ Jungreis et al “Evidence of abundant stop codon read-through in Drosophila and other metazoa.” Genome Res. 2011 21: 2096-113.

Interesting annotation challenges Errors in the consensus sequence TBLASTN search of exon against contig shows a frame shift in the middle of the exon (problem with 454 sequencing)

To avoid these discrepancies, students should remember to… 05/16/2019 To avoid these discrepancies, students should remember to… check the dot plot and peptide sequence alignment comparison with D. melanogaster (output from Gene Model Checker); be able to explain & defend any differences! look for discrepancies by going back to the Gene Record Finder and comparing exon lengths and locations; double check all splice sites; check whether any proposed non-canonical splice sites are also observed in the D. melanogaster model or nearby species; check all final annotation models with BLASTP alignments to the D. melanogaster orthologue (higher resolution); for 454 sequenced species, check DNA sequence using added Illumina reads or RNA-Seq data if needed.