Annotation of Drosophila primer

Slides:



Advertisements
Similar presentations
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Advertisements

Peter Tsai, Bioinformatics Institute.  University of California, Santa Cruz (UCSC)  A rapid and reliable display of any requested portion of genomes.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Copyright OpenHelix. No use or reproduction without express written consent1 Organization of genomic data… Genome backbone: base position number sequence.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
Genome Annotation BCB 660 October 20, From Carson Holt.
Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.
Gene Structure and Identification
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
 GEP Digital Laboratory Notebook Nick Reeves, Mt. San Jacinto Community College.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Annotation of Drosophila GEP Workshop – August 2015 Wilson Leung and Chris Shaffer.
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
Welcome to DNA Subway Classroom-friendly Bioinformatics.
Programs and Web Tools Status Update GEP Alumni Workshop Wilson Leung 08/05/2011.
1 P6a Extra Discussion Slides Part 1. 2 Section A.
Organizing information in the post-genomic era The rise of bioinformatics.
1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
Sackler Medical School
Mark D. Adams Dept. of Genetics 9/10/04
 GEP Implementation at Mt. San Jacinto Community College Nick Reeves, Ph.D.
Introduction to ab initio and evidence-based gene finding Wilson Leung08/2015.
Searching for Transcription Start Sites in Drosophila
Web Databases for Drosophila An introduction to web tools, databases and NCBI BLAST Wilson Leung08/2015.
Overview of the Drosophila modENCODE hybrid assemblies Wilson Leung01/2014.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Genomics Education Partnership: a flexible approach to implement Genomic teachings and research in the classroom Matthew W. Wadsworth and Consuelo J. Alvarez,
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Annotation of Drosophila virilis Chris Shaffer GEP workshop, 2006.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
Finding genes in the genome
Accessing and visualizing genomics data
Annotation of eukaryotic genomes
What is BLAST? Basic BLAST search What is BLAST?
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Gene Finding in Chimpanzee Evidence based improvement of ab initio gene predictions Chris Shaffer06/2009.
Genomes at NCBI. Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools lists 57 databases.
Primer on Reading Frames and Phase Wilson Leung08/2012.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Web Databases for Drosophila
What is BLAST? Basic BLAST search What is BLAST?
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on
Annotation of Drosophila
Annotation for D. virilis
bacteria and eukaryotes
Annotating The data.
Basics of BLAST Basic BLAST Search - What is BLAST?
Genomics and Personalized Care in Health Systems Lecture 7 Gene Finding (Part 2) Ab initio and Evidence-Based Gene Finding Leming Zhou, PhD School of.
TSS Annotation Workflow
GEP Annotation Workflow
Visualization of genomic data
Ab initio gene prediction
From: TopHat: discovering splice junctions with RNA-Seq
Ensembl Genome Repository.
Identify D. melanogaster ortholog
Follow-up from last night: XSEDE credits
Basic Local Alignment Search Tool
Determine CDS Coordinates
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
Presentation transcript:

Annotation of Drosophila primer GEP Workshop – August 2015 Wilson Leung and Chris Shaffer

Outline Overview of the GEP annotation project GEP annotation strategy Annotation goals Nomenclature GEP annotation strategy Using genomics databases Interpreting RNA-Seq data Understanding the phase of splice sites “Annotation of a Drosophila gene” walkthrough

Start codon Coding region Stop codon Splice donor Splice acceptor UTR AAACAACAATCATAAATAGAGGAAGTTTTCGGAATATACGATAAGTGAAATATCGTTCTTAAAAAAGAGCAAGAACAGTTTAACCATTGAAAACAAGATTATTCCAATAGCCGTAAGAGTTCATTTAATGACAATGACGATGGCGGCAAAGTCGATGAAGGACTAGTCGGAACTGGAAATAGGAATGCGCCAAAAGCTAGTGCAGCTAAACATCAATTGAAACAAGTTTGTACATCGATGCGCGGAGGCGCTTTTCTCTCAGGATGGCTGGGGATGCCAGCACGTTAATCAGGATACCAATTGAGGAGGTGCCCCAGCTCACCTAGAGCCGGCCAATAAGGACCCATCGGGGGGGCCGCTTATGTGGAAGCCAAACATTAAACCATAGGCAACCGATTTGTGGGAATCGAATTTAAGAAACGGCGGTCAGCCACCCGCTCAACAAGTGCCAAAGCCATCTTGGGGGCATACGCCTTCATCAAATTTGGGCGGAACTTGGGGCGAGGACGATGATGGCGCCGATAGCACCAGCGTTTGGACGGGTCAGTCATTCCACATATGCACAACGTCTGGTGTTGCAGTCGGTGCCATAGCGCCTGGCCGTTGGCGCCGCTGCTGGTCCCTAATGGGGACAGGCTGTTGCTGTTGGTGTTGGAGTCGGAGTTGCCTTAAACTCGACTGGAAATAACAATGCGCCGGCAACAGGAGCCCTGCCTGCCGTGGCTCGTCCGAAATGTGGGGACATCATCCTCAGATTGCTCACAATCATCGGCCGGAATGNTAANGAATTAATCAAATTTTGGCGGACATAATGNGCAGATTCAGA ACGTATTAACAAAATGGTCGGCCCCGTTGTTAGTGCAACAGGGTCAAATATCGCAAGCTCAAATATTGGCCCAAGCGGTGTTGGTTCCGTATCCGGTAATGTCGGGGCACAATGGGGAGCCACACAGGCCGCGTTGGGGCCCCAAGGTATTTCCAAGCAAATCACTGGATGGGAGGAACCACAATCAGATTCAGAATATTAACAAAATGGTCGGCCCCGTTGTTATGGATAAAAAATTTGTGTCTTCGTACGGAGATTATGTTGTTAATCAATTTTATTAAGATATTTAAATAAATATGTGTACCTTTCACGAGAAATTTGCTTACCTTTTCGACACACACACTTATACAGACAGGTAATAATTACCTTTTGAGCAATTCGATTTTCATAAAATATACCTAAATCGCATCGTC Start codon Coding region Stop codon Splice donor Splice acceptor UTR

Annotation: Adding labels to a sequence Genes: Novel or known genes, pseudogenes Regulatory elements: Promoters, enhancers, silencers Non-coding RNA: tRNAs, miRNAs, siRNAs, snoRNAs Repeats: Transposable elements, simple repeats Structural: Origins of replication, synteny Experimental results: DNase I Hypersensitive sites (DHS) ChIP-chip and ChIP-Seq datasets (e.g., modENCODE)

GEP Drosophila annotation projects D. melanogaster D. simulans D. sechellia Reference D. yakuba D. erecta Published D. ficusphila D. eugracilis D. biarmipes Species in the Four Genomes Paper D. takahashii D. elegans D. rhopaloa Annotation projects for Fall 2015 / Spring 2016 D. kikkawai D. bipectinata Anticipate that we will work on the remaining projects from the D. elegans F element and new projects from the D element this Fall. D. ananassae Manuscript in progress D. pseudoobscura D. persimilis New species sequenced by modENCODE D. willistoni D. mojavensis D. virilis D. grimshawi Phylogenetic tree produced by Thom Kaufman as part of the modENCODE project

Muller element nomenclature Muller elements Species A B C D E F D. simulans X 2L 2R 3L 3R 4 D. sechellia D. melanogaster D. yakuba D. erecta D. ananassae XLXR 4L4R D. pseudoobscura XL 3 XR 2 5 D. persimilis D. willistoni D. mojavensis 6 D. virilis D. grimshawi

Gene structure nomenclature Primary mRNA Protein Gene span Exons Exon UTR’s UTR CDS’s CDS

GEP annotation strategy Technique is optimized for projects with a moderately close, well annotated neighbor species Example: D. melanogaster Need to apply different strategies when annotating genes in other species Examples: corn, parrot

GEP annotation goals Identify and annotate all the genes in your project For each gene, identify and precisely map (accurate to the base pair) all coding exons (CDS) Do this for ALL isoform Annotate the initial transcribed exon and transcription start site (TSS) Optional curriculum not submitted to GEP Clustal analysis (protein, promoter regions) Repeats analysis Synteny analysis

Evidence-based annotation Human-curated analysis More accurate than standard ab initio and evidence-based gene finders Goal: collect, analyze, and synthesize all the available evidence to create the best-supported gene model Example: 4591-4688, 5157-5490, 5747-6001

Collect, analyze, and synthesize Genome Browser Conservation (BLAST searches) Analyze: Interpreting Genome Browser evidence tracks Interpreting BLAST results Synthesize: Construct the best-supported gene model based on potentially contradictory evidence

Basic annotation workflow Identify the likely ortholog in D. melanogaster Observe the gene structure of the ortholog Map each CDS of ortholog to the project sequence Use BLASTX to identify conserved region Note position and reading frame Use these data to construct a gene model Use TopHat splice junctions to identify splice site boundaries, or a combination of conservation, splice signals, reading frame Identify the exact start and stop base position for each CDS Use the Gene Model Checker to verify the gene model For each additional isoform, repeat steps 2-5

Annotation workflow (graphically) Contig Feature BLASTP search of feature against the D. melanogaster proteins database D. melanogaster gene model (1 isoform, 5 CDS) BLASTX search of each D. melanogaster CDS against the contig Contig Feature 3 Reading frame Alignment 2 1 3 1

Annotation workflow (graphically) BLASTX search of each D. melanogaster CDS against the contig Reading frame 3 1 3 2 1 Alignment Contig Identify the exact coordinates of each CDS using the Genome Browser M 3 Reading frame GT 1 AG GT Bem46 Use the Gene Model Checker to verify the final CDS coordinates Gene model 1245 1383 1437 1678 1740 2081 2159 2337 2397 2511 1245-1383, 1437-1678, 1740-2081, 2159-2337, 2397-2511 Coordinates:

Four main web sites used by the GEP annotation strategy GEP UCSC Genome Browser (http://gander.wustl.edu) FlyBase (http://flybase.org) Tools  Genomic/Map Tools  BLAST Gene Record Finder (http://gep.wustl.edu) Projects  Annotation Resources NCBI BLAST (http://blast.ncbi.nlm.nih.gov) BLASTX  select the checkbox:

UCSC Genome Browser Provides a graphical view of genomic region Sequence conservation Gene and splice site predictions RNA-Seq data and splice junction predictions BLAT – BLAST-Like Alignment Tool Map protein or nucleotide sequence against an assembly Faster but less sensitive than BLAST Table Browser Access raw data used to create the graphical browser

Two different versions of the UCSC Genome Browser Official UCSC Version http://genome.ucsc.edu Published data, lots of species, whole genomes, used for “Chimp Chunks” GEP Version http://gander.wustl.edu GEP data, parts of genomes, used for annotation of Drosophila species

Additional training resources Training section on the UCSC web site http://genome.ucsc.edu/training/index.html Training videos, user guides and tutorials Mailing lists OpenHelix tutorials and training materials http://www.openhelix.com/ucsc Pre-recorded tutorials Reference cards

GEP UCSC Genome Browser overview Genomic sequence Evidence tracks

Control how evidence tracks are displayed on the Genome Browser Five different display modes: Hide: track is hidden Dense: all features appear on a single line Squish: overlapping features appear on separate lines Features are half the height compared to full mode Pack: overlapping features appear on separate lines Features are the same height as full mode Full: each feature is displayed on its own line Set “Base Position” track to “Full” to see the amino acid translations Some evidence tracks (e.g., RepeatMasker) only have a subset of these display modes

FlyBase – Database for the Drosophila research community Lots of ancillary data for each gene in D. melanogaster Curation of literature for each gene Reference D. melanogaster annotations for all other databases Including NCBI, EBI, and DDBJ Fast release cycle (6-8 releases per year)

Be aware of different annotation releases D. melanogaster Release 6 genome assembly First change of the assembly since late 2006 Most modENCODE analysis used the Release 5 assembly Gene annotations change much more frequently Use FlyBase as the canonical reference GEP data freeze: Update GEP materials before the start of semester Potential discrepancies in exercise screenshots Minor differences in search results Let us know about major errors or discrepancies Lifted release 5 datasets required by the GEP to release 6

Gene Record Finder – Observe the structure of D. melanogaster genes Retrieve CDS and exon sequences for each gene in D. melanogaster CDS and exon usage maps for each isoform List of unique CDS Optimal for the exon-by-exon annotation strategy

Nomenclature for Drosophila genes Drosophila gene names are case-sensitive Lowercase initial letter = recessive mutant phenotype Uppercase initial letter = dominant mutant phenotype Every D. melanogaster gene has an annotation symbol Begins with the prefix CG (Computed Gene) Some genes have a different gene symbol (e.g., mav) Suffix after the gene symbol denotes different isoforms mRNA = -R; protein = -P mav-RA = Transcript for the A isoform of mav mav-PA = Protein product for the A isoform of mav ey is the gene symbol, CG1901 is the annotation symbol

NCBI – comprehensive database for biomedical and genomics information One of the most comprehensive genomics database Quality of GenBank records will vary PubMed for literature searches BLAST web service BLAST search against RefSeq, nr/nt databases Align two (or more) sequences (bl2seq)

Nucleotide -> Protein Common BLAST programs Except for BLASTN, all alignments are based on comparisons of protein sequences Alignment coordinates are relative to the original sequences Decide which BLAST program to use based on the type of query and subject sequences: Program Query Database (Subject) BLASTN Nucleotide BLASTP Protein BLASTX Nucleotide -> Protein TBLASTN TBLASTX

Where can I run BLAST? NCBI BLAST web service EBI BLAST web service http://blast.ncbi.nlm.nih.gov/Blast.cgi EBI BLAST web service http://www.ebi.ac.uk/Tools/sss/ FlyBase BLAST (Drosophila and other insects) http://flybase.org/blast/

Data on the Genome Browser is incomplete The Genome Browser contains many different evidence tracks: Sequence similarity to D. melanogaster proteins RNA-Seq from different developmental stages Gene and splice site predictions Sequence alignments of multiple Drosophila species Annotators still need to identify the ortholog WARNING! Students often over-interpret the BLASTX alignment track (D. mel Proteins); use with caution

Identify ortholog based on BLASTP search of the feature against D Identify ortholog based on BLASTP search of the feature against D. melanogaster proteins Large increase in E-value from mav-PA to gbb-PB Most genes (~90%) remain on the same Muller element across the different Drosophila species ~95% of all genes, ~90% of F element genes remain on the same Muller element

Basic biological constraints (inviolate rules*) Coding regions start with a methionine Coding regions end with a stop codon Gene should be on only one strand of DNA Exons appear in order along the DNA (collinear) Intron sequences should be at least 40 bp Intron starts with a GT (or rarely GC) Intron ends with an AG Only break these rules if they are found in D. melanogaster or supported by experimental evidence ~1% of the genes in D. melanogaster have a GC donor site * There are known exceptions to each rule

Evidence for gene models (in general order of importance) Expression data RNA-Seq, EST, cDNA Conservation Sequence similarity to genes in D. melanogaster Sequence similarity to other Drosophila species (Multiz) Computational predictions Gene and splice site predictions Tie-breakers of last resort See the “Annotation Instruction Sheet”

modENCODE RNA-Seq data RNA-Seq evidence tracks: RNA-Seq coverage (read depth) TopHat splice junction predictions Assembled transcripts (Cufflinks, Oases) Positive results very helpful Negative results less informative Lack of transcription ≠ no gene GEP curriculum: RNA-Seq Primer Browser-Based Annotation and RNA-Seq Data modENCODE RNA-Seq data: mixed embryos, adult males, adult females

Generating RNA-Seq data (Illumina) 5’ cap Poly-A tail Processed mRNA AAAAAA RNA fragments (~250bp) Library with adapters 5’ 3’ Paired end sequencing 5’ 3’ ~125bp RNA-Seq reads Reverse Forward Wang Z et al. (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 10(1):57-63.

Use the TopHat splice junction predictions to identify splice sites 5’ cap M * Poly-A tail Processed mRNA AAAAAA RNA-Seq reads Contig Intron TopHat junctions

Use BLASTX to map D. melanogaster CDS onto the contig Use sequence similarity to infer homology D. melanogaster is very well annotated Coding sequences evolve slowly Exon structure changes very slowly Change two settings in the “Algorithm parameters” section when using bl2seq Turn off compositional adjustments Turn off the low complexity filter

Strategies for finding small CDS Examine RNA-Seq coverage and TopHat junctions Small CDS is typically part of a larger transcribed exon Use Query subrange to restrict the search region Increase the Expect threshold and try again Keep increasing the Expect threshold until you get matches Also try decreasing the word size Use the Small Exon Finder Minimize changes in CDS size Available under Projects  Annotation Resources See the “Annotation Strategy Guide” for details

A genomic sequence has 6 different reading frames 1 3 2 Frames Frame: Base to begin translation relative to the first base of the sequence

Splice donor and acceptor phases Phase: Number of bases between the complete codon and the splice site Donor phase: Number of bases between the end of the last complete codon and the splice donor site (GT/GC) Acceptor phase: Number of bases between the splice acceptor site (AG) and the start of the first complete codon Phase is relative to the reading frame of the CDS

Phase depends on the reading frame Splice Acceptor Phase of acceptor site: Phase 2 relative to frame +1 Phase 0 relative to frame +2 Phase 1 relative to frame +3

Phase of the donor and acceptor sites must be compatible Extra nucleotides from donor and acceptor phases will form an additional codon Donor phase + acceptor phase = 0 or 3 CCA AAT G CCA AAT G GT … … … AG CTC GAT TT CTC GAT TT By definition, blastx cannot detect the additional codon that results from donor and acceptor phases CTC GAT GTT CCA AAT P N V L D Translation:

Incompatible donor and acceptor phases results in a frame shift CCA AAT G GT GT … … AG TT CTC GAT CCA AAT TCG AT GGT TTC P N G F S Translation: Phase 0 donor is incompatible with phase 2 acceptor; use prior GT, which is a phase 1 donor.

Verify the final gene model using the Gene Model Checker Examine the checklist and explain any errors or warnings in the GEP Annotation Report View your gene model in the context of the other evidence tracks on the Genome Browser Examine the dot plot and explain any discrepancies in the GEP Annotation Report Look for large vertical and horizontal gaps See the “How to do a quick check of student annotations” document on the GEP web site

Questions? http://www.flickr.com/photos/jac_opo/240254763/sizes/l/ Introduction to Macs, Annotation of a Drosophila gene walkthrough http://www.flickr.com/photos/jac_opo/240254763/sizes/l/