Annotation of Drosophila virilis Chris Shaffer GEP workshop, 2006.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Ab initio gene prediction Genome 559, Winter 2011.
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Copyright OpenHelix. No use or reproduction without express written consent1 Organization of genomic data… Genome backbone: base position number sequence.
Visualization of genomic data Genome browsers. How many have used a genome browser ? UCSC browser ? Ensembl browser ? Others ? survey.
Eukaryotic Gene Finding
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
UCSC Known Genes Version 3 Take 10. Overall Pipeline Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster.
Eukaryotic Gene Finding
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
© Wiley Publishing All Rights Reserved. Searching Sequence Databases.
Locating genes in Plasmodium falciparum You have seen how artemis is used to view, analyse and annotate bacterial genomes, but now we are going to move.
Bikash Shakya Emma Lang Jorge Diaz.  BLASTx entire sequence against 9 plant genomes. RepeatMasker  55.47% repetitive sequences  82.5% retroelements.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
1 The Genome Browser allows you to –Browse the Rice-Japonica, Maize and Arabidopsis genomes. –View the location of a particular feature on the rice genome.
 GEP Digital Laboratory Notebook Nick Reeves, Mt. San Jacinto Community College.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
Tweaking BLAST Although you normally see BLAST as a web page with boxes to place data in and tick boxes, etc., it is actually a command line program that.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Copyright OpenHelix. No use or reproduction without express written consent 2 Overview of Genome Browsers Materials prepared by Warren C. Lathe, Ph.D.
Annotation of Drosophila GEP Workshop – August 2015 Wilson Leung and Chris Shaffer.
GeneWise and Artemis Exercises Spliced Alignment using GeneWise Click on the GeneWise hyperlink on the course links page,
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics Lab v1 | Saurabh Sinha1 Powerpoint by Casey Hanson.
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
ANALYSIS AND VISUALIZATION OF SINGLE COPY ORTHOLOGS IN ARABIDOPSIS, LETTUCE, SUNFLOWER AND OTHER PLANT SPECIES. Alexander Kozik and Richard W. Michelmore.
Sequence-based Similarity Module (BLAST & CDD only ) & Horizontal Gene Transfer Module (Ortholog Neighborhood & GC content only)
1 P6a Extra Discussion Slides Part 1. 2 Section A.
Recombinant DNA Technology and Genomics A.Overview: B.Creating a DNA Library C.Recover the clone of interest D.Analyzing/characterizing the DNA - create.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.
A Genomics View of Unix. General Unix Tips To use the command line start X11 and type commands into the “xterm” window A few things about unix commands:
 GEP Implementation at Mt. San Jacinto Community College Nick Reeves, Ph.D.
Web Databases for Drosophila An introduction to web tools, databases and NCBI BLAST Wilson Leung08/2015.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics | Saurabh Sinha | PowerPoint by Casey Hanson.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
How can we find genes? Search for them Look them up.
Fgenes++ pipelines for automatic annotation of eukaryotic genomes Victor Solovyev, Peter Kosarev, Royal Holloway College, University of London Softberry.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Primer on Annotation of Drosophila Genes GEP Workshop – January 2016 Wilson Leung and Chris Shaffer.
Tweaking BLAST Although you normally see BLAST as a web page with boxes to place data in and tick boxes, etc., it is actually a command line program that.
Mojavensis: Issues of Polymorphisms Chris Shaffer GEP 2009 Washington University.
SRB Genome Assembly and Analysis From 454 Sequences HC70AL S Brandon Le & Min Chen.
What is BLAST? Basic BLAST search What is BLAST?
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Gene Finding in Chimpanzee Evidence based improvement of ab initio gene predictions Chris Shaffer06/2009.
Genomes at NCBI. Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools lists 57 databases.
Welcome to the combined BLAST and Genome Browser Tutorial.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
Web Databases for Drosophila
What is BLAST? Basic BLAST search What is BLAST?
Annotation of Drosophila
Annotation for D. virilis
Basics of BLAST Basic BLAST Search - What is BLAST?
Genomics and Personalized Care in Health Systems Lecture 7 Gene Finding (Part 2) Ab initio and Evidence-Based Gene Finding Leming Zhou, PhD School of.
GEP Annotation Workflow
Visualization of genomic data
Visualization of genomic data
Genome Center of Wisconsin, UW-Madison
Ab initio gene prediction
Identify D. melanogaster ortholog
Basic Local Alignment Search Tool
BLAT Blast Like Alignment Tool
Basic Local Alignment Search Tool
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
Presentation transcript:

Annotation of Drosophila virilis Chris Shaffer GEP workshop, 2006

Annotation of D. virilis Outline of general technique and then one practical example This technique may not be the best with other projects (e.g. corn, bacteria) The technique optimized for projects: – Moderately close, well annotated neighbor species – No EST, mRNA or expression data available

Helpful Hints Evolutionary distance between D. virilis and D. melanogaster is much larger than chimp to human – Conservation will be at the protein domain level – Synteny is detectable in some fosmids – Most genes stay on the same chromosome (3 exceptions seen in ~40 genes)

D. virilis Average gene size will be smaller than mammals Very low density of pseudogenes Almost all genes in virilis will have the same basic structure as melanogaster orthologs; mapping exon by exon works well for most genes

How to proceed First, identify features of interest: 1. Genscan results Watch out for ends - fused or split genes 2. Regions of high similarity with D. melanogaster protein, identified by BLAST Overlapping genes usually on opposite strand Be vigilant for partial genes at fosmid ends 3. Regions with high similarity to known genes (i.e. BLAST to nr) not covered above

Basic Procedure For each feature of interest: 1. Identify the likely ortholog in D. m. 2. Use D. m. database to find gene model of ortholog and identify all exons 3. Use BLASTX to identify locations and frames of each exon, one by one 4. Based on locations, frames, and gene predictions, find donor and acceptor splice sites that link frames together; identify the exact base location (start and stop) of each coding exon 5. double check your results by translation

Basic procedure (graphically) fosmid BLASTX of predicted gene to melanogaster proteins suggests this region orthologous to Dm gene with 5 exons: feature BLASTX of each exon to locate region of similarity:

Basic procedure (graphically) Zoom in on ends of exons and find first met, matching intron Doner (GT) and Acceptor (AG) sites and final stop codon GTAG Once these have been identified, write down the exact location of the first base and last base of each exon. Use these numbers to check your gene model MetGT

Example Annotation Open Safari and go to goose.wustl.edu Click on Genome Browser

Example Annotation Settings are: Insect; D. virilis; Mar. 2005; chr10 (chr10 is a fosmid from 2005) Click submit

Example Annotation Seven predicted Genscan genes Each one would be investigated

Investigate 10.4 All putative genes will need to be analyzed; we will focus on 10.4 in this example To zoom in on this gene enter: chr10: in position box Then click jump button

Step 1: Find Ortholog If this is a real gene it will probably have at least some homology to a D. melanogaster protein Step one: do a BLAST search with the predicted protein sequence of 10.4 to all proteins in D. melanogaster

Step 1: Find Ortholog Click on one of the exons in gene 10.4 On the Genscan report page click on Predicted Protein Select and copy the sequence Do a blastp search of the predicted sequence to the D. melanogaster “Annotated Proteins” database at

Step 1: Find Ortholog The results show a significant hit to the “A” and “B” isoforms of the gene “mav”

Step 1: Results of Ortholog search The alignment looks right for virilis vs. melanoaster- regions of high similarity interspersed with regions of little or no similarity We have a probable ortholog: maverick

Step 2: Gene model What does mav look like? Go to ENSEMBL to get exons and map them to regions: – Web brower- go to

Click on Drosophila Search for mav (top right search box) Click on “Ensembl Gene: CG1901” Scroll down to map and notice two isoforms: Step 2: Gene model

We now have a gene model (two exon gene, two isoforms). We will annotate isoform A since it is the largest. Due to time constraints, our policy so far is to have students pick and annotate only one isoform for each feature. If more than one isoform exists, pick the largest or the one with the most exons Here student should choose to annotate isoform A (largest) All isoforms should be annotated eventually

Step 3: Investigate Exons Given we need to annotate isoform A, we need exon sequence for exon 1 and 2, so we do BLASTX search Click on [Peptide info] for isoform A on right just above mapPeptide info Scroll down to find peptide sequence with exons in different colors: YNASSNKYSLINVSQSKNFPQLFNKKLSVQWINTVPIQSRQTR ETRDIGLETKRHSKPSKRVDETRLKHLVLKGLGIKKLPDMRKVNISQ AEYSSKYIEYLSRLRSNQEKGNSYFNNFMGASFTRDLHFLSITTNGF NDISNKRLRHRRSLKKINRLNQNPKKHQNYGDLLRGEQDTMNILLH FPLTNAQDANFHHDK

Step 3: Investigate Exons Start with exon 1 We will use a varient of the BLAST program, called blast2seq. This version compares two sequences instead of comparing a sequence to a database Best to search entire fosmid DNA sequence (easier to keep track of positions) with the amino acid sequence of exon 1

Step 3: Investigate Exons Create 3 tabs in Safari In the first tab, go to the goose browser chr10 of virilis; click the DNA button, then click “get DNA” In the second tab, go to and get the peptide sequence for the melanogaster mav gene These first two tabs now have the two sequences you are going to compare In the third tab go to NCBI blast page and click on “Align two sequences (bl2seq)”

Step 3: Investigate Exons Copy and paste the genomic sequence from tab 1 into sequence box 1 of tab 3 Copy and paste the peptide sequence of exon 1 from tab 2 into sequence box 2 Since we are comparing a DNA sequence to a protein we need to run BLASTX Turn off the filter Leave other values at default for now Click “align” button to run the comparison

Step 3: Investigate Exons No significant homology found Either the mav ortholog is not in this fosmid (unlikely given the original blastp hit) or this exon is not well conserved Lets look for similarities of lower quality Click the back button to go back to the bl2seq page Change the expect value to 1000 and click align

Step 3: Investigate Exons We have a weak alignment (50 identities and 94 similarities), but we have seen worse when comparing single exons from these two species Notice the location of the hit (bases to 17504) and frame +3

Step 3: Investigate Exons A similar search with exon 2 sequences gives a location of chr10: and frame +2 For larger genes continue with each exon, searching with bl2seq (adjusting e cutoff if necessary) and noting location and frame of region of similarity

Step 4: Create Gene Model Pick ATG (met) at start of gene, first met in frame with coding region of similarity (+3) For each putative intron/exon boundary compare location of BLASTX result with gene finder results to locate exact first and last base of the exon and check that the intron starts with “GT” and ends with “AG” Exons: ; Intron GT and AG present

Step 4: Confirm Gene Model As a final check we need to create the putative mRNA, translate it and make sure the protein we get out is similar to expected: 1. Enter coordinates for each exon in browser 2. Click “DNA” button at top then “get DNA” 3. Copy the sequence into a text file 4. Repeat for each exon, adding DNA to file 5. Go to 6. Enter your entire sequence, hit “Translate Sequence”; should get one long protein 7. Compare the protein sequence to ortholog using bl2seq

Step 4: confirm model (Future) We have a web page under construction which will simplify confirmation This web site will double check intron- exon boundaries, translate the putative message and create a data file suitable for uploading

Considerations Some exons are very hard to find (small or non-conserved; keep increasing E value to find any hits (10,000,000 not unheard of) Donor “GC” seen on rare occasions We have seen one example where the only reasonable interpretation was that an intron had moved (out of about 70 genes) Without est and expression data you may get stuck; use your best judgment

Gene Function In addition to annotation of the genes we ask the students to look into the function of each gene and discuss what they found in their final paper on annotation For genes in Drosophila the best source to begin your investigation into gene function is the drosophila online database called Flybase.

Flybase flybase.bio.indiana.edu

Flybase gene info Search for gene name Will find links to info pages with many helpful references Remember many genes have functions assigned based only on similarity data This is especially true for anonymous genes “CG#####”. Take any functional assignment with large amounts of skepticism, consider it a guess at best

Gene function for Mav