1 of 28 Evaluating Genes and Transcripts (“Genebuild”)

Slides:



Advertisements
Similar presentations
© Wiley Publishing All Rights Reserved. Using Nucleotide Sequence Databases.
Advertisements

Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Peter Tsai, Bioinformatics Institute.  University of California, Santa Cruz (UCSC)  A rapid and reliable display of any requested portion of genomes.
Lecture 7.11 The Ensembl Database Erin Pleasance Steven Jones Canada’s Michael Smith Genome Sciences Centre, Vancouver.
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
Genome Assembly and Annotation Erik Arner Omics Science Center, RIKEN Yokohama, Japan
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Displaying associations, improving alignments and gene sets at UCSC Jim Kent and the UCSC Genome Bioinformatics Group.
How to access genomic information using Ensembl August 2005.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Evaluating genes and transcripts in Ensembl
UniProt - The Universal Protein Resource
UCSC Known Genes Version 3 Take 10. Overall Pipeline Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster.
Genome Annotation BCB 660 October 20, From Carson Holt.
Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Genome Databases Computational Molecular Biology Biochem 218 – BioMedical Informatics.
1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled.
The Ensembl Gene set The “Genebuild” 21 April 2008.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn.
Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources.
Genome Annotation and Databases Genomic DNA sequence Genomic annotation BIO520 BioinformaticsJim Lund Reading Ch 9, Ch10.
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
Bioinformatics Overview, NCBI & GenBank JanPlan 2012.
NCBI’s Genome Annotation: Overview Incremental processing Re-annotation ( batch ) Post-annotation review Case studies NOTE: limiting discussion to annotation.
Part I: Identifying sequences with … Speaker : S. Gaj Date
An Introduction to ENSEMBL Cédric Notredame. The Top 5 Surprises in the Human Genome Map 1.The blue gene exists in 3 genotypes: Straight Leg, Loose Fit.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
Professional Development Course 1 – Molecular Medicine Genome Biology June 12, 2012 Ansuman Chattopadhyay, PhD Head, Molecular Biology Information Services.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
VectorBase BRC The evolving VectorBase gene build: mixing automated and manual approaches when annotating vector genomes Daniel Lawson VectorBase-EBI,
Genome Annotation Rosana O. Babu.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
Sackler Medical School
Annotator Interface Sharon Diskin GUS 3.0 Workshop June 18-21, 2002.
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
Importing Community annotations into VectorBase. Aims Provide the VectorBase community with tools for improving genome annotation. Must have low entry.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Mark D. Adams Dept. of Genetics 9/10/04
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome.
1 of 42 Browsing Genes and Genomes with Ensembl Maria Wilbe Department of Animal Breeding and Genetics, SLU, Sweden
Web Databases for Drosophila An introduction to web tools, databases and NCBI BLAST Wilson Leung08/2015.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Fgenes++ pipelines for automatic annotation of eukaryotic genomes Victor Solovyev, Peter Kosarev, Royal Holloway College, University of London Softberry.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
A guided tour of Ensembl This quick tour will give you an outline view of what Ensembl is all about. You will learn: –Why we need Ensembl –What is in the.
Evaluating genes and transcripts in Ensembl March 2007.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
1 Many to 1 Gene Associations The following slides show a few examples of gene predictions by one annotation group that overlap one or more genes from.
Accessing and visualizing genomics data
Sequence Curation Paul Davis Sanger Institute. Overview Sequence curation within WormBase consortium. Import of sequence data. Prediction stats. Work.
What is BLAST? Basic BLAST search What is BLAST?
Gene Finding in Chimpanzee Evidence based improvement of ab initio gene predictions Chris Shaffer06/2009.
The Bovine Genome Database Abstract The Bovine Genome Database (BGD, facilitates the integration of bovine genomic data. BGD is.
Gene Annotation & Gene Ontology May 24, Gene lists from RNAseq analysis What do you do with a list of 100s of genes that contain only the following.
Lecture/Lab 7.31
Daphnia Genome Annotation & Analysis Notes July 2007 Don Gilbert Genome Informatics Lab, Biology Dept., Indiana University
GENCODE: a rich dataset of all gene features in the human genome The GENCODE consortium aims to identify all gene features in the human genome, using a.
Introduction to Genes and Genomes with Ensembl
The Transcriptional Landscape of the Mammalian Genome
VectorBase genome annotation
Functional Annotation of the Horse Genome
Ensembl Genome Repository.
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

1 of 28 Evaluating Genes and Transcripts (“Genebuild”)

2 of 28 Outline Ensembl gene set Ensembl EST genes Ab initio predictions Manual curation (Vega) Ensembl / Havana merged gene set CCDS project

3 of 28 Biological Evidence UniProt/Swiss-Prot A manually curated database and therefore of highest accuracy NCBI RefSeq A partially manually curated database UniProt/TrEMBL Automatically annotated translations of EMBL coding sequence (CDS) features EMBL / GenBank / DDBJ Primary nucleotide sequence repository All Ensembl gene predictions are based on experimental evidence:

4 of 28 The Ensembl Genebuild Genome assembly Computer programs Experimental evidence Ensembl Genes + +

5 of 28 The Ensembl Genebuild A new release of Ensembl doesn’t contain a new genebuild for each species! New genebuilds are only done if there is: a new genome assembly a lot of new supporting evidence

6 of 28 Genome Assemblies Genome assemblies are not created by Ensembl, but provided by other institutes / consortia, e.g. NCBI: human, mouse Rat Genome Sequencing Consortium: rat Sanger: zebrafish Broad Institute: mammals Baylor College: cow Washington University: chicken etc.

7 of 28 The Ensembl Genebuild Targeted build: Align species-specific proteins to the genome to create transcripts Similarity build: Align proteins from closely related species to locate additional transcripts Add UTRs using mRNA evidence Eliminate redundant transcripts and create genes

8 of 28 “Special” cases Pseudogenes Non-coding RNA genes sequences from RFAM and miRBase dbs and covariance models hand-checked set Ig Segment Genes (Immunoglobulin and T-cell receptor segments) sequences from IMGT db and Exonerate

9 of 28 Classification of Transcripts Ensembl Transcripts and Proteins are mapped to UniProt/Swiss-Prot, NCBI RefSeq and UniProt/TrEMBL entries Genes that map to species-specific protein/mRNA records are classified as known Genes that do not map to species- specific protein/mRNA records are classified as novel

10 of 28 Names and Descriptions Transcript names are inferred from mapped transcripts and proteins Swiss-Prot > RefSeq > TrEMBL ID Novel transcripts have only Ensembl identifiers Genes are assigned the official gene symbol if available HGNC (HUGO) symbol for human genes Species-specific nomenclature committees (MGI, ZFIN etc.) Otherwise Swiss-Prot > RefSeq > TrEMBL ID Gene description is inferred from mapped database entries, the source is always given

11 of 28 Supporting evidence ExonView mRNA peptide mRNA UTRcoding/UTR

12 of 28 Supporting evidence ContigView

13 of 28 Configuring the Genebuild Genebuild configured for each species Data availibility Targeted build most useful in human, mouse Similarity build most useful in C. intestinalis, mosquito Structural issues Zebrafish Many duplications Genome from different haplotypes Mosquito Many single-exon genes Genes within genes

14 of 28 Low Coverage Genomes Low coverage genomes (~2x) come in lots of scaffolds: “classic” genebuild will result in many partial and fragmented genes Whole Genome Alignment (WGA) to an annotated reference genome: this method reduces fragmentation by piecing together scaffolds into “gene-scaffolds” that contain complete gene(s)

15 of 28 Low Coverage Genomes NNNNNN “gene-scaffold ” reference assembly

16 of 28 EST Gene Set ESTs (Expressed Sequence Tags) are single reads, high chance of sequencing mistakes EST libraries are regularly contaminated with genomic DNA Generally ~ 400 bp, so unlikely to cover a whole gene THEREFORE EST gene predictions are less reliable and thus kept separate from the core Ensembl Gene Set

17 of 28 EST Gene Set ContigView ESTs EST genes

18 of 28 Ab initio Predictions Predict translatable transcript structures solely on the basis of genome sequence. No validation with biological expression information. GENSCAN for vertebrate genomes SNAP better for invertebrates NB: Both programs are over- predicting transcript structures.

19 of 28 Ab initio Predictions ContigView GENSCAN prediction

20 of 28 Automatic vs Manual Annotation Automatic Annotation Quick Use unfinished sequence or shotgun assembly Consistent annotation Manual Annotation Slow Need finished sequence Flexible, can deal with inconsistencies Most rules have exceptions Consult publications as well as databases

21 of 28 Annotation that Causes Problems for Ensembl Multiple variants UTRs Pseudogenes Non-coding genes (ncRNAs) Overlapping genes, anti-sense genes Gene duplication events

22 of 28 Manually Curated Gene Sets FlyBasefruitfly WormBaseC. elegans SGDyeast Vegahuman, zebrafish, mouse, dog

23 of 28 Vega Genome Browser

24 of 28 Vega Transcripts Vega transcripts Vega Havana transcripts annotated by the Havana team at Sanger Vega External transcripts annotated by other Vega teams

25 of 28 Ensembl / Havana Merge Transcripts: Ensembl/Havana: gold Ensembl: red / black Havana: blue Genes: Ensembl/Havana: gold Ensembl: red / black Havana: blue Full-length protein-coding transcripts annotated by the Sanger Havana team (part of Vega) are merged with the human Ensembl transcript set

26 of 28 Ensembl / Havana Merge Merged Ensembl / Havana gene Merged Ensembl / Havana transcript

27 of 28 CCDS (Consensus Coding Sequences) Collaboration between NCBI, UCSC, Ensembl and Havana to produce a set of stable, reliable, complete (ATG->stop) CDS structures for human and mouse Long term aim is to get to a single gene set for human and mouse The genebuild pipeline has been modified to retain these ‘blessed’ CDSs (stored in a database for incorporation in the build)

28 of 28 Q & A Q U E S T I O N S A N S W E R S