NCBI’s Genome Annotation: Overview Incremental processing Re-annotation ( batch ) Post-annotation review Case studies NOTE: limiting discussion to annotation.

Slides:



Advertisements
Similar presentations
Model Organism Databases and Community Annotation
Advertisements

What is RefSeqGene?.
EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
Homology Based Analysis of the Human/Mouse lncRNome
Breakdown of 244 total (Yale+Vega) Pseudogenes Amongst Various ENCODE Regions 211 Yale, 178 Vega, Union is 244 More pseudogenes in the manually picked.
NCBI Genome Resources Using NCBI Resources for Gene Discovery Kim D. Pruitt Transcriptome 2002 National Center for Biotechnology Information (NCBI) National.
The Consensus CoDing Sequence (CCDS) Database
ABSTRACT WormBase is a freely available information resource primarily for the nematode Caenorhabditis elegans but which progressively includes data from.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Alignment of mRNAs to genomic DNA Sequence Martin Berglund Khanh Huy Bui Md. Asaduzzaman Jean-Luc Leblond.
Copyright OpenHelix. No use or reproduction without express written consent1 Organization of genomic data… Genome backbone: base position number sequence.
Lecture 2.21 Retrieving Information: Using Entrez.
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
Genome Assembly and Annotation Erik Arner Omics Science Center, RIKEN Yokohama, Japan
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Displaying associations, improving alignments and gene sets at UCSC Jim Kent and the UCSC Genome Bioinformatics Group.
UCSC Known Genes Version 3 Take 10. Overall Pipeline Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster.
Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Genome Databases Computational Molecular Biology Biochem 218 – BioMedical Informatics.
The Ensembl Gene set The “Genebuild” 21 April 2008.
ENCODE pseudogene updates Adam Frankish, HAVANA 6/10/05.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Genome Annotation and Databases Genomic DNA sequence Genomic annotation BIO520 BioinformaticsJim Lund Reading Ch 9, Ch10.
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
NCBI FieldGuide A Minimal Guide to NCBI Nucleotide Resources.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
발표자 석사 2 년 김태형 Vol. 11, Issue 3, , March 2001 Comparative DNA Sequence Analysis of Mouse and Human Protocadherin Gene Clusters 인간과 마우스의 PCDH 유전자.
Bioinformatics Overview, NCBI & GenBank JanPlan 2012.
DONNA MAGLOTT, PH.D. PRO AND MEDICAL GENETICS RESOURCES AT NCBI.
COURSE OF BIOINFORMATICS Exam_31/01/2014 A.
Part I: Identifying sequences with … Speaker : S. Gaj Date
Organizing information in the post-genomic era The rise of bioinformatics.
Browsing the Genome Using Genome Browsers to Visualize and Mine Data.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
Sackler Medical School
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
Importing Community annotations into VectorBase. Aims Provide the VectorBase community with tools for improving genome annotation. Must have low entry.
The Havana-Gencode annotation GENCODE CONSORTIUM.
Web Databases for Drosophila An introduction to web tools, databases and NCBI BLAST Wilson Leung08/2015.
The Reference Sequence database A non-redundant collection of richly annotated DNA, RNA, and protein sequences from diverse taxaDNARNA The collection includes.
Cool BaRC Web Tools Prat Thiru. BaRC Web Tools We have.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Copyright OpenHelix. No use or reproduction without express written consent1.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Fgenes++ pipelines for automatic annotation of eukaryotic genomes Victor Solovyev, Peter Kosarev, Royal Holloway College, University of London Softberry.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
ENCODE pseudogene updates Adam Frankish, HAVANA 13/10/05.
A guided tour of Ensembl This quick tour will give you an outline view of what Ensembl is all about. You will learn: –Why we need Ensembl –What is in the.
Evaluating genes and transcripts in Ensembl March 2007.
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
1 Many to 1 Gene Associations The following slides show a few examples of gene predictions by one annotation group that overlap one or more genes from.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Sequence Curation Paul Davis Sanger Institute. Overview Sequence curation within WormBase consortium. Import of sequence data. Prediction stats. Work.
1 Many to 1 Gene Associations The following slides show a few examples of gene predictions by one annotation group that overlap one or more genes from.
Gene Finding in Chimpanzee Evidence based improvement of ab initio gene predictions Chris Shaffer06/2009.
Welcome to the combined BLAST and Genome Browser Tutorial.
COURSE OF BIOINFORMATICS Exam_30/01/2014 A.
The Bovine Genome Database Abstract The Bovine Genome Database (BGD, facilitates the integration of bovine genomic data. BGD is.
What is sequencing? Video: WlxM (Illumina video) WlxM.
Web Databases for Drosophila
Retrieving Information: Using Entrez
The NCBI Annotation Pipeline
Figure 1. Number of CCDS IDs and genes represented in the human (A) and mouse (B) CCDS releases. The X-axis indicates the year in which a CCDS dataset.
ENCODE Pseudogenes and Transcription
Ensembl Genome Repository.
Identify D. melanogaster ortholog
Problems from last section
Presentation transcript:

NCBI’s Genome Annotation: Overview Incremental processing Re-annotation ( batch ) Post-annotation review Case studies NOTE: limiting discussion to annotation of genes and pseudogenes Donna Maglott, Ph.D. for the RefSeq and Annotation groups

NCBI: Incremental processing Maintain gene/sequence relationship –Data sources MGI’s ftp site (names, MGI ids, sequence accessions) Sequences and annotation from INSD ( DDBJ, EMBL, GenBank ) UniProt (names, sequence) CCDS Collaboration (CDS definition; Kim will expand on this later.) Gene family-specific databases HomoloGene UniGene Scientific community –Actions Create or update data via Entrez Gene Create or update RefSeq sequences Identify conflicts and discuss with stakeholders

Create or update RefSeq sequences /?id=NG_007044&v=947227:11232 Curated annotation of the T-cell receptor alpha/delta locus

Changed annotation after release of Mm37.1 Annotated as version 3, overlapping model at 5’ end

Hi MGI, I believe Tnrc18 (geneID: , MGI: ) and Zfp469 (geneID: , MGI: ) should be merged. This region of chr 5 has major assembly problems in the reference assembly, but the Celera assembly appears to accurately represent the structure as compared to transcript data and the orthologous regions of the human and rat genomes. In the reference assembly, Zfp469 and Tnrc18 are on separate scaffolds… there are multiple mouse and human transcripts spanning both loci… Zfp469 is currently represented as NM_ (based on BC ), and this appears to be a valid transcript variant that uses a well-supported early polyA signal/site. However, mouse AK , CB , AK , and human AB all extend past this early polyA signal/site to include an additional 13 exons that overlap with Tnrc18 and potentially encode an additional 1125 aa at the C-terminus…There is also an issue of nomenclature. I cannot find any evidence of the transcripts associated with Zfp469 encoding a zinc finger protein…the long variant does contain a trinucleotide repeat so the Tnrc18 name may be more appropriate, although the repeat is not found in the shorter NM_ variant. Also be aware that we had mis-associated the human TNRC18 nomenclature (HGNC: 11962) Terence Murphy Identify conflicts and discuss with stakeholders

One Gene or Two? BLAST alignment of human RefSeq NM_ to Mm37.1

Exon coverage better in Celera assembly BLAST alignment of human RefSeq NM_ to Celera

NCBI: Incremental processing Maintain gene/sequence relationship (cont’d) –Products RefSeq sequences via –Entrez Nucleotide, –Entrez Protein, –ftp Gene-specific data via –Entrez Gene –ftp Nomenclature propagated to UniGene and HomoloGene

NCBI: Re-annotation of genes Timing –Always with a new assembly –May occur without a re-assembly Evidence used –cDNAs aligned to the genome by Splign –Proteins aligned to the genome by proSplign –Ab initio predictions (gnomon) –Annotated RefSeq genomic sequences

NCBI: Re-annotation Tracking/identification of annotation ( decreasing weight ) –Best RefSeq placement (Splign/global alignment) –Comparison to previous annotation Assembly to assembly Clone to clone ‘product’ to ‘product’ –Best GenBank placement Products of annotation –If tracked, reassign GeneID and RefSeq model accession(s) –If novel and transcribed, assign new GeneID and RefSeq model accession(s) –If novel pseudogene, assign new GeneID and annotate on the genomic sequence without assigning a model RefSeq accession

One Gene or Two? BLAST alignment of human RefSeq NM_ to Mm37.1

NCBI: Post annotation review Data reviewed –GeneIDs with sequence data, not annotated –GeneIDs annotated previously, not in the current annotation –CDS features NOT included in the CCDS set Actions taken –Create new RefSeqs if necessary –Update existing RefSeqs if necessary –Provide annotation comment to explain cases currently under review

NCBI: Post annotation review

NCBI: Representative review cases Under-representation of ncRNAs in the RefSeq set –May result in failure to annotate ncRNA that overlap a protein coding gene –More RefSeqs for this category are being generated Management of ‘read-through’ transcripts –Generate RefSeq if multiple lines of evidence –Discuss with all nomenclature groups

NCBI: Representative review cases Adjudication of the name to be assigned to a given genomic location –Evaluate conserved synteny –Discuss with all nomenclature groups

A case history: Arhgap27 and P18Rik VEGA: One Gene NCBI/MGI: Two Genes

A case history: Arhgap27 and P18Rik Contributing factors Computation: model prediction suggests one gene, but independent RefSeq mRNAs force two Only one gene is in the CCDS set Curation: NCBI merged in 2005 Reversed the merge in 2006 in response to request from MGI UniProt treats as one gene, Arhgap27 Evidence: one cDNA in rat, one cDNA from mouse, Arhgap12 all consistent with the one-gene model

A case history: A read-through locus

A case history: A (vertical) read-through locus

A case history: olfactory receptors Action: Merge the Gene records in collaboration with MGI

A case history: interspersed loci