Presentation is loading. Please wait.

Presentation is loading. Please wait.

The NCBI Annotation Pipeline

Similar presentations


Presentation on theme: "The NCBI Annotation Pipeline"— Presentation transcript:

1 The NCBI Annotation Pipeline
6/13/2018 The NCBI Annotation Pipeline Genome Assembly Annotation Automated Manual Section 1-Sequence Characteristics - number of basepairs - characteristics of fragments - contamination NCBI Genome Resources

2 Assembly (sequence chromosome)
Remove contaminants Bin by chromosome arms Sequence Layout Sequence Building Place on chromosomes NCBI Genome Resources

3 Sequence Assembly - a modified greedy approach
6/13/2018 Sequence Assembly - a modified greedy approach Sequence Layout Curated Finished Regions Curated assembly instructions MegaBLAST hits Consider TPF clone order BAC chromosome assignment annotation STS markers personal communication Remove conflicting overlaps, redundant BACs BAC Sequence Fragments Assemble Order NCBI Contig Sequence Building Consider fragment:fragment sequence overlaps for each BAC pair in layout Meld overlapping sequence Order and Orient (o+o ): alignments (mRNA, EST) BAC annotation paired plasmid reads 113,310 Melds NCBI Genome Resources

4 Genome Build Process Freeze Contig Build & Release Exclude Problem
dbSNP STS Clones Collaboration Curation GenomeScan GenBank LocusLink RefSeq Update: Links gi’s Prepare for release LocusLink Annotation Contig Build & Release Assembly Resource Updates Freeze Input Data: Sequences Curated NTs TPF BLAST hits Public Release Sequences (contig mRNA protein) Exclude Problem accessions Analysis & Review Corrections for next build Map Viewer FTP BLAST Input Resources NCBI Genome Resources

5 What is being annotated?
6/13/2018 What is being annotated? Feature Method Genes: By alignment, by prediction Markers: By ePCR Variation: By alignment Clones/Cytogenetic location: By alignment (BAC ends) Phenotype (MIM): Via Gene identification, associated markers Cytogenetic Position: By annotated BAC-END sequenced clones By FISH-mapped clones used in assembly Section 1-Sequence Characteristics - number of basepairs - characteristics of fragments - contamination NCBI Genome Resources

6 Genome Annotation: Genes
6/13/2018 Genome Annotation: Genes Automated: By mRNA alignment (RefSeq, GenBank) protein coding and transcribed pseudogenes By EST alignments GenomeScan predictions Products: Sequences - RefSeq RNAs and proteins Resources - MapViewer, FTP, BLAST Section 1-Sequence Characteristics - number of basepairs - characteristics of fragments - contamination NCBI Genome Resources

7 RefSeq: a reagent for Contig Annotation
Potential Problems: Gene Families Partial Chimeric Intron read-through Linker Vector Wrong organism genome RefSeq mRNAs GenBank mRNAs RefSeq Advantages: Separate Gene Families Not Partial Means to correct problem sequences ESTs TBLASTN RPSBLAST RefSeq process results in excluding problem GenBank sequences from annotation pipeline GenomeScan NCBI Genome Resources

8 NCBI Reference Sequences (RefSeq)
Genome Oriented Resource A sequence for each macromolecule Central Dogma: Chromosome, mRNA, preprotein, mature protein Linked on a residue by residue basis Objectively non-redundant and comprehensive Curated Resource Authoritative source by genome Derivative of GenBank but corrected, merged, extended Publicly distributed, Entrez Genomes Web site Reagents for Genome Annotation and Analysis Substrate for Functional Genomics NCBI Genome Resources

9 Reference Sequences Goal: One sequence entry for each naturally occurring DNA, RNA and protein molecule Key: curated calculated NC_000000 NM_000000 NP_000000 XM_000000 XP_000000 chromosome NT_000000 contig mRNA predicted mRNA protein predicted protein NG_000000 gene Multiple products for one gene are instantiated as separate RefSeqs with the same LocusID. NCBI Genome Resources

10 RefSeq: Scope The NCBI Reference Sequence (RefSeq) project provides non-redundant sequence data including bacterial and viral genomes, mitochondrion, chromosomes, constructed genomic contigs, transcripts, and proteins. NCBI Genome Resources

11 Process Flow Technology: Sybase relational databases
Support multiple users Instant updates in-house Daily updates public web site C, C++, Perl, stored procedures, triggers In-house processing External Collaboration LocusLink Public Release: LocusLink Web Site Annotation Data Automated BLAST Analysis RefTrack Provisional & Predicted Records (Transcripts & Proteins) Status QBLAST Update Reviewed Records (Genomic Regions,Transcripts, Proteins) Manual Curation Accessible in: BLAST BLink Entrez FTP LocusLink RefTrack Tracking Database . Decisions . Status . Accessions . History Genome Annotation Pipeline NCBI Genome Resources

12 Manual Curation Process
RefSeq infrastructure: supporting analysis Manual Curation Process Tools: User Input Genome Analysis Database QC checks Assignment Disease Genes Gene Families Gene Clusters Identified Problems Analysis: Precomputed BLAST Vector ESTs GenBank ‘nr’ HTG Contigs blastx blastp Spidey alignments Map Viewer Evidence Viewer Review Alternate Splicing Gene Families Review Sequence Database data Literature Database Editing Sequence Editing Sequence changes Remove contamination Correct errors Extend UTRs Splice Variants Annotation Annotation: Sequin Ingenue Final Check Quality Control Store Data & QC: RefSeq validator Public Public NCBI Genome Resources

13 New Type of RefSeq: Genomic Regions
Why? Correct Assembly through Duplications, Paralogous Gene Clusters Optimize Annotation in Gene Clusters Goals: 1. Establish genomic reference sequence 2. Provide annotation [genes, pseudogenes, variation] 3. Provide corresponding mRNA, protein RefSeq Used in: Genome Assembly & Annotation Pipeline NCBI Genome Resources

14 Gene Annotation Automated Annotation: Manual Annotation:
6/13/2018 Gene Annotation Automated Annotation: By mRNA alignment (RefSeq, GenBank) By EST alignments GenomeScan predictions Initial RefSeq record (provisional, predicted status) LocusLink – maintaining associations; connections to function Manual Annotation: RefSeq – genomic regions, mRNA,protein (reviewed status) Local regions of Genome Assembly – provide curated assembly instructions (join a->b; don’t join a->c) Section 1-Sequence Characteristics - number of basepairs - characteristics of fragments - contamination NCBI Genome Resources

15 Products of annotation
6/13/2018 Products of annotation RefSeqs (transcripts, proteins) Gene id (LocusID) features in chromosome coordinates features in contig (NT accession) coordinates Available in: Map Viewer Graphical display Tabular display Sequence downloads FTP RefSeqs (contigs, transcripts, proteins) Mapping Data LocusLink & Other resources Section 1-Sequence Characteristics - number of basepairs - characteristics of fragments - contamination NCBI Genome Resources

16 Map Viewer: Query support
NCBI Genome Resources

17 Map Viewer: Tabular report
NCBI Genome Resources

18 Genes in regions of conserved synteny
Anchored by human gene order Anchored by mouse gene order NCBI Genome Resources

19 Query by sequence: Review the alignment
6/13/2018 Query by sequence: Review the alignment A click away: Alignments (BLAST hit) Gene Description (LocusLink) Report of all features in the region Contig sequence Sequence in the region other mRNAs aligning in the region Define your own gene model based on alignments in the region Section 1-Sequence Characteristics - number of basepairs - characteristics of fragments - contamination NCBI Genome Resources

20 Quality Control - Genome review
6/13/2018 Quality Control - Genome review Is the sequence correct? Is the feature correctly placed? Is there a feature that should be placed? Are the attributes of the feature correct? Approaches: In-house analysis & review (manual curation) Shared information (UCSC/Ensembl) Solicited review by experts in local regions Section 1-Sequence Characteristics - number of basepairs - characteristics of fragments - contamination NCBI Genome Resources

21 Acknowledgments http://www.ncbi.nlm.nih.gov/genome/guide/human.html
Genome Build Team: Richa Agarwala Hsiu-Chuan Chen Slava Chetvernin Deanna Church Olga Ermolaeva Renata Geer Wratko Hlavina Wonhee Jang Jonathan Kans Ken Katz Paul Kitts David Lipman Adam Lowe Donna Maglott Jim Ostell Kim Pruitt Sergey Resenchuk Victor Sapojnikov Greg Schuler Steve Sherry Andrei Shkeda Tatiana Tatusova Lukas Wagner Sarah Wheelan Acknowledgments RefSeq Curator Staff BLAST Team Entrez Team NCBI Service Desk Staff Collaborators: Human Gene Nomenclature Committee OMIM Staff The Jackson Laboratory Rat Genome Database


Download ppt "The NCBI Annotation Pipeline"

Similar presentations


Ads by Google