Presentation is loading. Please wait.

Presentation is loading. Please wait.

National Center for Biotechnology Information The Consensus CoDing Sequence (CCDS) Database Kim D. Pruitt Mouse Genome Annotation Summit Meeting March.

Similar presentations


Presentation on theme: "National Center for Biotechnology Information The Consensus CoDing Sequence (CCDS) Database Kim D. Pruitt Mouse Genome Annotation Summit Meeting March."— Presentation transcript:

1 National Center for Biotechnology Information The Consensus CoDing Sequence (CCDS) Database Kim D. Pruitt Mouse Genome Annotation Summit Meeting March 12-13, 2008

2 National Center for Biotechnology Information Why is the CCDS project needed? The availability of the human and mouse genome sequence has had a significant impact on disease and health research. Most scientists rely on annotation information when designing, interpreting, and evaluating research results. Inconsistencies in annotation results among the main public resources hampers use of this important data. Researchers may not realize that a different annotation result is available elsewhere – possibly leading to erroneous or incomplete interpretations. The Problem The Problem: Annotation of the genome sequence is essential – but beware of different interpretations!

3 National Center for Biotechnology Information Initiated by the main public annotation/browser groups to address concerns by the scientific community about inconsistencies in the human and mouse genome annotation. Built by consensus among the collaborating members, which include: European Bioinformatics Institute (EBI) National Center for Biotechnology Information (NCBI) University of California, Santa Cruz (UCSC) Sanger Institute (WTSI) CCDS - A collaborative project

4 National Center for Biotechnology Information Project Goals –identify a core set of protein-coding genes that are consistently annotated and of high quality –support convergence toward a standard set of gene annotations Scope: –Human and mouse protein coding regions Update frequency –Variable –Depends on frequency of genome annotation updates What is the CCDS project?

5 National Center for Biotechnology Information Process flow – calculating updates Ensembl merged annotation Havana (manual)Ensembl (computational)NCBI (computational) RefSeq (manual) Compare CDS (Annotation + Sequence) IdenticalSimilarNovel Existing CCDSRetain Lost New matchNew CCDS IDOut of scope QA

6 National Center for Biotechnology Information Quality assessment tests include: –Consensus splice sites ("GY..AG" or "AT..AC") –Valid start and stop codons with no internal stops –NMD –Low complexity –Repeat-containing –Insufficient protein homology –Genome conservation –Putative pseudogene Assessing Quality CCDS status is conservatively applied: Annotated CDS coordinates are identical Annotation is of high quality and passes QA tests, or curator review Existing CCDS proteins can be flagged for review by the collaborating members Updates and removals are by consensus agreement. QA test results are reviewed by curators Over-rides are set to retain supported CDSs

7 National Center for Biotechnology Information DateBuild CDS IDs GeneIDs Mar-05Hs35.114,79513,142 Feb-07Hs36.218,29016,008 Oct-06Mm36.113,37413,014 Nov-07Mm37.117,70716,893 CCDS Counts StepSourceGenesProteins AnnotationNCBI AnnotationEnsembl Matching CDS QA & curation rejections Accepted rejections Final CCDS ID

8 National Center for Biotechnology Information Any member of the collaboration can flag a CCDS for review –Update the CDS definition (alter N-terminus extent internal splice site) –Withdraw the CCDS ID (insufficiently supported, or non-protein coding) NCBI provides a collaboration web site to coordinate this review All collaborators must agree with a change to finalize a decision Withdrawal of a CCDS may happen between genome annotation updates An update to a CCDS is indicated by: –Status change: a status of ‘pending update’ is reported when there is collaborative agreement that a change is needed –Version change: The CCDS version number is incremented once the change is reflected in public annotation. This only occurs after a genome annotation update and CCDS analysis has taken place. CCDS curation is fully integrated with RefSeq curation Curation – how are updates curated and coordinated?

9 National Center for Biotechnology Information CCDS update & curation stats name action status count human update pending 366 human update agreed 557 human withdraw pending 189 human withdraw agreed 519 mouse update pending 185 mouse update agreed 57 mouse withdraw pending 16 mouse withdraw agreed 8 Curation-based changes: Annotation pipeline-based changes: name build status count human 35.1 Withdrawn, inconsistent annotation 133 human 36.2 Withdrawn, inconsistent annotation 29 mouse 36.1 Withdrawn, inconsistent annotation 29 mouse 37.1 Withdrawn, inconsistent annotation Mouse: ~5200 curated CCDS genes

10 National Center for Biotechnology Information Alignments Track low quality sequences (‘kill list’) Protein conservation Publications Personal communications QA measures Curation considerations

11 National Center for Biotechnology Information Genome browser displays –NCBI –UCSC Gene reports –Ensembl –NCBI –UCSC –Vega Other: –RefSeq annotation (NCBI) –CCDS web site –FTP Access – How do I know if an annotation has a CCDS ID?

12 National Center for Biotechnology Information NCBI Map Viewer (chr.5) Link to CCDS Browser

13 National Center for Biotechnology Information UCSC Browser chr5:

14 National Center for Biotechnology Information UCSC Browser – Tyms gene CCDS Browser

15 National Center for Biotechnology Information Access of CCDS data at NCBI CCDS Database & Browser interface Project Description Query support Reports attributes of the CCDS Location data Sequence members Status FTP reports

16 National Center for Biotechnology Information CCDS Browser History Entrez GeneView CCDS Details Find all CCDSs for the Gene

17 National Center for Biotechnology Information CCDS Browser Mouse-over highlights codon Click to highlight codon and corresponding amino acid

18 National Center for Biotechnology Information Biology is complex – some CCDS curation examples 1 vs 2 vs ‘n’ genes translation start site

19 National Center for Biotechnology Information 1 vs. 2 vs. ‘n’ genes Curation Considerations: –Nomenclature –History (scientific use, publications, etc.) –Different (but similar) products vs. distinct products –Shared promoters

20 National Center for Biotechnology Information carnitine palmitoyltransferase 1b, choline kinase beta

21 National Center for Biotechnology Information 1 vs. 2 vs. ‘n’ genes Current RefSeq representation of the region - two protein coding loci - one non-coding loci for the non-coding transcript product (a read-through transcript) Chkb-cpt1b (PMID: ) Chkb (CCDS )Cpt1b (CCDS )

22 National Center for Biotechnology Information Translation start site Curation Considerations –Publication reports (CDS begins at ‘n’) –Other cDNA sequencing reveals the ORF can be extended further upstream –Evaluate: Genome conservation Literature reports for the protein Putative Kozak signals Presence of in-frame upstream stop codon INSDC submissions from an experimental lab source that do have the longer ORF extent annotated. Consult with an expert

23 National Center for Biotechnology Information Internal CCDS browser (restricted access) Jmjd2d jumonji domain containing 2D (chr 19)

24 National Center for Biotechnology Information Update is agreed on by all parties Resulting in a 258 aa N-terminal extension

25 National Center for Biotechnology Information Examples – no CCDS ID EBI+WTSI and NCBI transcript annotation may differ even though the gene includes annotations with CCDS IDs

26 National Center for Biotechnology Information Examples –no CCDS ID Reasons: not found by one group different CDS length different splice sites different internal exon Curation removal EBI/WTSINCBI EBI/WTSINCBI EBI/WTSINCBIEBI/WTSINCBI

27 National Center for Biotechnology Information Acknowledgements Donna Maglott Josh Cherry Keith Oxenride Craig Wallin Andrei Shkeda RefSeq Curators NCBI Genome Annotation Group NCBI Map Viewer Group Collaborators at Ensembl, UCSC, Vega Jen Ashurst & Vega curator group Rachel Harte Mark Diekhans Steve Searle

28 National Center for Biotechnology Information Ensembl – Tyms gene

29 National Center for Biotechnology Information Vega browser Tyms gene (chromosome )


Download ppt "National Center for Biotechnology Information The Consensus CoDing Sequence (CCDS) Database Kim D. Pruitt Mouse Genome Annotation Summit Meeting March."

Similar presentations


Ads by Google