Presentation is loading. Please wait.

Presentation is loading. Please wait.

CDD – a conserved domain database Aron Marchler-Bauer NCBI, National Library of Medicine, NIH DIMACS Workshop on Protein Domains: Identification, Classification.

Similar presentations


Presentation on theme: "CDD – a conserved domain database Aron Marchler-Bauer NCBI, National Library of Medicine, NIH DIMACS Workshop on Protein Domains: Identification, Classification."— Presentation transcript:

1 CDD – a conserved domain database Aron Marchler-Bauer NCBI, National Library of Medicine, NIH DIMACS Workshop on Protein Domains: Identification, Classification and Evolution February 27-28, 2003

2 CDD: a collection of domain multiple alignments linked to protein 3D structure imported alignment models mirrored ‘as-is’, sources are Pfam, Smart, COGs (close to 10,000) curated alignment models (about 300) part of NCBI’s Entrez query/retrieval system RPS-Blast to search PSSMs derived from alignment models

3 Entrez with CDD … Protein Sequences MEDLINE Abstracts BLAST Sequence Similarity 3D Structures Conserved Domains Protein Sequences Term Frequency Statistics Related Conserved Domains VAST Structure Similarity Domain Architecture Similarity CD-protein Links

4 Conserved Domains as part of Entrez to ….. annotate three-dimensional structures

5 Conserved Domains as part of Entrez to ….. annotate protein sequences

6 Conserved Domains as part of Entrez to ….. neighbor proteins by domain architecture Currently (CDD v1.60): ~5 Mio protein-CD links

7 RPS-Blast (Reverse Position-Specific Blast) (Psi)-BlastRPS-Blast Search query (and its PSSM) againstSearch query protein sequence sequence databaseagainst a database of PSSMs Lookup table holds possible wordLookup table holds possible word matches to query, database sequencesmatches to database PSSMs, query are scanned for single or multiplesequences are scanned for single or word matches, which are thenmultiple word matches, which are extended to identify statisticallythen extended to identify statistically significant alignments.significant alignments. How does it compare?

8 Test set:Smart v3.3, 569 Domain Families / Alignments / PSSMs 23736 protein sequences used in alignments 14100 protein sequences from the initial Drosophila genome set. The effect of the search heuristics can be measured directly against IMPALA, a similar program using the rigorous Smith-Waterman algorithm.

9 The effect of the search heuristics and the differences in alignment model encoding can be measured against HMMer Test set:Smart v3.3, 569 Domain Families / Alignments / PSSMs 23736 protein sequences used in alignments 14100 protein sequences from the initial Drosophila genome set.

10 RPS-Blast vs. IMPALA, Speed vs. Sensitivity

11 Self-recognition: Fraction of sequence fragments used to build up the alignment model, which yield significant scores when compared with the search model. Information content: sum Splog(p/q) across aligned columns The average alignment information content for 568 models used in the test is 240 bits. for 26 families – about 5% - self-recognition works better with IMPALA (detectable heuristics effects). the average alignment information content for these 26 models is 100 bits. for 542 models – about 95% - we did not detect heuristics effects in a self-recognition test.

12

13 In 65 families (~11%) more than 5% difference in self recognition between HMMer and RPS-Blast Their mean information content is 65 bits In 503 families (89%) less than 5% difference in self recognition.

14 Conclusions: differences, maybe not too surprising affecting a fairly small subset of the models at the lower end of the ‘informativeness’ spectrum can optimize PSSM calculation, but might see diminishing returns it may be more effective to deal with scope of models Need to do something about the model collection … curation of alignment models

15 Recording conserved features in CDD …

16

17 Conserved Features in CDs: catalytic, binding, interaction- and regulatory sites explain observed patterns of sequence conservation annotate if applicable to all aligned members annotate if evidence is available (3D structure, citation)

18 Collection has become redundant: Search results for 2SRC (Tyrosine Kinase) Right now: about 9400 CD-CD links in Entrez

19 Collection has become redundant: Search results for 1G291 (Malk) Many ATP-ase domains are sequence-similar to each other, and possibly related by descent from a common ancestor How to explain this redundancy?

20 Curation: literature check examination of the conserved domain extent examination of the multiple alignment, identification of a core substructure, establishment of a block-based alignment in agreement with 3D-structure data Feature annotation and recording of evidence Investigation of ‘related’ domains and their apparent relationships, resolving and recording the family hierarchy Update of CD alignment models with new sequences and 3D-structure data

21 Curation needs to deal with: noise from sequence data (gene models, annotation) noise from alignments / alignment methods

22 Block alignment model and family hierarchies: Parent alignment Children: Membership consistency Alignment consistency

23

24 Rizzi and Schindelin, Curr. Opin. Struct. Biol. 2002, 12:709-720

25 .. sequences used in the alignment hit a variety of models in CD-Search:

26 … examine domain architectures as recorded in CDART:

27 … validate sequences, validate alignment block structure, and examine sequence tree:

28

29

30

31 Pfam COGs CDD PF0994 MoCF_biosynth MoeA_N MoeA_C cd00758 MoaB CinA MoeA cd00758_b (MoeA) cd00758_c (CinA)cd00758_a (MoaB)

32 Concept borrowed from COGs – pattern of phylogenetic distribution as evidence for functional divergence after gene duplication events

33 Principles for establishing CD-Hierarchies: Economy – too many families slow down search system Search performance – flat alignment models must be split Domain age – we’re primarily interested in sets of ancient conserved domains Domain architectures Subgroup-specific features Plants Animals Archaea Alpha-proteobact. Gram+ 3.5 bio 1.7 bio 2.6 bio

34 Future directions: ability to describe complex hierarchies, which will allow modeling of fusion events ABCDEFG ABC_2ABC_1DEFG_2DEFG_1 ABC_2DEFG_2

35 Credits: Steve Bryant Lewis Geer Siqian He David Hurwitz Christopher Lanczycki Charlie Liu Tom Madej Anna Panchenko Ben Shoemaker Vahan Simonyan Paul Thiessen Yanli Wang John Anderson Natalie Fedorova John Jackson Aviva Jacobs Cynthia Liebert Gabriele Marchler Raja Mazumder B. Sridhar Rao Carol DeWeese-Scott James Song Sona Vasudevan Roxanne Yamashita Jodie Yin PFAM SMART COGs BLAST team Entrez team Taxonomy team NCBI Help-Desk

36


Download ppt "CDD – a conserved domain database Aron Marchler-Bauer NCBI, National Library of Medicine, NIH DIMACS Workshop on Protein Domains: Identification, Classification."

Similar presentations


Ads by Google