Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.

Similar presentations

Presentation on theme: "Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek."— Presentation transcript:

1 Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek

2 Pattern databases - topics Definition Applications Classifications Common Databases Conclusions

3 Pattern databases Definition Applications Classifications Common Databases Conclusions

4 Secondary databases derived from conserved obtained from multiple sequence alignment of primary databases such as GenBank, EMBL,DDBJ, SP/TrEMBL,PIR,etc Pattern databases – definition

5 Primary databases (SWISS-PROT - Protein GenBank - DNA) Millions of sequences Pattern databases Pattern Extraction - Multiple sequence alignment Thousands of patterns

6 Pattern databases Definition Applications Classifications Common Databases Conclusions

7 Pattern Databases - Applications Function prediction of protein/ nucleotide sequences even when sequence similarity is low (<25%). Useful for classification of protein sequences into families. It takes less time to search the pattern than the primary database. –Since “patterns” is the compact representation of features of many sequences.

8 Pattern databases Definition Applications Classifications Common Databases Conclusions

9 Multiple Sequence Alignment (MSA) Family based databases – considers full MSA Motif -3 Motif -1 Motif based databases – considers local regions in MSA

10 Pattern Databases – Protein Motif based PROSITE PRINTS BLOCKS Family based ProDom PIR-ALN ProtoMap DOMO ProClass Pfam SMART TIGRFAMs SBASE SYSTERS

11 DNA pattern database REBASE Transfac

12 InterPro - Integrated resources of protein families and sites PROSITE PRINTS BLOCKS Pfam ProDom InterPro

13 Pattern databases Definition Applications Classifications Common Databases –PROSITE, PRINTS, BLOCKS & SMART (motif based) –MetaFam, InterPro (Integrated databases) Conclusions

14 Databases – General Tips 1. Source 2. Input formats & parameters 3. Output formats 4. Quality of the data 5. Other details – updates, coverage, speed, download, reference, methods etc.

15 Focus To search pattern databases using the text or keyword search options in them for “Alkaline phosphatase” enzyme. To analyze the quality of results from each of these database –Sensitivity, specificity. Sequence & Pattern searches - In the afternoon’s practical.

16 PROSITE consists of biologically significant protein sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs. Based on SWISSPROT/TrEMBL

17 Text Search Sequence Scanner ID and text Search


19 Details about the pattern/profile Details about the pattern/profile PROSITE ID PROSITE Pattern Result: PROSITE Documentaion page [IV]-x-D-S-[GAS]-[GASC]-[GAST]-[GA]-T [S is the active site residue]

20 Numerical Results PROSITE Pattern Detailed View - page 1

21 Detailed View - page 2 True Positives False Positives View entry in raw text format (no links)

22 Raw Text Format – PROSITE Format

23 ID Identification AC Accession number DT Date DE Short description PA Pattern MA Matrix/profile RU Rule NR Numerical results CC Comments DR Cross-references to SWISS-PROT 3D Cross-references to PDB DO Pointer to the documentation file // Termination line

24 PROSITE Profiles

25 Highly degenerate protein structural and functional domains –immunoglobulin domains, SH2 and SH3 domains. Consensus sequences of repetitive DNA elements –SINEs, LINEs Basic gene expression signals –promoter elements, RNA processing signals, translational initiation sites. DNA-binding protein motifs. Protein and nucleic acid compositional domains –glutamine-rich activation domains, CpG islands.

26 PROSITE - features Completeness High specificity Documentation Periodic reviewing Parallel update with SWISS- PROT(primary database)

27 Multiple Sequence Alignment Find 4-5 functionally conserved residues cydeggis cyedggis cyeeggit cyhgdggs cyrgdgnt C-Y-x2-[DG]-G-x-[ST] CORE PATTERN SWISS-PROT More FALSE POSITIVES ? Increase the sequence length of the pattern PROSITE DB YES NO motif

28 Protein fingerprint database Fingerprint - set of motifs used that represent the most conserved regions of multiple sequence alignment. Improved diagnostic reliability than single motif methods Source – SWISSPROT/TrEMBL

29 Multiple Sequence Alignment Identification of ALL the conserved regions cydeggis cyedggis cyeeggit cyhgdggs Creation of frequency matrices SWISS-PROT / Tr-EMBL PRINTS DB xxxxxxx Frequency matrices motif fingerprint Iterative database scanning of the frequency matrices with protein databases till convergence

30 Database ID, no. of motifs and text Search Motif scanner (for searching a sequence or pattern against PRINTS database)


32 Page 1 for ‘alkaline phosphatase’ entry in PRINTS Documentation, Links & references

33 Page 2 Fingerprint details Sequence Summary

34 Page 3 Motif no. 1 Motif no. 2 “Raw” motif SWISSPROT -IDs Start and Interval between motifs in the fingerprint

35 BLOCKS Blocks are multiple aligned ungapped segments corresponding to the most highly conserved regions of proteins The BLOCKS database is a collection of blocks representing known protein families that can be used to compare a protein or DNA sequence with documented families of proteins.

36 Blocks Making Blocks are produced by the automated PROTOMAT system (Henikoff and Henikoff, 1991), which applies a robust motif-finder to a set of related protein sequences.


38 Sequence, no. of blocks and text Searches Blocks Maker

39 Page 1 Summary Search methods using blocks

40 Page 2 BLOCK - 1 Represent start position of the block SWISSPROT ID Weak Blocks - Strength < 1100 Strong Blocks - Strength >= 1100 Weak Blocks - Strength < 1100 Strong Blocks - Strength >= 1100

41 Contains >500 domain families associated with signaling, extra-cellular and chromatin-associated proteins are found. Each domain is extensively annotated with phyletic distributions, functional class, tertiary structures and functionally important residues.

42 ID and text Search ID & sequence Search Domain & GO search Alkaline Phosphatase



45 Results – Alkaline phosphatase “Signatures” PROSITE –Represented as a single motif. PRINTS –Represented as 5 motif regions. BLOCKS –Represented as 6 block regions SMART –Represented as a single profile

46 Composite Pattern Databases MetaFam InterPro CDD (conserved Domain Database) IProClass

47 Metafam & PANAL Metafam - PANAL – Protein ANALysis tool page of Metafam Protein family classification built with Blocks+, DOMO, Pfam, PIR-ALN, PRINTS, Prosite, ProDom, SBASE, SYSTERS.


49 Interpro Built from PROSITE, PRINTS, Pfam, ProDom, SMART, TIGRFAM, SWISS- PROT and TrEMBL Text- and sequence-based searches.



52 PRINTS PROSITE Pfam PRODOM SMART Detailed View - page 1

53 Detailed View - page 2 BLOCKS database link


55 Detailed View - page 2

56 T – True Positive F – False Positive Range of the motif

57 Pattern databases Definition Applications Classifications Common Databases –PROSITE, PRINTS & BLOCKS (motif based) –MetaFam, InterPro (Integrated databases) Conclusions

58 CONCLUSION Diverse pattern databases from small patterns to profiles to complex HMM models Different strength and weakness Different database formats Best to combine and analyze results from different pattern databases.

Download ppt "Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek."

Similar presentations

Ads by Google