Presentation is loading. Please wait.

Presentation is loading. Please wait.

PRINTS A protein family database with a difference Terri Attwood Faculty of Life Sciences & School of Computer Science University of Manchester, Oxford.

Similar presentations


Presentation on theme: "PRINTS A protein family database with a difference Terri Attwood Faculty of Life Sciences & School of Computer Science University of Manchester, Oxford."— Presentation transcript:

1 PRINTS A protein family database with a difference Terri Attwood Faculty of Life Sciences & School of Computer Science University of Manchester, Oxford Road Manchester M13 9PT, UK http://www.bioinf.manchester.ac.uk/dbbrowser/ Understanding the difference

2 Preface 10/26/20152 Pattern-recognition tools come in different shapes & sizes –the databases they underpin consequently differ [GSTALIVMFYWC]-[GSTANCPDE]-{EDPKRH}-x-{PQ}-[LIVMNQGA]-{RK}-{RK}-[LIVMFT]- [GSTANC]-[LIVMFYWSTAC]-[DENH]-R-[FYWCSH]-{PE}-x-[LIVM] The challenge is to understand the consequences of those differences But it isn’t just the underlying methods that differ –the database search tools differ –& the results of using those tools differ –the annotation philosophies of the databases also differ How, then, do we know if the family we see in different databases is the same or different? –how do we know if the differences are meaningful? –the smallest things can be highly significant

3 10/26/20153 Overview Setting the scene –health warnings Methods of family analysis –where fingerprints fit in Some examples –& cautionary tales Conclusions Epilogue

4 10/26/20154 Health warning 1 Remember the biology Proteins exhibit rich evolutionary relationships & complex molecular interactions –so they present significant challenges for in silico analysis Problems arise when we lose sight of the underlying biology

5 10/26/20155 We are using biology-unaware search tools to analyse such complex systems… In trying to understand molecular function, we must be realistic about what we can achieve using such naïve approaches…

6 10/26/20156 Health warning 2 Remember the limitations of the search methods Pairwise search methods (BLAST, FastA...) & catch-all family-based methods (profiles, HMMs…) ‘see’ generic similarity These methods do not see the often-subtle differences that constitute the functional determinants between closely-related families But identifying similarity between sequences is not the same as identifying their functions –in trying to derive functional insights, it is therefore imperative to recognise the limitations of the methods used What you see depends on how you look!

7 10/26/20157 Aims of family analysis Identifying patterns Given a set of sequences, we usually want to know –what are these proteins; to what family do they belong? –what is the function; how can we explain this in structural terms? We try to answer these questions by seeking patterns that will allow us to infer relationships with previously- characterised sequences We do this in 3 main ways…

8 10/26/20158 Full domain alignment methods Single motif methods Multiple motif methods Regular expressions (PROSITE) Profiles (Profile Library) HMMs (Pfam) Identity matrices (PRINTS) Methods of family analysis

9 10/26/20159 Challenge of family analysis patterns of conservation change Highly divergent family with single function? Superfamily with many diverse functional families? –must distinguish if we are to diagnose function reliably –but this is not always straightforward

10 10/26/201510 Where fingerprints fit in Fingerprints are sets of motifs that characterise families –taken together, the motifs create diagnostic signatures Offer improved diagnostic reliability over single motifs by virtue of the biological context provided by motif neighbours order interval

11 10/26/201511 Single-motif search Need convincing? …it’s actually common sense!

12 10/26/201512 Two-motif search

13 10/26/201513 Three-motif search From 406 hits with 1 motif, we converged to 1 hit with 3 motifs – so, adding motifs improves diagnostic reliability

14 10/26/201514 NC NC Visualising fingerprints the significance of motif context

15 10/26/201515/55 Creating fingerprints signature database annotation! UniProt  PRINTS

16 10/26/201516 loop region TM domain Families are hierarchical & hence so are fingerprints Fingerprints allow us to focus on differences as well as similarities

17 10/26/201517 Differences yield functional insights K1 K2 K3 K4K5 PTP1 K6 PTP2 PTP3 K7K8 PTP4 PTP5 PTP6 WPD HCX 5 R A D B C

18 10/26/201518 Perspectives from InterPro highlighting similarities & differences

19 Similarities are informative They give insights into shared high-level functions

20 Differences are informative They give insights into unique functional specificities The more differences, the more you learn about the tool’s functional niche Protein families are just the same!

21 10/26/201521 Examples & cautionary tales Similarity searches have been the mainstay of functional annotation efforts –because they allow us to recognise similarities with things we’ve seen before & allow us to transfer characteristics of known to unknown proteins Results of in silico searches need to be considered carefully –let’s take a closer look…

22 10/26/201522  -opioid receptor  -opioid receptor  -opioid receptor true Q23293_CAEEL Putative uncharacterized protein

23 10/26/201523 When is a GPCR not an SSR? Query length: 389 AA Date run: 2002-10-18 09:08:29 UTC+0100 on sib-blast.unil.ch Taxon: Homo sapiensDatabase: XXswissprot 120,412 sequences; 45,523,583 total letters SWISS-PROT Release 40.29 of 10-Oct-2002 Db AC Description Score E-value sp Q9UKP6 Q9UKP6 Orphan receptor [Homo sapiens... 782 0.0 sp P31391 SSR4_HUMAN Somatostatin receptor type 4 (SS4R) [SSTR4]... 167 3e-41 sp O43603 GALS_HUMAN Galanin receptor type 2 (GAL2-R) (GALR2) [G... 147 4e-35 sp P30872 SSR1_HUMAN Somatostatin receptor type 1 (SS1R) (SRIF-2... 144 3e-34 sp P32745 SSR3_HUMAN Somatostatin receptor type 3 (SS3R) (SSR-28... 140 3e-33 sp P35346 SSR5_HUMAN Somatostatin receptor type 3 (SS5R) (SSTR5)... 140 6e-33 sp P30874 SPLICE ISOFORM B of P30874 [SSTR2] [Homo sapiens... 134 3e-31 sp P30874 SSR2_HUMAN Somatostatin receptor type 2 (SS2R) (SRIF-1... 134 3e-31 sp P48145 GPR7_HUMAN Neuropeptides B/W receptor type 1 (G protei... 133 7e-31 sp O60755 GALT_HUMAN Galanin receptor type 3 (GAL3-R) (GALR3) [G... 132 2e-30 sp P41143 OPRD_HUMAN Delta-type opioid receptor (DOR-1) [OPRD1]... 128 2e-29 sp P35372 SPLICE ISOFORM 1A of P35372 [OPRM1] [Homo sapien... 125 1e-28 sp P35372 OPRM_HUMAN Mu-type opioid receptor (MOR-1) [OPRM1] [Ho... 125 1e-28

24 10/26/201524 When is a GPCR not an SSR? …when it’s a UR2R Query length: 389 AA Date run: 2002-10-18 09:08:29 UTC+0100 on sib-blast.unil.ch Taxon: Homo sapiensDatabase: XXswissprot 120,412 sequences; 45,523,583 total letters SWISS-PROT Release 40.29 of 10-Oct-2002 Db AC Description Score E-value sp Q9UKP6 UR2R_HUMAN Urotensin II receptor (UR-II-R) [GPR14] [Ho... 782 0.0 sp P31391 SSR4_HUMAN Somatostatin receptor type 4 (SS4R) [SSTR4]... 167 3e-41 sp O43603 GALS_HUMAN Galanin receptor type 2 (GAL2-R) (GALR2) [G... 147 4e-35 sp P30872 SSR1_HUMAN Somatostatin receptor type 1 (SS1R) (SRIF-2... 144 3e-34 sp P32745 SSR3_HUMAN Somatostatin receptor type 3 (SS3R) (SSR-28... 140 3e-33 sp P35346 SSR5_HUMAN Somatostatin receptor type 3 (SS5R) (SSTR5)... 140 6e-33 sp P30874 SPLICE ISOFORM B of P30874 [SSTR2] [Homo sapiens... 134 3e-31 sp P30874 SSR2_HUMAN Somatostatin receptor type 2 (SS2R) (SRIF-1... 134 3e-31 sp P48145 GPR7_HUMAN Neuropeptides B/W receptor type 1 (G protei... 133 7e-31 sp O60755 GALT_HUMAN Galanin receptor type 3 (GAL3-R) (GALR3) [G... 132 2e-30 sp P41143 OPRD_HUMAN Delta-type opioid receptor (DOR-1) [OPRD1]... 128 2e-29 sp P35372 SPLICE ISOFORM 1A of P35372 [OPRM1] [Homo sapien... 125 1e-28 sp P35372 OPRM_HUMAN Mu-type opioid receptor (MOR-1) [OPRM1] [Ho... 125 1e-28

25 10/26/201525 UR2R_HUMAN vs GPCRRHODOPSN

26 Perspectives from other resources 10/26/201526

27 10/26/201527 ID Q6NV75 PRELIMINARY; PRT; 609 AA. AC Q6NV75; DT 05-JUL-2004 (TrEMBLrel. 27, Created) DT 05-JUL-2004 (TrEMBLrel. 27, Last sequence update) DT 05-JUL-2004 (TrEMBLrel. 27, Last annotation update) DE G protein-coupled receptor 153. GN Name=GPR153; OS Homo sapiens (Human). OX NCBI_TaxID=9606 RN [1] RP SEQUENCE FROM N.A. RC TISSUE=Brain; RA Strausberg R.L., Feingold E.A., Grouse L.H., Derge J.G., RA Jones S.J., Marra M.A.; RT "Generation and initial analysis of more than 15,000 full-length RT human and mouse cDNA sequences."; RL Proc. Natl. Acad. Sci. U.S.A. 99:16899-16903(2002). RP SEQUENCE FROM N.A. RC TISSUE=Brain; RA Strausberg R.; RL Submitted (MAR-2004) to the EMBL/GenBank/DDBJ databases. DR EMBL; BC068275; AAH68275.1; -. DR GO; GO:0004872 DR InterPro; IPR000276; GPCR_Rhodpsn. DR Pfam; PF00001; 7tm_1; 1. DR PROSITE; PS50262; G_PROTEIN_RECEP_F1_2; 1. KW Receptor SQ SEQUENCE 609 AA; 65341 MW; E525CC7F60D0891C CRC64; MSDERRLPGS AVGWLVCGGL SLLANAWGIL SVGAKQKKWK PLEFLLCTLA ATHMLNVAVP IATYSVVQLR RQRPDFEWNE GLCKVFVSTF YTLTLATCFS VTSLSYHRMW MVCWPVNYRL SNAKKQAVHT VMGIWMVSFI LSALPAVGWH DTSERFYTHG CRFIVAEIGL GFGVCFLLLV GGSVAMGVIC TAIALFQTLA VQVGRQADHR AFTVPTIVVE DAQGKRRSSI DGSEPAKTSL QTTGLVTTIV FIYDCLMGFP VLVVSFSSLR ADASAPWMAL CVLWCSVAQA LLLPVFLWAC DRYRADLKAV REKCMALMAN DEESDDETSL EGGISPDLVL ERSLDYGYGG DFVALDRMAK YEISALEGGL PQLYPLRPLQ EDKMQYLQVP PTRRFSHDDA DVWAAVPLPA FLPRWGSGED LAALAHLVLP AGPERRRASL LAFAEDAPPS RARRRSAESL LSLRPSALDS GPRGARDSPP GSPRRRPGPG PRSASASLLP DAFALTAFEC EPQALRRPPG PFPAAPAAPD GADPGEAPTP PSSAQRSPGP RPSAHSHAGS LRPGLSASWG EPGGLRAAGG GGSTSSFLSS PSESSGYATL HSDSLGSAS // Pfam match Q6NV75/24-297 GPCR? PROSITE (profile) no match PROSITE (regex) no match PRINTS no match ClustalW – sequences too divergent to be aligned  false negative

28 10/26/201528 Rhodopsin-like superfamily GPCRs in InterPro 2005 IPR000276GPCR_Rhodopsn 7,752 proteins PS50262G_PROTEIN_RECEP_F1_27,702 proteins PF000017tm_1 7,064 proteins PS00237G_PROTEIN_RECEP_F1_16,527 proteins PR00237GPCRRHODOPSN5,821 proteins (don’t include partials)

29 10/26/201529 Rhodopsin-like superfamily GPCRs in the source databases Pfam FP ?FN ? U ? TP? 8,776 matches 7,064 PROSITE (profile) FP 3FN 3 U 12TP 1,837 matches 7,702 PROSITE (regex) FP 92FN 261 U 0TP 1,530 matches 6,527 PRINTSFP 0 FN ? U 0 TP 1,154 matches 5,821

30 10/26/201530 Rhodopsin-like superfamily GPCRs in InterPro 2006 IPR000276GPCR_Rhodopsn 14,206 proteins PS50262G_PROTEIN_RECEP_F1_214,108 proteins PF000017tm_1 13,148 proteins PR00237GPCRRHODOPSN11,357 proteins PS00237G_PROTEIN_RECEP_F1_111,109 proteins

31 10/26/201531 Rhodopsin-like superfamily GPCRs in InterPro 2009 IPR0002767TM_GPCR_Rhodopsn 24,039 proteins PF000017tm_1 23,702 proteins 16,975 PR00237GPCRRHODOPSN20,158 proteins 6,660 (incl.partials) PS00237G_PROTEIN_RECEP_F1_115,939 proteins 1,950 PS50262G_PROTEIN_RECEP_F1_2? proteins 2,390 What does it all mean? How are users supposed to know?No human curator has time to validate all these matches… 25,248 GPCR rhodopsin- like superfamily

32 10/26/201532 The annotation paradox Without annotation, data are meaningless But, there’s too much data for manual annotation to be practicable −it took ~600 person years, over 23-years, to annotate ~500K Swiss-Prot entries −but...9 million entries in TrEMBL, 163 million in EMBL?! So manual annotation is clearly impossible, but is nevertheless a necessary evil Like PROSITE, therefore, fingerprints are manually annotated prior to inclusion in PRINTS –& hence, like PROSITE, the database has remained small –let’s briefly take a closer look…

33 33 Protein family annotation a PRINTS view

34 34 Protein family annotation a PRINTS view Where do we get this information? UniProt:Swiss-Prot PROSITE InterPro PubMed/literature Auto-annotation tools PRECIS, METIS, BioIE… MINOTAUR

35 10/26/201535 Protein family annotation a PROSITE view

36 10/26/201536 Protein family annotation a PROSITE view

37 Protein family annotation a Pfam view

38 10/26/201538 Protein family annotation an InterPro view Where does this information come from?

39 10/26/201539 In an ideal world…

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55 Automatic nonsense!

56 10/26/201556 Conclusions Similarity searches have been the mainstay of functional annotation efforts –because they reduce a complex problem to a more tractable one i.e., identifying & quantifying relationships between sequences But identifying similarity between sequences is not the same as identifying their functions Failure to appreciate this fundamental point has generated numerous annotation errors in our databases –& in the literature!

57 10/26/201557 Conclusions In characterising unknown sequences, it is wise to run pairwise & family-based searches –top hits aren’t always the most biologically significant –BLAST/FastA/profiles/HMMs offer broad brush strokes –motif-based methods add fine detail no method alone is best (they all have limitations) different methods give different perspectives The differences revealed by these perspectives are often more important than the similarities they uncover –differences may shed light on unique functional determinants Never lose sight of the underlying biology!

58 Rhodopsin - rod cell, achromatic receptor Opsin - green-sensitive cone photoreceptor

59 Argininosuccinate lyase - amino acid biosynthesis Delta crystallin - non-enzymatic, structural eye-lens protein

60 Hands On Review the UniProt entry Q6NV75: http://www.uniprot.org/uniprot/Q6NV75 http://www.uniprot.org/uniprot/Q6NV75 Submit this to ScanProsite: http://www.expasy.ch/tools/scanprosite http://www.expasy.ch/tools/scanprosite Submit this to FingerPRINTScan: http://www.bioinf.manchester.ac.uk/cgi- bin/dbbrowser/fingerPRINTScan/muppet/FPScan_fam.cgi http://www.bioinf.manchester.ac.uk/cgi- bin/dbbrowser/fingerPRINTScan/muppet/FPScan_fam.cgi Submit this to GraphScan: http://www.bioinf.manchester.ac.uk/cgi- bin/dbbrowser/fingerPRINTScan/muppet/GRAPHScan.cgi http://www.bioinf.manchester.ac.uk/cgi- bin/dbbrowser/fingerPRINTScan/muppet/GRAPHScan.cgi Access Utopia: http://utopia.cs.manchester.ac.uk/ http://utopia.cs.manchester.ac.uk/ 10/26/201560


Download ppt "PRINTS A protein family database with a difference Terri Attwood Faculty of Life Sciences & School of Computer Science University of Manchester, Oxford."

Similar presentations


Ads by Google