Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bioinformatics approaches for… Teresa K Attwood Faculty of Life Sciences & School of Computer Science University of Manchester, Oxford Road Manchester.

Similar presentations


Presentation on theme: "Bioinformatics approaches for… Teresa K Attwood Faculty of Life Sciences & School of Computer Science University of Manchester, Oxford Road Manchester."— Presentation transcript:

1 Bioinformatics approaches for… Teresa K Attwood Faculty of Life Sciences & School of Computer Science University of Manchester, Oxford Road Manchester M13 9PT, UK

2 ….analysing GPCRs….

3 ….which craft is best?

4 Overview What are GPCRs? –why they’re interesting & important –why bioinformatics approaches are important In silico function prediction –a reality check Family-based methods for characterising GPCRs Understanding the tools –problems with pair-wise & family-based approaches –estimating (biological) significance Seeking deeper functional insights Conclusions

5 GDPGTP What are GPCRs? G protein-coupled receptors A functionally diverse family of cell-surface 7TM proteins Functional diversity achieved via –interaction with a variety of ligands –stimulation of various intracellular pathways via coupling to different G proteins

6 Why are GPCRs interesting? Why are GPCRs interesting? Attwood, TK & Flower, DR (2002) Trawling the genome for G protein-coupled receptors: the importance of integrating bioinformatic approaches. In Drug Design – Cutting Edge Approaches, pp They are ubiquitous –>800 GPCR genes in the human genome, from 3 major superfamilies rhodopsin-, secretin- & metabotropic glutamate receptor-like Share almost no sequence similarity –but are united by common 7TM architecture Constitute a complex multi-gene family –populated by >50 families & >350 subtypes

7 Isn’t just stamp collecting! Isn’t just stamp collecting! Attwood, TK & Flower, DR (2002) Trawling the genome for G protein-coupled receptors: the importance of integrating bioinformatic approaches. In Drug Design – Cutting Edge Approaches, pp GPCRs are of profound biomedical importance –targets for >50% of prescription drugs –yield sales >$16 billion/annum they’re big business! Given their importance, we need to –characterise the ones we know about –identify new ones & discover what they do! –e.g., as potential new drug targets

8 Why studying GPCRs is difficult Only 2 crystal structures available –bovine rhodopsin (2000) & human  2-adrenergic receptor (2007) Many GPCRs haven’t been characterised experimentally –remain 'orphans’, with unknown ligand specificity With >800 human GPCRs, this isn’t much to go on!

9 Why use bioinformatics approaches? Computational approaches are important –can be used to help identify, characterise & model novel receptors usually by similarity & extrapolation of known characteristics Bioinformatics thus offers complementary tools for elucidating the structures & functions of receptors But the task is non-trivial –GPCRs exhibit rich relationships & complex molecular interactions present many challenges for in silico analysis –in trying to derive meaningful functional insights, traditional methods are likely to be limited

10 We’ve been using biology-unaware search tools to analyse such complex systems How far can we truly expect to understand cellular function with such naïve approaches…?

11 In silico function prediction …a reality check What is the function of this structure? What is the function of this sequence? What is the function of this motif? –the fold provides a scaffold, which can be decorated in different ways by different sequences to confer different functions - knowing the fold & function allows us to rationalise how the structure effects its function at the molecular level

12 “ A test case for structural genomics Structure-based assignment of the biochemical function of hypothetical protein mj0577” ( Zarembinski et al., PNAS ) Although the structure co-crystallised with ATP, the biochemical function of the protein is unknown

13 What's in a sequence?

14 Full domain alignment methods Single motif methods Multiple motif methods Fuzzy regex (eMOTIF) Exact regex (PROSITE) Profiles (Profile Library) HMMs (Pfam) Identity matrices (PRINTS) Weight matrices (Blocks) Methods for family analysis Attwood, TK (2000). The quest to deduce protein function from sequence: the role of pattern databases. Int.J. Biochem. Cell Biol., 32(2), 139–155.

15 The challenge of family analysis highly divergent family with single function? superfamily with many diverse functional families? –must distinguish if function analysis done in silico –a tough challenge!

16 In the beginning was PROSITE [GSTALIVMYWC]-[GSTANCPDE]-{EDPKRH}-X(2)-[LIVMNQGA]-X(2)-[LIVMFT]-[GSTANC]-LIVMFYWSTAC]-[DENH]-R TM domain

17 Diagnostic limitations of PROSITE ID G_PROTEIN_RECEP_F1_1; PATTERN. AC PS00237; DT APR-1990 (CREATED); NOV-1997 (DATA UPDATE); SEP-2004 (INFO UPDATE). DE G-protein coupled receptors family 1 signature. PA [GSTALIVMFYWC]-[GSTANCPDE]-{EDPKRH}-x(2)-[LIVMNQGA]-x(2)-[LIVMFT]- PA [GSTANC]-[LIVMFYWSTAC]-[DENH]-R-[FYWCSH]-x(2)-[LIVM]. NR /RELEASE=44.6,159201; NR /TOTAL=1622(1621); /POSITIVE=1530(1529); /UNKNOWN=0(0); NR /FALSE_POS=92(92); /FALSE_NEG=261; /PARTIAL=61; This represents an apparent 22% error rate –the actual rate is probably higher Thus, a match to a pattern is not necessarily true –& a mis-match is not necessarily false! False-negatives are a fundamental limitation to this type of pattern matching –if you don't know what you're looking for, you'll never know you missed it!

18 Where do motifs (fingerprints) fit in? (fingerprints are hierarchical) loop region TM domain

19 Rhodopsin-like superfamily, family & subtype GPCRs in PRINTS Attwood, TK (2001) A compendium of specific motifs for diagnosing GPCR subtypes. TiPS, 22(4),

20 Searching PRINTS - FingerPRINTScan Scordis, P, Flower, DR & Attwood, TK (1999) FingerPRINTScan: intelligent searching of the PRINTS motif database. Bioinformatics, 15, GPCR fingerprints are embedded in PRINTS –allows diagnosis of GPCR mosaics

21

22 NC NC Visualising fingerprints Attwood, TK & Findlay, JBC (1993) Design of a discriminating fingerprint for G-protein-coupled receptors. Protein Eng., 6(2), 167–176.

23 Visualising fingerprints Attwood, TK & Findlay, JBC (1993) Design of a discriminating fingerprint for G-protein-coupled receptors. Protein Eng., 6(2), 167–176. N C

24 Diagnosing partial matches Missed by PROSITE –wasn’t annotated as a FN

25 An integrated approach An integrated approach Mulder, NJ, Apweiler, R, Attwood, TK, Bairoch, A et al. (2007) New developments in InterPro. NAR, 35, D To simplify sequence analysis, the family dbs were integrated within a unified annotation resource – InterPro –initial partners were PRINTS, PROSITE, profiles & Pfam now many more partners –linked to its satellite dbs but lags behind their coverage –by Oct 2007, it had 14,768 entries & covered 76% of UnitProtKB major role in fly & human genome annotation

26 InterPro – method comparison

27 Where has this got us?

28 Understanding the tools …estimating significance How do we know what to believe? Let’s explore some of the difficulties that arise when pair-wise search tools (BLAST & FastA) & family- based methods are used naïvely –these examples caution us to think about what the results actually mean in biological terms.....

29 Identifying sequence similarity GPCRs present many challenges for in silico functional analysis Several signature-based methods now available –with different areas of optimum application Yet naïve, pair-wise similarity searching has been the mainstay of functional annotation efforts –it allows us to identify/quantify relationships between sequences But quantifying similarity between sequences is not the same as identifying their functions

30 Problems with pairwise similarity tools Problems with pairwise similarity tools Gaulton, A & Attwood, TK (2003) Bioinformatics approaches for the classification of G protein-coupled receptors. Current Opinion in Pharmacology, 3, For identifying precise families to which receptors belong & the ligands they bind, pair-wise tools are limited –at what level of seq ID is ligand specificity conserved? some GPCRs with 25% ID share a common ligand; others, with greater levels, don’t… It may be impossible to tell from BLAST if an orphan belongs to a known family (the top hit), or if it will bind a novel ligand –e.g., for the now de-orphaned UR2R, BLAST indicates most similarity to the type 4 SSRs, yet it is known to bind a different (related) ligand

31 When is a GPCR not an SSR? Query length: 389 AA Date run: :08:29 UTC+0100 on sib-blast.unil.ch Taxon: Homo sapiensDatabase: XXswissprot 120,412 sequences; 45,523,583 total letters SWISS-PROT Release of 10-Oct-2002 Db AC Description Score E-value sp Q9UKP6 Q9UKP6 Orphan receptor [Homo sapiens sp P31391 SSR4_HUMAN Somatostatin receptor type 4 (SS4R) [SSTR4] e-41 sp O43603 GALS_HUMAN Galanin receptor type 2 (GAL2-R) (GALR2) [G e-35 sp P30872 SSR1_HUMAN Somatostatin receptor type 1 (SS1R) (SRIF e-34 sp P32745 SSR3_HUMAN Somatostatin receptor type 3 (SS3R) (SSR e-33 sp P35346 SSR5_HUMAN Somatostatin receptor type 3 (SS5R) (SSTR5) e-33 sp P30874 SPLICE ISOFORM B of P30874 [SSTR2] [Homo sapiens e-31 sp P30874 SSR2_HUMAN Somatostatin receptor type 2 (SS2R) (SRIF e-31 sp P48145 GPR7_HUMAN Neuropeptides B/W receptor type 1 (G protei e-31 sp O60755 GALT_HUMAN Galanin receptor type 3 (GAL3-R) (GALR3) [G e-30 sp P41143 OPRD_HUMAN Delta-type opioid receptor (DOR-1) [OPRD1] e-29 sp P35372 SPLICE ISOFORM 1A of P35372 [OPRM1] [Homo sapien e-28 sp P35372 OPRM_HUMAN Mu-type opioid receptor (MOR-1) [OPRM1] [Ho e-28

32 …when it’s a UR2R When is a GPCR not an SSR? …when it’s a UR2R Query length: 389 AA Date run: :08:29 UTC+0100 on sib-blast.unil.ch Taxon: Homo sapiensDatabase: XXswissprot 120,412 sequences; 45,523,583 total letters SWISS-PROT Release of 10-Oct-2002 Db AC Description Score E-value sp Q9UKP6 UR2R_HUMAN Urotensin II receptor (UR-II-R) [GPR14] [Ho sp P31391 SSR4_HUMAN Somatostatin receptor type 4 (SS4R) [SSTR4] e-41 sp O43603 GALS_HUMAN Galanin receptor type 2 (GAL2-R) (GALR2) [G e-35 sp P30872 SSR1_HUMAN Somatostatin receptor type 1 (SS1R) (SRIF e-34 sp P32745 SSR3_HUMAN Somatostatin receptor type 3 (SS3R) (SSR e-33 sp P35346 SSR5_HUMAN Somatostatin receptor type 3 (SS5R) (SSTR5) e-33 sp P30874 SPLICE ISOFORM B of P30874 [SSTR2] [Homo sapiens e-31 sp P30874 SSR2_HUMAN Somatostatin receptor type 2 (SS2R) (SRIF e-31 sp P48145 GPR7_HUMAN Neuropeptides B/W receptor type 1 (G protei e-31 sp O60755 GALT_HUMAN Galanin receptor type 3 (GAL3-R) (GALR3) [G e-30 sp P41143 OPRD_HUMAN Delta-type opioid receptor (DOR-1) [OPRD1] e-29 sp P35372 SPLICE ISOFORM 1A of P35372 [OPRM1] [Homo sapien e-28 sp P35372 OPRM_HUMAN Mu-type opioid receptor (MOR-1) [OPRM1] [Ho e-28

33

34 The trouble with top hits The most statistically significant hit is not always the most biologically relevant Yet many rule-based ‘expert systems’ still rely on top BLAST or FastA hits to make their diagnoses BLAST/FastA ‘see’ generic similarity & not the often-subtle differences that constitute the functional determinants between closely-related receptor families & subtypes Failure to appreciate this fundamental point has generated numerous annotation errors in our databases

35  -opioid receptor  -opioid receptor  -opioid receptor true Misleading annotation via FastA

36 As we’ve seen, it’s tempting to use top hits from BLAST or FastA results to classify unknown proteins –but this may lead us (& especially computer programs) to false functional conclusions PSI-BLAST is more sensitive than BLAST, because it creates a profile from hits above a given threshold –but this too can cause problems –let’s take a closer look Misleading results from BLAST

37

38 So, is UL78 a GPCR? & if so, what sort?

39 What PSI- BLAST said (profile dilution in action) * * *

40 What GeneQuiz said… a thrombin receptor

41 What GeneQuiz said later…

42 Overview of results pair-wise & family-based methods

43 What is UL78? ToolNo hitPoor hitSignificant hit BLASTGPCRs in list PSI-BLASTthrombin receptor; chemokine & opioid receptors PROSITE profileGPCR Pfam PRINTS Blocks-PRINTSGPCR GeneQuizthrombin receptor; C5A receptor   Bioinformatics tools, alone, cannot tell us!

44 So, beware top hits …but also beware bottom hits! Let us now compare & contrast some InterPro results with those of its source dbs…

45 Rhodopsin-like superfamily GPCRs in InterPro 2005 IPR000276GPCR_Rhodopsn 7752 proteins PS50262G_PROTEIN_RECEP_F1_27702 proteins PF000017tm_ proteins PS00237G_PROTEIN_RECEP_F1_16527 proteins PR00237GPCRRHODOPSN5821 proteins (don’t include partials)

46 Rhodopsin-like superfamily GPCRs in the source databases Pfam FP ?FN ? U ? TP? 8776 matches 7064 PROSITE (profile) FP 3FN 3 U 12TP 1837 matches 7702 PROSITE (regex) FP 92FN 261 U 0TP 1530 matches 6527 PRINTSFP 0 FN ? U 0 TP 1154 matches 5821 >2165 updated

47 Rhodopsin-like superfamily GPCRs in InterPro 2007 IPR000276GPCR_Rhodopsn 16,845 proteins PS50262G_PROTEIN_RECEP_F1_216,714 proteins PF000017tm_1 15,712 proteins PR00237GPCRRHODOPSN13,405 proteins PS00237G_PROTEIN_RECEP_F1_113,723 proteins No human curator has time to validate all these matches…

48 14,615 rhodopsin-like superfamily GPCRs in Pfam?

49 ID Q6NV75 PRELIMINARY; PRT; 609 AA. AC Q6NV75; DT 05-JUL-2004 (TrEMBLrel. 27, Created) DT 05-JUL-2004 (TrEMBLrel. 27, Last sequence update) DT 05-JUL-2004 (TrEMBLrel. 27, Last annotation update) DE G protein-coupled receptor 153. GN Name=GPR153; OS Homo sapiens (Human). OX NCBI_TaxID=9606 RN [1] RP SEQUENCE FROM N.A. RC TISSUE=Brain; RA Strausberg R.L., Feingold E.A., Grouse L.H., Derge J.G., RA Jones S.J., Marra M.A.; RT "Generation and initial analysis of more than 15,000 full-length RT human and mouse cDNA sequences."; RL Proc. Natl. Acad. Sci. U.S.A. 99: (2002). RP SEQUENCE FROM N.A. RC TISSUE=Brain; RA Strausberg R.; RL Submitted (MAR-2004) to the EMBL/GenBank/DDBJ databases. DR EMBL; BC068275; AAH ; -. DR GO; GO: DR InterPro; IPR000276; GPCR_Rhodpsn. DR Pfam; PF00001; 7tm_1; 1. DR PROSITE; PS50262; G_PROTEIN_RECEP_F1_2; 1. KW Receptor SQ SEQUENCE 609 AA; MW; E525CC7F60D0891C CRC64; MSDERRLPGS AVGWLVCGGL SLLANAWGIL SVGAKQKKWK PLEFLLCTLA ATHMLNVAVP IATYSVVQLR RQRPDFEWNE GLCKVFVSTF YTLTLATCFS VTSLSYHRMW MVCWPVNYRL SNAKKQAVHT VMGIWMVSFI LSALPAVGWH DTSERFYTHG CRFIVAEIGL GFGVCFLLLV GGSVAMGVIC TAIALFQTLA VQVGRQADHR AFTVPTIVVE DAQGKRRSSI DGSEPAKTSL QTTGLVTTIV FIYDCLMGFP VLVVSFSSLR ADASAPWMAL CVLWCSVAQA LLLPVFLWAC DRYRADLKAV REKCMALMAN DEESDDETSL EGGISPDLVL ERSLDYGYGG DFVALDRMAK YEISALEGGL PQLYPLRPLQ EDKMQYLQVP PTRRFSHDDA DVWAAVPLPA FLPRWGSGED LAALAHLVLP AGPERRRASL LAFAEDAPPS RARRRSAESL LSLRPSALDS GPRGARDSPP GSPRRRPGPG PRSASASLLP DAFALTAFEC EPQALRRPPG PFPAAPAAPD GADPGEAPTP PSSAQRSPGP RPSAHSHAGS LRPGLSASWG EPGGLRAAGG GGSTSSFLSS PSESSGYATL HSDSLGSAS // Pfam match Q6NV75/ GPCR? PROSITE (profile) no match PROSITE (regex) no match PRINTS no match ClustalW – sequences too divergent to be aligned  false negative

50 Beware top & bottom hits …but also beware simplistic analysis tools coupled with wet experiments! Let’s finally look at how hydropathy profiles can compel biologists to make strange deductions… - & still get their results published in Science!

51 GPCR? Pfam Lanthionine synthetase C-like protein PROSITE (profile) no match PROSITE (regex) no match PRINTS no match ClustalW – sequences too divergent to be aligned ID Q9C929_ARATH Unreviewed; 401 AA. AC Q9C929; DT 01-JUN-2001, integrated into UniProtKB/TrEMBL. DT 01-JUN-2001, sequence version 1. DT 24-JUL-2007, entry version 23. DE Putative G protein-coupled receptor; GN Name=F14G24.19; OrderedLocusNames=At1g52920; OS Arabidopsis thaliana (Mouse-ear cress). OC Eukaryota; Viridiplantae; Streptophyta;... Arabidopsis. OX NCBI_TaxID=3702; RN [1] RP NUCLEOTIDE SEQUENCE. RA Lin X., Kaul S., Town C.D., Benito M., Creasy T.H., Haas B.J., Wu D., RA Maiti R., Ronning C.M., Koo H., Fujii C.Y., Utterback T.R., RA Barnstead M.E., Bowman C.L., White O., Nierman W.C., Fraser C.M.; RT "Arabidopsis thaliana chromosome 1 BAC F14G24 genomic sequence."; RL Submitted (DEC-1999) to the EMBL/GenBank/DDBJ databases. RN [2] RP NUCLEOTIDE SEQUENCE. RA Town C.D., Kaul S.; RL Submitted (JAN-2001) to the EMBL/GenBank/DDBJ databases. DR EMBL; AC019018; AAG ; -; Genomic_DNA. [EMBL / GenBank / DDBJ] DR PIR; E96570; E DR UniGene; At.66935; -. DR GenomeReviews; CT485782_GR; AT1G DR KEGG; ath:At1g52920; -. DR TAIR; At1g52920; -. DR GO; GO: ; F:receptor activity; IEA:UniProtKB-KW. DR InterPro; IPR007822; LANC_like. DR InterPro; Graphical view of domain structure. DR Pfam; PF05147; LANC_like; 1. KW Receptor. SQ SEQUENCE 401 AA; MW; C9D3BF8CC8F0FE0B CRC64; MPEFVPEDLS GEEETVTECK DSLTKLLSLP YKSFSEKLHR YALSIKDKVV WETWERSGKR VRDYNLYTGV LGTAYLLFKS YQVTRNEDDL KLCLENVEAC DVASRDSERV TFICGYAGVC ALGAVAAKCL GDDQLYDRYL ARFRGIRLPS DLPYELLYGR AGYLWACLFL NKHIGQESIS SERMRSVVEE IFRAGRQLGN KGTCPLMYEW HGKRYWGAAH GLAGIMNVLM HTELEPDEIK DVKGTLSYMI QNRFPSGNYL SSEGSKSDRL VHWCHGAPGV ALTLVKAAQV YNTKEFVEAA MEAGEVVWSR GLLKRVGICH GISGNTYVFL SLYRLTRNPK YLYRAKAFAS FLLDKSEKLI SEGQMHGGDR PFSLFEGIGG MAYMLLDMND PTQALFPGYE L //

52 They do sums (quickly) & crude string matching Remember Computers don’t do biology!

53 Seeking deeper functional insights Attwood, TK, Croning, MD & Gaulton, A (2002) Deriving structural and functional insights from a ligand-based hierarchical classification of G protein-coupled receptors. Protein Eng., 15, S’family, family & subtype motifs have different locations If s’family motifs define the common scaffold, hypothesis: –family motifs relate to ligand binding? –subtype motifs relate to G protein coupling? –powerful tools for subtyping & potentially de-orphaning GPCRs

54 Locations of ligand-binding residues & motif distribution

55 Locations of G protein-coupling residues & distribution of motifs Subtype motifs & # of fingerprints mapping to each region # G protein coupling regions & # of families mapping to each region

56 Seeking deeper functional insights? Attwood, TK, Croning, MD & Gaulton, A (2002) Deriving structural and functional insights from a ligand-based hierarchical classification of G protein-coupled receptors. Protein Eng., 15, Clearly, many family- & subtype motifs are simply in the ‘wrong’ place for the initial hypothesis to be true Muscarinic receptorsMuscarinic receptor M5GPCR superfamily

57 Refining the hypothesis Besides, it’s not that simple –only part of the answer Need to consider that GPCRs don’t function in isolation –their functions are modulated via interactions with other proteins Also, the phenomenon of dimerisation challenges the view of the GPCR monomer as functional unit –many GPCRs exist as homo- & heterodimers Such observations demand a more systematic analysis of motifs & their likely functional roles

58 Oligomerisation & protein-protein interaction residues/regions A pilot study with adrenergic, bradykinin & dopamine receptors family-level motifs subfamily-level motifs residues involved in oligomerisation residues involved in protein-protein interaction residues involved in G protein coupling residues involved in ligand binding

59 Where next? Based on location, some family-level motifs couldn’t be involved in ligand binding & some subtype-level motifs couldn’t be involved in G protein coupling –clearly, 3D location must be taken into account functional correlations would then be stronger The remaining motifs are likely to be involved in other molecular interactions –e.g., dimerisation, effector proteins….(early results promising) this will help us to build a knowledge-based system to help suggest the likely functional roles for family- & subtype-level motifs in future

60 Conclusions There are many barriers to success for the jobbing bioinformatician, e.g.: –not fully understanding the processes we’re trying to model & predict (e.g., protein folding) –the dynamic nature of biological data –not having been rigorous in the way we define &/or describe biology/biological processes in the literature –the volume of data, data heterogeneity –maintenance of data, propagation of errors… Possibly the largest hurdle is that computers are number crunchers –they don’t do biology, & trying to teach them is hard –& the harder we try, the clearer it is how naïve we’ve been

61 Conclusions In silico functional annotation requires several dbs to be searched & several tools to be used –different methods provide different perspectives –dbs aren’t complete & their contents don’t fully overlap The more dbs searched, the harder it is to interpret results The more computers are involved in automating annotation, the greater the need for collaboration –especially between s/w developers, annotators & ‘wet’ experimentalists The more data we have, the more rigorous we must be in thinking/writing if we are to make sense of the complexities

62 Conclusions Conclusions Flower DR & Attwood, TK (2004) Integrative bioinformatics for functional genome annotation: trawling for G protein-coupled receptors.Semin Cell Dev Biol., 15(6), For GPCRs, there are many analysis tools available –BLAST, FastA, family databases, modelling tools, etc. We must understand the limitations of the methods –no method is infallible or able to replace the need for biological validation –use all available resources & understand their problems – none is best! Used wisely, bioinformatics tools are useful –BLAST/FastA offer broad brush strokes, motif-methods add fine detail –together, they facilitate receptor characterisation & prediction of ligand specificity, & allow identification of novel ligand-binding, G protein- coupling or other likely molecular interaction motifs We are a long way from having reliable tools for deducing GPCR function & structure from sequence –but with the right approach, there is hope

63


Download ppt "Bioinformatics approaches for… Teresa K Attwood Faculty of Life Sciences & School of Computer Science University of Manchester, Oxford Road Manchester."

Similar presentations


Ads by Google