Presentation is loading. Please wait.

Presentation is loading. Please wait.

Improving compound–protein interaction prediction by building up highly credible negative samples Toward more realistic drug-target interaction predictions.

Similar presentations


Presentation on theme: "Improving compound–protein interaction prediction by building up highly credible negative samples Toward more realistic drug-target interaction predictions."— Presentation transcript:

1 Improving compound–protein interaction prediction by building up highly credible negative samples Toward more realistic drug-target interaction predictions Source :Bioinformatics 12,31,2015: i221-i229 Author(s) :Hui Liu, Jianjiang Sun, Jihong Guan, Jie Zheng and Shuigeng Zhou Source : Bioinformatics April 2014: 325-337 Author(s) : Tapio Pahikkala, Antti Airola, Sami Pietila, Sushil Shakyawar, Agnieszka Szwajda, Jing Tang and Tero Aittokallio Presented by Shiang-Yin Shih 1

2 OutLine  Background  Introduction  Materials  Methods  Results  Discussion and conclusion 2

3 Background  Popular benchmarking data set of binary drug-target interactions (1) enzyme (3)nuclear receptor (2)ion channel(4)G protein-coupled receptor targets  Four factors that may lead to dramatic differences in the prediction results (1)problem formulation(3)evaluation procedure (2)evaluation data set(4)experimental setting 3

4 Introduction(1/2)  Computational prediction of compound–protein interactions (CPIs) is great importance for drug design and development.  Genome-scale experimental validation of CPIs is not only time-consuming but also prohibitively expensive.  Traditional computational approaches fall roughly into two categories: structure based and ligand based 4

5 5

6 Introduction(2/2)  Structure-based are often unavailable for most protein families  Ligand-based get poor performance for those proteins having few or none of the known ligands  This article aims at building up a set of highly credible negative samples of CPIs via an in silico screening method. 6

7 Materials(1/6) - Compound–protein interaction  Compound–protein interaction(CPIs) were retrieved from DrugBank 4.1, Matador and STITCH 4.0  DrugBank and Matador are manually curated databases, and STITCH is a comprehensive database that collects CPIs from four different sources: experiments, databases, text mining and predicted interactions. 7

8 Materials(2/6) - Chemical structure similarity  Chemical structures (also referred to as fingerprints) of drugs were obtained from the PubChem database.  Jaccard score of the fingerprints as the chemical structure similarity between compounds.  Jaccard score between compounds c and c’ is defined as  There are totally 821 kinds of substructures used in our analysis for human and C.elegans. 8

9 Materials(3/6) - Side effect similarity  Side effects of drugs were downloaded from the SIDER database.  Compute the Jaccard score of each pair of drugs as side effect similarity based on either their known side effects or top 10 predicted side effects in case they are unknown. 9

10 Materials(4/6) - Sequence similarity  Amino acid sequences of proteins were obtained from the UCSC Table Browser.  Computed sequence similarity between proteins using a normalized version of Smith–Waterman score.  Smith–Waterman score between two proteins g and g’ is  means the original Smith–Waterman score. 10

11 Materials(5/6) - Functional annotation semantic similarity  GO annotations were downloaded from the GO database.  Semantic similarity score between each pair of proteins was calculated based on the overlap of the GO terms that were associated with the two proteins.  Computed the Jaccard score with respect to the GO terms of each pair of proteins as their similarity. 11

12 Materials(6/6) - Protein domain similarity  Protein domains were extracted from PFAM database.  Each protein was represented by a domain fingerprint (binary vector).  Numbers of PFAM domains for human and C.elegans are 1331 and 3837. 12

13 Methods  Integration of multiple similarities  The screening framework 13

14  Compute the integrated similarity of each pair of compounds/proteins via Equation (1)/Equation (2).  Build the assembly K of known/predicted CPIs as mentioned above. 14

15  For drugs ci and cj, we formulate them into a single comprehensive similarity measure as below:  In which csenTij (n?1, 2) represents the similarity measure derived from features of chemical structure and side effect.  Authors computed the comprehensive similarity between proteins pi and pj by  Where psenTij (n?1, 2, 3) represents the similarity measure derived from sequence similarity. 15

16  For any protein pl targeted by ck in K, compute the weighted score  Indicates the possibility of protein pj being targeted by compound ck in consideration of the similarity between pj and pl  Calculate the combined score by summing up the weighted scores spcjkl with respect to l, and thus obtain 16

17  Build the set of positive interactions from two databases:  DrugBank and Matador 17

18  Rank the potential negative CPIs according to the scores obtained by Equation (3), and those with the highest scores are taken to form the set of negative sample candidates. 18

19  The negative sample candidates are further filtered by using feature divergence of compound and protein. 19

20  Combining the positive interactions and negative interactions, authors get a gold standard set of CPIs.  On the basis of the chemical substructures and protein PFAM domains, authors construct the tensor product for each CPI, so that each interaction is represented by a vector in the chemogenomical space. 20

21  Train a classifier (e.g. SVM) by using the chemogenomical feature vectors, tune the model parameters via cross- validation and finally predict new CPIs. 21

22 Results(1/4) - Performance evaluation protocol  First built the positive samples from the manually curated databases DrugBank and Matador  Then generated two sets of negative samples: one was generated by randomly sampling compound–protein pairs not included in the positive samples  Evaluated the screened negative samples by comparing the performances of both six classical classifiers and three existing predictive methods on the same set of positive samples combining with screened and randomly generated negative samples 22

23 Results(2/4) - Evaluation on classical classifiers  Evaluation on classical classifiers 23

24 24

25 Results(3/4) - Evaluation on existing predictive methods  Evaluation on existing predictive methods 25

26 Results(4/4) - Evaluation on drug bioactivity dataset  Evaluation on drug bioactivity dataset 26

27 Discussion and conclusion  Negative samples have equal importance to positive samples.  First work devoted to screen reliable negative samples of CPIs.  Extensive experiments demonstrated that Author’s screened negative samples are highly credible and helpful for identifying CPIs.  A useful resource for identifying drug targets and constitute a helpful supplement to the current curated compound–protein databases. 27


Download ppt "Improving compound–protein interaction prediction by building up highly credible negative samples Toward more realistic drug-target interaction predictions."

Similar presentations


Ads by Google