Improving compound–protein interaction prediction by building up highly credible negative samples Toward more realistic drug-target interaction predictions.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Transmembrane Protein Topology Prediction Using Support Vector Machines Tim Nugent and David Jones Bioinformatics Group, Department of Computer Science,
Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Location Recognition Given: A query image A database of images with known locations Two types of approaches: Direct matching: directly match image features.
Recognizing Human Actions by Attributes CVPR2011 Jingen Liu, Benjamin Kuipers, Silvio Savarese Dept. of Electrical Engineering and Computer Science University.
11/9/99ICTAI-99, Chicago1 Protein Secondary Structure Prediction Using Data Mining Tool C5 Meiliu Lu †, Du Zhang †, Hongjun Xu †, Ken Tse-yau Lau ‡, and.
Carolina Galleguillos, Brian McFee, Serge Belongie, Gert Lanckriet Computer Science and Engineering Department Electrical and Computer Engineering Department.
A Novel Knowledge Based Method to Predicting Transcription Factor Targets
Ke Liu1, Junqiu Wu2, Shengwen Peng1,Chengxiang Zhai3, Shanfeng Zhu1
Jürgen Sühnel Institute of Molecular Biotechnology, Jena Centre for Bioinformatics Jena / Germany Supplementary Material:
Intelligent Systems Lab. Recognizing Human actions from Still Images with Latent Poses Authors: Weilong Yang, Yang Wang, and Greg Mori Simon Fraser University,
Presented By Cindy Xiaotong Lin
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
1 PharmID: A New Algorithm for Pharmacophore Identification Stan Young Jun Feng and Ashish Sanil NISSMPDM 3 June 2005.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Training a Neural Network to Recognize Phage Major Capsid Proteins Author: Michael Arnoult, San Diego State University Mentors: Victor Seguritan, Anca.
SUPPORT VECTOR MACHINES PRESENTED BY MUTHAPPA. Introduction Support Vector Machines(SVMs) are supervised learning models with associated learning algorithms.
A Study on Feature Selection for Toxicity Prediction*
Similar Sequence Similar Function Charles Yan Spring 2006.
Training a Neural Network to Recognize Phage Major Capsid Proteins Author: Michael Arnoult, San Diego State University Mentors: Victor Seguritan, Anca.
Distributed Representations of Sentences and Documents
Single Motif Charles Yan Spring Single Motif.
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.
Predicting Missing Provenance Using Semantic Associations in Reservoir Engineering Jing Zhao University of Southern California Sep 19 th,
Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University.
Towards Improving Classification of Real World Biomedical Articles Kostas Fragos TEI of Athens Christos Skourlas TEI of Athens
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
SISAP’08 – Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu
Theory and Application of Database Systems A Hybrid Approach for Extending Ontology from Text He Wei.
Intelligent Database Systems Lab Presenter: WU, MIN-CONG Authors: Zhiyuan Liu, Wenyi Huang, Yabin Zheng and Maosong Sun 2010, ACM Automatic Keyphrase Extraction.
ChEMBL– Open Access Database For Drug Discovery By – Udghosh Singh M.S.(Pharm), 3 rd Sem Pharmacoinformatics.
Abstract Background: In this work, a candidate gene prioritization method is described, and based on protein-protein interaction network (PPIN) analysis.
Friends and Locations Recommendation with the use of LBSN By EKUNDAYO OLUFEMI ADEOLA
INF380 - Proteomics-101 INF380 – Proteomics Chapter 10 – Spectral Comparison Spectral comparison means that an experimental spectrum is compared to theoretical.
A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim.
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Virtual Screening C371 Fall INTRODUCTION Virtual screening – Computational or in silico analog of biological screening –Score, rank, and/or filter.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.
Designing multiple biometric systems: Measure of ensemble effectiveness Allen Tang NTUIM.
PREDICTION OF CATALYTIC RESIDUES IN PROTEINS USING MACHINE-LEARNING TECHNIQUES Natalia V. Petrova (Ph.D. Student, Georgetown University, Biochemistry Department),
Retrieval of Highly Related Biomedical References by Key Passages of Citations Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan.
Gang WangDerek HoiemDavid Forsyth. INTRODUCTION APROACH (implement detail) EXPERIMENTS CONCLUSION.
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino.
Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.
AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam.
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
COT6930 Course Project. Outline Gene Selection Sequence Alignment.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine 朱林娇 14S
Object Recognition as Ranking Holistic Figure-Ground Hypotheses Fuxin Li and Joao Carreira and Cristian Sminchisescu 1.
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding Xu Linhe 14S
Vertical Search for Courses of UIUC Homepage Classification The aim of the Course Search project is to construct a database of UIUC courses across all.
A new protein-protein docking scoring function based on interface residue properties Reporter: Yu Lun Kuo (D )
Constructing a Predictor to Identify Drug and Adverse Event Pairs
Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment Xinyang Jiang, Fei Wu, Xi Li, Zhou Zhao, Weiming Lu, Siliang Tang, Yueting.
Introduction Feature Extraction Discussions Conclusions Results
Extra Tree Classifier-WS3 Bagging Classifier-WS3
Reporter: Yu Lun Kuo (D )
Enriching Taxonomies With Functional Domain Knowledge
Presentation transcript:

Improving compound–protein interaction prediction by building up highly credible negative samples Toward more realistic drug-target interaction predictions Source :Bioinformatics 12,31,2015: i221-i229 Author(s) :Hui Liu, Jianjiang Sun, Jihong Guan, Jie Zheng and Shuigeng Zhou Source : Bioinformatics April 2014: Author(s) : Tapio Pahikkala, Antti Airola, Sami Pietila, Sushil Shakyawar, Agnieszka Szwajda, Jing Tang and Tero Aittokallio Presented by Shiang-Yin Shih 1

OutLine  Background  Introduction  Materials  Methods  Results  Discussion and conclusion 2

Background  Popular benchmarking data set of binary drug-target interactions (1) enzyme (3)nuclear receptor (2)ion channel(4)G protein-coupled receptor targets  Four factors that may lead to dramatic differences in the prediction results (1)problem formulation(3)evaluation procedure (2)evaluation data set(4)experimental setting 3

Introduction(1/2)  Computational prediction of compound–protein interactions (CPIs) is great importance for drug design and development.  Genome-scale experimental validation of CPIs is not only time-consuming but also prohibitively expensive.  Traditional computational approaches fall roughly into two categories: structure based and ligand based 4

5

Introduction(2/2)  Structure-based are often unavailable for most protein families  Ligand-based get poor performance for those proteins having few or none of the known ligands  This article aims at building up a set of highly credible negative samples of CPIs via an in silico screening method. 6

Materials(1/6) - Compound–protein interaction  Compound–protein interaction(CPIs) were retrieved from DrugBank 4.1, Matador and STITCH 4.0  DrugBank and Matador are manually curated databases, and STITCH is a comprehensive database that collects CPIs from four different sources: experiments, databases, text mining and predicted interactions. 7

Materials(2/6) - Chemical structure similarity  Chemical structures (also referred to as fingerprints) of drugs were obtained from the PubChem database.  Jaccard score of the fingerprints as the chemical structure similarity between compounds.  Jaccard score between compounds c and c’ is defined as  There are totally 821 kinds of substructures used in our analysis for human and C.elegans. 8

Materials(3/6) - Side effect similarity  Side effects of drugs were downloaded from the SIDER database.  Compute the Jaccard score of each pair of drugs as side effect similarity based on either their known side effects or top 10 predicted side effects in case they are unknown. 9

Materials(4/6) - Sequence similarity  Amino acid sequences of proteins were obtained from the UCSC Table Browser.  Computed sequence similarity between proteins using a normalized version of Smith–Waterman score.  Smith–Waterman score between two proteins g and g’ is  means the original Smith–Waterman score. 10

Materials(5/6) - Functional annotation semantic similarity  GO annotations were downloaded from the GO database.  Semantic similarity score between each pair of proteins was calculated based on the overlap of the GO terms that were associated with the two proteins.  Computed the Jaccard score with respect to the GO terms of each pair of proteins as their similarity. 11

Materials(6/6) - Protein domain similarity  Protein domains were extracted from PFAM database.  Each protein was represented by a domain fingerprint (binary vector).  Numbers of PFAM domains for human and C.elegans are 1331 and

Methods  Integration of multiple similarities  The screening framework 13

 Compute the integrated similarity of each pair of compounds/proteins via Equation (1)/Equation (2).  Build the assembly K of known/predicted CPIs as mentioned above. 14

 For drugs ci and cj, we formulate them into a single comprehensive similarity measure as below:  In which csenTij (n?1, 2) represents the similarity measure derived from features of chemical structure and side effect.  Authors computed the comprehensive similarity between proteins pi and pj by  Where psenTij (n?1, 2, 3) represents the similarity measure derived from sequence similarity. 15

 For any protein pl targeted by ck in K, compute the weighted score  Indicates the possibility of protein pj being targeted by compound ck in consideration of the similarity between pj and pl  Calculate the combined score by summing up the weighted scores spcjkl with respect to l, and thus obtain 16

 Build the set of positive interactions from two databases:  DrugBank and Matador 17

 Rank the potential negative CPIs according to the scores obtained by Equation (3), and those with the highest scores are taken to form the set of negative sample candidates. 18

 The negative sample candidates are further filtered by using feature divergence of compound and protein. 19

 Combining the positive interactions and negative interactions, authors get a gold standard set of CPIs.  On the basis of the chemical substructures and protein PFAM domains, authors construct the tensor product for each CPI, so that each interaction is represented by a vector in the chemogenomical space. 20

 Train a classifier (e.g. SVM) by using the chemogenomical feature vectors, tune the model parameters via cross- validation and finally predict new CPIs. 21

Results(1/4) - Performance evaluation protocol  First built the positive samples from the manually curated databases DrugBank and Matador  Then generated two sets of negative samples: one was generated by randomly sampling compound–protein pairs not included in the positive samples  Evaluated the screened negative samples by comparing the performances of both six classical classifiers and three existing predictive methods on the same set of positive samples combining with screened and randomly generated negative samples 22

Results(2/4) - Evaluation on classical classifiers  Evaluation on classical classifiers 23

24

Results(3/4) - Evaluation on existing predictive methods  Evaluation on existing predictive methods 25

Results(4/4) - Evaluation on drug bioactivity dataset  Evaluation on drug bioactivity dataset 26

Discussion and conclusion  Negative samples have equal importance to positive samples.  First work devoted to screen reliable negative samples of CPIs.  Extensive experiments demonstrated that Author’s screened negative samples are highly credible and helpful for identifying CPIs.  A useful resource for identifying drug targets and constitute a helpful supplement to the current curated compound–protein databases. 27