Predicting domain-domain interactions using a parsimony approach Katia Guimaraes, Ph.D. NCBI / NLM / NIH.

Slides:



Advertisements
Similar presentations
Trustworthy Service Selection and Composition CHUNG-WEI HANG MUNINDAR P. Singh A. Moini.
Advertisements

Unsupervised Learning
Statistics in Bioinformatics May 2, 2002 Quiz-15 min Learning objectives-Understand equally likely outcomes, Counting techniques (Example, genetic code,
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Day 6 Model Selection and Multimodel Inference
CSE Fall. Summary Goal: infer models of transcriptional regulation with annotated molecular interaction graphs The attributes in the model.
Molecular Evolution Revised 29/12/06
Identifying Early Buyers from Purchase Data Paat Rusmevichientong, Shenghuo Zhu & David Selinger Presented by: Vinita Shinde Feb 18 th, 2010.
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
CHAPTER 22 Reliability of Ordination Results From: McCune, B. & J. B. Grace Analysis of Ecological Communities. MjM Software Design, Gleneden Beach,
Hidden Markov Model Special case of Dynamic Bayesian network Single (hidden) state variable Single (observed) observation variable Transition probability.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Multiple-Instance Learning Paper 1: A Framework for Multiple-Instance Learning [Maron and Lozano-Perez, 1998] Paper 2: EM-DD: An Improved Multiple-Instance.
Analysis of Variance. Experimental Design u Investigator controls one or more independent variables –Called treatment variables or factors –Contain two.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Simulation Modeling and Analysis Session 12 Comparing Alternative System Designs.
Network topology and evolution of hard to gain and hard to loose attributes Teresa Przytycka NIH / NLM / NCBI.
Lecture 9: One Way ANOVA Between Subjects
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Incomplete Block Designs
Chapter 15: Model Building
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Classification and Prediction: Regression Analysis
Relationships Among Variables
Dependency networks Sushmita Roy BMI/CS 576 Nov 26 th, 2013.
JCKBSE2010 Kaunas Predicting Combinatorial Protein-Protein Interactions from Protein Expression Data Based on Correlation Coefficient Sho Murakami, Takuya.
九大数理集中講義 Comparison, Analysis, and Control of Biological Networks (3) Domain-Based Mathematical Models for Protein Evolution Tatsuya Akutsu Bioinformatics.
Multiple testing correction
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 9. Hypothesis Testing I: The Six Steps of Statistical Inference.
Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Recitation on EM slides taken from:
Experimental Evaluation of Learning Algorithms Part 1.
EVIDENCE ABOUT DIAGNOSTIC TESTS Min H. Huang, PT, PhD, NCS.
Slides to accompany Weathington, Cunningham & Pittenger (2010), Chapter 3: The Foundations of Research 1.
Inferring strengths of protein-protein interactions from experimental data using linear programming Morihiro Hayashida, Nobuhisa Ueda, Tatsuya Akutsu Bioinformatics.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Comp. Genomics Recitation 3 The statistics of database searching.
Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.
Calculating branch lengths from distances. ABC A B C----- a b c.
Today Ensemble Methods. Recap of the course. Classifier Fusion
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Estimating Component Availability by Dempster-Shafer Belief Networks Estimating Component Availability by Dempster-Shafer Belief Networks Lan Guo Lane.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Measures of Conserved Synteny Work was funded by the National Science Foundation’s Interdisciplinary Grants in the Mathematical Sciences All work is joint.
HMM - Part 2 The EM algorithm Continuous density HMM.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Dependency networks Sushmita Roy BMI/CS 576 Nov 25 th, 2014.
Biostatistics Case Studies 2006 Peter D. Christenson Biostatistician Session 2: Correlation of Time Courses of Simultaneous.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
Discovering functional interaction patterns in Protein-Protein Interactions Networks   Authors: Mehmet E Turnalp Tolga Can Presented By: Sandeep Kumar.
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty  Confidence Intervals (credible intervals)  Hypothesis Tests.
Detecting Protein Function and Protein-Protein Interactions from Genome Sequences TuyetLinh Nguyen.
Protein – Protein Interactions Simon Kanaan Advisor: Dr. Izaguirre Others: Dr. Chen, Dr. Wuchty, ChengBang Huang.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
BPS - 5th Ed. Chapter 231 Inference for Regression.
AP Stat 2007 Free Response. 1. A. Roughly speaking, the standard deviation (s = 2.141) measures a “typical” distance between the individual discoloration.
METHOD: Family Classification Scheme 1)Set for a model building: 67 microbial genomes with identified protein sequences (Table 1) 2)Set for a model.
Model Comparison. Assessing alternative models We don’t ask “Is the model right or wrong?” We ask “Do the data support a model more than a competing model?”
. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.
Presentation transcript:

Predicting domain-domain interactions using a parsimony approach Katia Guimaraes, Ph.D. NCBI / NLM / NIH

K. Guimaraes NCBI/NLM/NIH 2 The problem We have: A protein-protein interaction network, not necessarily very reliable. Domain composition of the proteins in the network. We want: Identify a set of putative domain interactions. Basic assumption: Protein interactions are mediated by domain-domain interactions.

K. Guimaraes NCBI/NLM/NIH 3 Related Work Association Method: Sprinzak and Margalit. J.Mol. Biol., Score(, ) = 4 Score based on the ratio: observed frequency (i,j) expected frequency (i,j) (Figure from Sprinzak and Margalit, 2001) P( )

K. Guimaraes NCBI/NLM/NIH 4 Related Work Maximum Likelihood Estimation (EM): Deng, Mehta, Sun, and Chen. Genome Res., GOAL: To assign a probability to each domain-domain contact so that the likelihood of the network is maximized. Repeatedly tries to adapt parameters to explain the observed network, until there is no change. Important feature of this method: Can take into account missing data so as to consider, for instance, false negatives.

K. Guimaraes NCBI/NLM/NIH 5 Related Work Domain Pair Exclusion Analysis (DPEA): Riley, Lee, Sabatti, and Eisenberg. Genome Biology, APPROACH: MLE is computed multiple times, with a given domain-domain interaction disallowed, in order to observe the impact of that in the likelihood of the protein interaction network. DPEA outperforms all previous prediction methods.

K. Guimaraes NCBI/NLM/NIH 6 Our Approach Our hypothesis: Interactions evolved in the most parsimonious way. So, we will try to explain the protein interactions using the “smallest-weighted” set of putative domain interactions. Ex: For this protein interaction network: Domain pair (, ) would suffice to explain all protein interactions.

K. Guimaraes NCBI/NLM/NIH 7 The intuition behind our approach If single-domain proteins interact, But the fact is that most proteins have multiple domains. the problem is trivial:

K. Guimaraes NCBI/NLM/NIH 8 What if there are multiple interacting proteins all with multiple domains? By parsimony principle  Domain pairs that are common in those protein interactions are the best candidates as putative mediators. In this example, pairs (, ) and (, ) represent the best choices.

K. Guimaraes NCBI/NLM/NIH 9 Modeling the problem as an LP For each domain pair D i D j create a variable x ij ≥ 0. For each protein interaction P m P n create a constraint: x ij P m P n i j  x ij  1 x ij  { P m, P n } For this network there will be six constraints.

K. Guimaraes NCBI/NLM/NIH 10 Modeling the problem as an LP From the set protein-protein interactions, identify the potential domain-domain contacts, a set of variables. Ex: We have 8 potential contacts: (, )   1 1

K. Guimaraes NCBI/NLM/NIH 11 Modeling the problem as an LP Since parsimonious evolution favors that domain pairs appearing in multiple interacting protein pairs are better candidates for mediating the contact, minimize the sum of all scores assigned to the variables. So, we have: Minimize  x ij Subject to:  x ij  1 x ij  {P m, P n } {P m, P n } interacting protein pair

K. Guimaraes NCBI/NLM/NIH 12 Modeling the reliability of the protein interaction network Large scale experiments are rather unreliable. Estimation: Protein interaction network reliability ~50% To model that: –Build 1000 protein interaction subnetworks where each edge is kept according to the network reliability. –Compute LP-scores for each x ij in each network k, x ij k –LP-score for each pair will be the average of the values obtained in all runs.

K. Guimaraes NCBI/NLM/NIH 13 The pw-score pw-score(i,j) = min (p-value (i,j), (1-r) w(i,j) ) pw-score is an indicator of the influence of: - Frequency of appearance of the domain pair - Number of witness in view of network reliability We use pw-score to filter our predictions.

K. Guimaraes NCBI/NLM/NIH 14 Dataset used Protein interaction network and domain contents compiled by Eisenberg’s group for [Riley et al., 2005] (DPEA) Protein interaction network originally obtained from DIP. - 26,032 protein-protein interactions (constraints) - 177,233 potential domain contacts (variables) Gold Standard Set = Subset of iPFAM

K. Guimaraes NCBI/NLM/NIH 15 Comparison with other methods We did two experiments to evaluate our method: 1. Enrichment of domain pairs in confirmed by crystal structure among topmost scored pairs 2. Prediction of interacting domain pair between two proteins containing at least one domain pair in the gold standard set.

K. Guimaraes NCBI/NLM/NIH 16 Enrichment of domain pairs in the gold standard set among topmost scored pairs PE method outperforms others in both coverage and accuracy. pw-score ≤ 0.01 pw-score ≤ 0.05

K. Guimaraes NCBI/NLM/NIH 17 EXPERIMENT 2 Prediction of interacting domain pair between two interacting proteins We use a more controlled dataset Protein pairs used in this experiment includes only those that contain at least one potential domain contact that is in the GSS (1,780 and not 26,032). Pm Pm PnPn Given an interacting protein pair, Identify which domain pair(s) mediates the protein interaction. We assume that: Every protein interaction is mediated by a domain pair in the gold standard set. For each one of the 1780 protein interacting pairs, check if the domain(s) with maximum score is (are) in gold standard set.

K. Guimaraes NCBI/NLM/NIH 18 Comparison of PPV in Mediating Domain Pair Prediction experiment Overall PPV around 75% PPV of PE is well above that of other methods in every class DPEA ~42% PPV estimations separated by classes, according to the # of potential domain contacts of the protein interaction.

K. Guimaraes NCBI/NLM/NIH 19 Predicting domain-domain interactions using a parsimony approach Katia Guimaraes, Raja Jothi, Elena Zotenko, and Teresa Przytycka Genome Biology, 2006

K. Guimaraes NCBI/NLM/NIH 20 The impact of many appearances of the same domain Domain pairs that appear very frequently may induce domain pairs with higher scores. Obviously, a frequent pair may actually interact. But we define a p-value to indicate that possibility.

K. Guimaraes NCBI/NLM/NIH 21 Estimating a p-value We randomize the network: Build 1000 protein interaction networks with: Same set of proteins, with same domain architectures n e edges selected at random ( n e = # edges in original protein interaction network.) –Compute LP-scores for each x ij in each network k, x ij k –p-value (x ij ) = # times LP-score (x ij k )  LP-score (x ij ) 1000

K. Guimaraes NCBI/NLM/NIH 22 The presence of Witnesses We recall the case of single domain interacting proteins: We call such interacting protein pairs witnesses. But since the edges of the network are not reliable, we may have false witnesses. We use an estimation on the chance that a false witness is present in the dataset: (1-r) w(i,j) r = reliability of network; w(i,j) = # witnesses of (i,j).

K. Guimaraes NCBI/NLM/NIH 23 Dataset used As input data we used the files compiled by Eisenberg’s group for [Riley et al., 2005] (DPEA) Protein interaction network originally obtained from DIP. - 26,032 protein-protein interactions - underlying 11,403 proteins - from 69 organisms. (This set generated 177,233 potential domain contacts.) Domain architectures of the 11,403 proteins were obtained by HMM, and include PFAM-B domains. Our LP had 177,233 variables and 26,032 constraints.