Protein – Protein Interactions Lisa Chargualaf Simon Kanaan Keefe Roedersheimer Others: Dr. Izaguirre, Dr. Chen, Dr. Wuchty, ChengBang Huang.

Slides:

Advertisements

Similar presentations

Advertisements

Statistics in Bioinformatics May 2, 2002 Quiz-15 min Learning objectives-Understand equally likely outcomes, Counting techniques (Example, genetic code,

Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.

Kostas Kontogiannis E&CE

Chapter 5 Producing Data

Measuring the degree of similarity: PAM and blosum Matrix

Introduction to Cryptography and Security Mechanisms: Unit 5 Theoretical v Practical Security Dr Keith Martin McCrea

Chapter 9 Structure Prediction. Motivation Given a protein, can you predict molecular structure Want to avoid repeated x-ray crystallography, but want.

Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.

Computational Biology, Part 10 Protein Structure Prediction and Display Robert F. Murphy Copyright  1996, 1999, All rights reserved.

Predicting protein functions from redundancies in large-scale protein interaction networks Speaker: Chun-hui CAI

Protein Domain Finding Problem Olga Russakovsky, Eugene Fratkin, Phuong Minh Tu, Serafim Batzoglou Algorithm Step 1: Creating a graph of k-mers First,

SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.

Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.

Similar Sequence Similar Function Charles Yan Spring 2006.

Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.

Prediction of Local Structure in Proteins Using a Library of Sequence-Structure Motifs Christopher Bystroff & David Baker Paper presented by: Tal Blum.

Go to Table of ContentTable of Content Analysis of Variance: Randomized Blocks Farrokh Alemi Ph.D. Kashif Haqqi M.D.

Radial Basis Function Networks

Measures of Central Tendency U. K. BAJPAI K. V. PITAMPURA.

Logic and Introduction to Sets Chapter 6 Dr.Hayk Melikyan/ Department of Mathematics and CS/ For more complicated problems, we will.

TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,

Chapter 7 Logic, Sets, and Counting Section 4 Permutations and Combinations.

6.4 Permutations and combinations For more complicated problems, we will need to develop two important concepts: permutations and combinations. Both of.

Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.

Lesson#23 Topic: Simple Circuits Objectives: (After this class I will be able to) 1. Explain the difference between wiring light bulbs in series and in.

Theory Revision Chris Murphy. The Problem Sometimes we: – Have theories for existing data that do not match new data – Do not want to repeat learning.

Discrete Mathematical Structures (Counting Principles)

April 10, 2002Applied Discrete Mathematics Week 10: Relations 1 Counting Relations Example: How many different reflexive relations can be defined on a.

RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?

Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.

Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.

Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.

Chapter 5: Producing Data “An approximate answer to the right question is worth a good deal more than the exact answer to an approximate question.’ John.

Chapter 7 Logic, Sets, and Counting Section 4 Permutations and Combinations.

Comp. Genomics Recitation 3 The statistics of database searching.

CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.

Chapter 3 Computational Molecular Biology Michael Smith

Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.

Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Barnett/Ziegler/Byleen Finite Mathematics 11e1 Learning Objectives for Section 7.4 Permutations and Combinations The student will be able to set up and.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Database Management Systems Chapter 4 Relational Algebra.

PROTEIN STRUCTURE SIMILARITY CALCULATION AND VISUALIZATION CMPS 561-FALL 2014 SUMI SINGH SXS5729.

Approximation Algorithms Department of Mathematics and Computer Science Drexel University.

Sequence Alignment.

Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.

Abdul-Rahman Elshafei – ID  Introduction  SLAT & iSTAT  Multiplet Scoring  Matching Passing Tests  Matching Complex Failures  Multiplet.

Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,

Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

COUNTING Permutations and Combinations. 2Barnett/Ziegler/Byleen College Mathematics 12e Learning Objectives for Permutations and Combinations  The student.

1 Chapter 11 Understanding Randomness. 2 Why Random? What is it about chance outcomes being random that makes random selection seem fair? Two things:

1 Chapter 4 Unordered List. 2 Learning Objectives ● Describe the properties of an unordered list. ● Study sequential search and analyze its worst- case.

Protein – Protein Interactions Simon Kanaan Advisor: Dr. Izaguirre Others: Dr. Chen, Dr. Wuchty, ChengBang Huang.

Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)

Multiple String Comparison – The Holy Grail. Why multiple string comparison? It is the most critical cutting-edge toοl for extracting and representing.

Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.

Task: It is necessary to choose the most suitable variant from some set of objects by those or other criteria.

Hidden Markov Models Part 2: Algorithms

Chapter 7 Logic, Sets, and Counting

Unit-2 Divide and Conquer

Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

Counting Discrete Mathematics.

Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

Applying principles of computer science in a biological context

Sequence alignment, E-value & Extreme value distribution

Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

Presentation transcript:

Protein – Protein Interactions Lisa Chargualaf Simon Kanaan Keefe Roedersheimer Others: Dr. Izaguirre, Dr. Chen, Dr. Wuchty, ChengBang Huang

What are proteins? Basis of most living functions Building blocks of life – Substrates – Products – Enzymes One cell contains thousands of different proteins; the human body contains 50 to 100 thousand proteins!

Proteins Composed of sequences of amino acids – Variations of 20 primary/basic amino acids Rules governing structure: – AAs close in the folded structure may/may not be close in primary structure – Hydrophobic residues generally buried in core; hydrophilic are usually exposed – Protein strings cannot form knots – Related proteins generally have similar structures Similar structures can exist without having similar sequences

What is a “ protein – protein ”, P-P, interaction and why is it important? Derived from the nuclear material within a cell, proteins fold and interact in intricate arrangements that provide functionality to the components of a cell, which in turn work cooperatively to form whole body systems. Protein-protein interactions serve as the chemical basis of all living organisms. Understanding protein interactions helps us understand the protein network.

What causes P-P interactions? Many speculations arise when it comes to the driving force behind proteins interacting with each other – Primary sequence dictating interaction between attached functional groups – Protein domains drive proteins to fold and interact as they do.

What are protein domains? significant portions of proteins composed of distinct peptides the key to intricate arrangements

Domains and Proteins A single protein molecule can possess multiple domains, causing difficulty in discovering a simple formula that dictates the manner by which protein-protein interactions occur. Yet, certain affinities exist between certain protein domains and are frequently seen in living organisms. This drives our research that seeks to extrapolate the mechanism of protein-protein interactions to focus on domain-domain interactions as a factor. The model system used for these proceedings is the yeast cell, with several of its proteins serving as the test cases. This is done using a protein family data bank available online.

Our “ Formula ” dictating which P-P interactions occur A data bank gives a list of protein interactions. A protein interaction, (P1, P2), is explained by a domain pair, (D1, D2), if P1 includes one domain and P2 includes the other. Find the minimum number of domain pairs that explains the databank. Equivalent to Minimum Set Cover problem.

Minimum Set Cover Problem The problem of finding the minimum size set of sets whose union is equal to the union of all the sets. NP complete problem.

Why the Minimum Set of Domains? Lets look at the following case: – P1 contains domains D2 – P2 contains domains D2 and D3 – P3 contains domains D2 and D4 – P4 contains domains D2 and D5 And lets assume the protein interactions are: P1 - P1 P1 - P2 P1 - P3 P1 - P4 P-P interactions explained by: – (D2 - D2) – (D2 - D3) – (D2 - D4) – (D1 - D5) Or by: – (D2 - D2)

Mapping to MSC Let P1 - P1 = 0 P1 - P2 = 1 P1 - P3 = 2 P1 - P4 = 3 Each pair’s interactions D2-D2={0,1,2,3} D2-D3={1} D2-D4={2} D1-D5={3} This maps to the integer MSC problem with a global set of {0,1,2,3} and subsets of {{0,1,2,3},{1},{2},{3}} Solution is D2-D2, more difficult for larger problems.

Implementation/Algorithm This base algorithm consists of functions that can record the protein structure and interaction information and store them into different data structures. It also builds a domain-domain matrix. This matrix holds information about interacting domains. Each entry in the matrix represents the number of times domains Di and Dj were observed as the possible cause in different protein-protein interactions. Example: – P1:{D1, D2, D3} and P2 {D1, D5} interact. (D1, D1), (D1, D5), (D2, D1), (D2, D5), (D3, D1) and (D3, D5).

Exact Problems In the worst case, (# of domains)^2 number of domain interactions, corresponding to subsets. Large number of proteins corresponding to a global set. MSC is an NP complete problem, the exact solution requires considering all combinations of subsets. Computationally expensive, impractical for more than ~10 domains. There are thousands in a real problem.

Implementation/Algorithm Algorithm approximates the minimum set of domains pairs. Algorithm needs to be able to choose d-d pairs in an educated, not a randomized fashion. This educated way can be done using weight functions. Where each domain pair is given a weight, and the largest of the weights is chosen.

Different Functions Different weight functions were considered. Decided on looking at two for now: – MSC – MSC by probability Also looked at running MSC twice with the addition of adding pairs with a high probability of interacting.

MSC Assumption: – most common observed interacting domain pair among the protein interactions is probably the cause of the protein interactions. While there are P-P interactions to be explained { – Chooses the most common observed interacting domain Di-Dj. – Removes Di-Dj Removes all P-P interactions from the data being observed Undoes P-P interactions effect on matrix }

MSC by Probability Assumption: – Incorporate the absence of p-p interactions. – Initialize matrix just like MSC. go through every element in the matrix and divide that entry by the total number of proteins that contain the first domain times the number of proteins which contain the second domain. Now each element now represents the probability that domains i and j interact. – Then the weight function goes about choosing the highest probability in the matrix, seeing which proteins this domain pair explains, remove these proteins influence from the data and then performing the same tasks again.

Prediction Input set of proteins with known structure. Set of domains pairs obtained from algorithm being observed. Go through each interacting domain pair (Di, Dj) Every protein contained domain Di is considered interacting with a protein containing Dj.

Testing Running MSC approximation VS. MSC exact on very small sets to see how good the approximation really is to exact solution.

Testing Building different size training data using swiss pfam A database among others. Running The aproximation algorithms on these sets. Running AM on the same sets. Attempting to use similar size sets to MLE for comparisons sake.

Testing Compares calculated P-P interactions with observed interactions. (number of matches, false positive, and false negative p-p interactions) Calculate fold, specificity, and sensitivity in order to compare to previous research.

Results

Results

Results

Future Work Finish Testing and comparing different Weight Functions. Getting some stats by running different algorithms multiple times on different size data sets. Testing MSC exact vs. different weight functions