Lecture 3 Protein Function prediction using network concepts

Slides:



Advertisements
Similar presentations
CS 336 March 19, 2012 Tandy Warnow.
Advertisements

LECTURE 3 Introduction to PCA and PLS K-mean clustering Protein function prediction using network concepts Network Centrality measures.
D ISCOVERING REGULATORY AND SIGNALLING CIRCUITS IN MOLECULAR INTERACTION NETWORK Ideker Bioinformatics 2002 Presented by: Omrit Zemach April Seminar.
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
Los Angeles September 27, 2006 MOBICOM Localization in Sparse Networks using Sweeps D. K. Goldenberg P. Bihler M. Cao J. Fang B. D. O. Anderson.
Threshold selection in gene co- expression networks using spectral graph theory techniques Andy D Perkins*,Michael A Langston BMC Bioinformatics 1.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
HCS Clustering Algorithm
Chapter 11: Limitations of Algorithmic Power
Chapter 11 Limitations of Algorithm Power Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
Introduction to Bioinformatics Algorithms Graph Algorithms in Bioinformatics.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Lecture 4 1.Protein Function prediction using network concepts 2.Hierarchical Clustering.
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 8, 2005 ChengXiang Zhai Department of Computer Science University of Illinois,
Chapter 11 Limitations of Algorithm Power. Lower Bounds Lower bound: an estimate on a minimum amount of work needed to solve a given problem Examples:
PROTEIN STRUCTURE NAME: ANUSHA. INTRODUCTION Frederick Sanger was awarded his first Nobel Prize for determining the amino acid sequence of insulin, the.
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
Lecture 3 1.Protein Function prediction using network concepts 2.Application of network concepts in DNA sequencing.
394C March 5, 2012 Introduction to Genome Assembly.
1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Introduction to DPClusO 4.Concept.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.
NP-COMPLETENESS PRESENTED BY TUSHAR KUMAR J. RITESH BAGGA.
1 Lower Bounds Lower bound: an estimate on a minimum amount of work needed to solve a given problem Examples: b number of comparisons needed to find the.
Lecture 3 1.Different centrality measures of nodes 2.Hierarchical Clustering 3.Line graphs.
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
Fixed parameter algorithms for protein similarity search under mRNA structure constrains A joint work by: G. Blin, G. Fertin, D. Hermelin, and S. Vialette.
LIMITATIONS OF ALGORITHM POWER
Discovering functional interaction patterns in Protein-Protein Interactions Networks   Authors: Mehmet E Turnalp Tolga Can Presented By: Sandeep Kumar.
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Course Name: Comparative Genomics Conducted by- Shigehiko kanaya & Md. Altaf-Ul-Amin.
6/11/20161 Graph models and efficient exact algorithms in studying cancer signaling pathways Songjian Lu, Lujia Chen, Chunhui Cai Department of Biomedical.
Mining Coherent Dense Subgraphs across Multiple Biological Networks Vahid Mirjalili CSE 891.
Short reads: 50 to 150 nt (nucleotide)
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics)
CSCI2950-C Genomes, Networks, and Cancer
The NP class. NP-completeness
Finding Dense and Connected Subgraphs in Dual Networks
CSCI2950-C Lecture 12 Networks
Parallel Density-based Hybrid Clustering
Minimum Spanning Tree 8/7/2018 4:26 AM
Eulerian tours Miles Jones MTThF 8:30-9:50am CSE 4140 August 15, 2016.
Assessing Hierarchical Modularity in Protein Interaction Networks
Prediction of Regulatory Elements for Non-Model Organisms Rachita Sharma, Patricia.
Analysis and design of algorithm
The Importance of Communities for Learning to Influence
1 Department of Engineering, 2 Department of Mathematics,
ICS 353: Design and Analysis of Algorithms
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
Clustering.
V11 Metabolic networks - Graph connectivity
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Chapter 11 Limitations of Algorithm Power
Graph Algorithms in Bioinformatics
Walking the Interactome for Prioritization of Candidate Disease Genes
3.3 Network-Centric Community Detection
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Anastasia Baryshnikova  Cell Systems 
Solving the Minimum Labeling Spanning Tree Problem
Statistical Data Analysis
V12 Menger’s theorem Borrowing terminology from operations research
V11 Metabolic networks - Graph connectivity
V11 Metabolic networks - Graph connectivity
Clustering.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

Lecture 3 Protein Function prediction using network concepts Application of network concepts in DNA sequencing

Topology of Protein-protein interaction is informative but further analysis can reveal other information. A popular assumption, which is true in many cases is that similar function proteins interact with each other. Based on these assumption, we have developed methods to predict protein functions and protein complexes from the PPI networks mainly based on cluster analysis.

Cluster Analysis Cluster Analysis, also called data segmentation, implies grouping or segmenting a collection of objects into subsets or "clusters", such that those within each cluster are more closely related to one another than objects assigned to different clusters. In the context of a graph densely connected nodes are considered as clusters Visually we can detect two clusters in this graph

Protein-Protein Interaction Networks K-cores of Protein-Protein Interaction Networks Definition Let, a graph G=(V, E) consists of a finite set of nodes V and a finite set of edges E. A subgraph S=(V, E) where V V and E  E is a k-core or a core of order k of G if and only if  v  V: deg(v)  k within S and S is the maximal subgraph of this property.

Concept of a k-core graph Graph G 1-core graph: The degree of all nodes are one or more

Concept of a k-core graph 1-core graph: The degree of all nodes are one or more

Concept of a k-core graph 2-core graph: The degree of all nodes are two or more

Concept of a k-core graph 1-core graph: The degree of all nodes are one or more

Graph G 3-core graph: The degree of all nodes are three or more The 3-core is the highest k-core subgraph of the graph G

Application of a k-core graph Analyzing protein-protein interaction data obtained from different sources, G. D. Bader and C.W.V. Hogue, Nature biotechnology, Vol 20, 2002

Protein function prediction using k-core graphs

Introduction : Function prediction Schwikowski, B., Uetz, P. and Fields, S. A network of protein-protein interactions in yeast. Nature Biotech. 18, 1257-1261 (2000) Deals with a network of 2039 proteins and 2709 interactions. 65% of interactions occurred between protein pairs with at least one common function Hishigaki, H., Nakai, K., Ono, T., Tanigami, A., and Tagaki, T. Assessment of prediction accuracy of protein function from protein-protein interaction data. Yeast 18, 523-531 (2001) Reported similar results..

Introduction : Function prediction Hypothesis Unknown function proteins that form densely connected subgraph with proteins of a particular function may belong to that functional group. We utilize this concept by determining k-cores of strategically constructed sub-networks.

Prediction of Protein Functions Based on K-cores of Protein-Protein Interaction Networks “Prediction of Protein Functions Based on K-cores of Protein-Protein Interaction Networks and Amino Acid Sequences”, Md. Altaf-Ul-Amin, Kensaku Nishikata, Toshihiro Koma, Teppei Miyasato, Yoko Shinbo, Md. Arifuzzaman, Chieko Wada, Maki Maeda, Taku Oshima, Hirotada Mori, Shigehiko Kanaya The 14th International Conference on Genome Informatics December 14-17, 2003, Yokohama Japan.

E.Coli PPI network Total 3007 proteins and 11531 interactions Around 2000 are unknown function proteins Highest K-core of this total graph is not so helpful

10-core graph—the highest k-core of the E.Coli PPI network

We separate 1072 interactions (out of 11531) involving protein synthesis and function unknown proteins. P. S. U. F. P. S. P. S.

Function unknown Proteins of this 6-kore graph are likely to be involved in protein synthesis

Extending the k-core based function prediction method and its application to PPI data of Arabidopsis thaliana Protein Function Prediction based on k-cores of Interaction Networks, Norihiko Kamakura, Hiroki Takahashi, Kensuke Nakamura, Shigehiko Kanaya and Md. Altaf-Ul-Amin, Proceedings of 2010 International Conference on Bioinformatics and Biomedical Technology (ICBBT 2010)

Materials and Methods : Dataset All PPI data of Arabidopsis thaliana 3118 interactions involving 1302 proteins. Collected from databases and scientific literature by our laboratory. Green= Unknown proteins (289 proteins) Pink= Known proteins (1013 proteins)

Materials and Methods : Dataset Functional groups in the network The PPI dataset contains proteins of 19 different functions according to the first level categories of the KNApSAcK database.

Materials and Methods : Dataset The trends of interactions in the context of functional similarity Diagonal elements show number of interactions between similar function proteins.

Materials And Methods : Flowchart of the method

Results : Subnetworks Subnetwork Name Number of interactions we do not consider in this work the sub-networks that contain less than 100 interactions. And finally I consider subnetworks corresponding to 9 functional classes.

Results : Subnetwork corresponding to cellular communication As an example here we show the subnetworks and k-cores corresponding to cellular communication. Subnetwork extraction We extracted the following 3 types of interactions. Cellular communication-Cellular communication Cellular communication-Unknown, Unknown-Unknown Total 603 interactions

Results : Subnetwork corresponding to cellular communication 1-core The red nodes : known proteins. The green nodes : unknown proteins.

Results : k-cores corresponding to cellular communication The red nodes : known proteins. The green nodes : unknown proteins. The red color nodes represent known proteins, the green color nodes represent function unknown proteins.

Results : k-cores corresponding to cellular communication The red nodes : known proteins The green nodes : unknown proteins. 6-core 7-core This figure implies that determination of k-cores in strategically constructed sub-networks can reveal which unknown proteins are densely connected to proteins of a particular functional class.

Results : Function Predictions The number of unknown genes included in different k-cores corresponding to different functional groups k-core 2 k-core 3 k-core 4 k-core 5 k-core 6 k-core 7 k-core 8 cell_cycle 11 7 cell_rescue 4 cellular_communication 37 33 23 15 12 8 energy 5 2 metabo 1 protein_fate 69 35 25 10 protein_synthesis transcription 24 14 transport_facilitation total 129 88 64 52 36 27

Results : Function Predictions Prediction based on 2-cores, 3-cores and 4-cores 2-core 4-core Most proteins have been assigned unique functions 3-core Most proteins have been assigned unique functions and some have been assigned multiple functions 31

Assessment of Predictions As most of the function predicted proteins are still unknown their annotations do not contain clear information on their functions. When k is much larger than one, the effect of false positives is greatly reduced. However to assess statistically, we constructed 1000 random graphs consisting of the same 1,302 proteins but I inserted 3,118 edges randomly and constructed subnetworks.

Assessment of Predictions The box plots show the distribution of k-cores with respect to their size in 1000 graphs corresponding to each sub-network and the filled triangles show the size of k-cores in real PPI sub-networks.

Assessment of Predictions it can be theoretically concluded that the existence of higher order k-core graphs in PPI sub-networks compared to in the random graphs of the same size are likely to be because of interaction between similar function proteins. Therefore we assume that the function prediction based on k-cores for the value of k greater than highest possible value of k for corresponding random graphs are statistically significant predictions. Based on this we predicted the functions of 67 proteins(list is available online at http://kanaya.naist.jp/Kcore/supplementary/Function_prediction.xls. 34

“Prediction of Protein Functions Based on Protein-Protein Interaction Networks: A Min-Cut Approach”, Md. Altaf-Ul-Amin, Toshihiro Koma, Ken Kurokawa, Shigehiko Kanaya, Proceedings of the Workshop on Biomedical Data Engineering (BMDE), Tokyo, Japan, pp. 37-43, April 3-4, 2005.

Outline Introduction The concept of Min-Cut Problem Formulation A Heuristic Method Evaluation of the Proposed Method Conclusions

Outline Introduction The concept of Min-Cut Problem Formulation A Heuristic Method Evaluation of the Proposed Method Conclusions

Introduction After the complete sequencing of several genomes, the challenging problem now is to determine the functions of proteins Determining protein functions experimentally Using various computational methods a) sequence b) structure c) gene neighborhood d) gene fusions e) cellular localization f) protein-protein interactions

Introduction Present work predicts protein functions based on protein-protein interaction network. For the purpose of prediction, we consider the interactions of function-unknown proteins with function-known proteins and function-unknown proteins with function-unknown proteins In the context of the whole network.

Introduction Majority of protein-protein interactions are between similar function protein pairs. Therefore, We assign function-unknown proteins to different functional groups in such a way so that the number of inter-group interactions becomes the minimum. Hence we call the proposed approach a Min-Cut approach.

Outline Introduction The concept of Min-Cut Problem Formulation A Heuristic Method Evaluation of the Proposed Method Conclusions

The concept of Min-Cut U4 K8 U3 K1 K4 U2 K6 K2 K3 U1 K5 G1 G2 A typical and small network of known and unknown proteins

The concept of Min-Cut U4 K U3 K K U2 K K K U1 K G1 G2 Unknown proteins assigned to known groups based on majority interactions

The concept of Min-Cut U4 K U3 K K U2 K K K U1 K G1 G2 Number of CUT = 4

The concept of Min-Cut U4 K U3 K K U2 K K K U1 K G1 G2 An alternative assignment of unknown proteins

The concept of Min-Cut U4 K U3 K K U2 K K K U1 K G1 Number of CUT = 2 G2 For every assignment of unknown proteins, there is a value of CUT. Min-cut approach looks for an assignment for which the number of CUT is minimum.

Outline Introduction The concept of Min-Cut Problem Formulation A Heuristic Method Evaluation of the Proposed Method Conclusions

Problem Formulation Here we explain some points with a typical example.

Problem Formulation V= set of all nodes E =set of all edges G={K1, K2, K3, K4, K5, K6, K7, K8, K9, K10} U={U1, U2, U3, U4, U5, U6, U7, U8}

Problem Formulation We generate U´ U such that each protein of U´ is connected in N with at least one protein of group G by a path of length 1 or length 2. U´= {U1, U2, U3, U4, U5, U6, U7}

Problem Formulation We can assign proteins of U´ to different groups and calculate CUT Interactions between known protein pairs can never be part of CUT For this assignment of unknown proteins, the CUT= 6

Problem Formulation The problem we are trying to solve is to assign the proteins of set U´ to known groups G1 , G2 ,…….., G3 in such a way so that the CUT becomes the minimum.

Outline Introduction The concept of Min-Cut Problem Formulation A Heuristic Method Evaluation of the Proposed Method Conclusions

A Heuristic Method The problem under hand is a variant of network partitioning problem. It is known that network partitioning problems are NP-hard. Therefore, we resort to some heuristics to find a solution as better as it is possible.

A Heuristic Method U1 U2 U3 U4 U5 U6 U7

U1 G2 G1 x U2 U3 U4 U5 U6 U7 A Heuristic Method U1 has one path of length 1 with G2 and two paths of length two with G1

U1 G2 G1 x U2 U3 U4 G3 U5 U6 U7 A Heuristic Method U4 has two paths of length 1 with G1, one path of length one with G2 and one path of length two with G3.

A Heuristic Method U1 G2 G1 x U2 U3 U4 G3 U5 U6 U7

A Heuristic Method U1 G2 G1 x U2 U3 U4 G3 U5 U6 U7

U1 G2 G1 x U2 U3 U4 G3 U5 U6 U7 A Heuristic Method By assigning all the unknown proteins to respective height priority groups, CUT = 6

U1 G2 G1 x U2 U3 U4 G3 U5 U6 U7 A Heuristic Method For this assignment of unknown proteins, the CUT= 7

U1 G2 G1 x U2 U3 U4 G3 U5 U6 U7 A Heuristic Method For this assignment of unknown proteins, the CUT= 4

Outline Introduction The concept of Min-Cut Problem Formulation A Heuristic Method Evaluation of the Proposed Method Conclusions

Evaluation of the Proposed Approach The proposed method is a general one and can be applied to any organism and any type of functional classification. Here we applied it to yeast Saccharomyces cerevisiae protein-protein interaction network We obtain the protein-protein interaction data from ftp://ftpmips.gsf.de/yeast/PPI/ which contains 15613 genetic and physical interactions.

Evaluation of the Proposed Approach YAR019c YMR001c YAR019c YNL098c YAR019c YOR101w YAR019c YPR111w YAR027w YAR030c YAR027w YBR135w YAR031w YBR217w ------------- ------------- Total 12487 pairs We discard self-interactions and extract a set of 12487 unique binary interactions involving 4648 proteins.

A network of 12487 interactions and 4648 proteins is reasonably big Evaluation of the Proposed Approach A network of 12487 interactions and 4648 proteins is reasonably big

Evaluation of the Proposed Approach   Evaluation of the Proposed Approach We collect from http://mips.gsf.de/genre/proj/yeast/index.jsp the classification data  

Evaluation of the Proposed Approach   Evaluation of the Proposed Approach The proposed approach is intended to predict the functions of function-unknown proteins. However, by predicting the functions of function-unknown proteins, it is not possible to determine the correctness of the predictions. We consider around 10% randomly selected proteins of each group of Table 1 as function-unknown proteins.  

Evaluation of the Proposed Approach   Evaluation of the Proposed Approach The union of 10% of all groups consists of 604 proteins. This is the unknown group U. The union of the rest 90% of each of the functional groups constitutes the set of known proteins G. There are total 3783 proteins in G. We generate U´ U such that each protein of U´ is connected in N with at least one protein of group G by a path of length 1 or length 2. There are 470 proteins in U´ . We predicted functions of these 470 proteins using the proposed method.  

Evaluation of the Proposed Approach We applied this algorithm using Max_value=50000 to predict the functions 470 proteins.

Evaluation of the Proposed Approach We cannot guarantee that minimum CUT corresponds to maximum successful prediction. However, the trends of the results of the Figure above shows that it is very likely that the lower is the value of CUT the greater is the number of successful predictions

Evaluation of the Proposed Approach We then examine the relation of successful predictions with the number of degrees of the proteins in the network . Degree of U4 =7 Degree of U7=3

Evaluation of the Proposed Approach We then examine the relation of successful predictions with the number of degrees of the proteins in the network .

Evaluation of the Proposed Approach The success rate of prediction is as low as 30.46% for proteins that have only one degree in the interaction network. However it is 67.61% for proteins that have degrees 8 or more. This implies that the reliability of the prediction can be improved by providing reasonable amount of interaction information

Application of network concepts in DNA sequencing

Sequencing by hybridization (SBH) Given an unknown DNA sequence, an array provides information about all strings of length l that the sequence contains s=TATGGTGC S(s,l)={TAT, ATG, TGG, GGT, GTG, TGC} S(s,l)={GTG, ATG, TGG, TAT, GGT, TGC} Orderly placed Randomly placed Input: A spectrum S representing all l-mers from an unknown string s Output: The string s such that spectrum (s,l) = S.

Sequencing by hybridization (SBH) Input: A spectrum S representing all l-mers from an unknown string s Output: The string s such that spectrum (s,l) = S. The reduction of the SBH problem to an Eulerian path problem is to construct a graph whose edges correspond to l-mers from spectrum(s,l) and then to find a path in this graph visiting every edge exactly once.

Sequencing by hybridization (SBH) The reduction of the SBH problem to an Eulerian path problem is to construct a graph whose nodes correspond to (l-1)-mers and edges correspond to l-mers from spectrum(s,l) and then to find a path in this graph visiting every edge exactly once. S(s,l)={GTG, ATG, TGG, TAT, GGT, TGC} (l-1)-mers: GT, TG, AT, TG, TG, GG, TA, AT, GG, GT, TG, GC (l-1)-mers(redundancy removed): GT, TG, AT, GG, TA, GC GG AT GT s=TATGGTGC GC TG TA

Sequencing by hybridization (SBH) A path in a graph visiting every edge exactly once is called Eulerian (pronounced Oilerian) path A connected graph has an Eulerian path, if and only if it contains at most two semibalanced nodes and all other nodes are balanced. Balanced node, indegree=outdegree Semibalanced node |indegree-outdegree|=1 GG AT GT GC TG TA Semibalanced

Sequencing by hybridization (SBH) Another example S(s,l)={ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT} (l-1)-mers:AT, TG, TG, GG, TG, GC, GT, TG, GG, GC, GC, CA, GC, CG, CG, GT (l-1)-mers(redundancy removed):AT, TG, GG, GC, GT, CA, CG TG GG AT ATGGCGTGCA GC CA GT CG

Sequencing by hybridization (SBH) S(s,l)={ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT} (l-1)-mers:AT, TG, TG, GG, TG, GC, GT, TG, GG, GC, GC, CA, GC, CG, CG, GT (l-1)-mers(redundancy removed):AT, TG, GG, GC, GT, CA, CG TG GG AT ATGCGTGGCA GC CA GT CG