Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 3 Protein Function prediction using network concepts

Similar presentations


Presentation on theme: "Lecture 3 Protein Function prediction using network concepts"— Presentation transcript:

1 Lecture 3 Protein Function prediction using network concepts Application of network concepts in DNA sequencing

2 Topology of Protein-protein interaction is informative but further analysis can reveal other information. A popular assumption, which is true in many cases is that similar function proteins interact with each other. Based on these assumption, we have developed methods to predict protein functions and protein complexes from the PPI networks mainly based on cluster analysis.

3 Cluster Analysis Cluster Analysis, also called data segmentation, implies grouping or segmenting a collection of objects into subsets or "clusters", such that those within each cluster are more closely related to one another than objects assigned to different clusters. In the context of a graph densely connected nodes are considered as clusters Visually we can detect two clusters in this graph

4 Protein-Protein Interaction Networks
K-cores of Protein-Protein Interaction Networks Definition Let, a graph G=(V, E) consists of a finite set of nodes V and a finite set of edges E. A subgraph S=(V, E) where V V and E  E is a k-core or a core of order k of G if and only if  v  V: deg(v)  k within S and S is the maximal subgraph of this property.

5 Concept of a k-core graph
Graph G 1-core graph: The degree of all nodes are one or more

6 Concept of a k-core graph
1-core graph: The degree of all nodes are one or more

7 Concept of a k-core graph
2-core graph: The degree of all nodes are two or more

8 Concept of a k-core graph
1-core graph: The degree of all nodes are one or more

9 Graph G 3-core graph: The degree of all nodes are three or more The 3-core is the highest k-core subgraph of the graph G

10 Application of a k-core graph
Analyzing protein-protein interaction data obtained from different sources, G. D. Bader and C.W.V. Hogue, Nature biotechnology, Vol 20, 2002

11

12 Protein function prediction using k-core graphs

13 Introduction : Function prediction
Schwikowski, B., Uetz, P. and Fields, S. A network of protein-protein interactions in yeast. Nature Biotech. 18, (2000) Deals with a network of 2039 proteins and 2709 interactions. 65% of interactions occurred between protein pairs with at least one common function Hishigaki, H., Nakai, K., Ono, T., Tanigami, A., and Tagaki, T. Assessment of prediction accuracy of protein function from protein-protein interaction data. Yeast 18, (2001) Reported similar results..

14 Introduction : Function prediction
Hypothesis Unknown function proteins that form densely connected subgraph with proteins of a particular function may belong to that functional group. We utilize this concept by determining k-cores of strategically constructed sub-networks.

15 Prediction of Protein Functions Based on K-cores of
Protein-Protein Interaction Networks “Prediction of Protein Functions Based on K-cores of Protein-Protein Interaction Networks and Amino Acid Sequences”, Md. Altaf-Ul-Amin, Kensaku Nishikata, Toshihiro Koma, Teppei Miyasato, Yoko Shinbo, Md. Arifuzzaman, Chieko Wada, Maki Maeda, Taku Oshima, Hirotada Mori, Shigehiko Kanaya The 14th International Conference on Genome Informatics December 14-17, 2003, Yokohama Japan.

16 E.Coli PPI network Total 3007 proteins and interactions Around 2000 are unknown function proteins Highest K-core of this total graph is not so helpful

17 10-core graph—the highest k-core of the E.Coli PPI network

18 We separate 1072 interactions (out of 11531) involving protein synthesis and function unknown proteins. P. S. U. F. P. S. P. S.

19 Function unknown Proteins of this 6-kore graph are likely to be involved in protein synthesis

20 Extending the k-core based function prediction method and its application to PPI data of Arabidopsis thaliana Protein Function Prediction based on k-cores of Interaction Networks, Norihiko Kamakura, Hiroki Takahashi, Kensuke Nakamura, Shigehiko Kanaya and Md. Altaf-Ul-Amin, Proceedings of 2010 International Conference on Bioinformatics and Biomedical Technology (ICBBT 2010)

21 Materials and Methods : Dataset All PPI data of Arabidopsis thaliana
3118 interactions involving 1302 proteins. Collected from databases and scientific literature by our laboratory. Green= Unknown proteins (289 proteins) Pink= Known proteins (1013 proteins)

22 Materials and Methods : Dataset Functional groups in the network
The PPI dataset contains proteins of 19 different functions according to the first level categories of the KNApSAcK database.

23 Materials and Methods : Dataset The trends of interactions in the context of functional similarity
Diagonal elements show number of interactions between similar function proteins.

24 Materials And Methods : Flowchart of the method

25 Results : Subnetworks Subnetwork Name Number of interactions we do not consider in this work the sub-networks that contain less than 100 interactions. And finally I consider subnetworks corresponding to 9 functional classes.

26 Results : Subnetwork corresponding to cellular communication
As an example here we show the subnetworks and k-cores corresponding to cellular communication. Subnetwork extraction We extracted the following 3 types of interactions. Cellular communication-Cellular communication Cellular communication-Unknown, Unknown-Unknown Total 603 interactions

27 Results : Subnetwork corresponding to cellular communication
1-core The red nodes : known proteins. The green nodes : unknown proteins.

28 Results : k-cores corresponding to cellular communication
The red nodes : known proteins. The green nodes : unknown proteins. The red color nodes represent known proteins, the green color nodes represent function unknown proteins.

29 Results : k-cores corresponding to cellular communication
The red nodes : known proteins The green nodes : unknown proteins. 6-core 7-core This figure implies that determination of k-cores in strategically constructed sub-networks can reveal which unknown proteins are densely connected to proteins of a particular functional class.

30 Results : Function Predictions
The number of unknown genes included in different k-cores corresponding to different functional groups k-core 2 k-core 3 k-core 4 k-core 5 k-core 6 k-core 7 k-core 8 cell_cycle 11 7 cell_rescue 4 cellular_communication 37 33 23 15 12 8 energy 5 2 metabo 1 protein_fate 69 35 25 10 protein_synthesis transcription 24 14 transport_facilitation total 129 88 64 52 36 27

31 Results : Function Predictions
Prediction based on 2-cores, 3-cores and 4-cores 2-core 4-core Most proteins have been assigned unique functions 3-core Most proteins have been assigned unique functions and some have been assigned multiple functions 31

32 Assessment of Predictions
As most of the function predicted proteins are still unknown their annotations do not contain clear information on their functions. When k is much larger than one, the effect of false positives is greatly reduced. However to assess statistically, we constructed 1000 random graphs consisting of the same 1,302 proteins but I inserted 3,118 edges randomly and constructed subnetworks.

33 Assessment of Predictions
The box plots show the distribution of k-cores with respect to their size in 1000 graphs corresponding to each sub-network and the filled triangles show the size of k-cores in real PPI sub-networks.

34 Assessment of Predictions
it can be theoretically concluded that the existence of higher order k-core graphs in PPI sub-networks compared to in the random graphs of the same size are likely to be because of interaction between similar function proteins. Therefore we assume that the function prediction based on k-cores for the value of k greater than highest possible value of k for corresponding random graphs are statistically significant predictions. Based on this we predicted the functions of 67 proteins(list is available online at 34

35 “Prediction of Protein Functions Based on Protein-Protein Interaction Networks: A Min-Cut Approach”, Md. Altaf-Ul-Amin, Toshihiro Koma, Ken Kurokawa, Shigehiko Kanaya, Proceedings of the Workshop on Biomedical Data Engineering (BMDE), Tokyo, Japan, pp , April 3-4, 2005.

36 Outline Introduction The concept of Min-Cut Problem Formulation A Heuristic Method Evaluation of the Proposed Method Conclusions

37 Outline Introduction The concept of Min-Cut Problem Formulation A Heuristic Method Evaluation of the Proposed Method Conclusions

38 Introduction After the complete sequencing of several genomes, the challenging problem now is to determine the functions of proteins Determining protein functions experimentally Using various computational methods a) sequence b) structure c) gene neighborhood d) gene fusions e) cellular localization f) protein-protein interactions

39 Introduction Present work predicts protein functions based on protein-protein interaction network. For the purpose of prediction, we consider the interactions of function-unknown proteins with function-known proteins and function-unknown proteins with function-unknown proteins In the context of the whole network.

40 Introduction Majority of protein-protein interactions are between similar function protein pairs. Therefore, We assign function-unknown proteins to different functional groups in such a way so that the number of inter-group interactions becomes the minimum. Hence we call the proposed approach a Min-Cut approach.

41 Outline Introduction The concept of Min-Cut Problem Formulation A Heuristic Method Evaluation of the Proposed Method Conclusions

42 The concept of Min-Cut U4 K8 U3 K1 K4 U2 K6 K2 K3 U1 K5 G1 G2 A typical and small network of known and unknown proteins

43 The concept of Min-Cut U4 K U3 K K U2 K K K U1 K G1 G2 Unknown proteins assigned to known groups based on majority interactions

44 The concept of Min-Cut U4 K U3 K K U2 K K K U1 K G1 G2 Number of CUT = 4

45 The concept of Min-Cut U4 K U3 K K U2 K K K U1 K G1 G2 An alternative assignment of unknown proteins

46 The concept of Min-Cut U4 K U3 K K U2 K K K U1 K G1 Number of CUT = 2 G2 For every assignment of unknown proteins, there is a value of CUT. Min-cut approach looks for an assignment for which the number of CUT is minimum.

47 Outline Introduction The concept of Min-Cut Problem Formulation A Heuristic Method Evaluation of the Proposed Method Conclusions

48 Problem Formulation Here we explain some points with a typical example.

49 Problem Formulation V= set of all nodes E =set of all edges G={K1, K2, K3, K4, K5, K6, K7, K8, K9, K10} U={U1, U2, U3, U4, U5, U6, U7, U8}

50 Problem Formulation We generate U´ U such that each protein of U´ is connected in N with at least one protein of group G by a path of length 1 or length 2. U´= {U1, U2, U3, U4, U5, U6, U7}

51 Problem Formulation We can assign proteins of U´ to different groups and calculate CUT Interactions between known protein pairs can never be part of CUT For this assignment of unknown proteins, the CUT= 6

52 Problem Formulation The problem we are trying to solve is to assign the proteins of set U´ to known groups G1 , G2 ,…….., G3 in such a way so that the CUT becomes the minimum.

53 Outline Introduction The concept of Min-Cut Problem Formulation A Heuristic Method Evaluation of the Proposed Method Conclusions

54 A Heuristic Method The problem under hand is a variant of network partitioning problem. It is known that network partitioning problems are NP-hard. Therefore, we resort to some heuristics to find a solution as better as it is possible.

55 A Heuristic Method U1 U2 U3 U4 U5 U6 U7

56 U1 G2 G1 x U2 U3 U4 U5 U6 U7 A Heuristic Method
U1 has one path of length 1 with G2 and two paths of length two with G1

57 U1 G2 G1 x U2 U3 U4 G3 U5 U6 U7 A Heuristic Method
U4 has two paths of length 1 with G1, one path of length one with G2 and one path of length two with G3.

58 A Heuristic Method U1 G2 G1 x U2 U3 U4 G3 U5 U6 U7

59 A Heuristic Method U1 G2 G1 x U2 U3 U4 G3 U5 U6 U7

60 U1 G2 G1 x U2 U3 U4 G3 U5 U6 U7 A Heuristic Method
By assigning all the unknown proteins to respective height priority groups, CUT = 6

61 U1 G2 G1 x U2 U3 U4 G3 U5 U6 U7 A Heuristic Method
For this assignment of unknown proteins, the CUT= 7

62 U1 G2 G1 x U2 U3 U4 G3 U5 U6 U7 A Heuristic Method
For this assignment of unknown proteins, the CUT= 4

63 Outline Introduction The concept of Min-Cut Problem Formulation A Heuristic Method Evaluation of the Proposed Method Conclusions

64 Evaluation of the Proposed Approach
The proposed method is a general one and can be applied to any organism and any type of functional classification. Here we applied it to yeast Saccharomyces cerevisiae protein-protein interaction network We obtain the protein-protein interaction data from ftp://ftpmips.gsf.de/yeast/PPI/ which contains genetic and physical interactions.

65 Evaluation of the Proposed Approach
YAR019c YMR001c YAR019c YNL098c YAR019c YOR101w YAR019c YPR111w YAR027w YAR030c YAR027w YBR135w YAR031w YBR217w Total pairs We discard self-interactions and extract a set of unique binary interactions involving 4648 proteins.

66 A network of 12487 interactions and 4648 proteins is reasonably big
Evaluation of the Proposed Approach A network of interactions and 4648 proteins is reasonably big

67 Evaluation of the Proposed Approach
Evaluation of the Proposed Approach We collect from the classification data

68 Evaluation of the Proposed Approach
Evaluation of the Proposed Approach The proposed approach is intended to predict the functions of function-unknown proteins. However, by predicting the functions of function-unknown proteins, it is not possible to determine the correctness of the predictions. We consider around 10% randomly selected proteins of each group of Table 1 as function-unknown proteins.

69 Evaluation of the Proposed Approach
Evaluation of the Proposed Approach The union of 10% of all groups consists of 604 proteins. This is the unknown group U. The union of the rest 90% of each of the functional groups constitutes the set of known proteins G. There are total 3783 proteins in G. We generate U´ U such that each protein of U´ is connected in N with at least one protein of group G by a path of length 1 or length 2. There are 470 proteins in U´ . We predicted functions of these 470 proteins using the proposed method.

70 Evaluation of the Proposed Approach
We applied this algorithm using Max_value=50000 to predict the functions 470 proteins.

71 Evaluation of the Proposed Approach
We cannot guarantee that minimum CUT corresponds to maximum successful prediction. However, the trends of the results of the Figure above shows that it is very likely that the lower is the value of CUT the greater is the number of successful predictions

72 Evaluation of the Proposed Approach
We then examine the relation of successful predictions with the number of degrees of the proteins in the network . Degree of U4 =7 Degree of U7=3

73 Evaluation of the Proposed Approach
We then examine the relation of successful predictions with the number of degrees of the proteins in the network .

74 Evaluation of the Proposed Approach
The success rate of prediction is as low as 30.46% for proteins that have only one degree in the interaction network. However it is 67.61% for proteins that have degrees 8 or more. This implies that the reliability of the prediction can be improved by providing reasonable amount of interaction information

75 Application of network concepts in DNA sequencing

76 Sequencing by hybridization (SBH)
Given an unknown DNA sequence, an array provides information about all strings of length l that the sequence contains s=TATGGTGC S(s,l)={TAT, ATG, TGG, GGT, GTG, TGC} S(s,l)={GTG, ATG, TGG, TAT, GGT, TGC} Orderly placed Randomly placed Input: A spectrum S representing all l-mers from an unknown string s Output: The string s such that spectrum (s,l) = S.

77 Sequencing by hybridization (SBH)
Input: A spectrum S representing all l-mers from an unknown string s Output: The string s such that spectrum (s,l) = S. The reduction of the SBH problem to an Eulerian path problem is to construct a graph whose edges correspond to l-mers from spectrum(s,l) and then to find a path in this graph visiting every edge exactly once.

78 Sequencing by hybridization (SBH)
The reduction of the SBH problem to an Eulerian path problem is to construct a graph whose nodes correspond to (l-1)-mers and edges correspond to l-mers from spectrum(s,l) and then to find a path in this graph visiting every edge exactly once. S(s,l)={GTG, ATG, TGG, TAT, GGT, TGC} (l-1)-mers: GT, TG, AT, TG, TG, GG, TA, AT, GG, GT, TG, GC (l-1)-mers(redundancy removed): GT, TG, AT, GG, TA, GC GG AT GT s=TATGGTGC GC TG TA

79 Sequencing by hybridization (SBH)
A path in a graph visiting every edge exactly once is called Eulerian (pronounced Oilerian) path A connected graph has an Eulerian path, if and only if it contains at most two semibalanced nodes and all other nodes are balanced. Balanced node, indegree=outdegree Semibalanced node |indegree-outdegree|=1 GG AT GT GC TG TA Semibalanced

80 Sequencing by hybridization (SBH)
Another example S(s,l)={ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT} (l-1)-mers:AT, TG, TG, GG, TG, GC, GT, TG, GG, GC, GC, CA, GC, CG, CG, GT (l-1)-mers(redundancy removed):AT, TG, GG, GC, GT, CA, CG TG GG AT ATGGCGTGCA GC CA GT CG

81 Sequencing by hybridization (SBH)
S(s,l)={ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT} (l-1)-mers:AT, TG, TG, GG, TG, GC, GT, TG, GG, GC, GC, CA, GC, CG, CG, GT (l-1)-mers(redundancy removed):AT, TG, GG, GC, GT, CA, CG TG GG AT ATGCGTGGCA GC CA GT CG


Download ppt "Lecture 3 Protein Function prediction using network concepts"

Similar presentations


Ads by Google