Presentation is loading. Please wait.

Presentation is loading. Please wait.

Efficient Triangle Motif Counting in Large Scale Complex Networks with GPUs Hakan Kardeş CS 791v.

Similar presentations


Presentation on theme: "Efficient Triangle Motif Counting in Large Scale Complex Networks with GPUs Hakan Kardeş CS 791v."— Presentation transcript:

1 Efficient Triangle Motif Counting in Large Scale Complex Networks with GPUs
Hakan Kardeş CS 791v

2 Introduction Many systems are being modeled as complex networks to understand local and global characteristics of these systems. Studying network models of these systems provides a new direction towards understanding biological, chemical, technological or social systems in a better way. Uzun cumle kullanma CS 791v

3 Complex Networks Everywhere
Aspirin Yeast protein interaction network An Internet Web Co-author network CS 791v

4 Why Graph Mining and Searching?
In many cases, systems under investigation are very large and the corresponding graphs have large number of nodes/edges requiring graph mining techniques to derive information from the graph. Several graph mining techniques have been developed to extract useful information from graph representation and analyze various features of complex networks. Uzun cumle kullanma CS 791v

5 Why is Triangle Counting important?
Clustering coefficient Transitivity ratio Social Network Analysis fact: “Friends of friends are friends” A C B [WF94)] Hidden Thematic Structure of the Web (Eckmann et al. PNAS [EM02]) Motif Detection, (e.g., [YPSB05] ) Web Spam Detection (Becchetti et.al. KDD ’08 [BBCG08]) Uzun cumle kullanma CS 791v

6 Related Work Hakan Kardes, and M. H. Gunes. Structural Graph Indexing for Mining Complex Networks. IEEE ICDCS 2010 Workshop on Simplifying Complex Networks for Practitioners, Genoa, ITALY, June Our paper in which we count all star, triangle, complete bipartite and clique structures. Matthieu Latapy Main-memory triangle computations for very large (sparse (power-law)) graphs. Theor. Comput. Sci. 407, 1-3 (November 2008), Survey paper, focused on space complexity Charalampos Tsourakakis, Petros Drineas, Eirinaios Michelakis, Ioannis Koutis, Christos Faloutsos, "Spectral Counting of Triangles in Power-Law Networks via Element-Wise Sparsification," Social Network Analysis and Mining, International Conference on Advances in, pp , 2009 International Conference on Advances in Social Network Analysis and Mining, 2009 relies on the spectral properties of power-law networks, focused on power-law networks CS 791v

7 Related Work Luca Becchetti, Paolo Boldi, Carlos Castillo, and Aristides Gionis Efficient algorithms for large-scale local triangle counting. ACM Transactions on Knowl. Discov. Data 4, 3, Article 13 (October 2010), 28 pages. They count the number of triangles for a given node. Charalampos E. Tsourakakis, U Kang, Gary L. Miller, and Christos Faloutsos. Doulion: Counting triangles in massive graphs with a coin.In Knowledge Discovery and Data Mining (KDD '09) Belkacem Serrour, Alex Arenas, Sergio Gomez Detecting communities of triangles in complex networks using spectral optimization. Bill Andreopoulos, Christof Winter, Dirk Labudde and Michael Schroeder. Triangle network motifs predict complexes by complementing high-error interactomes with structural information. BMC Bioinformatics 2009  The graph indexing studies can be mainly categorized into two categories, namely, path-based and structure-based approaches. Path-based graph indexing approaches use path expressions as indexing features such as GraphGrep [19] and Daylight [11]. GraphGrep enumerates all paths in the graph up to the length maxL. Then, it looks for each gi whether it contains all paths up to MaxL for a graph query qi. A significant feature of path-based approaches is that paths can be manipulated easier than general graphs. However, as Yan et. al. indicated, path is a simple structure loosing structural information of a graph, and hence false positive ratio of pathbased methods would be very high [22]. In addition, the number of paths in a graph database increases exponentially making path-based methods impractical for very large graphs. Alternatively, structure-based graph indexing approaches identifies subgraphs to be indexed as in gIndex [22]. gIndex first searches for the frequent subgraphs in the graph, then indexes these frequent structures. An issue in this case is that frequent subgraph discovery increases complexity and exponential number of frequent fragments may exist under low frequency support. Therefore, in their study, they limit the number of nodes and index frequent structures up to 10 nodes. CS 791v

8 Methodology

9 Star We first index the star structure where a node has multiple neighbors as shown in below figures. All star structures within a graph G = (V,E) are represented as s(vi , nsi) where vi ∈ V and nsi is the set of all neighbors of vi. We index maximal star structures for each node. v1 ns1 ns2 ns2 v1 ns1 ns3 vi ns1 ns2 nsn . CS 791v

10 Star Nodes: a, b, c, d, e, f. Edges: (a,b), (a,d), (a,f), (b,e), (c,f), (d,f) Star Structures: Algorithm: First build a star structure s(v,ø) for each node v ∈ V, without any neighbors. Then, for each edge e(a, b) ∈ E, append neighbor sets of nodes a and b to the other one. Finally, remove star structures s(v,ns) that have less than two neighbors. a b f a f b c e d a a b d e f d c f CS 791v

11 Triangle Algorithm: Find second hop neighbors of ‘a’ by iterating over the ns set Then, take the intersection of second hop neighbors of ‘a’ and ns set. Grow the triangle set for each isi ԑ is. a ns1 ns2 nsn . ls1 ls2 lsn . CS 791v

12 CUDA For the parallel algorithm, I will use CUDA. CS 791v

13 CUDA CS 791v

14 Experiments

15 Possible Datasets for Experiments
Router-level Internet topology (around 2.3 M nodes and 4M edges) the routing data on the Internet network (around 124K nodes and 207K edges) a mobile phone graph. (around 2.7M nodes and 6M edges) Will be requested from the authors of “Structure of neighborhoods in a large social network” Biological Data Wikipedia graph (around 1.6M nodes and 18.5M edges) I haven’t decided how to do it yet. I will generate sample graphs with different number of triangles CS 791v

16 Results Triangle Counting CPU vs. GPU: Execution Time no. of nodes
CS 791v

17 no. of edges(while no. of nodes is constant
Results Triangle Counting CPU vs. GPU: Execution Time no. of edges(while no. of nodes is constant CS 791v

18 Results Triangle Counting with different triangle sizes:
Execution Time No. of triangles CS 791v

19 Results Triangle Counting with different block sizes: Execution Time
CS 791v

20 Future Work

21 Structural Graph Indexing(SGI)
We propose an alternative structural indexing approach to search and process queries efficiently even in very large graphs. As indexing features, we use commonly observed graph structures: star, complete bipartite, triangle and clique. These structures are ubiquitous in biological, chemical, technological, social, and many other complex networks. Uzun cumle kullanma CS 791v

22 Structural Models . . 3-Star (K1,3) n-Star (K1,n)
dn d2 v1 . d3 v1 v3 d4 d2 d1 v2 d3 3-Star (K1,3) n-Star (K1,n) 2*3-Complete Bipartite(K2,3) *4-Complete Bipartite(K3,4) v1 vm dn d1 . v1 v3 v2 v4 vn . Redraw in powerpoint and animate m*n-Complete Bipartite(Km,n) Triangle(K3) Clique (K4) n-Clique(Kn) CS 791v

23 Structural Graph Indexing
An important feature of these structures is that each one is comprised from the previous one where clique contains complete bipartite structures and complete bipartite contains star structures. So, we can index these structures within the original graph in a consecutive manner. We first identify star structures, and then the complete-bipartite and clique structures from the preceding ones. Uzun cumle kullanma CS 791v

24 Structural Graph Indexing
An important difference of our approach from the previous studies is that we does not limit the size of subgraph considered in indexing. We index all maximal graphs that match the structure formulation. For instance, a maximal clique is a clique that cannot be extended by adding one more vertex from the graph. However, the substructure size in indexing may be limited when needed, since maximal clique search is known to be NP-complete. Uzun cumle kullanma CS 791v

25 Complete Bipartite The second structure we index is complete bipartite, shown in below figures. A complete bipartite graph G = (V1 ∪ V2,E) is a bipartite graph such that V1 and V2 are two distinct sets and for any two vertices vi ∈ V1 and vj ∈ V2, then there is an edge between them (i.e., ∃ e∗ (vi,vj ) ∈ E). v1 vm dn d1 . v1 v3 d4 d2 d1 v2 d3 v1 v2 d3 d2 d1 The complete bipartite graph with partitions of size |V 1| = m and |V 2| = n is denoted as K(m, n). Note that, star structure is a special case of a bipartite graph (not necessarily complete) where m = 1. Moreover, finding complete bipartite subgraph K(m,n) with maximal number of edges m.n is an NP-complete problem [15]. SIMPLEX’10 CS 791v

26 Complete Bipartite Complete bipartite structure is ubiquitous in many complex networks. protein-protein interaction networks (Thomas et. al.) the Internet (Fay et. al.) We index all complete bipartite structures in the graph G using indexed star structures. For each star structure s(a,ns) where a ∈ V and ns is the neighbor set of the node a, we identify the maximal complete bipartite involving the node ‘a’. Complete bipartite structure is ubiquitous in many complex networks. For example, [4] examines the structure of protein-protein interaction networks and showed that the graph of all protein-protein interactions is made up of complete bipartite structures containing two disjoint sets of nodes in which each node in one set is connected to every node in the other set. Additionally, bipartite graphs are very relevant in the Internet, as such subgraphs correspond to nodes that have identical sets of neighbors. Studying those nodes would be of particular interest to understand the structure of the Internet as these are duplicated structures in the graph [8]. CS 791v

27 Complete Bipartite Algorithm:
Find second hop neighbors of ‘a’ by iterating over the ns set and unifying them under Lcan set that indicates candidates for the left side of the complete bipartite while the ns set is the candidate set for the right hand side. Then, find a K2,n and then grow it to Km,n. In finding K2,n , iterate over each candidate node in the Lcan and determine the neighbor intersection with a. If the intersection set is larger than two, then these nodes belong to the right hand side. Grow the K2,n by finding all nodes in the left hand side (i.e., Lcan) that has the right hand side nodes (i.e., Rnew) as a neighbor. a ns1 ns2 nsn . ls1 ls2 lsn . Right can. set Left can. set CS 791v

28 Clique Finally, we index clique structures shown in below figures.
A clique in graph G = (V,E) is a subset of the vertex set (i.e., C ⊆ V ) such that there are edges between all node pairs (i.e., ∀(ci, cj) ∈ C, ∃e(ci,cj) ∈ E, when i ≠ j). We index all maximal clique structures in the graph using complete bipartite structures. v1 vn v2 v3 v4 . v1 v2 v3 v1 v3 v2 v4 CS 791v

29 Clique This structure has been observed and utilized in many fields.
computational biology protein structure prediction (Samudrala et. al.) electronic circuits (Cong et. al.) chemicals in a chemical database (Rhodes et. al.) For example, in computational biology many problems can be solved by finding maximal or all cliques within the graph. Similarly, [18] models protein structure prediction as a problem of finding cliques in a graph whose vertices represent positions of subunits of the protein; [6] finds a hierarchical partition of an electronic circuit into smaller subunits using cliques; and [16] uses cliques to describe chemicals in a chemical database that have a high degree of similarity with a target structure. CS 791v

30 Clique Algorithm: Set1 v1 Set2 d1 d2 v2 d3 v3 d4
First get the set of nodes from each complete bipartite k(m,n) and look for cliques that are formed by those nodes. The clique search algorithm works recursively on each node from the k(m,n) as the pivot node in the L1 set and considers other nodes as candidate nodes in the L2 set. The function, moves each node from the L2 set to the L1 set if it is connected to all nodes in the L1 and then recursively tries to grow the structure with remaining nodes as candidates. When there are no more candidates to consider in L2 set then a clique has been identified. Set1 v1 v3 d4 d2 d1 v2 d3 v1 Set2 d1 d2 d3 d4 Note that, any clique larger than three nodes in the graph G will be indexed as multiple bipartite structures. Hence, we do not need to consider all nodes in the graph when indexing maximal clique structures. Note that, this algorithm is not optimal and better solutions for finding all cliques are proposed in [5], [17]. v2 v3 CS 791v

31 Where to Submit Advances in Social Network Analysis and Mining (ASONAM 2011): Full paper submission deadline is March 1, 2011. Full paper manuscripts must be with a maximum length of 8 pages (using the IEEE two- column template). Kaohsiung, Taiwan 7/25-7/27 Workshop on Large-scale Data Mining: Theory and Applications (LDMTA 2011) Workshop on Mining and Learning with Graphs (MLG 2011) Workshop on Social Network Mining and Analysis (SNAKDD 2011) Full paper submission deadline is May 4-10, 2011. Full paper manuscripts must be with a maximum length of 10 pages (using the ACM two- column template). San Diego, CA 8/21-8/24 Simplifying Network Science for Practitioners: (SIMPLEX 2011) Full paper submission deadline is Jan 31, 2011 – Feb Full paper manuscripts must be with a maximum length of 6-10 pages (using the IEEE two- column template). Minneapolis, Minnesota, USA 6/20-6/24 CS 791v

32 Questions SIMPLEX’10

33 Thank you


Download ppt "Efficient Triangle Motif Counting in Large Scale Complex Networks with GPUs Hakan Kardeş CS 791v."

Similar presentations


Ads by Google