Searching for large cliques in large scale networks WORKSHOP ON CLUSTERING AND SEARCH TECHNIQUES IN LARGE SCALE NETWORKS (3-8 Nov. 2014) This work is partially funded by the Spanish National Government (DPI C02) and CAR (UPM-CSIC) Pablo San Segundo Carrillo (Associate professor in UPM)
Overview 2 Basic concepts related to exact (large) clique search Enumeration Pruning scheme: greedy sequential coloring K-core analysis An O(|E|) algorithm Bit string encoding of graphs BITSCAN / GRAPH C++ libraries Encoding of sparse graphs BBMCS: a new maximum clique algorithm for large scale networks Pseudocode Results Summary
Basic clique enumeration 3 {2,3,4} BINOMIAL SEARCH TREE (with repetitions) {4} 2 4 {1,4}
Basic pruning scheme: greedy coloring (I) C1C1 C2C2 C3C SEQ: GREEDY COLORING PROCEDURE 1.Define a vertex ordering 2.Color vertices sequentially with the least possible color The size of any feasible coloring C(G) is an upper bound on the size of a maximum clique in G ( (G) ≤ |C(G)| ) Proposition 1 Balas & Yu (1986) How to define a good ordering?
Basic pruning scheme: greedy coloring(II) 5 1 Search node at depth level k Is it worth selecting vertex 1 as candidate ? U’ size of current growing clique size of current champion G[U] Application of color bound Since the current largest clique cannot be improved, vertex 1 is pruned
Initial sorting of nodes for maximu clique 6 Absolute Degenerate At each step each selected vertex is removed from the original graph and degrees are recomputed Initially vertices should be sorted in non-decreasing degree order Proposition II. How should vertices be sorted initially? Absolute0 (1)2 (1)3 (1)4 (2)1 (3) Degenerate0 (1)2 (1)3 (1)1 (1)4 (1)
State of the Art (last decade): middle size graphs 7 MCQ : Tomita & Seki 2003 Heuristic decision based on color MaxClique-Dyn: Konc & Janecic 2007 MCS: Tomita & al BBMC: San Segundo & al Use of bitstrings Initial order of vertices fixed BBMCL: San Segundo & al Impact of an initial large clique: Batsyn & al MaxSAT: Li & al. 2010, 2013 BBMCX: San Segundo, Batsyn, Nikolaev 2014 Initial sorting improvements: San Segundo, Batsyn, Nikolaev 2014 Vertical coloring: Nikolaev, Bastsyn, San Segundo 2014 REAL GRAPHS
8 K-CORE DECOMPOSITION
Preliminaries 9 A maximal subgraph such that all its vertices have minimum degree k Definition I: k-core of a graph The largest k-core to which the vertex belongs Definition II: core number K(v) of a vertex k-core decomposition is hierarchical Proposition III Degenerate0 (1)2 (1)3 (1)1 (1)4 (1) The core number of a graph +1 is an upper bound for maximum clique ( (G) ≤ K(G)+1) Proposition IV 0-core 1-core 2-core 3-core
Quality of core number bounds for clique 10 (G)≤ |C(G)| ≤ K(G)+1 ≤ G +1 Proposition V. There exists an O(|E|) algorithm to compute k-core decomposition Proposition VI Batagelj & Zaversnik 2002 I.Order vertices by degree using bin-sort II.Critical operation: reduce degree of a vertex keeping all vertices sorted by degree Swap the vertex with the first vertex in the same bin and increment the bin pointer by one Sketch of proof bins of deg012 vertices
Pruning with core numbers 11 Given a clique of size c any vertex v s.t. K(v) < c cannot be part of a larger clique so it may be pruned Proposition VII degree core numbers any clique of size 2 cuts all vertices Can the coloring of a vertex c(v) be used in the same manner?
12 ENCODING OF THE MAXIMUM CLIQUE PROBLEM WITH BITSTRINGS
Preliminaries 13 Membership to a set 1-bit : member 0-bit: not a member Storage of a subset of natural numbers Masks (C-C++) A U BA b | B b A ∩ BA b & B b A – BA b &~ B b (A B)?{B b &~ A b } ≠
BITSCAN: a C++ library for bitstrings 14 Inspired by optimization requirements for bit string data structures found during 10 years of research in combinatorial optimization problems. Implementation of exact algorithms for NP-hard problems related to graphs (maximum clique-BBMC, vertex coloring-PASS etc.) Some of these requirements Fast bitscanning loops Forward and reverse directions Destructive and non-destructive Sparsity Semi-sparsity
GRAPH: Graph encoding with BITSCAN Vertices x x x x x Adjacency Matrix bitarray 0 bitarray 2 bitarray 3 bitarray 4 bitarray 1 #include "pablodev/graph/graph.h“ #define NUMBER_OF_VERTICES 5 void main(){ //undirected graph ugraph ug(NUMBER_OF_VERTICES); ug.add_edge(0, 1); ug.add_edge(0, 2); ug.add_edge(1, 2); ug.add_edge(1, 3); ug.add_edge(3, 4); //… } #include "pablodev/graph/graph.h“ #define NUMBER_OF_VERTICES void main(){ //undirected graph sparse_ugraph ug(NUMBER_OF_VERTICES); ug.add_edge(0, 1); ug.add_edge(0, 2); ug.add_edge(1, 2); ug.add_edge(1, 3); ug.add_edge(3, 4); //… }
Subgraphs and sets of vertices as bitstrings 16 For large scale networks it is CRITICAL to use a sparse bitstring encoding G=(V, E) W={1,2, 4} / G[W]11010 V={1,2, 3, 4, 5} / G11111 U={2, 3, 5} / G[U]01101
NEW BBMCS MAXIMUM CLIQUE ALGORITHM FOR LARGE SCALE NETOWRKS
The new maximum clique algorithm(I) 18 BBMCS (G=(V, E)) Initial operations: U=V 1. K = core numbers of U // computed in O(|E|) 2. H= initial heuristic clique 3. Remove vertices s.t. K(v)<H // a good H possibly solves the graph 4. Sort vertices in G by non decreasing K(v) // typical degeneracy order 5. repeat while |U|>0 6.select vertex u with minimum kcore 7.INIT_BRANCH(U, u) //unrolling of first level 8.remove u from U 9. end-repeat 10. return (G)
The maximum clique algorithm(II) 19 BRANCH is the new implementation of BBMC for sparse graphs INIT_BRANCH(U, u) //unrolling of first level 1. P = N U (u) + u //neighbor set of u (w.r.t. remaining vertices) plus u (a sparse bitstring) 2. if |P|<|H| return //CUT based on size 3. if |COLOR(P)| ≤ H return // a good H possibly solves the graph 4. K p = core numbers of P 5. if K p (P) < |H| return //graph core number cut 6. Remove any vertex v from P s.t. K p (v)<|H| //vertex core number cut 7. L= P sorted by non decreasing K(v) 8. BRANCH (P, L) //BRANCH is the extension of BBMC to the sparse case
Experiments 20 PMC algorithm Parallel Maximum Clique Algorithms with Applications to Network Analysis and Storage, Ryan Rossi et al., arXiv.org, 2013 THE state of the art algorithm by far HW: XEON 20 core, Linux Server, 64GB RAM Only one core used in all cases Datasets
Results DIMACS 10(I) 21 categoryname|V||E|∆d avg K(G)+1wowo w DIMACS 10 (massive)hugebubbles DIMACS 10 (triangular)delaunay_n DIMACS 10 (massive)hugetrace DIMACS 10 (triangular)delaunay_n DIMACS 10 (massive)hugetric DIMACS 10adaptive DIMACS 10 (massive)hugetric DIMACS 10 (massive)hugetric DIMACS 10channel-500x100x100-b DIMACS 10 (massive)hugetrace DIMACS 10 (triangular)delaunay_n DIMACS 10packing-500x100x100-b DIMACS 10 (triangular)delaunay_n DIMACS 10 (triangular)delaunay_n DIMACS 10 (random geometric)rgg_n_2_20_s DIMACS 10 (triangular)delaunay_n DIMACS 10auto DIMACS 10citationCiteseer DIMACS 10 (triangular)delaunay_n DIMACS 10m14b DIMACS DIMACS 10fe-ocean DIMACS 10 (triangular)delaunay_n DIMACS 10598a DIMACS 10fe_rotor DIMACS 10fe-tooth DIMACS 10 (triangular)delaunay_n DIMACS 10wing DIMACS 10fe-body DIMACS 10 (triangular)delaunay_n
Results DIMACS 10(II) 22 categoryname|V||E|PMCBBMCS%impratio imp DIMACS 10 (massive)hugebubbles DIMACS 10 (triangular)delaunay_n DIMACS 10 (massive)hugetrace DIMACS 10 (triangular)delaunay_n DIMACS 10 (massive)hugetric DIMACS 10adaptive DIMACS 10 (massive)hugetric DIMACS 10 (massive)hugetric DIMACS 10channel-500x100x100-b DIMACS 10 (massive)hugetrace DIMACS 10 (triangular)delaunay_n DIMACS 10packing-500x100x100-b DIMACS 10 (triangular)delaunay_n DIMACS 10 (triangular)delaunay_n DIMACS 10 (random geometric)rgg_n_2_20_s <.001 DIMACS 10 (triangular)delaunay_n DIMACS 10auto DIMACS 10citationCiteseer DIMACS 10 (triangular)delaunay_n DIMACS 10m14b DIMACS DIMACS 10fe-ocean DIMACS 10 (triangular)delaunay_n DIMACS 10598a DIMACS 10fe_rotor DIMACS 10fe-tooth DIMACS 10 (triangular)delaunay_n DIMACS 10wing DIMACS 10fe-body DIMACS 10 (triangular)delaunay_n
Results: Social (I) 23 categoryname|V||E|∆d avg K(G)+1wowo w Social facebooksocfb-A-anon Social facebooksocfb-B-anon Socialsoc-flixster Web graphsweb-wikipedia Socialsoc-pokec Socialsoc-lastfm Socialsoc-youtube-snap Socialsoc-digg Socialsoc-FourSquare Socialsoc-delicious Socialsoc-flickr Socialsoc-youtube Socialsoc-twitter-follows Socialsoc-gowalla Socialsoc-douban Socialsoc-LiveMocha Socialsoc-buzznet Socialsoc-BlogCatalog Socialsoc-slashdot Social facebooksocfb-OR Socialsoc-brightkite Social facebooksocfb-Penn Social facebooksocfb-Texas Social facebooksocfb-UF Social facebooksocfb-UIllinois Social facebooksocfb-Indiana Socialsoc-epinions Social facebooksocfb-Wisconsin Social facebooksocfb-Berkeley Social facebooksocfb-UCLA Social facebooksocfb-UConn
Results: Social(II) 24 categoryname|V||E|PMCBBMCS%impratio imp Social facebooksocfb-A-anon Social facebooksocfb-B-anon Socialsoc-flixster Web graphsweb-wikipedia Socialsoc-pokec Socialsoc-lastfm Socialsoc-youtube-snap Socialsoc-digg Socialsoc-FourSquare Socialsoc-delicious Socialsoc-flickr Socialsoc-youtube Socialsoc-twitter-follows Socialsoc-gowalla Socialsoc-douban Socialsoc-LiveMocha Socialsoc-buzznet Socialsoc-BlogCatalog Socialsoc-slashdot Social facebooksocfb-OR Socialsoc-brightkite Social facebooksocfb-Penn Social facebooksocfb-Texas Social facebooksocfb-UF Social facebooksocfb-UIllinois Social facebooksocfb-Indiana Socialsoc-epinions Social facebooksocfb-Wisconsin Social facebooksocfb-Berkeley Social facebooksocfb-UCLA Social facebooksocfb-UConn
Results: infrastructure 25 categoryname|V||E|∆d avg K(G)+1wowo w DIMACS 10 (infrastructure)inf-europe_osm Infrastructureinf-road-usa DIMACS 10 (infrastructure)inf-road_usa DIMACS 10 (infrastructure)inf-road_central DIMACS 10 (infrastructure)inf-germany_osm DIMACS 10 (infrastructure)inf-great-britain_osm DIMACS 10 (infrastructure)inf-netherlands_osm DIMACS 10 (infrastructure)inf-belgium_osm Infrastructureinf-roadNet-PA DIMACS 10 (infrastructure)inf-luxembourg_osm categoryname|V||E|PMCBBMCS%impratio imp DIMACS 10 (infrastructure)inf-europe_osm ts<.001 Infrastructureinf-road-usa < DIMACS 10 (infrastructure)inf-road_usa < DIMACS 10 (infrastructure)inf-road_central < DIMACS 10 (infrastructure)inf-germany_osm < DIMACS 10 (infrastructure)inf-great-britain_osm < DIMACS 10 (infrastructure)inf-netherlands_osm < DIMACS 10 (infrastructure)inf-belgium_osm < Infrastructureinf-roadNet-PA < DIMACS 10 (infrastructure)inf-luxembourg_osm ts<.001
Results: technological(I) 26 categoryname|V||E|∆d avg K(G)+1wowo w DIMACS 10 (technological)venturiLevel technologicaltech-as-skitter Scientific computingsc-ldoor Scientific computingsc-msdoor Scientific computingsc-pwtk DIMACS 10 (technological)tech-caidaRouterLevel technologicaltech-RL-caida Scientific computingsc-shipsec Scientific computingsc-shipsec Scientific computingsc-pkustk Scientific computingsc-pkustk technologicaltech-p2p-gnutella DIMACS 10 (technological)t60k Scientific computingsc-nasasrb technologicaltech-internet-as technologicaltech-as-caida technologicaltech-WHOIS
Results: technological(II) 27 categoryname|V||E|PMCBBMCS%impratio imp DIMACS 10 (technological)venturiLevel technologicaltech-as-skitter Scientific computingsc-ldoor Scientific computingsc-msdoor Scientific computingsc-pwtk DIMACS 10 (technological)tech-caidaRouterLevel technologicaltech-RL-caida Scientific computingsc-shipsec Scientific computingsc-shipsec Scientific computingsc-pkustk Scientific computingsc-pkustk technologicaltech-p2p-gnutella DIMACS 10 (technological)t60k Scientific computingsc-nasasrb technologicaltech-internet-as < technologicaltech-as-caida < technologicaltech-WHOIS <
Results: trivially solved during unrolling 28 categoryname|V||E|∆d avg K(G)+1wowo w DIMACS 10 (random geometric)rgg_n_2_23_s DIMACS 10 (random geometric)rgg_n_2_22_s Socialsoc-livejournal DIMACS 10 (random geometric)rgg_n_2_21_s temporal reachibilityscc_retweet-crawl Collaborationca-hollywood Collaborationca-coauthors-dblp DIMACS 10co-papers-dblp DIMACS 10 (random geometric)rgg_n_2_19_s Web graphsweb-it DIMACS 10co-papers-citeseer Collaborationca-MathSciNet Collaborationca-dblp DIMACS 10coAuthorsCiteseer Collaborationca-dblp Web graphsweb-arabic DIMACS 10 (random geometric)rgg_n_2_17_s Web graphsweb-uk Web graphsweb-sk recommendation Netrec-amazon DIMACS 10 (random geometric)rgg_n_2_16_s DIMACS 10 (random geometric)rgg_n_2_15_s Collaborationca-CondMat Collaborationca-AstroPh Web graphsweb-webbase Web graphsweb-BerkStan Web graphsweb-indochina Collaborationca-HepPh temporal reachibilityscc_infect-dublin
Summary 29 The main ideas behind finding the largest clique in large scale networks have been described Coloring and k-core bounds Initial sorting decision heuristic Sparse bitstring data structures A new algorithm BBMCS has been presented and compared with state of the art reference algorithm PMC. BBCMS has ouperformed PMC clearly in extensive empirical tests
Related bibliography 30 Initial sorting of vertices in the maximum clique problem reviewed. Pablo San Segundo, Alvaro Lopez, Mikhail Batsyn. LION 8 Conf. February, Florida, Relaxed approximate coloring in exact maximum clique search. Pablo San Segundo, Cristobal Tapia, COR An improved bit parallel exact maximum clique algorithm. Pablo San Segundo et. al., OPL, A new DSATUR-based algorithm for exact vertex coloring. Pablo San Segundo, COR, An exact bit-parallel algorithm for the maximum clique problem. Pablo San Segundo et. al., COR Parallel Maximum Clique Algorithms with Applications to Network Analysis and Storage, Ryan Rossi et al., arXiv.org, Efficient Search Using Bitboard Models. Pablo San Segundo, et al., ICTAI Conf., 2006.