3Routing in transportation networks Road networks, Point-to-point shortest paths: 15 seconds (naïve) 10 microsecondsH. Bast et al., “Fast Routing in Road Networks with Transit Nodes”, Science 27, 2007.
4Internet and the WWWThe world-wide web can be represented as a directed graphWeb search and crawl: traversalLink analysis, ranking: Page rank and HITSDocument classification and clusteringInternet topologies (router networks) are naturally modeled as graphs
5Scientific Computing Reorderings for sparse solvers Fill reducing orderingsPartitioning, eigenvectorsHeavy diagonal to reduce pivoting (matching)Data structures for efficient exploitationof sparsityDerivative computations for optimizationgraph colorings, spanning treesPreconditioningIncomplete FactorizationsPartitioning for domain decompositionGraph techniques in algebraic multigridIndependent sets, matchings, etc.Support TheorySpanning trees & graph embedding techniquesImage source: Yifan Hu, “A gallery of large graphs”B. Hendrickson, “Graphs and HPC: Lessons for Future Architectures”,Image source: Tim Davis, UF Sparse Matrix Collection.
6Large-scale data analysis Graph abstractions are very useful to analyze complex data sets.Sources of data: petascale simulations, experimental devices, the Internet, sensor networksChallenges: data size, heterogeneity, uncertainty, data qualityAstrophysics: massive datasets,temporal variationsBioinformatics: data quality,heterogeneitySocial Informatics: new analyticschallenges, data uncertaintyImage sources: (1) (2,3)
7Data Analysis and Graph Algorithms in Systems Biology Study of the interactions between various components in a biological systemGraph-theoretic formulations are pervasive:Predicting new interactions: modelingFunctional annotation of novel proteins: matching, clusteringIdentifying metabolic pathways: paths, clusteringIdentifying new protein complexes: clustering, centralityImage Source: Giot et al., “A Protein Interaction Map of Drosophila melanogaster”,Science 302, , 2003.
8Graph –theoretic problems in social networks Targeted advertising: clustering and centralityStudying the spread of informationImage Source: Nexus (Facebook application)
9Network Analysis for Intelligence and Survelliance [Krebs ’04] Post 9/11 Terrorist Network Analysis from public domain informationPlot masterminds correctly identified from interaction patterns: centralityA global view of entities is often more insightfulDetect anomalous activities by exact/approximate subgraph isomorphism.Image Source:Image Source: T. Coffman, S. Greenblatt, S. Marcus, Graph-based technologiesfor intelligence analysis, CACM, 47 (3, March 2004): pp 45-47
11Characterizing Graph-theoretic computations Factors that influencechoice of algorithmInput dataGraph kernelgraph sparsity (m/n ratio)static/dynamic natureweighted/unweighted, weight distributionvertex degree distributiondirected/undirectedsimple/multi/hyper graphproblem sizegranularity of computation at nodes/edgesdomain-specific characteristicstraversalshortest path algorithmsflow algorithmsspanning tree algorithmstopological sort…..Problem: Find ***pathsclusterspartitionsmatchingspatternsorderingsGraph problems are often recast as sparse linear algebra (e.g., partitioning) or linear programming (e.g., matching) computations
13History1735: “Seven Bridges of Königsberg” problem, resolved by Euler, one of the first graph theory results.…1966: Flynn’s Taxonomy.1968: Batcher’s “sorting networks”1969: Harary’s “Graph Theory”1972: Tarjan’s “Depth-first search and linear graph algorithms”1975: Reghbati and Corneil, Parallel Connected Components1982: Misra and Chandy, distributed graph algorithms.1984: Quinn and Deo’s survey paper on “parallel graph algorithms”
14The PRAM model Idealized parallel shared memory system model Unbounded number of synchronous processors; no synchronization, communication cost; no parallel overheadEREW (Exclusive Read Exclusive Write), CREW (Concurrent Read Exclusive Write)Measuring performance: space and time complexity; total number of operations (work)
15PRAM Pros and Cons Pros Cons Simple and clean semantics. The majority of theoretical parallel algorithms are designed using the PRAM model.Independent of the communication network topology.ConsNot realistic, too powerful communication model.Algorithm designer is misled to use IPC without hesitation.Synchronized processors.No local memory.Big-O notation is often misleading.
16PRAM Algorithms for Connected Components Reghbati and Corneil , O(log2n) time and O(n3) processorsWyllie , O(log2n) time and O(m) processorsShiloach and Vishkin , O(log n) time and O(m) processorsReif and Spirakis , O(log log n) time and O(n) processors (expected)
20Distributed Graph representation Each processor stores the entire graph (“full replication”)Each processor stores n/p vertices and all adjacencies out of these vertices (“1D partitioning”)How to create these “p” vertex partitions?Graph partitioning algorithms: recursively optimize for conductance (edge cut/size of smaller partition)Randomly shuffling the vertex identifiers ensures that edge count/processor are roughly the same
212D graph partitioningConsider a logical 2D processor grid (pr * pc = p) and the matrix representation of the graphAssign each processor a sub-matrix (i.e, the edges within the sub-matrix)9 vertices, 9 processors, 3x3 processor gridx5817346FlattenSparse matrices2Per-processor local graphrepresentation
22Data structures in (parallel) graph algorithms A wide range seen in graph algorithms: array, list, queue, stack, set, multiset, treeImplementations are typically array-based for performance considerations.Key data structure considerations in parallel graph algorithm designPractical parallel priority queuesSpace-efficiencyParallel set/multiset operations, e.g., union, intersection, etc.
23Graph Algorithms on current systems ConcurrencySimulating PRAM algorithms: hardware limits of memory bandwidth, number of outstanding memory references; synchronizationLocalityTry to improve cache locality, but avoid “too much” superfluous computationWork-efficiencyIs (Parallel time) * (# of processors) = (Serial work)?
24The locality challenge “Large memory footprint, low spatial and temporal locality impede performance”Serial Performance of “approximate betweenness centrality” on a 2.67 GHz Intel Xeon 5560 (12 GB RAM, 8MB L3 cache) Input: Synthetic R-MAT graphs (# of edges m = 8n)No Last-level Cache (LLC) misses~ 5X drop inperformanceO(m) LLC misses
25Tuning implementation to minimize parallel overhead is non-trivial The parallel scaling challenge “Classical parallel graph algorithms perform poorly on current parallel systems”Graph topology assumptions in classical algorithms do not match real-world datasetsParallelization strategies at loggerheads with techniques for enhancing memory localityClassical “work-efficient” graph algorithms may not fully exploit new architectural featuresIncreasing complexity of memory hierarchy, processor heterogeneity, wide SIMD.Tuning implementation to minimize parallel overhead is non-trivialShared memory: minimizing overhead of locks, barriers.Distributed memory: bounding message buffer sizes, bundling messages, overlapping communication w/ computation.
27Graph traversal (BFS) problem definition 12distance fromsource vertexOutput:Input:58411233473469sourcevertex12Memory requirements (# of machine words):Sparse graph representation: m+nStack of visited vertices: nDistance array: n
28Optimizing BFS on cache-based multicore platforms, for networks with “power-law” degree distributionsProblem Spec.AssumptionsNo. of vertices/edges106 ~ 109Edge/vertex ratio1 ~ 100Static/dynamic?StaticDiameterO(1) ~ O(log n)Weighted/UnweightedUnweightedVertex degree distributionUnbalanced (“power law”)Directed/undirected?BothSimple/multi/hypergraph?MultigraphGranularity of computation at vertices/edges?MinimalExploiting domain-specific characteristics?PartiallyTest data(Data: Mislove et al.,IMC 2007.)Synthetic R-MATnetworks
29Parallel BFS Strategies 1. Expand current frontier (level-synchronous approach, suited for low diameter graphs)581O(D) parallel stepsAdjacencies of all verticesin current frontier arevisited in parallel73469sourcevertex22. Stitch multiple concurrent traversals (Ullman-Yannakakis approach, suited for high-diameter graphs)58path-limited searches from “super vertices”APSP between “super vertices”1sourcevertex734692
30A deeper dive into the “level synchronous” strategy Locality (where are the random accesses originating from?)1. Ordering of vertices in the “current frontier” array, i.e., accesses to adjacency indexing array, cumulative accesses O(n).2. Ordering of adjacency list of each vertex, cumulative O(m).3. Sifting through adjacencies to check whether visited or not, cumulative accesses O(m).5384933144746326111. Access Pattern: idx array -- 53, 31, 74, 262,3. Access Pattern: d array -- 0, 84, 0, 84, 93, 44, 63, 0, 0, 11
31Performance Observations Youtube social networkFlickr social networkGraph expansionEdge filtering
32Improving locality: Vertex relabeling Well-studied problem, slight differences in problem formulationsLinear algebra: sparse matrix column reordering to reduce bandwidth, reveal dense blocks.Databases/data mining: reordering bitmap indices for better compression; permuting vertices of WWW snapshots, online social networks for compressionNP-hard problem, several known heuristicsWe require fast, linear-work approachesExisting ones: BFS or DFS-based, Cuthill-McKee, Reverse Cuthill-McKee, exploit overlap in adjacency lists, dimensionality reduction
34Improving locality: Cache blocking Metadata denotingblocking patternProcesshigh-degreeverticesseparatelyAdjacencies (d)Adjacencies (d)123xxxxxxxxxxxxxxxxxxxxxxxxfrontierxxfrontierxxTune to L2cache sizexxxxxxxxxxlinear processingNew: cache-blocked approachInstead of processing adjacencies of each vertex serially, exploit sorted adjacency list structure w/ blocked accessesRequires multiple passes through the frontier array, tuning for optimal block size.Note: frontier array size may be O(n)
35Vertex relabeling heuristic Similar to older heuristics, but tuned for small-world networks:High percentage of vertices with (out) degrees 0, 1, and 2 in social and information networks => store adjacencies explicitly (in indexing data structure).Augment the adjacency indexing data structure (with two additional words) and frontier array (with one bit)Process “high-degree vertices” adjacencies in linear order, but other vertices with d-array cache blocking.Form dense blocks around high-degree verticesReverse Cuthill-McKee, removing degree 1 and degree 2 vertices
36Architecture-specific Optimizations 1. Software prefetching on the Intel Core i7 (supports 32 loads and 20 stores in flight)Speculative loads of index array and adjacencies of frontier vertices will reduce compulsory cache misses.2. Aligning adjacency lists to optimize memory accesses16-byte aligned loads and stores are faster.Alignment helps reduce cache misses due to fragmentation16-byte aligned non-temporal stores (during creation of new frontier) are fast.3. SIMD SSE integer intrinsics to process “high-degree vertex” adjacencies.4. Fast atomics (BFS is lock-free w/ low contention, and CAS-based intrinsics have very low overhead)5. Hugepage support (significant TLB miss reduction)6. NUMA-aware memory allocation exploiting first-touch policy
37Experimental Setup Network n m Max. out-degree % of vertices w/ out-degree 0,1,2Orkut3.07M223M32K5LiveJournal5.28M77.4M9K40Flickr1.86M22.6M26K73Youtube1.15M4.94M28K76R-MAT8M-64M8nn0.6Intel Xeon 5560 (Core i7, “Nehalem”)2 sockets x 4 cores x 2-way SMT12 GB DRAM, 8 MB shared L351.2 GBytes/sec peak bandwidth2.66 GHz proc.Performance averaged over10 different source vertices, 3 runs each.
39Cache locality improvement Performance count: # of non-contiguous memory accesses (assuming cache line size of 16 words)Theoretical count of the number of non-contiguous memory accesses: m+3n
40Parallel performance (Orkut graph) Execution time:0.28 seconds (8 threads)Parallel speedup: 4.9Speedup overbaseline: 2.9Graph: 3.07 million vertices, 220 million edgesSingle socket of Intel Xeon 5560 (Core i7)
41Performance AnalysisHow well does my implementation match theoretical bounds?When can I stop optimizing my code?Begin with asymptotic analysisExpress work performed in terms of machine-independent performance countsAdd input graph characteristics in analysisUse simple kernels or micro-benchmarks to provide an estimate of achievable peak performanceRelate observed performance to machine characteristics
42Graph 500 “Search” Benchmark (graph500.org) BFS (from a single vertex) on a static, undirected R-MAT network with average vertex degree 16.Evaluation criteria: largest problem size that can be solved on a system, minimum execution time.Reference MPI, shared memory implementations provided.NERSC Franklin system is ranked #2 on current list (Nov 2010).BFS using 500 nodes of Franklin
44Parallel Single-source Shortest Paths (SSSP) algorithms Edge weights: concurrency primary challenge!No known PRAM algorithm that runs in sub-linear time and O(m+nlog n) workParallel priority queues: relaxed heaps [DGST88], [BTZ98]Ullman-Yannakakis randomized approach [UY90]Meyer and Sanders, ∆ - stepping algorithm [MS03]Distributed memory implementations based on graph partitioningHeuristics for load balancing and termination detectionK. Madduri, D.A. Bader, J.W. Berry, and J.R. Crobak, “An Experimental Study of A Parallel Shortest Path Algorithm for Solving Large-Scale Graph Instances,” Workshop on Algorithm Engineering and Experiments (ALENEX), New Orleans, LA, January 6, 2007.
45∆ - stepping algorithm [MS03] Label-correcting algorithm: Can relax edges from unsettled vertices also∆ - stepping: “approximate bucket implementation of Dijkstra’s algorithm”∆: bucket widthVertices are ordered using buckets representing priority range of size ∆Each bucket may be processed in parallel
46∆ - stepping algorithm: illustration ∆ = 0.1 (say)0.05One parallel phasewhile (bucket is non-empty)Inspect light edgesConstruct a set of “requests” (R)Clear the current bucketRemember deleted vertices (S)Relax request pairs in RRelax heavy request pairs (from S)Go on to the next bucket0.56360.070.010.150.23420.020.130.1815d array∞∞∞∞∞∞∞Buckets
47∆ - stepping algorithm: illustration 0.05One parallel phasewhile (bucket is non-empty)Inspect light edgesConstruct a set of “requests” (R)Clear the current bucketRemember deleted vertices (S)Relax request pairs in RRelax heavy request pairs (from S)Go on to the next bucket0.56360.070.010.150.23420.020.130.1815d array∞∞∞∞∞∞BucketsInitialization:Insert s into bucket, d(s) = 0
48∆ - stepping algorithm: illustration One parallel phasewhile (bucket is non-empty)Inspect light edgesConstruct a set of “requests” (R)Clear the current bucketRemember deleted vertices (S)Relax request pairs in RRelax heavy request pairs (from S)Go on to the next bucket0.050.56360.070.010.150.23420.020.130.1815d array∞∞∞∞∞∞BucketsR2.01S
49∆ - stepping algorithm: illustration 0.05One parallel phasewhile (bucket is non-empty)Inspect light edgesConstruct a set of “requests” (R)Clear the current bucketRemember deleted vertices (S)Relax request pairs in RRelax heavy request pairs (from S)Go on to the next bucket0.56360.070.010.150.23420.020.130.1815d array∞∞∞∞∞∞BucketsR2.01S
50∆ - stepping algorithm: illustration 0.05One parallel phasewhile (bucket is non-empty)Inspect light edgesConstruct a set of “requests” (R)Clear the current bucketRemember deleted vertices (S)Relax request pairs in RRelax heavy request pairs (from S)Go on to the next bucket0.56360.070.010.150.23420.020.130.1815d array∞∞∞∞∞.01BucketsR2S
51∆ - stepping algorithm: illustration 0.05One parallel phasewhile (bucket is non-empty)Inspect light edgesConstruct a set of “requests” (R)Clear the current bucketRemember deleted vertices (S)Relax request pairs in RRelax heavy request pairs (from S)Go on to the next bucket0.56360.070.010.150.23420.020.130.1815d array∞∞∞∞∞.01BucketsR132.03.06S
52∆ - stepping algorithm: illustration 0.05One parallel phasewhile (bucket is non-empty)Inspect light edgesConstruct a set of “requests” (R)Clear the current bucketRemember deleted vertices (S)Relax request pairs in RRelax heavy request pairs (from S)Go on to the next bucket0.56360.070.010.150.23420.020.130.1815d array∞∞∞∞∞.01BucketsR13.03.06S2
53∆ - stepping algorithm: illustration 0.05One parallel phasewhile (bucket is non-empty)Inspect light edgesConstruct a set of “requests” (R)Clear the current bucketRemember deleted vertices (S)Relax request pairs in RRelax heavy request pairs (from S)Go on to the next bucket0.56360.070.010.150.23240.020.130.1815d array∞∞∞.03.01.06BucketsR13S2
54∆ - stepping algorithm: illustration 0.05One parallel phasewhile (bucket is non-empty)Inspect light edgesConstruct a set of “requests” (R)Clear the current bucketRemember deleted vertices (S)Relax request pairs in RRelax heavy request pairs (from S)Go on to the next bucket0.56360.070.010.150.23420.020.130.1815d array.03.01.06.16.29.62BucketsR1425S21366
64Betweenness Centrality Centrality: Quantitative measure to capture the importance of a vertex/edge in a graphdegree, closeness, eigenvalue, betweennessBetweenness Centrality( : No. of shortest paths between s and t)Applied to several real-world networksSocial interactionsWWWEpidemiologySystems biology
65Algorithms for Computing Betweenness All-pairs shortest path approach: compute the length and number of shortest paths between all s-t pairs (O(n3) time), sum up the fractional dependency values (O(n2) space).Brandes’ algorithm (2003): Augment a single-source shortest path computation to count paths; uses the Bellman criterion; O(mn) work and O(m+n) space.
66Our New Parallel Algorithms Madduri, Bader (2006): parallel algorithms for computing exact and approximate betweenness centralitylow-diameter sparse graphs (diameter D = O(log n), m = O(nlog n))Exact algorithm: O(mn) work, O(m+n) space, O(nD+nm/p) time.Madduri et al. (2009): New parallel algorithm with lower synchronization overhead and fewer non-contiguous memory referencesIn practice, 2-3X faster than previous algorithmLock-free => better scalability on large parallel systems
67Parallel BC Algorithm Consider an undirected, unweighted graph High-level idea: Level-synchronous parallel Breadth-First Search augmented to compute centrality scoresTwo stepstraversal and path countingdependency accumulation
71Parallel BC Algorithm Illustration 1. Traversal step: visit adjacent vertices, update distanceand path counts.SDP581122773469831sourcevertex27125257Level-synchronous approach: The adjacencies of all verticesin the current frontier can be visited in parallel
72Parallel BC Algorithm Illustration Traversal step: at the end, we have all reachable vertices,their corresponding predecessor multisets, and D values.SDP5821166142277346983831sourcevertex2871252576Level-synchronous approach: The adjacencies of all verticesin the current frontier can be visited in parallel
73Graph traversal step analysis Exploit concurrency in visiting adjacencies, as we assume that the graph diameter is small: O(log n)Upper bound on size of each predecessor multiset: In-degreePotential performance bottlenecks: atomic updates to predecessor multisets, atomic increments of path countsNew algorithm: Based on observation that we don’t need to store “predecessor” vertices. Instead, we store successor edges along shortest paths.simplifies the accumulation stepreduces an atomic operation in traversal stepcache-friendly!
74Graph Traversal Step locality analysis All the vertices are in acontiguous block (stack)for all vertices u at level d in parallel dofor all adjacencies v of u in parallel dodv = D[v];if (dv < 0)vis = fetch_and_add(&Visited[v], 1);if (vis == 0)D[v] = d+1;pS[count++] = v;fetch_and_add(&sigma[v], sigma[u]);fetch_and_add(&Scount[u], 1);if (dv == d + 1)All the adjacencies of a vertex arestored compactly (graph rep.)Non-contiguousmemory accessNon-contiguousmemory accessNon-contiguousmemory accessStore to S[u]Better cache utilization likely if D[v], Visited[v], sigma[v] are stored contiguously
75Parallel BC Algorithm Illustration 2. Accumulation step: Pop vertices from stack,update dependence scores.SDeltaP5821166427734698383sourcevertex28725576
76Parallel BC Algorithm Illustration 2. Accumulation step: Can also be done in alevel-synchronous manner.SDeltaP5821166427734698383sourcevertex28725576
77Accumulation step locality analysis All the vertices are in acontiguous block (stack)for level d = GraphDiameter-2 to 1 dofor all vertices v at level d in parallel dofor all w in S[v] in parallel do reduction(delta)delta_sum_v = delta[v] + (1 + delta[w]) * sigma[v]/sigma[w];BC[v] = delta[v] = delta_sum_v;Each S[v] is acontiguousblockOnly floating point operation in code
78Centrality Analysis applied to Protein Interaction Networks 43 interactionsProtein Ensembl IDENSGKelch-like protein 8
79Designing parallel algorithms for large sparse graph analysis System requirements: High (on-chip memory, DRAM, network, IO) bandwidth.Solution: Efficiently utilize available memory bandwidth.ImprovelocalityAlgorithmic innovationto avoid corner cases.“RandomAccess”-likeLocalityComplexityProblemData reduction/Compression“Stream”-likeFastermethodsO(n)1041061081012Peta+O(nlog n)WorkperformedO(n2)Problem size (n: # of vertices/edges)
80Review of lectureApplications: Internet and WWW, Scientific computing, Data analysis, SurveillanceOverview of parallel graph algorithmsPRAM algorithmsgraph representationParallel algorithm case studiesBFS: locality, level-synchronous approach, multicore tuningShortest paths: exploiting concurrency, parallel priority queueBetweenness centrality: importance of locality, atomics