Optimizing Graph Algorithms for Improved Cache Performance Aya Mire & Amir Nahir Based on: Optimizing Graph Algorithms for Improved Cache Performance –

Slides:



Advertisements
Similar presentations
Fibonacci Numbers F n = F n-1 + F n-2 F 0 =0, F 1 =1 – 0, 1, 1, 2, 3, 5, 8, 13, 21, 34 … Straightforward recursive procedure is slow! Why? How slow? Lets.
Advertisements

Lecture 7. Network Flows We consider a network with directed edges. Every edge has a capacity. If there is an edge from i to j, there is an edge from.
Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
* Bellman-Ford: single-source shortest distance * O(VE) for graphs with negative edges * Detects negative weight cycles * Floyd-Warshall: All pairs shortest.
Bipartite Matching, Extremal Problems, Matrix Tree Theorem.
22C:19 Discrete Structures Trees Spring 2014 Sukumar Ghosh.
1 Appendix B: Solving TSP by Dynamic Programming Course: Algorithm Design and Analysis.
Advanced Algorithm Design and Analysis (Lecture 7) SW5 fall 2004 Simonas Šaltenis E1-215b
CS138A Single Source Shortest Paths Peter Schröder.
1 Chapter 26 All-Pairs Shortest Paths Problem definition Shortest paths and matrix multiplication The Floyd-Warshall algorithm.
1 Theory I Algorithm Design and Analysis (10 - Shortest paths in graphs) T. Lauer.
ALGORITHMS THIRD YEAR BANHA UNIVERSITY FACULTY OF COMPUTERS AND INFORMATIC Lecture eight Dr. Hamdy M. Mousa.
Lecture 17 Path Algebra Matrix multiplication of adjacency matrices of directed graphs give important information about the graphs. Manipulating these.
 2004 SDU Lecture11- All-pairs shortest paths. Dynamic programming Comparing to divide-and-conquer 1.Both partition the problem into sub-problems 2.Divide-and-conquer.
Representing Relations Using Matrices
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 21: Graphs.
1 Discrete Structures & Algorithms Graphs and Trees: II EECE 320.
Jim Anderson Comp 122, Fall 2003 Single-source SPs - 1 Chapter 24: Single-Source Shortest Paths Given: A single source vertex in a weighted, directed graph.
Shortest Paths Definitions Single Source Algorithms –Bellman Ford –DAG shortest path algorithm –Dijkstra All Pairs Algorithms –Using Single Source Algorithms.
Greedy Algorithms Reading Material: Chapter 8 (Except Section 8.5)
Data Structures, Spring 2004 © L. Joskowicz 1 Data Structures – LECTURE 15 Shortest paths algorithms Properties of shortest paths Bellman-Ford algorithm.
Shortest Paths Definitions Single Source Algorithms
CS 206 Introduction to Computer Science II 11 / 05 / 2008 Instructor: Michael Eckmann.
Data Structures, Spring 2004 © L. Joskowicz 1 Data Structures – LECTURE 16 All shortest paths algorithms Properties of all shortest paths Simple algorithm:
DAST 2005 Tirgul 12 (and more) sample questions. DAST 2005 Q.We’ve seen that solving the shortest paths problem requires O(VE) time using the Belman-Ford.
Algorithms All pairs shortest path
Tirgul 13. Unweighted Graphs Wishful Thinking – you decide to go to work on your sun-tan in ‘ Hatzuk ’ beach in Tel-Aviv. Therefore, you take your swimming.
Greedy Algorithms Like dynamic programming algorithms, greedy algorithms are usually designed to solve optimization problems Unlike dynamic programming.
Backtracking.
CS 473 All Pairs Shortest Paths1 CS473 – Algorithms I All Pairs Shortest Paths.
TECH Computer Science Graph Optimization Problems and Greedy Algorithms Greedy Algorithms  // Make the best choice now! Optimization Problems  Minimizing.
Jim Anderson Comp 122, Fall 2003 Single-source SPs - 1 Chapter 24: Single-Source Shortest Paths Given: A single source vertex in a weighted, directed graph.
Chapter 9 – Graphs A graph G=(V,E) – vertices and edges
1 Network Optimization Chapter 3 Shortest Path Problems.
Lecture 12-2: Introduction to Computer Algorithms beyond Search & Sort.
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
CSC401: Analysis of Algorithms CSC401 – Analysis of Algorithms Chapter Dynamic Programming Objectives: Present the Dynamic Programming paradigm.
Chapter 24: Single-Source Shortest Paths Given: A single source vertex in a weighted, directed graph. Want to compute a shortest path for each possible.
MA/CSSE 473 Day 28 Dynamic Programming Binomial Coefficients Warshall's algorithm Student questions?
1 The Floyd-Warshall Algorithm Andreas Klappenecker.
Graph Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Adapted for 3030 To accompany the text ``Introduction to Parallel Computing'',
All Pair Shortest Path IOI/ACM ICPC Training June 2004.
Introduction to Algorithms Jiafen Liu Sept
Parallel Programming: All-Pairs Shortest Path CS599 David Monismith Based upon notes from multiple sources.
The single-source shortest path problem (SSSP) input: a graph G = (V, E) with edge weights, and a specific source node s. goal: find a minimum weight (shortest)
The all-pairs shortest path problem (APSP) input: a directed graph G = (V, E) with edge weights goal: find a minimum weight (shortest) path between every.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.
1/24 Introduction to Graphs. 2/24 Graph Definition Graph : consists of vertices and edges. Each edge must start and end at a vertex. Graph G = (V, E)
Chapter 8 Maximum Flows: Additional Topics All-Pairs Minimum Value Cut Problem  Given an undirected network G, find minimum value cut for all.
1 Ch20. Dynamic Programming. 2 BIRD’S-EYE VIEW Dynamic programming The most difficult one of the five design methods Has its foundation in the principle.
All-Pairs Shortest Paths
Chapter 20: Graphs. Objectives In this chapter, you will: – Learn about graphs – Become familiar with the basic terminology of graph theory – Discover.
C&O 355 Lecture 19 N. Harvey TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A A A A A.
Iterative Improvement for Domain-Specific Problems Lecturer: Jing Liu Homepage:
1 Lecture 5 (part 2) Graphs II (a) Circuits; (b) Representation Reading: Epp Chp 11.2, 11.3
Proof of correctness of Dijkstra’s algorithm: Basically, we need to prove two claims. (1)Let S be the set of vertices for which the shortest path from.
Parallel Graph Algorithms
Chapter 5. Optimal Matchings
Algorithms (2IL15) – Lecture 5 SINGLE-SOURCE SHORTEST PATHS
3.5 Minimum Cuts in Undirected Graphs
Basic Graph Algorithms
Minimum Spanning Tree Algorithms
Advanced Algorithms Analysis and Design
All pairs shortest path problem
Algorithms (2IL15) – Lecture 7
Text Book: Introduction to algorithms By C L R S
Parallel Graph Algorithms
Negative-Weight edges:
Directed Graphs (Part II)
Presentation transcript:

Optimizing Graph Algorithms for Improved Cache Performance Aya Mire & Amir Nahir Based on: Optimizing Graph Algorithms for Improved Cache Performance – Joon-Sang Park, Michael Penner, Viktor K Prasanna

The Problem with Graphs… Graph problems pose unique challenges to improving cache performance due to their irregular data access patterns

Agenda A recursive implementation of the Floyd-Warshall Algorithm. A tiled implementation of the Floyd- Warshall Algorithm. Efficient data structures for general graph problem. Optimizations for the maximum matching algorithm.

Analysis model All proofs and complexity analysis will be based on the I/O model. i.e: the goal of the improved algorithm is to minimize the number of cpu- memory transactions. CPU Cache Main Memory AB C cost(A) ≪ cost(B) cost(C) ≪ cost(B)

Analysis model All proofs will assume total control of the cache. i.e if the cache is big enough to hold two data blocks, than the two can be held in the cache without running over each other (no conflict misses)

The Floyd Warshall Algorithm An ‘all pairs shortest path’ algorithm. Works by iteratively calculating D k, where D k is the matrix of all pair shortest paths going through vertices {1, 2, …k}. Each iteration depends on the result of the previous one. Time complexity: Θ(|V| 3 ).

The Floyd Warshall Algorithm Pseudo Code: for k from 1 to |V| for i from 1 to |V| for j from 1 to |V| D i,j (k) ← min {D i,j (k-1), D i,k (k-1) + D k,j (k-1) } return D (|V|)

The Floyd Warshall Algorithm The algorithm accesses the entire matrix in each iteration. The dependency of the k th iteration on the results of the (k-1) th iteration eliminate the ability to perform data reuse.

Lemma 1 Suppose D i,j (k) is computed as D i,j (k) ← min {D i,j (k-1), D i,k (k’) + D k,j (k’’) } for k-1 ≤ k’, k’’ ≤ |V|, then upon termination the FW algorithm correctly computes the all pair shortest paths. D i,j (k) ← min {D i,j (k-1), D i,k (k-1) + D i,k (k-1) }

Lemma 1 - Proof To distinguish between the traditional FW Algorithm, we’ll use T i,j (k) to denote the results calculated using the “new” computation way. ⇒ T i,j (k) ← min {T i,j (k-1), T i,k (k’) + T k,j (k’’) } for k-1 ≤ k’, k’’ ≤ |V| Suppose D i,j (k) is computed as D i,j (k) ← min {D i,j (k-1), D i,k (k’) + D k,j (k’’) } for k-1 ≤ k’, k’’ ≤ |V|, then upon termination the FW algorithm correctly computes the all pair shortest paths.

Lemma 1 - Proof First, we’ll show that for 1 ≤ k ≤ |V| the following inequality holds: T i,j (k) ≤ D i,j (k) We Prove this by induction. Base case: by definition we have T i,j (0) = D i,j (0)

Lemma 1 - Proof Induction step: suppose T i,j (k) ≤ D i,j (k) for k = m-1. Then: T i,j (m) ← min {T i,j (m-1), T i,m (m’) + T m,j (m’’) } ≤ min {D i,j (m-1), T i,m (m’) + T m,j (m’’) } ≤ min {D i,j (m-1), T i,m (m-1) + T m,j (m-1) } ≤ min {D i,j (m-1), D i,m (m-1) + D m,j (m-1) } = D i,j (m) for 1 ≤ k ≤ |V| : T i,j (k) ≤ D i,j (k) T i,j (k) ← min {T i,j (k-1), T i,k (k’) + T i,k (k’’) } By step of induction Limiting the choices for intermediate vertices makes path same or longer By step of induction By definition

Lemma 1 - Proof On the other hand, since the traditional algorithm computes the shortest paths at termination, and since T i,j (|V|) is the length of some path, we have: D i,j (|V|) ≤ T i,j (|V|) ⇒ D i,j (|V|) = T i,j (|V|) for 1 ≤ k ≤ |V| : T i,j (k) ≤ D i,j (k) Suppose D i,j (k) is computed as D i,j (k) ← min {D i,j (k-1), D i,k (k’) + D k,j (k’’) } for k-1 ≤ k’, k’’ ≤ |V|, then upon termination the FW algorithm correctly computes the all pair shortest paths.

FW’s Algorithm – Recursive Implementation We first consider the basic case of a two-node graph. w1w1 w2w2 1 2 W1W1 - -W2W2 Floyd-Warshall (T){ T 11 = min {T 11, T 11 + T 11 } T 12 = min {T 12, T 11 + T 12 } T 21 = min {T 21, T 21 + T 11 } T 22 = min {T 22, T 21 + T 12 } T 22 = min {T 22, T 22 + T 22 } T 21 = min {T 21, T 22 + T 21 } T 12 = min {T 12, T 12 + T 22 } T 11 = min {T 11, T 12 + T 21 } {

FW’s Algorithm – Recursive Implementation The general case III IIIIV Floyd-Warshall (T){ If (not base case){ T I = min {T I, T I, T I } T II = min {T II, T I, T II } T III = min {T III, T III, T I } T IV = min {T IV, T III, T II } T IV = min {T IV, T IV, T IV } T III = min {T III, T IV, T III } T II = min {T II, T II, T IV } T I = min {T I, T II, T III } } else { … } {

FW’s Recursive Algorithm – Correctness It can be shown, that for each action D i,j (k) ← min {D i,j (k-1), D i,k (k-1) + D k,j (k-1) } in FW’s traditional implementation, there is a corresponding action, T i,j (k) ← min {T i,j (k-1), T i,k (k’) + T k,j (k’’) }, where k-1 ≤ k’, k’’ ≤ |V|. Hence the algorithm’s correctness follows from lemma 1.

FW’s Recursive Algorithm – How does it actually work… T I (0) T II (0) T IV (0) T III (0) Floyd-Warshall (T){ If (not base case){ T I = min {T I, T I, T I } T II = min {T II, T I, T II } T III = min {T III, T III, T I } T IV = min {T IV, T III, T II } T IV = min {T IV, T IV, T IV } T III = min {T III, T IV, T III } T II = min {T II, T II, T IV } T I = min {T I, T II, T III } } else { … } { T (0) T (|V|) T I (|V|/2) T II (|V|/2) T III (|V|/2) T IV (|V|/2) T IV (|V|) T III (|V|) T II (|V|) T I (|V|)

FW’s Recursive Algorithm - Example

FW’s Recursive Algorithm – Example Floyd-Warshall (T){ T 11 = min {T 11, T 11 + T 11 } T 12 = min {T 12, T 11 + T 12 } T 21 = min {T 21, T 21 + T 11 } T 22 = min {T 22, T 21 + T 12 } T 22 = min {T 22, T 22 + T 22 } T 21 = min {T 21, T 22 + T 21 } T 12 = min {T 12, T 12 + T 22 } T 11 = min {T 11, T 12 + T 21 } {

FW’s Recursive Algorithm – Example Floyd-Warshall (T){ T 11 = min {T 11, T 11 + T 11 } T 12 = min {T 12, T 11 + T 12 } T 21 = min {T 21, T 21 + T 11 } T 22 = min {T 22, T 21 + T 12 } T 22 = min {T 22, T 22 + T 22 } T 21 = min {T 21, T 22 + T 21 } T 12 = min {T 12, T 12 + T 22 } T 11 = min {T 11, T 12 + T 21 } {

Representing the Matrix in an efficient way We usually store matrices in the memory in one of two ways: Using either of these layouts will not improve performance since the algorithm breaks the matrix into quadrants Column-major layout: Row-major layout:

Representing the Matrix in an efficient way The Z-Morton layout: perform the following operations recursively until the quadrant size is of a single data unit: divide the matrix into four quadrants. store quadrant I, II, III, IV in the memory. For example:

Complexity Analysis The running time of the algorithm is given by T(|V|) = 8·T(|V|/2) = Θ(|V| 3 ) Without considering Cache the number of cpu-memory transactions is exactly as the running time Floyd-Warshall (T){ If (not base case){ T I = min {T I, T I, T I } T II = min {T II, T I, T II } T III = min {T III, T III, T I } T IV = min {T IV, T III, T II } T IV = min {T IV, T IV, T IV } T III = min {T III, T IV, T III } T II = min {T II, T II, T IV } T I = min {T I, T II, T III } } else { … } {

Complexity Analysis - Theorem There exists some B, where B = O(|cache| 1/2 ), such that, when using the FW-Recursive implementation, with the matrix stored in the Z-Morton layout, the number of cpu-memory transactions will be reduced by a factor of B. ⇒ there will be O(|V| 3 /B) cpu-memory transactions.

Complexity Analysis After k recursive calls, the size of a quadrant’s dimension is |V|/2 k. There exists some k, such that B ≜ |V|/2 k and 3 · B 2 ≤ |cache| Once the above condition is fulfilled, 3 matrices of size B 2 can be placed in the cache, and no further cpu-memory transactions are required. ⇒ B = O(|cache| 1/2 ) Floyd-Warshall (T){ If (not base case){ T I = min {T I, T I, T I } T II = min {T II, T I, T II } T III = min {T III, T III, T I } T IV = min {T IV, T III, T II } T IV = min {T IV, T IV, T IV } T III = min {T III, T IV, T III } T II = min {T II, T II, T IV } T I = min {T I, T II, T III } } else { … } {

Complexity Analysis Therefore we get: O(|V|/B) 3 · O(B 2 ) ⇒ the number of cpu-memory transactions is reduced by a factor of B. Transaction complexity of FW, when the size of the matrix dimension is |V|/B, and there’s no cache Transactions required in order to bring a BxB quadrant into the cache = O(|V| 3 /B)

Complexity Analysis – lower bound In “I/O complexity: The Red Blue Pebble Game” J.Hong and H.Kung have shown that the lower bound on cpu-memory transactions for multiplying matrices is Ω(N 3 /B) where B = O(|cache| 1/2 )

Complexity Analysis – lower bound – Theorem The lower bound on cpu-memory transactions for the Floyd Warshall algorithm is Ω(|V| 3 /B) where B = O(|cache| 1/2 ) Proof: by reduction

Complexity Analysis – lower bound theorem - Proof for k from 1 to N for i from 1 to N for j from 1 to N C k,i += A k,j · B j,I |V|    D i,j (k) ← min {D i,j (k-1), D i,k (k-1) + D k,j (k-1) }  

Complexity Analysis - Conclusion The algorithm’s complexity: O(|V| 3 /B) Lower bound for FW: Ω(|V| 3 /B) The recursive implementation is asymptotically optimal among all implementations of the Floyd Warshall algorithm (with respect to cpu-memory transactions).

FW’s Algorithm – Recursive Implementation - Comments Note, that the size of the cache is not part of the algorithm’s parameters, neither it is needed in order to store the matrix in the Z-Morton layout. Therefore: the algorithm is cache- oblivious

FW’s Algorithm – Recursive Implementation - Comments Though the analysis model included only a single hierarchy of cache, since no special attributes were defined, the proofs can be generalized to multiple levels of cache. L 0 Cache Main Memory L 1 Cache L 2 Cache

FW’s Algorithm – Recursive Implementation - Comments Since cache parameters have been disregarded, the best (and simplest) way to find the optimal size B is by experiment.

FW’s Algorithm – Recursive Implementation - Improvement The algorithm can be further improved by making it cache conscious: performing the recursive calls until the problem size is reduced to B, and solving the B- size problem in the traditional way (saves recursive calls’ overhead) This modification showed up to 2x improvement of running time on some of the machines.

FW’s Algorithm – Tiled Implementation Consider a special case of lemma 1 when k’, k’’ are restricted such that k - 1 ≤ k’, k’’ ≤ k B Where |cache| ≤ 3 · B 2 ( B = O(|cache| 1/2 )) Suppose D i,j (k) is computed as D i,j (k) ← min {D i,j (k-1), D i,k (k’) + D k,j (k’’) } for k-1 ≤ k’, k’’ ≤ |V|, then upon termination the FW algorithm correctly computes the all pair shortest paths. This leads to the following tiled implementation of FW’s algorithm

FW’s Algorithm – Tiled Implementation Divide the matrix into BxB tiles Perform |V|/B iterations: during the t th iteration: I. update the (t,t) th block II. update the remainder of the t th row and t th column III. update the rest of the matrix

FW’s Algorithm – Tiled Implementation Each iteration consists of three phases: Phase I: performing FW’s algorithms on the (t,t) th tile (which is self-dependent). Divide the matrix into BxB tiles Perform N/B iterations: during the t th iteration: I. update the (t,t) th block II. update the remainder of the t th row and t th column III. update the rest of the matrix

FW’s Algorithm – Tiled Implementation Phase II: updating the remainder of row t: A i,j (k) ← min{A i,j (k-1), A i,k (tB) + A k,j (k-1) } updating the remainder of column t: A i,j (k) ← min{A i,j (k-1), A i,k (k-1) + A k,j (tB) } Divide the matrix into BxB tiles Perform N/B iterations: during the t th iteration: I. update the (t,t) th block II. update the remainder of the t th row and t th column III. update the rest of the matrix During the t th iteration, k goes from i · (B-1) to i · B

FW’s Algorithm – Tiled Implementation Phase III: updating the rest of the matrix: A i,j (k) ← min{A i,j (k-1), A i,k (tB) + A k,j (tB) } Divide the matrix into BxB tiles Perform N/B iterations: during the t th iteration: I. update the (t,t) th block II. update the remainder of the t th row and t th column III. update the rest of the matrix During the t th iteration, k goes from i · (B-1) to i · B

FW’s Algorithm – Tiled Example

FW’s Algorithm – Tiled Example Divide the matrix into BxB tiles Perform N/B iterations: during the t th iteration: I. update the (t,t) th block II. update the remainder of the t th row and t th column III. update the rest of the matrix

FW’s Algorithm – Tiled Example Divide the matrix into BxB tiles Perform N/B iterations: during the t th iteration: I. update the (t,t) th block II. update the remainder of the t th row and t th column III. update the rest of the matrix

FW’s Algorithm – Tiled Example Divide the matrix into BxB tiles Perform N/B iterations: during the t th iteration: I. update the (t,t) th block II. update the remainder of the t th row and t th column III. update the rest of the matrix

FW’s Algorithm – Tiled Example Divide the matrix into BxB tiles Perform N/B iterations: during the t th iteration: I. update the (t,t) th block II. update the remainder of the t th row and t th column III. update the rest of the matrix

Representing the Matrix in an efficient way In order to match the data access pattern, a tile must be stored in continuous memory. Therefore, the Z-Morton layout is used

FW’s Tiled Algorithm – correctness Let D i,j (k) be the result of the k th iteration of the traditional FW’s implementation. Even though D i,j (k) and A i,j (k) may not be equal during the “inner” iterations, it can be shown, using induction, that at the end of each iteration, D i,j (k) = A i,j (k) (where k = t·B)

Complexity Analysis - Theorem There exists some B, where B = O(|cache| 1/2 ), such that, when using the FW-Tiled implementation, the number of cpu-memory transactions will be reduced by a factor of B. ⇒ there will be O(|V| 3 /B) cpu-memory transactions.

Complexity Analysis There are |V|/B x |V|/B tiles in the matrix. There are |V|/B iterations in the algorithm, in each iteration, all tiles are accessed. Updating a tile requires holding at most 3 tiles in the cache. ⇒ 3 · B 2 ≤ |cache| Divide the matrix into BxB tiles Perform N/B iterations: during the t th iteration: I. update the (t,t) th block II. update the remainder of the t th row and t th column III. update the rest of the matrix

Complexity Analysis Therefore we get: (|V|/B) · [(|V|/B)x (|V|/B)] · O(B 2 ) ⇒ the number of cpu-memory transactions is reduced by a factor of B. The number of iterations Transactions required in order to bring a BxB tile into the cache = O(|V| 3 /B) The size of the matrix

Complexity Analysis - Conclusion The algorithm’s complexity: O(|V| 3 /B) Lower bound for FW: Ω(|V| 3 /B) The tiled implementation is asymptotically optimal among all implementations of the Floyd Warshall algorithm (with respect to cpu-memory transactions).

FW’s Algorithm – Tiled Implementation - Comments Note, that when using the tiling method, the size of the cache is one of the algorithm’s parameters Therefore: the tiled algorithm is cache - conscious

FW’s Algorithm – Tiled Implementation - Comments Since cache parameters have been disregarded, the best (and simplest) way to find the optimal size B is by experiment.

FW’s Algorithm – experimental results Both algorithms (recursive and tiled) have shown a 30% improvement in L1 cache misses and 50% improvement in L2 cache misses for problem size of 1024 and 2048 vertices. The results for both algorithms are nearly identical! (less than 1% difference)

Dijkstra’s algorithm for Single Source Shortest Paths & Prim’s Algorithm for Minimum Spanning Tree Dijkstra’s Algorithm: S ← ∅ Q ← V While Q ≠ ∅ u ← extract-min (Q) S ← S ∪ {u} for each v ∊ adj(u) update d[v] Return S Prim’s Algorithm: Q ← V for each u ∊ Q do key(u) ← ∞ key (root) ← 0 While Q ≠ ∅ u ← extract-min (Q) for each v ∊ adj(u) if v ∊ Q and weight(u,v) < key(v) than key(v) ← weight(u,v) Both Algorithms have the same data access pattern

Graph representation There are two commonly used graph representations. The Adjacency matrix: A(i,j) = the cost of the edge from vertex i to vertex j. Elements are accessed in adjacent fashion. Representation size of O(|V| 2 )  

Graph representation The adjacency list representation: a pointer-based representation where a list of adjacent vertices is stored for each vertex in the graph, each node in the list holds the cost of the edge from the given vertex to the adjacent vertex. Representation size of O(|V| + |E|) Pointer-based representation leads to cache pollution.  

Adjacency Array representation For each vertex in the graph, there exists an array of adjacent vertices. Representation size of O(|V| + |E|) Elements are accessed in adjacent fashion. 123 …|V| vivi wiwi vjvj wjwj  

Matching Algorithm for Bipartite Graph Matching: A set M of edges in a graph is a matching if no vertex of the graph is end of more than one edge in M. A matching is considered maximum if it is larger than any other matching { 1 – 4 } is a maximal matching { 1 – 3, 2 – 4} is a maximum matching

Matching Algorithm for Bipartite Graph Let M be a matching. All edges in the graph are divided into two groups: matching-edges and non- matching-edges. A vertex is called free if it is not an end of any matching edge.

Matching Algorithm for Bipartite Graph A path P = {u 0, e 1, u 1, …, e k, u k } is called an augmenting path (with respect to M) if: - u 0 and u k are free. - the even numbered edges e 2, e 4, …, e k-1 are matching edges. The set of edges M\{e 2,e 4, …,e k-1 } ∪ {e 1,e 3, …,e k } is also a matching; it has one edge more than M has. So, if we find an augmenting path, we can construct a larger matching.

Finding Augmenting paths in a Bipartite Graph In bipartite graphs, each augmenting path has one end in A and one end in B. following such augmenting path starting from its end in A, we traverse non-matching edges from A to B and matching edges from B to A. By turning the graph into a directed graph (all matching edges are directed v B → v A, all the rest v A → v B ), we turn the problem into a simple path finding problem in a directed graph.

Matching Algorithm for Bipartite Graph The Algorithm: while (there exists an augmenting path) { increase |M| by one using the augmenting path } return M Algorithm’s complexity: O(|V|·|E|)

Matching Algorithm for Bipartite Graph – first optimization In order to find augmenting paths, we use the BFS algorithm, which has similar data access pattern to that of Dijsktra/Prim. Therefore, using the adjacency array instead of the adjacency list / matrix improves running time.

Matching Algorithm for Bipartite Graph – second optimization We try to reduce the size of the working set as in tiling: I.Partition G into g[1], g[2], … g[p]. II.Find the maximum matching in g[i] for each i ∊ {1,2,.. P} using the basic algorithm. III.Unite all sub-matches into M. IV.Find maximum matching in G using basic algorithm (starting with M).

Matching Algorithm for Bipartite Graph – second optimization If the sizes of sub-graphs are chosen appropriately, each of which fits into the cache, it generates minimal cpu- memory transactions of O(|V| + |E|) during phase II, because a single loading of each data element into the cache is necessary. Finding the best size for a sub-graph is by experiment.

Matching Algorithm for Bipartite Graph – best case In the best case, the maximum matching is found in phase II, the algorithm’s cpu-memory transactions complexity is O(|V| + |E|) That leaves us with the problem of partition the graph optimally. I.Partition G into g[1], g[2], … g[p]. II. Find the maximum matching in g[i] for each i ∊ {1,2,.. P} using the basic algorithm. III. Unite all sub-matches into M. IV. Find maximum matching in G using basic algorithm (starting with M).

Partitioning the Bipartite Graph The goal: to partition the edges into two groups such that the best matching possible is found within each group. Algorithm: I.Arbitrarily partition the vertices into 4 equal partitions. II.Count the number of edges between each pair of partitions. III.Combine partitions into two partitions such that as many “internal” edges as possible are created.

Conclusions Using efficient data representation methods can highly improve algorithms’ running time. Further improvement can be achieved by methods as tiling and recursion. Other graph algorithms, such as Bellman- Ford, BFS & DFS can be improved by the above, because of their data access pattern.