Evaluating Graph Coloring on GPUs Pascal Grosset, Peihong Zhu, Shusen Liu, Suresh Venkatasubramanian, and Mary Hall Final Project for the GPU class - Spring.

Slides:



Advertisements
Similar presentations
Max- coloring in trees SRIRAM V.PEMMARAJU AND RAJIV RAMAN BY JAYATI JENNIFER LAW.
Advertisements

Social network partition Presenter: Xiaofei Cao Partick Berg.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Stanford University CS243 Winter 2006 Wei Li 1 Register Allocation.
A Backtracking Correction Heuristic for Graph Coloring Algorithms Sanjukta Bhowmick and Paul Hovland Argonne National Laboratory Funded by DOE.
1 Maximal Independent Set. 2 Independent Set (IS): In a graph G=(V,E), |V|=n, |E|=m, any set of nodes that are not adjacent.
Pratik Agarwal and Edwin Olson University of Freiburg and University of Michigan Variable reordering strategies for SLAM.
An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.
Polynomial Time Approximation Schemes Presented By: Leonid Barenboim Roee Weisbert.
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
Lecture 21: Spectral Clustering
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
TCSS 343, version 1.1 Algorithms, Design and Analysis Transform and Conquer Algorithms Presorting HeapSort.
CSE 421 Algorithms Richard Anderson Lecture 4. What does it mean for an algorithm to be efficient?
A Resource-level Parallel Approach for Global-routing-based Routing Congestion Estimation and a Method to Quantify Estimation Accuracy Wen-Hao Liu, Zhen-Yu.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
All-Pairs-Shortest-Paths for Large Graphs on the GPU Gary J Katz 1,2, Joe Kider 1 1 University of Pennsylvania 2 Lockheed Martin IS&GS.
Finding a maximum independent set in a sparse random graph Uriel Feige and Eran Ofek.
Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
JPEG C OMPRESSION A LGORITHM I N CUDA Group Members: Pranit Patel Manisha Tatikonda Jeff Wong Jarek Marczewski Date: April 14, 2009.
11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.
X-Stream: Edge-Centric Graph Processing using Streaming Partitions
Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.
A GPU Implementation of Inclusion-based Points-to Analysis Mario Méndez-Lojo (AMD) Martin Burtscher (Texas State University, USA) Keshav Pingali (U.T.
A Graph-based Friend Recommendation System Using Genetic Algorithm
Télécom 2A – Algo Complexity (1) Time Complexity and the divide and conquer strategy Or : how to measure algorithm run-time And : design efficient algorithms.
Implementing Parallel Graph Algorithms: Graph coloring Created by: Avdeev Alex, Blakey Paul.
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Fall 2013.
ANALYSIS AND IMPLEMENTATION OF GRAPH COLORING ALGORITHMS FOR REGISTER ALLOCATION By, Sumeeth K. C Vasanth K.
Tao Lin Chris Chu TPL-Aware Displacement- driven Detailed Placement Refinement with Coloring Constraints ISPD ‘15.
Lecture 19 Greedy Algorithms Minimum Spanning Tree Problem.
IIIT Hyderabad Scalable Clustering using Multiple GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute.
 Genetic Algorithms  A class of evolutionary algorithms  Efficiently solves optimization tasks  Potential Applications in many fields  Challenges.
CS270 Project Overview Maximum Planar Subgraph Danyel Fisher Jason Hong Greg Lawrence Jimmy Lin.
Data Structures and Algorithms in Parallel Computing Lecture 2.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.
Data Structures and Algorithms in Parallel Computing Lecture 3.
Data Structures and Algorithms in Parallel Computing Lecture 7.
Data Structures and Algorithms in Parallel Computing
Smoothed Particle Hydrodynamics Matthew Zhu CSCI 5551 — Fall 2015.
Custom Computing Machines for the Set Covering Problem Paper Written By: Christian Plessl and Marco Platzner Swiss Federal Institute of Technology, 2002.
GPGPU: Parallel Reduction and Scan Joseph Kider University of Pennsylvania CIS Fall 2011 Credit: Patrick Cozzi, Mark Harris Suresh Venkatensuramenan.
Large Scale Parallel Graph Coloring 1. Presentation Overview Problem Description Basic Algorithm Parallel Strategy –Work Spawning –Graph Partition Results.
Outline  Introduction  Subgraph Pattern Matching  Types of Subgraph Pattern Matching  Models of Computation  Distributed Algorithms  Performance.
James Hipp Senior, Clemson University.  Graph Representation G = (V, E) V = Set of Vertices E = Set of Edges  Adjacency Matrix  No Self-Inclusion (i.
Errol Lloyd Design and Analysis of Algorithms Approximation Algorithms for NP-complete Problems Bin Packing Networks.
An Accelerated Procedure for Hypergraph Coarsening on the GPU Lin Cheng, Hyunsu Cho, and Peter Yoon Trinity College Hartford, CT, USA.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Cohesive Subgraph Computation over Large Graphs
Algorithms for Finding Distance-Edge-Colorings of Graphs
Greedy & Heuristic algorithms in Influence Maximization
Discrete ABC Based on Similarity for GCP
A Backtracking Correction Heuristic
Graphs Representation, BFS, DFS
Spare Register Aware Prefetching for Graph Algorithms on GPUs
Synchronization trade-offs in GPU implementations of Graph Algorithms
Finding Heuristics Using Abstraction
Mayank Bhatt, Jayasi Mehar
Carlos Ordonez, Predrag T. Tosic
Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform
GPGPU: Parallel Reduction and Scan
University of Wisconsin-Madison
Algorithms and data structures
CS 584 Project Write up Poster session for final Due on day of final
Alan Kuhnle*, Victoria G. Crawford, and My T. Thai
Bin Packing Michael T. Goodrich Some slides adapted from slides from
Time Complexity and the divide and conquer strategy
Presentation transcript:

Evaluating Graph Coloring on GPUs Pascal Grosset, Peihong Zhu, Shusen Liu, Suresh Venkatasubramanian, and Mary Hall Final Project for the GPU class - Spring 2010 submitted as a poster to PPoPP 2011

Graph Coloring Assignment of colors to vertices of a graph such that connected vertices have different color o Planar graphs: 4 colors is enough o Non-planar graphs: NP Complete Solutions: o Brute-force o Heuristics Use: o Assignment of frequencies to wireless access points o... Planar Graph Non-planar graph

Existing Algorithms Many heuristics exist with different decision criteria o First Fit - none o LDO - uses degree as decision parameter o SDO - uses saturation as decision parameter o LDO - uses degree as decision parameter o SDO & LDO - uses saturation and then degree Trade-offs o Speed:  Fastest: First-Fit  Slowest: SDO & LDO o Colors  Best: SDO & LDO  Worst: First-Fit Degree: number of neighbors of a vertex, Saturation: number of differently colored neighbors Benchmarks

Existing Parallel Solutions We did not find any relevant related works for graph coloring on GPUs Main inspiration: o Gebremedhin and Manne o (G-M) algorithm for shared memory architectures  4 stages:  Partitioning  Pseudo-coloring  Conflict detection  Conflict Resolution

Proposed Framework Adapt existing framework to GPUs Phase 1: Graph Partitioning o Decide how the Graph will be partitioned into subgraphs Phase 2: Graph Coloring & Conflict Identification o Graph coloring using one of the heuristics  First Fit, SDO & LDO, Max In, Max Out o Conflict Identification Phase 3: Sequential Conflict Resolution o To definitely remove all conflicts

Max In & Max Out Two new heuristics o Decision parameter: number of vertices having neighbors outside the subgraph while Num_Colored < N(Number of vertices in subgraph) do max = -1 for i = 1 to N do if !colored(Vi) then no = Number of neighbors outside partition if no > max then max = no index = i if no == max then if d(Vi) > d(Vindex) then index = i Color Vindex Num_Colored ++

Phase 2: Graph Coloring & Conflict Identification Main part Transfer data from CPU to GPU 1. Graph Coloring o Run Graph Coloring: 1 thread per subgraph - cudaEventSynchronize - 2. Conflicts Indentification o sets color of conflicted nodes to 0: 1 thread per node - cudaEventSynchronize - Transfer Conflicts to CPU Count Conflicts o If conflicts < threshold  exit o Else  Repeat from 1

Data Storage Adjacency Matrix (Initial) o Too big Adjacency List o Very compact o Bad memory access pattern  bad performance "Sort of" Adjacency List o Size of each list: max degree o Good balance between performance and size  can still be optimized

Phase 2: Graph Coloring & Conflict Identification Main part Transfer data from CPU to GPU 1. Graph Coloring o Run Graph Coloring: 1 thread per subgraph - cudaEventSynchronize - 2. Conflicts Indentification o sets color of conflicted nodes to 0: 1 thread per node - cudaEventSynchronize - Transfer Conflicts to CPU Count Conflicts o If conflicts < threshold  exit o Else  Repeat from 1

Test Data Data source: University of Florida Sparse Matrix Collection nasasrb pwtk

Benchmarks Sequential Algorithms o First-Fit o SDO & LDO  Implementation direct from H. Al-Omari and K. E. Sabri, “New Graph Coloring Algorithms  O (n^3)  up to 1000x speedups!!!  Optimized (our) implementation (as a red black Tree)  O (m log n)  20x - 40x speedup n: number of vertices, m: number of edges

Implementation Some Details: o Tests were carried out on a Tesla S1070 and Fermi GTX 480 with Caching o Memory transfer time included in all results o All results are the average of 10 runs Detailed times: Graph = hood; Algo = First Fit Memory transfer Transfer in: ms; Transfer out: ms Getting boundary list: ms + Final CPU steps time: 1.63 ms: Total GPU: ~ 96 ms vs Total CPU: ~ 100 ms Pass 1Pass 2Pass 3 Coloring Detect Count Time

Results: Classes Interesting pattern in test results; 3 classes identified o Class 1: o pkustk10, pkustk11, shipsec1, shipsec5, shipsec8, msdoor & hood  Speedup steadily increases initially and eventually plateaus; Coloring improves the more threads we use  Ratio of maximum to average degree is between 1.6 and 2.5

Results: Classes Class 2 o pwtk, bmw3_2 and ldoor o Speedup steadily increases but then drops off at a certain point; best coloring is found before the dip o All the graphs in this class are quite large; pwtk & bmw3_2 have similar densities and size; there is a larger ratio of maximum to average degree than in class 1: 3.4 for pwtk and 6.8 for bmw3_2

Results: Classes Class 3 o ct20stif, nasasrb & pkustk13  Speedup steadily increase Best color is in the middle of range  Ratio of maximum degree is approximately 4 times the average degree

Results: Subgraph Size Small subgraph size produce better speedup and colors

Results: Subgraph Size

Results: Tesla vs Fermi Obviously Fermi is faster!

Results: Metis vs Non-Metis Naive partitioning is most of the time faster and more efficient than using Metis o Metis is not effective for such small partitions o Yields unbalanced partitions at such small graph sizes o Unbalanced is bad for GPU Metis was slower despite not taking into account the time for Metis to run!

Conclusion Set of guidelines for graph coloring o Fastest: Parallel First Fit  Much better colors than Sequential First Fit  Slightly slower than Sequential First Fit o Best Results (if you do not care about speed)  Sequential SDO & LDO implemented as a Red-Black Tree o Balance of Speed and Colors  Parallel Max Out or Min Out - average of 20X speedup over sequential SDO & LDO Use small subgraph Sizes Naive partitioning is good CUDA does not only makes calculations faster but can also be used to improve results - First Fit!

Class Poster

PPoPP Final Poster

Questions?