All-Pairs Shortest Paths

Slides:



Advertisements
Similar presentations
Algorithm Design Methodologies Divide & Conquer Dynamic Programming Backtracking.
Advertisements

* Bellman-Ford: single-source shortest distance * O(VE) for graphs with negative edges * Detects negative weight cycles * Floyd-Warshall: All pairs shortest.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
A Parallel GPU Version of the Traveling Salesman Problem Molly A. O’Neil, Dan Tamir, and Martin Burtscher* Department of Computer Science.
Why GPUs? Robert Strzodka. 2Overview Computation / Bandwidth / Power CPU – GPU Comparison GPU Characteristics.
Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow Language Eric Battenberg Mark Murphy CS 267, Spring 2008.
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
High-Performance Computation for Path Problems in Graphs
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
All-Pairs-Shortest-Paths for Large Graphs on the GPU Gary J Katz 1,2, Joe Kider 1 1 University of Pennsylvania 2 Lockheed Martin IS&GS.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Rahul Sharma (Stanford) Michael Bauer (NVIDIA Research) Alex Aiken (Stanford) Verification of Producer-Consumer Synchronization in GPU Programs June 15,
1 Challenges in Combinatorial Scientific Computing John R. Gilbert University of California, Santa Barbara Grand Challenges in Data-Intensive Discovery.
2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.
GPU-accelerated Evaluation Platform for High Fidelity Networking Modeling 11 December 2007 Alex Donkers Joost Schutte.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
Enhancing GPU for Scientific Computing Some thoughts.
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Extracted directly from:
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
GPU Programming with CUDA – Optimisation Mike Griffiths
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
NVIDIA Tesla GPU Zhuting Xue EE126. GPU Graphics Processing Unit The "brain" of graphics, which determines the quality of performance of the graphics.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
GPU Architecture and Programming
A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.
Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
1 Ch20. Dynamic Programming. 2 BIRD’S-EYE VIEW Dynamic programming The most difficult one of the five design methods Has its foundation in the principle.
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.
SPANNING TREES, INTRO. TO THREADS Lecture 23 CS2110 – Fall
GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.
Blocked 2D Convolution Ravi Sankar P Nair
Parallel Programming Models
CS203 – Advanced Computer Architecture
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Two-Dimensional Phase Unwrapping On FPGAs And GPUs
Shortest Paths.
Employing compression solutions under openacc
CS427 Multicore Architecture and Parallel Computing
Parallel Computing Lecture
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Lecture 5: GPU Compute Architecture
Speedup over Ji et al.'s work
Shortest Path Problems
Presented by: Isaac Martin
Lecture 5: GPU Compute Architecture for the last time
Lecture 7 All-Pairs Shortest Paths
Shortest Paths.
CS200: Algorithm Analysis
Single Source Shortest Paths Bellman-Ford Algorithm
Chapter 9: Graphs Basic Concepts
Shortest Path Problems
Shortest Paths.
6- General Purpose GPU Programming
Chapter 9: Graphs Basic Concepts
All Pairs Shortest Path Examples While the illustrations which follow only show solutions from vertex A (or 1) for simplicity, students should note that.
Force Directed Placement: GPU Implementation
Presentation transcript:

All-Pairs Shortest Paths Directed graph with “costs” on edges Find least-cost paths between all reachable vertex pairs Several classical algorithms with Work ~ matrix multiplication Span ~ log2 n Case study of implementation on multicore architecture: graphics processing unit (GPU)

GPU characteristics But: Powerful: two Nvidia 8800s > 1 TFLOPS Inexpensive: $500 each But: Difficult programming model: One instruction stream drives 8 arithmetic units Performance is counterintuitive and fragile: Memory access pattern has subtle effects on cost Extremely easy to underutilize the device: Doing it wrong easily costs 100x in time

Recursive All-Pairs Shortest Paths B D C Based on R-Kleene algorithm Well suited for GPU architecture: Fast matrix-multiply kernel In-place computation => low memory bandwidth Few, large MatMul calls => low GPU dispatch overhead Recursion stack on host CPU, not on multicore GPU Careful tuning of GPU code A B C D + is “min”, × is “add” A = A*; % recursive call B = AB; C = CA; D = D + CB; D = D*; % recursive call B = BD; C = DC; A = A + BC;

Execution of Recursive APSP

APSP: Experiments and observations 128-core Nvidia 8800 Speedup relative to. . . 1-core CPU: 120x – 480x 16-core CPU: 17x – 45x Iterative, 128-core GPU: 40x – 680x MSSSP, 128-core GPU: ~3x Time vs. Matrix Dimension Conclusions: High performance is achievable but not simple Carefully chosen and optimized primitives will be key