A GPU Implementation of Inclusion-based Points-to Analysis Mario Méndez-Lojo (AMD) Martin Burtscher (Texas State University, USA) Keshav Pingali (U.T.

Slides:



Advertisements
Similar presentations
Bounded Model Checking of Concurrent Data Types on Relaxed Memory Models: A Case Study Sebastian Burckhardt Rajeev Alur Milo M. K. Martin Department of.
Advertisements

Λλ Divergence Analysis with Affine Constraints Diogo Sampaio, Sylvain Collange and Fernando Pereira The Federal University of Minas.
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
CS412/413 Introduction to Compilers Radu Rugina Lecture 37: DU Chains and SSA Form 29 Apr 02.
Evaluating Graph Coloring on GPUs Pascal Grosset, Peihong Zhu, Shusen Liu, Suresh Venkatasubramanian, and Mary Hall Final Project for the GPU class - Spring.
A Parallel GPU Version of the Traveling Salesman Problem Molly A. O’Neil, Dan Tamir, and Martin Burtscher* Department of Computer Science.
An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
Why GPU Computing. GPU CPU Add GPUs: Accelerate Science Applications © NVIDIA 2013.
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
Parallel Inclusion-based Points-to Analysis Mario Méndez-Lojo Augustine Mathew Keshav Pingali The University of Texas at Austin (USA) 1.
OpenFOAM on a GPU-based Heterogeneous Cluster
High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of Computer.
The Ant and The Grasshopper Fast and Accurate Pointer Analysis for Millions of Lines of Code Ben Hardekopf and Calvin Lin PLDI 2007 (Best Paper & Best.
Semi-Sparse Flow-Sensitive Pointer Analysis Ben Hardekopf Calvin Lin The University of Texas at Austin POPL ’09 Simplified by Eric Villasenor.
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
Structure-driven Optimizations for Amorphous Data-parallel Programs 1 Mario Méndez-Lojo 1 Donald Nguyen 1 Dimitrios Prountzos 1 Xin Sui 1 M. Amber Hassaan.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
The of Parallelism in Algorithms Keshav Pingali The University of Texas at Austin Joint work with D.Nguyen, M.Kulkarni, M.Burtscher, A.Hassaan, R.Kaleem,
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
Data Parallel Algorithms Presented By: M.Mohsin Butt
Back-Projection on GPU: Improving the Performance Wenlay “Esther” Wei Advisor: Jeff Fessler Mentor: Yong Long April 29, 2010.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Task Based Execution of GPU Applications with Dynamic Data Dependencies Mehmet E Belviranli Chih H Chou Laxmi N Bhuyan Rajiv Gupta.
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Jared Barnes Chris Jackson.  Originally created to calculate pixel values  Each core executes the same set of instructions Mario projected onto several.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Optimal Parallelogram Selection for Hierarchical Tiling Authors: Xing Zhou, Maria J. Garzaran, David Padua University of Illinois Presenter: Wei Zuo.
L21: “Irregular” Graph Algorithms November 11, 2010.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Elixir : A System for Synthesizing Concurrent Graph Programs
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
GPU Architecture and Programming
Synchronization Transformations for Parallel Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.
Martin Kruliš by Martin Kruliš (v1.1)1.
STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES MAYANK DAGA AND JOSEPH L. GREATHOUSE AMD RESEARCH ADVANCED MICRO DEVICES, INC.
1PLDI 2000 Off-line Variable Substitution for Scaling Points-to Analysis Atanas (Nasko) Rountev PROLANGS Group Rutgers University Satish Chandra Bell Labs.
My Coordinates Office EM G.27 contact time:
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
An Accelerated Procedure for Hypergraph Coarsening on the GPU Lin Cheng, Hyunsu Cho, and Peter Yoon Trinity College Hartford, CT, USA.
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Harry Xu University of California, Irvine & Microsoft Research
Lecture 2: Intro to the simd lifestyle and GPU internals
Lecture 5: GPU Compute Architecture
Synchronization trade-offs in GPU implementations of Graph Algorithms
A Cloud System for Machine Learning Exploiting a Parallel Array DBMS
Presented by: Isaac Martin
Linchuan Chen, Peng Jiang and Gagan Agrawal
Topic 10: Dataflow Analysis
Lecture 5: GPU Compute Architecture for the last time
CS 179: Lecture 3.
Fully-dynamic aimGraph
Amir Kamil and Katherine Yelick
CSE P 501 – Compilers SSA Hal Perkins Autumn /31/2019
6- General Purpose GPU Programming
Pointer analysis John Rollinson & Kaiyuan Li
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

A GPU Implementation of Inclusion-based Points-to Analysis Mario Méndez-Lojo (AMD) Martin Burtscher (Texas State University, USA) Keshav Pingali (U.T. Austin, USA) 1

GPU + irregular algorithms Regular algorithms – manipulate matrices – GPU + regular algorithm = OK MMM, sorting, … Irregular algorithms – manipulate graphs – local computation: modifies edge/node values, not graph structure BFS, SCC, BH, … GPU + local computation algorithm = OK – refinement: adds edges or nodes GPU + refinement algorithm = ? Goal: implement refinement algorithm on GPU – Andersen’s points-to analysis 2

Overview 1.Andersen’s points-to analysis – graph formulation – parallelization 2.GPU implementation – graph representation – rule application – optimizations 3.Experiments – vs. serial CPU code – vs. parallel CPU code 3

Points-to analysis Static analysis technique – approximate locations pointed to by (pointer) variable – useful for code optimization, verification… Focus: Andersen’s analysis – available in modern compilers (gcc, LLVM…) – context and flow insensitive 4

Andersen’s - graph formulation [OOPSLA’10] 1.Extract pointer assignments x= &y, x=y, x=*y, *x=y, x=y+o 2.Create initial constraint graph – nodes ≡ variables, edges ≡ statements 3.Apply graph rewrite rules until fixpoint (next slide) – solution = points-to subgraph 5

Graph rewrite rules NAME REWRITE RULE ENSURES copy 6 x y z c p x y z c p p p x y s c r t p p x y s c r t p p p p x y s c r t p p p p p

Graph rewrite rules NAME RULEENSURES copy load store addPtr 7 xy z p p xy z l p c xy z p s c c xy z a,o p z+oz+o p

Parallelization All rules can proceed in parallel Requires atomic edge additions – enforced by data structure 8 xy z c p s t c p p p

GPU - graph representation Cannot use standard representations – space → adjacency matrix – dynamic addition of edges → CSR Sparse bit vector – linked list – base indicates range – bits indicates membership Example (bits width = 8) represents {0*8+0, 0*8+2, 1*8 + 7} = {0,2,15} 9

Sparse bit vectors on the GPU Node is 128 bytes-wide Good fit for Fermi architecture – assign i th word to i th thread of warp – load node from global/shared memory = 1 transaction thread i brings in i th word – set operations cause little thread divergence thread i=0..30 operates on i th word, thread 31 skips 10

Wide sparse bit vectors Bits mostly empty → waste of space and parallelism Works well in practice because of data locality – identifiers sequentially assigned to variables in program order – if x points to y, x probably points to variables with identifiers close to y Adjacency matrix for P edges – input = gcc, #vars=120K – initial graph in black, final in blue 11

Rule execution on the GPU Concurrent rule execution requires synchronization – GPUs support atomics, but hurts performance Solution: flip edges Active nodes add edges to themselves – no synchronization required each active node processed by only one thread 12 xy z c p r s c p p p xy z c -1 p p s p p r

originalreversed copy load store ptrAdd Reversed rules 13 xy z c p p xy z c -1 p p xy z l p c xy z l -1 p c -1 xy z p s c xy z p -1 s c -1 xy z a,o p z+oz+o p xy z a -1,o p z+oz+o p

Baseline algorithm blue=CPU, red=GPU, green=transfer 14 read input transfer program statements to GPU transfer P subgraph to CPU create initial constraint graph verify solution apply rewrite rules graph changed? no yes

Iterations perform redundant work Solution: keep two sets of points-to edges – ∆P: added in last iteration – P: added before last iteration Implies – replace P by ∆P in rules – when iteration ends, update P and ∆P 15 xy z c -1 p p node x should fire this rule only once!

Optimized algorithm blue=CPU, red= GPU, green=transfer 16 read input transfer program statements to GPU transfer ∆P subgraph to CPU create initial constraint graph verify solution apply rewrite rules graph changed? no yes update ∆P transfer time < rule timesmall transfer (<10ms) + transfer penalty =

Results Input: large C programs – linux: 1.5M vars, 420K statements CPU-s – AMD Opteron, Ghz – C++, sequential – “The ant and the grasshopper…” [PLDI’07] CPU-1 and CPU-16 – Same HW as CPU-s – Java 1.6, parallel – “Parallel inclusion-based points-to analysis” [OOPSLA’10] GPU – NVIDIA Tesla C2070, Ghz – CUDA 4.1 GPU beats CPU-16 for 11/14 inputs – avg speedup: 7x (compared to CPU-s) 17 time (ms) speedup speedup

GPU + irregular algorithm = OK Success story: Andersen’s points-to analysis – 7x speedup compared to CPU-sequential – 40% faster than CPU-parallel Rethink data structures + algorithm – CPU-parallel similar to CPU-sequential – GPU very different to both – GPU took 35% more person hours (vs. CPU-parallel) In the paper – dynamic work assignment – memory allocation – experimental analysis 18

Thank you! CPU (parallel) + GPU implementations available at 19