Copperhead: A Python-like Data Parallel Language & Compiler Bryan Catanzaro, UC Berkeley Michael Garland, NVIDIA Research Kurt Keutzer, UC Berkeley Universal.

Slides:

Advertisements

Similar presentations

Intermediate GPGPU Programming in CUDA

Advertisements

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Programming Languages Marjan Sirjani 2 2. Language Design Issues Design to Run efficiently : early languages Easy to write correctly : new languages.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

Introduction to CUDA and GPUGPU Computing

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

CS 536 Spring Run-time organization Lecture 19.

Cse321, Programming Languages and Compilers 1 6/19/2015 Lecture #18, March 14, 2007 Syntax directed translations, Meanings of programs, Rules for writing.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.

The PTX GPU Assembly Simulator and Interpreter N.M. Stiffler Zheming Jin Ibrahim Savran.

Building An Interpreter After having done all of the analysis, it’s possible to run the program directly rather than compile it … and it may be worth it.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

1.3 Executing Programs. How is Computer Code Transformed into an Executable? Interpreters Compilers Hybrid systems.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Parallel Programming in Java with Shared Memory Directives.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.

Exploiting SIMD parallelism with the CGiS compiler framework Nicolas Fritz, Philipp Lucas, Reinhard Wilhelm Saarland University.

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,

GPU Architecture and Programming

CS 193G Lecture 7: Parallel Patterns II. Overview Segmented Scan Sort Mapreduce Kernel Fusion.

HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.

1 The Portland Group, Inc. Brent Leback HPC User Forum, Broomfield, CO September 2009.

Chapter 1 Introduction. Chapter 1 - Introduction 2 The Goal of Chapter 1 Introduce different forms of language translators Give a high level overview.

CS 460/660 Compiler Construction. Class 01 2 Why Study Compilers? Compilers are important – –Responsible for many aspects of system performance Compilers.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 11: Functions and stack frames.

1 Compiler & its Phases Krishan Kumar Asstt. Prof. (CSE) BPRCE, Gohana.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.

Single Node Optimization Computational Astrophysics.

Compiler Construction CPCS302 Dr. Manal Abdulaziz.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

Presented by : A best website designer company. Chapter 1 Introduction Prof Chung. 1.

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Computer Engg, IIT(BHU)

The Present and Future of Parallelism on GPUs

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Chapter 1 Introduction.

Chapter 1 Introduction.

Compiler Back End Panel

Compiler Back End Panel

Parallel Computation Patterns (Scan)

6- General Purpose GPU Programming

Presentation transcript:

Copperhead: A Python-like Data Parallel Language & Compiler Bryan Catanzaro, UC Berkeley Michael Garland, NVIDIA Research Kurt Keutzer, UC Berkeley Universal Parallel Computing Research Center University of California, Berkeley

2/28 Intro to CUDA  Overview  Multicore/Manycore  SIMD  Programming with millions of threads

3/28 The CUDA Programming Model  CUDA is a recent programming model, designed for  Manycore architectures  Wide SIMD parallelism  Scalability  CUDA provides:  A thread abstraction to deal with SIMD  Synchronization & data sharing between small groups of threads  CUDA programs are written in C + extensions  OpenCL is inspired by CUDA, but HW & SW vendor neutral  Programming model essentially identical

4/28 Multicore and Manycore  Multicore: yoke of oxen  Each core optimized for executing a single thread  Manycore: flock of chickens  Cores optimized for aggregate throughput, deemphasizing individual performance MulticoreManycore

5/28 Multicore & Manycore, cont. SpecificationsCore i7 960GTX285 Processing Elements 4 cores, 4 way GHz 30 cores, 8 way GHz Resident Threads (max) 4 cores, 2 threads, 4 width SIMD: 32 strands 30 cores, 32 SIMD vectors, 32 width SIMD: strands SP GFLOP/s Memory Bandwidth25.6 GB/s159 GB/s Register File MB Local Store-480 kB Core i7 GTX285

6/28 SIMD: Neglected Parallelism  It is difficult for a compiler to exploit SIMD  How do you deal with sparse data & branches?  Many languages (like C) are difficult to vectorize  Fortran is somewhat better  Most common solution:  Either forget about SIMD ▪ Pray the autovectorizer likes you  Or instantiate intrinsics (assembly language)  Requires a new code version for every SIMD extension

7/28 What to do with SIMD?  Neglecting SIMD in the future will be more expensive  AVX: 8 way SIMD, Larrabee: 16 way SIMD  This problem composes with thread level parallelism 4 way SIMD16 way SIMD

8/28 CUDA  CUDA addresses this problem by abstracting both SIMD and task parallelism into threads  The programmer writes a serial, scalar thread with the intention of launching thousands of threads  Being able to launch 1 Million threads changes the parallelism problem  It’s often easier to find 1 Million threads than 32: just look at your data & launch a thread per element  CUDA is designed for Data Parallelism  Not coincidentally, data parallelism is the only way for most applications to scale to 1000(+) way parallelism

9/28 Hello World

10/28 CUDA Summary  CUDA is a programming model for manycore processors  It abstracts SIMD, making it easy to use wide SIMD vectors  It provides good performance on today’s GPUs  In the near future, CUDA-like approaches will map well to many processors & GPUs  CUDA encourages SIMD friendly, highly scalable algorithm design and implementation

11/28 A Parallel Scripting Language  What is a scripting language?  Lots of opinions on this  I’m using an informal definition: ▪ A language where performance is happily traded for productivity  Weak performance requirement of scalability ▪ “My code should run faster tomorrow”  What is the analog of today’s scripting languages for manycore?

12/28 Data Parallelism  Assertion: Scaling to 1000 cores requires data parallelism  Accordingly, manycore scripting languages will be data parallel  They should allow the programmer to express data parallelism naturally  They should compose and transform the parallelism to fit target platforms

13/28 Warning: Evolving Project  Copperhead is still in embryo  We can compile a few small programs  Lots more work to be done in both language definition and code generation  Feedback is encouraged

14/28 Copperhead = Cu + python  Copperhead is a subset of Python, designed for data parallelism  Why Python?  Extant, well accepted high level scripting language ▪ Free simulator(!!)  Already understands things like map and reduce  Comes with a parser & lexer  The current Copperhead compiler takes a subset of Python and produces CUDA code  Copperhead is not CUDA specific, but current compiler is

15/28 Copperhead is not Pure Python  Copperhead is not for arbitrary Python code  Most features of Python are unsupported  Copperhead is compiled, not interpreted  Connecting Python code & Copperhead code will require binding the programs together, similar to Python-C interaction  Copperhead is statically typed Python Copperhead

16/28 Saxpy: Hello world  Some things to notice:  Types are implicit ▪ The Copperhead compiler uses a Hindley-Milner type system with typeclasses similar to Haskell ▪ Typeclasses are fully resolved in CUDA via C++ templates  Functional programming: ▪ map, lambda (or equivalent in list comprehensions) ▪ you can pass functions around to other functions ▪ Closure: the variable ‘a’ is free in the lambda function, but bound to the ‘a’ in its enclosing scope def saxpy(a, x, y): return map(lambda xi, yi: a*xi + yi, x, y)

17/28 Type Inference, cont.  Copperhead includes function templates for intrinsics like add, subtract, map, scan, gather  Expressions are mapped against templates  Every variable starts out with a unique generic type, then types are resolved by union find on the abstract syntax tree  Tuple and function types are also inferred c = a + b + : (Num0, Num0) > Num0 A145 A207 A52 c = a + b Num52

18/28 Data parallelism  Copperhead computations are organized around data parallel arrays  map performs a “forall” for each element in an array  Accesses must be local  Accessing non-local elements is done explicitly  shift, rotate, or gather  No side effects allowed

19/28 Copperhead primitives  map  reduce  Scans:  scan, rscan, segscan, rsegscan  exscan, exrscan, exsegscan, exrsegscan  Shuffles:  shift, rotate, gather, scatter

20/28 Implementing Copperhead  The Copperhead compiler is written in Python  Python provides its own Abstract Syntax Tree  Type inference, code generation, etc. is done by walking the AST Module( None, Stmt( Function( None, 'saxpy', ['a', 'x', 'y'], 0, None, Stmt( Return( CallFunc( Name('map'), Lambda( ['xi', 'yi'], 0, Add( Mul( Name('a'), Name('xi') ), Name('yi') ) ), Name('x'), Name('y'), None, None ) def saxpy(a, x, y): return map(lambda xi, yi: a*xi + yi, x, y)

21/28 Compiling Copperhead to CUDA  Every Copperhead function creates at least one CUDA device function  Top level Copperhead functions create a CUDA global function, which orchestrates the device function calls  The global function takes care of allocating shared memory and returning data (storing it to DRAM)  Global synchronizations are implemented through multiple phases  All intermediate arrays & plumbing handled by Copperhead compiler

22/28 Saxpy Revisited template __device__ Num lambda0(Num xi, Num yi, Num a) { return ((a * xi) + yi); } template __device__ void saxpy0Dev(Array x, Array y, Num a, uint _globalIndex, Num& _returnValueReg) { Num _xReg, yReg; if (_globalIndex < x.length) _xReg = x[_globalIndex]; if (_globalIndex < y.length) _yReg = y[_globalIndex]; if (_globalIndex (_xReg, _yReg, a); } template __global__ void saxpy0(Array x, Array y, Num a, Array _returnValue) { uint _blockMin = IMUL(blockDim.x, blockIdx.x); uint _blockMax = _blockMin + blockDim.x; uint _globalIndex = _blockMin + threadIdx.x; Num _returnValueReg; saxpy0Dev(x, y, a, _globalIndex, _returnValueReg); if (_globalIndex < _returnValue.length) _returnValue[_globalIndex] = _returnValueReg; } def saxpy(a, x, y): return map(lambda xi, yi: a*xi + yi, x, y)

23/28 Phases  Reduction phase 0 phase 1   Scan phase 0 phase 1 phase 2

24/28 Copperhead to CUDA, cont.  Compiler schedules computations into phases  Right now, this composition is done greedily  Compiler tracks global and local availability of all variables and creates a phase boundary when necessary  Fusing work into phases is important for performance B = reduce(map(A)) D = reduce(map(C)) phase 0 phase 1

25/28 Copperhead to CUDA, cont.  Shared memory used only for communicating between threads  Caching unpredictable accesses (gather)  Accessing elements with a uniform stride (shift & rotate)  Each device function returns its intermediate results through registers

26/28 Split  This code is decomposed into 3 phases  Copperhead compiler takes care of intermediate variables  Copperhead compiler uses shared memory for temporaries used in scans here  Everything else is in registers def split(input, value): flags = map(lambda a: 1 if a <= value else 0, input) notFlags = map(lambda a: not a, flags) leftPositions = exscan(lambda a, b: a + b, 0, flags) rightPositions = exrscan(lambda a, b: a + b, 0, notFlags) positions = map(lambda a, b, flag: a if flag else len(input) - b - 1, leftPositions, rightPositions, flags) return scatter(input, positions) phases

27/28 Interpreting to Copperhead  If the interpreter harvested dynamic type information, it could use the Copperhead compiler as a backend  Fun project – see what kinds of information could be gleaned from the Python interpreter at runtime to figure out what should be compiled via Copperhead to a manycore chip

28/28 Future Work  Finish support for the basics  Compiler transformations  Nested data parallelism flattening ▪ segmented scans  Retargetability  Thread Building Blocks/OpenMP/OpenCL  Bridge Python and Copperhead  Implement real algorithms with Copperhead  Vision/Machine Learning, etc.