Presentation is loading. Please wait.

Presentation is loading. Please wait.

Copperhead: A Python-like Data Parallel Language & Compiler Bryan Catanzaro, UC Berkeley Michael Garland, NVIDIA Research Kurt Keutzer, UC Berkeley Universal.

Similar presentations


Presentation on theme: "Copperhead: A Python-like Data Parallel Language & Compiler Bryan Catanzaro, UC Berkeley Michael Garland, NVIDIA Research Kurt Keutzer, UC Berkeley Universal."— Presentation transcript:

1 Copperhead: A Python-like Data Parallel Language & Compiler Bryan Catanzaro, UC Berkeley Michael Garland, NVIDIA Research Kurt Keutzer, UC Berkeley Universal Parallel Computing Research Center University of California, Berkeley

2 2/28 Intro to CUDA  Overview  Multicore/Manycore  SIMD  Programming with millions of threads

3 3/28 The CUDA Programming Model  CUDA is a recent programming model, designed for  Manycore architectures  Wide SIMD parallelism  Scalability  CUDA provides:  A thread abstraction to deal with SIMD  Synchronization & data sharing between small groups of threads  CUDA programs are written in C + extensions  OpenCL is inspired by CUDA, but HW & SW vendor neutral  Programming model essentially identical

4 4/28 Multicore and Manycore  Multicore: yoke of oxen  Each core optimized for executing a single thread  Manycore: flock of chickens  Cores optimized for aggregate throughput, deemphasizing individual performance MulticoreManycore

5 5/28 Multicore & Manycore, cont. SpecificationsCore i7 960GTX285 Processing Elements 4 cores, 4 way SIMD @3.2 GHz 30 cores, 8 way SIMD @1.5 GHz Resident Threads (max) 4 cores, 2 threads, 4 width SIMD: 32 strands 30 cores, 32 SIMD vectors, 32 width SIMD: 30720 strands SP GFLOP/s1021080 Memory Bandwidth25.6 GB/s159 GB/s Register File-1.875 MB Local Store-480 kB Core i7 GTX285

6 6/28 SIMD: Neglected Parallelism  It is difficult for a compiler to exploit SIMD  How do you deal with sparse data & branches?  Many languages (like C) are difficult to vectorize  Fortran is somewhat better  Most common solution:  Either forget about SIMD ▪ Pray the autovectorizer likes you  Or instantiate intrinsics (assembly language)  Requires a new code version for every SIMD extension

7 7/28 What to do with SIMD?  Neglecting SIMD in the future will be more expensive  AVX: 8 way SIMD, Larrabee: 16 way SIMD  This problem composes with thread level parallelism 4 way SIMD16 way SIMD

8 8/28 CUDA  CUDA addresses this problem by abstracting both SIMD and task parallelism into threads  The programmer writes a serial, scalar thread with the intention of launching thousands of threads  Being able to launch 1 Million threads changes the parallelism problem  It’s often easier to find 1 Million threads than 32: just look at your data & launch a thread per element  CUDA is designed for Data Parallelism  Not coincidentally, data parallelism is the only way for most applications to scale to 1000(+) way parallelism

9 9/28 Hello World

10 10/28 CUDA Summary  CUDA is a programming model for manycore processors  It abstracts SIMD, making it easy to use wide SIMD vectors  It provides good performance on today’s GPUs  In the near future, CUDA-like approaches will map well to many processors & GPUs  CUDA encourages SIMD friendly, highly scalable algorithm design and implementation

11 11/28 A Parallel Scripting Language  What is a scripting language?  Lots of opinions on this  I’m using an informal definition: ▪ A language where performance is happily traded for productivity  Weak performance requirement of scalability ▪ “My code should run faster tomorrow”  What is the analog of today’s scripting languages for manycore?

12 12/28 Data Parallelism  Assertion: Scaling to 1000 cores requires data parallelism  Accordingly, manycore scripting languages will be data parallel  They should allow the programmer to express data parallelism naturally  They should compose and transform the parallelism to fit target platforms

13 13/28 Warning: Evolving Project  Copperhead is still in embryo  We can compile a few small programs  Lots more work to be done in both language definition and code generation  Feedback is encouraged

14 14/28 Copperhead = Cu + python  Copperhead is a subset of Python, designed for data parallelism  Why Python?  Extant, well accepted high level scripting language ▪ Free simulator(!!)  Already understands things like map and reduce  Comes with a parser & lexer  The current Copperhead compiler takes a subset of Python and produces CUDA code  Copperhead is not CUDA specific, but current compiler is

15 15/28 Copperhead is not Pure Python  Copperhead is not for arbitrary Python code  Most features of Python are unsupported  Copperhead is compiled, not interpreted  Connecting Python code & Copperhead code will require binding the programs together, similar to Python-C interaction  Copperhead is statically typed Python Copperhead

16 16/28 Saxpy: Hello world  Some things to notice:  Types are implicit ▪ The Copperhead compiler uses a Hindley-Milner type system with typeclasses similar to Haskell ▪ Typeclasses are fully resolved in CUDA via C++ templates  Functional programming: ▪ map, lambda (or equivalent in list comprehensions) ▪ you can pass functions around to other functions ▪ Closure: the variable ‘a’ is free in the lambda function, but bound to the ‘a’ in its enclosing scope def saxpy(a, x, y): return map(lambda xi, yi: a*xi + yi, x, y)

17 17/28 Type Inference, cont.  Copperhead includes function templates for intrinsics like add, subtract, map, scan, gather  Expressions are mapped against templates  Every variable starts out with a unique generic type, then types are resolved by union find on the abstract syntax tree  Tuple and function types are also inferred c = a + b + : (Num0, Num0) > Num0 A145 A207 A52 c = a + b Num52

18 18/28 Data parallelism  Copperhead computations are organized around data parallel arrays  map performs a “forall” for each element in an array  Accesses must be local  Accessing non-local elements is done explicitly  shift, rotate, or gather  No side effects allowed

19 19/28 Copperhead primitives  map  reduce  Scans:  scan, rscan, segscan, rsegscan  exscan, exrscan, exsegscan, exrsegscan  Shuffles:  shift, rotate, gather, scatter

20 20/28 Implementing Copperhead  The Copperhead compiler is written in Python  Python provides its own Abstract Syntax Tree  Type inference, code generation, etc. is done by walking the AST Module( None, Stmt( Function( None, 'saxpy', ['a', 'x', 'y'], 0, None, Stmt( Return( CallFunc( Name('map'), Lambda( ['xi', 'yi'], 0, Add( Mul( Name('a'), Name('xi') ), Name('yi') ) ), Name('x'), Name('y'), None, None ) def saxpy(a, x, y): return map(lambda xi, yi: a*xi + yi, x, y)

21 21/28 Compiling Copperhead to CUDA  Every Copperhead function creates at least one CUDA device function  Top level Copperhead functions create a CUDA global function, which orchestrates the device function calls  The global function takes care of allocating shared memory and returning data (storing it to DRAM)  Global synchronizations are implemented through multiple phases  All intermediate arrays & plumbing handled by Copperhead compiler

22 22/28 Saxpy Revisited template __device__ Num lambda0(Num xi, Num yi, Num a) { return ((a * xi) + yi); } template __device__ void saxpy0Dev(Array x, Array y, Num a, uint _globalIndex, Num& _returnValueReg) { Num _xReg, yReg; if (_globalIndex < x.length) _xReg = x[_globalIndex]; if (_globalIndex < y.length) _yReg = y[_globalIndex]; if (_globalIndex (_xReg, _yReg, a); } template __global__ void saxpy0(Array x, Array y, Num a, Array _returnValue) { uint _blockMin = IMUL(blockDim.x, blockIdx.x); uint _blockMax = _blockMin + blockDim.x; uint _globalIndex = _blockMin + threadIdx.x; Num _returnValueReg; saxpy0Dev(x, y, a, _globalIndex, _returnValueReg); if (_globalIndex < _returnValue.length) _returnValue[_globalIndex] = _returnValueReg; } def saxpy(a, x, y): return map(lambda xi, yi: a*xi + yi, x, y)

23 23/28 Phases  Reduction phase 0 phase 1   Scan phase 0 phase 1 phase 2

24 24/28 Copperhead to CUDA, cont.  Compiler schedules computations into phases  Right now, this composition is done greedily  Compiler tracks global and local availability of all variables and creates a phase boundary when necessary  Fusing work into phases is important for performance B = reduce(map(A)) D = reduce(map(C)) phase 0 phase 1

25 25/28 Copperhead to CUDA, cont.  Shared memory used only for communicating between threads  Caching unpredictable accesses (gather)  Accessing elements with a uniform stride (shift & rotate)  Each device function returns its intermediate results through registers

26 26/28 Split  This code is decomposed into 3 phases  Copperhead compiler takes care of intermediate variables  Copperhead compiler uses shared memory for temporaries used in scans here  Everything else is in registers def split(input, value): flags = map(lambda a: 1 if a <= value else 0, input) notFlags = map(lambda a: not a, flags) leftPositions = exscan(lambda a, b: a + b, 0, flags) rightPositions = exrscan(lambda a, b: a + b, 0, notFlags) positions = map(lambda a, b, flag: a if flag else len(input) - b - 1, leftPositions, rightPositions, flags) return scatter(input, positions) 0 0 0-2 2 2 phases

27 27/28 Interpreting to Copperhead  If the interpreter harvested dynamic type information, it could use the Copperhead compiler as a backend  Fun project – see what kinds of information could be gleaned from the Python interpreter at runtime to figure out what should be compiled via Copperhead to a manycore chip

28 28/28 Future Work  Finish support for the basics  Compiler transformations  Nested data parallelism flattening ▪ segmented scans  Retargetability  Thread Building Blocks/OpenMP/OpenCL  Bridge Python and Copperhead  Implement real algorithms with Copperhead  Vision/Machine Learning, etc.


Download ppt "Copperhead: A Python-like Data Parallel Language & Compiler Bryan Catanzaro, UC Berkeley Michael Garland, NVIDIA Research Kurt Keutzer, UC Berkeley Universal."

Similar presentations


Ads by Google