1 Developing Scalable Parallel Applications in X10 David Cunningham, David Grove, Vijay Saraswat, and Olivier Tardieu IBM Research Based on material from.

1 Developing Scalable Parallel Applications in X10 David Cunningham, David Grove, Vijay Saraswat, and Olivier Tardieu IBM Research Based on material from previous X10 Tutorials by Bard Bloom, Evelyn Duesterwald Steve Fink, Nate Nystrom, Igor Peshansky, Christoph von Praun, and Vivek Sarkar Additional contributions from Anju Kambadur, Josh Milthorpe and Juemin Zhang This material is based upon work supported in part by the Defense Advanced Research Projects Agency under its Agreement No. HR0011-07-9-0002. Please see x10-lang.org for the most up-to-date version of these slides and sample programs.

2 Tutorial Objectives Learn to develop scalable parallel applications using X10 – X10 language and class library fundamentals – Effective patterns of concurrency and distribution – Case studies: draw material from real working code – Lab sessions: learn by doing Exposure to Asynchronous PGAS programming model – Concepts broader than any single language Updates on X10 project and community – Highlight what is new in X10 since SC’10 – Gather feedback from you about X10

3 X10 Tutorial Outline X10 Overview X10 Fundamentals (~60 minutes) X10 Workstealing (~20 minutes) X10 on GPUs (~40 minutes) X10 Tools (~20 minutes) Lab Session I LUNCH Four Case Studies (approximately 30 minutes each) LU Factorization Global Load Balancing Framework Global Matrix Library Selections from ANUChem Summary (~15 minutes) Lab Session II (~45 minutes)

4 X10: Performance and Productivity at Scale An evolution of Java for concurrency, scale out, and heterogeneity The X10 language provides: – Ability to specify fine-grained concurrency – Ability to distribute computation across large-scale clusters – Ability to represent heterogeneity at the language level – Simple programming model for computation offload – Modern OO language features (build libraries/frameworks) – Support for writing deadlock-free and determinate code – Interoperability with Java – Interoperability with native libraries (eg BLAS)

5 X10 Target Environments High-end large clusters – BlueGene/P – Power7IH – x86 + IB, Power + IB – Goal: deliver scalable performance competitive with C+MPI Medium-scale commodity systems – ~100 nodes (~1000 core and ~1 terabyte main memory) – Scale out environments, but MTBF is days not minutes – Programs that run in minutes/hours at this scale – Goal: deliver main-memory performance with simple programming model (accessible to Java programmers) Developer laptops – Linux, Max, Windows. Eclipse-based IDE, debugger, etc – Goal: support developer productivity

6 X10 Programming Model Asynchronous PGAS (Partitioned Global Address Space) – Global address space, but with explicit locality – Activities can be dynamically created and joined Abstracts hardware details, while still allowing programmer to express locality and concurrency. Thread/Process Address Space Legend Memory Reference Parallel activities … …

7 X10 APGAS Constructs Fine grained concurrency async S Atomicity atomic S when (c) S Place-shifting operations at (P) S Ordering finish S clocked, next Two central ideas: Places and Asynchrony Language Support for Replication of Values Across Places Distributed object model Replicate by default GlobalRef Global data structures DistArray

8 Hello Whole World import x10.io.Console; class HelloWholeWorld { public static def main(Array[String]) { finish for (p in Place.places()) { async at (p) Console.OUT.println("Hello World from place" +p.id); } (%1) x10c++ -o HelloWholeWorld -O HelloWholeWorld.x10 (%2) runx10 -n 4 HelloWholeWorld Hello World from place 0 Hello World from place 2 Hello World from place 3 Hello World from place 1 (%3)

9 Sequential MontyPi import x10.io.Console; import x10.util.Random; class MontyPi { public static def main(args:Array[String](1)) { val N = Int.parse(args(0)); val r = new Random(); var result:Double = 0; for (1..N) { val x = r.nextDouble(); val y = r.nextDouble(); if (x*x + y*y <= 1) result++; } val pi = 4*result/N; Console.OUT.println(“The value of pi is “ + pi); }

10 Concurrent MontyPi import x10.io.Console; import x10.util.Random; class MontyPi { public static def main(args:Array[String](1)) { val N = Int.parse(args(0)); val P = Int.parse(args(1)); val result = new Cell[Double](0); finish for (1..P) async { val r = new Random(); var myResult:Double = 0; for (1..(N/P)) { val x = r.nextDouble(); val y = r.nextDouble(); if (x*x + y*y <= 1) myResult++; } atomic result() += myResult; } val pi = 4*(result())/N; Console.OUT.println(“The value of pi is “ + pi); }

11 Distributed MontyPi import x10.io.Console; import x10.util.Random; class MontyPi { public static def main(args:Array[String](1)) { val N = Int.parse(args(0)); val result = GlobalRef[Cell[Double]](new Cell[Double](0)); finish for (p in Place.places()) at (p) async { val r = new Random(); var myResult:Double = 0; for (1..(N/Place.MAX_PLACES)) { val x = r.nextDouble(); val y = r.nextDouble(); if (x*x + y*y <= 1) myResult++; } val ans = myResult; at (result) atomic result()() += ans; } val pi = 4*(result()())/N; Console.OUT.println(“The value of pi is “ + pi); }

12 X10 Project X10 is an open source project (http://x10-lang.org)http://x10-lang.org – Released under Eclipse Public License (EPL) v1 – Current Release: 2.2.1 (September 2011) X10 Implementations – C++ based (“Native X10”) Multi-process (one place per process; multi-node) Linux, AIX, MacOS, Cygwin, BlueGene/P x86, x86_64, PowerPC – JVM based (“Managed X10”) Multi-process (one place per JVM process; multi-node) – Limitation on Windows to single process (single place) Runs on any Java 5/Java 6 JVM X10DT (X10 IDE) available for Windows, Linux, Mac OS X – Based on Eclipse 3.6 – Supports many core development tasks including remote build/execute facilities

13 X10 Compilation Flow X10 Source Parsing / Type Check AST Optimizations AST Lowering X10 AST C++ Code Generation Java Code Generation C++ SourceJava Source C++ CompilerJava Compiler XRCXRJXRX Native CodeBytecode X10RT X10 Compiler Front-End C++ Back-End Java Back-End Java VMsNative Env JNI Native X10Managed X10

14 Overview of Language Features  Many sequential features of Java inherited unchanged  Classes (w/ single inheritance)  Interfaces, (w/ multiple inheritance)  Instance and static fields  Constructors, (static) initializers  Overloaded, over-rideable methods  Garbage collection  Familiar control structures, etc.  Structs  Closures  Multi-dimensional & distributed arrays Substantial extensions to the type system Dependent types Generic types Function types Type definitions, inference Concurrency Fine-grained concurrency: async S; Atomicity and synchronization atomic S; when (c) S; Ordering finish S; Distribution GlobalRef[T] at (p) S;

15 Classes  Classes  Single inheritance, multiple interfaces  Static and instance methods  Static val fields  Instances fields may be val or var  Values of class types may be null  Heap allocated

16 Structs  User defined primitives  No inheritance  May implement interfaces  All fields are val  All methods are final  Allocated “inline” in containing object, array or variable  Headerless struct Complex { val real:Double; val img :Double; def this(r:Double, i:Double) { real = r; img = i; } def operator + (that:Complex) { return Complex(real + that.real, img + that.img); }.... } val x = new Array[Complex](1..10)

17 Function Types  (T 1, T 2,..., T n ) => U  type of functions that take arguments T i and returns U  If f: (T) => U and x: T then invoke with f(x): U  Function types can be used as an interface  Define apply operator with the appropriate signature: operator this(x:T): U Closures First-class functions (x: T): U => e used in array initializers: new Array[int](5, (i:int) => i*i ) yields the array [ 0, 1, 4, 9, 16 ]

18 Generic Types  Classes, structs, and interfaces may have type parameters  class ArrayList[T]  Defines a type constructor ArrayList  and a family of types ArrayList[Int], ArrayList[String], ArrayList[C],...  ArrayList[C]: as if ArrayList class is copied and C substituted for T  Can instantiate on any type, including primitives (e.g., Int) public final class ArrayList[T] { private var data:Array[T]; private var ptr:int = 0; public def this(n:int) { data = new Array[T](n); } …. public def addLast(elem:T) { if (ptr == list.size) { …grow data… } data(ptr++) = elem; } public def get(idx:int) { if (idx = ptr) { throw …} return data(idx); }

19 Dependent Types  Classes and structs may have properties  public val instance fields  class Region(rank:Int, zeroBased:Boolean, rect:Boolean) {... }  Can constrain properties with a boolean expression  Region{rank==3}  type of all regions with rank 3  Array[Int]{region==R}  type of all arrays defined over region R  R must be a constant or a final variable in scope at the type Dependent types are checked statically. Dependent types used to statically check locality properties Dependent types used heavily in Array library to check invariants Dependent type system is extensible Dependent types enable optimization (compile-time constant propagation of properties)

20 Type inference  Field and local variable types inferred from initializer type val x = 1;  x has type int{self==1} val y = 1..2*1..10;  y has type Region{rank==2,rect}  Method return types inferred from method body def m() {... return true... return false... }  m has return type Boolean Loop index types inferred from region R: Region{rank==2} for (p in R) {... } p has type Point{rank==2} Programming Tip: Often important to not declare types of local vals. Enables type inference to compute the most precise possible type.

21 async async S  Creates a new child activity that executes statement S  Returns immediately  S may reference variables in enclosing blocks  Activities cannot be named  Activity cannot be aborted or cancelled Stmt ::= async Stmt cf Cilk’s spawn // Compute the Fibonacci // sequence in parallel. def fib(n:Int):Int { if (n < 2) return 1; val f1:Int; val f2:Int; finish { async f1 = fib(n-1); f2 = fib(n-2); } return f1+f2; }

22 // Compute the Fibonacci // sequence in parallel. def fib(n:Int):Int { if (n < 2) return 1; val f1:Int; val f2:Int; finish { async f1 = fib(n-1); f2 = fib(n-2); } return f1+f2; } finish finish S  Execute S, but wait until all (transitively) spawned asyncs have terminated. Rooted exception model  Trap all exceptions thrown by spawned activities.  Throw an (aggregate) exception if any spawned async terminates abruptly.  implicit finish at main activity finish is useful for expressing “synchronous” operations on (local or) remote data. Stmt ::= finish Stmt cf Cilk’s sync

23 // push data onto concurrent // list-stack val node = new Node(data); atomic { node.next = head; head = node; } atomic atomic S  Execute statement S atomically  Atomic blocks are conceptually executed in a serialized order with respect to all other atomic blocks: isolation and weak atomicity.  An atomic block body (S)...  must be nonblocking  must not create concurrent activities (sequential)  must not access remote data (local)  restrictions checked dynamically // target defined in lexically // enclosing scope. atomic def CAS(old:Object, n:Object) { if (target.equals(old)) { target = n; return true; } return false; } Stmt ::= atomic Statement MethodModifier ::= atomic

24 when when (E) S  Activity suspends until a state in which the guard E is true.  In that state, S is executed atomically and in isolation.  Guard E is a boolean expression  must be nonblocking  must not create concurrent activities (sequential)  must not access remote data (local)  must not have side-effects (const) Stmt ::= WhenStmt WhenStmt ::= when ( Expr ) Stmt | WhenStmt or (Expr) Stmt class OneBuffer { var datum:Object = null; var filled:Boolean = false; def send(v:Object) { when ( !filled ) { datum = v; filled = true; } def receive():Object { when ( filled ) { val v = datum; datum = null; filled = false; return v; }

25 at at (p) S  Execute statement S at place p  Current activity is blocked until S completes  Deep copy of local object graph to target place; the variables referenced from S define the roots of the graph to be copied.  GlobalRef[T] used to create remote pointers across at boundaries Stmt ::= at (p) Stmt class C { var x:int; def this(n:int) { x = n; } } // Increment remote counter def inc(c:GlobalRef[C]) { at (c.home) c().x++; } // Create GR of C static def make(init:int) { val c = new C(init); return GlobalRef[C](c); }

26 Distributed Object Model Objects live in a single place Objects can only be accessed in the place where they live Object graph reachable from the statements in an “at” will be deeply copied from the source to destination place and a new isomorphic graph created Instances fields of an object may be declared transient to break deep copy GlobalRef[T] enables cross-place pointers to be created GlobalRef can be freely copied, but referent can only be accessed via the apply at its home place public struct GlobalRef[T](home:Place){T<:Object} { public native def this(t:T):GlobalRef{self.home==here}; public native operator this(){here == this.home}:T; }

27 Summary Whirlwind tour through X10 language Additional details will be introduced via case studies See http://x10-lang.org for specification, samples, etc.http://x10-lang.org Next topic: alternate implementations of async/finish in the X10 runtime and their implications for programmers Questions so far?

28 X10 Runtime One process per place choice inter-process communication library (x10rt) standalone (shared mem), sockets (tcp/ip), mpi, PAMI Pool of worker threads per place runtime scheduler maps asyncs to threads (many to 1) choice of two schedulers fork-join scheduler (default, ~ Java Fork-Join) work-stealing scheduler (work in progress, ~ Cilk) to use compile with “x10c++ -WORK-STEALING”

29 X10 Schedulers each worker maintains a queue of pending jobs async p; q… FJ:worker pushes p, runs q…(job = async) WS:worker pushes q…, runs p(job = continuation) idle worker pops job from queue if any steals job for random other worker if empty (spin) queues are double ended push and pop from one end, steal from other end => no contention for statically balanced workloads less contention if queues are non-empty

30 Blocking Constructs finish typically does not block worker WS: run all subtasks first FJ: run subtask in queue instead of blocking finish may block because of thieves or remote async worker 1 reaches end of finish block while worker 2 is still running async context switch (blocking finish, when) FJ: OS-level context switch worker thread suspends, another worker thread is awaken WS: runtime-level context switch runtime saves worker’s state, assigns new job to worker

31 async queue operations (fences, at worst CAS) finish (single place) synchronized counter (FJ: always, WS: if >1 worker) FJ relies on helper objects for finish and async overhead of object creation + heap allocation + GC WS relies on compiler transformation (CPS) sequential overhead (ex: 1.3x for QuickSort) steal overhead (~ context switch) Cost Analysis

32 Pros and Cons WS +much more scalable with fine grain concurrency +fixed-size thread pool -sequential overhead FJ -requires coarser tasks -makes load balancing harder +more “obvious” performance model +more mature implementation (for now)

33 X10 Tutorial Outline X10 Overview X10 on GPUs CUDA via APGAS K-Means example & demo X10 Tools Lab Session I LUNCH Case Studies Summary Lab Session II

34 Why Program GPUs with X10? Why program the GPU at all? Many times faster for certain classes of applications Hardware is cheap and widely available (as opposed to e.g. FPGA) Why write X10 (instead of CUDA/OpenCL)? Use the same programming model (APGAS) on GPU and CPU Easier to write parallel/distributed GPU-aware programs Better type safety Higher level abstractions => fewer lines of code

35 GPU programming will never be as easy as CPU. We try to make it as easy as possible. Correctness Can debug X10 kernel code on CPU first, using standard techniques Static errors avoid certain classes of faults. Goal: Eliminate all GPU segfaults with negligible overhead. Currently detect all dereferences of non-local data (place errors) TODO: Static array bounds checking TODO: Static null-pointer checking Performance: Need understanding of the CUDA performance model Must know advantages+limitations of [registers, SHM, global memory] Avoid warp divergence Avoid irregular/misaligned memory access Use CUDA profiling tool to debug kernel performance (very easy and usable) Can inspect and disassemble generated cubin file Easier to tune blocks/threads using auto-configuration X10/GPU Programming Experience

36 Why target CUDA and not OpenCL? In early 2009, OpenCL support was limited CUDA is based on C++ OpenCL based on C C++ features help us implement X10 features (e.g. generics) By targeting CUDA we can re-use parts of existing C++ backend CUDA evolving fast, more interesting for high level languages (e.g. OOP) However there are advantages of OpenCL too... We could support it in future

37 APGAS model ↔ GPU model at Java-like sequential constructs finish clocks GPU / host divide kernel invocation Shared mem Constant mem Local mem DMAs __syncthreads() blocks threads async

38 GPU / Host divide Host CUDA High Perf. Network Goals Re-use existing language concepts wherever possible Independent GPU memory space ⇒ new place Place count and topology unknown until run-time Same X10 code works on different run-time configurations Could use same approach for OpenCL, Cell, FPGA, … PCI bus A 'place' in APGAS model

39 X10/CUDA code (mass sqrt example) for (host in Place.places()) at (host) { val init = new Array[Float](1000, (i:Int)=>i as Float); val recv = new Array[Float](1000); for (accel in here.children().values()) { val remote = CUDAUtilities.makeRemoteArray[Float](accel, 1000, (Int)=>0.0f); val num_blocks = 8, num_threads = 64; finish async at (accel) @CUDA { finish for ([block] in 0..(num_blocks-1)) async { clocked finish for ([thread] in 0..(num_threads-1)) clocked async { val tid = block*num_threads + thread; val tids = num_blocks*num_threads; for (var i:Int=tid ; i<1000 ; i+=tids) { remote(i) = Math.sqrtf(init(i)); } } } } // Console.OUT.println(remote(42)); finish Array.asyncCopy(remote, 0, recv, 0, recv.size); Console.OUT.println(recv(42)); } GPU code Alloc on GPU Discover GPUs Copy result to host Static type error Implicit capture and transfer to GPU

40 __global__ void kernel (...) {... } kernel >>(...) CUDA threads as APGAS activities

41 CUDA threads as APGAS activities finish async at (p) @CUDA { finish for ([block] in 0..(blocks-1)) async { clocked finish for ([thread] in 0..(threads-1)) clocked async { val tid = block*threads + thread; val tids = blocks*threads; for (var i:Int=tid ; i<len ; i+=tids) { remote(i) = Math.sqrtf(init(i)); } Enforces language restrictions Define kernel shape Only sequential code Only primitive types / arrays No allocation, no dynamic dispatch CUDA block CUDA kernel CUDA thread

42 Dynamic Shared Memory & barrier finish async at (p) @CUDA { finish for (block in 0..(blocks-1)) async { val shm = new Array[Int](threads, (Int)=>0); clocked finish for (thread in 0..(threads-1)) clocked async { shm(thread) = f(thread); Clock.advanceAll(); val tmp = shm((thread+1) % threads); } } } Compiled to CUDA 'shared' memory Initialised on GPU (once per block) General philosophy: Use existing X10 constructs to express CUDA semantics APGAS model is sufficient CUDA barrier represented with clock op

43 Constant cache memory val arr = new Array[Int](threads, (Int)=>0); finish async at (p) @CUDA { val cmem : Sequence[Int] = arr.sequence(); finish for ([block] in 0..(blocks-1)) async { clocked finish for ([thread] in 0..(threads-1)) clocked async { val tmp = cmem(42); } } } Compiled to CUDA 'constant' cache Automatically uploaded to GPU (just before kernel invocation) ('Sequence' is an immutable array) General philosophy: Quite similar to 'shared' memory Again, APGAS model is sufficient

44 Blocks/Threads auto-config finish async at (p) @CUDA { val blocks = CUDAUtilities.autoBlocks(), threads = CUDAUtilities.autoThreads(); finish for ([block] in 0..(blocks-1)) async { clocked finish for ([thread] in 0..(threads-1)) clocked async { // kernel cannot assume any particular // values for 'blocks' and 'threads' } Maximise 'occupancy' at run-time according to a simple heuristic If called on the CPU, autoBlocks() == 1, autoThreads() == 8

45 KMeansCUDA kernel finish async at (gpu) @CUDA { val blocks = CUDAUtilities.autoBlocks(), threads = CUDAUtilities.autoThreads(); finish for ([block] in 0..(blocks-1)) async { val clustercache = new Array[Float](num_clusters*dims, clusters_copy); clocked finish for ([thread] in 0..(threads-1)) clocked async { val tid = block * threads + thread; val tids = blocks * threads; for (var p:Int=tid ; p<num_local_points ; p+=tids) { var closest:Int = -1; var closest_dist:Float = Float.MAX_VALUE; @Unroll(20) for ([k] in 0..(num_clusters-1)) { var dist : Float = 0; for ([d] in 0..(dims-1)) { val tmp = gpu_points(p+d*num_local_points_stride) - clustercache(k*dims+d); dist += tmp * tmp; } if (dist < closest_dist) { closest_dist = dist; closest = k; } gpu_nearest(p) = closest; } SOA form to allow regular accesses Pad for alignment (stride includes padding) Strip-mine outer loop Cache clusters in SHM Drag inputs from host each iteration Allocated earlier (via CUDAUtilities) Unroll this loop Above Kernel is only part of the story... Use kernel once per iteration of Lloyd's Alg. Copy gpu_nearest to CPU when done Rest of algorithm runs faster on the CPU!

46 K-Means graphical demo Map of USA (can scroll / zoom) 300 Million points (2 dimensional space) Data auto-generated from zip-code population data Town pop. normally distributed around town centers Use K-Means to find 50 Centroids for USA population Possible use case: Delivery distribution warehouses Visualization uses GL Runs on the GPU Completely independent from CUDA GPUs can be used for graphics too! :) Can compare K-Means performance on CPU and GPU

47 K-Means End-to-End Performance Colours show scaling up to 4 Tesla GPUs Y-axis gives Gflops normalized WRT native CUDA implementation (single GPU, host) 2M/4M points, K=100/400, 4 dimensions Higher K ⇒ more GPU work ⇒ better scaling Analysis Performance sensitive to many factors: CPU code: fragile g++ optimisations X10/CUDA DMA bandwidth (GPUs share PCI bus) Infiniband for host host communication (also share PCI bus) nvcc register allocation fragile Outstanding DMA issue (~40% bandwidth loss) CUDA requires X10 CPU objects allocated with cudaMalloc Hard to integrate with existing libraries / GC Currently staging DMAs through pre-allocated buffer... complaints by users on CUDA forums Options for further improving performance: Use multicore for CPU parts Hide DMAs Hosts x GPUs (per host)

48 Future Work Support texture cache... Explicit allocation: val t = Texture.make(...) at (gpu) { … t(x, y) … } Fermi architecture presents obvious opportunities for supporting X10 Indirect branches ⇒ Object orientation and Exceptions NVidia's goal is the same as ours – allow high level GPU programming OpenCL No conceptual hurdles Different code-gen and runtime work X10 programming experience should be very similar Memory Allocation on GPU Latest CUDA has a 'malloc' call we might use GPU GC cycle between kernel invocations A research project in itself atomic operations GL interoperability

49 X10 Tutorial Outline X10 Overview X10 on GPUs X10 Tools Overview Demo of X10DT and X10 Debugger Lab Session I LUNCH Case Studies Summary Lab Session II

50 X10 DT: Integrated Development Tools for the X10 Programmer Building Debugging Editing Help A modern development environment for productive parallel programming Source navigation, syntax highlighting, parsing errors, folding, hyperlinking, outline & quick outline, hover help, content assist, type hierarchy, format, search, call graph, quick fixes Local and remote launch - Java/C++ back-end support - Remote target configuration X10 programmer guide X10DT usage guide Integrated parallel X10 debugging Launching

51 X10 Editor plus a set of common IDE services Source navigation Syntax highlighting Parsing errors Folding Hyperlinking Outline & quick outline Hover help Content assist More services in progress: Type hierarchy view Format Search Call graph view Quick fixes Editing

52 Editing Services Outline view Source folding Hover help Syntax highlighting

53 Editing Services (cont.) Content-Assist (Ctrl-space)

54 Editing Services (cont.) Early parsing errors

55 Editing Services (cont.) Hyperlinking (Ctrl – mouseclick) opens up new window

56 X10DT Builders Support both X10 back-ends: Java and C++ Ability to switch back-ends from “Configure”context-menu in the Package Explorer view.

57 X10 Builders (2) – C++ back-end Sockets, MPI, Parallel Environment, IBM LoadLeveler. Local vs Remote (SSH) Communication Interfaces

58 X10 Builders (3) – C++ back-end Ability to customize compilation and linking for C++ generated files Different sanity checks occur: SSH connection, compiler version, paths existence, etc… Global validation available.

59 X10 Launcher Run As shortcuts for the current back-end selected Dedicated X10 launch configuration

60 X10 Launcher (2) Possibility to tune communication interface options at launch-time Number of Procs

61 X10 Launcher (3) X10 run with C++ back-end switches automatically to PTP Parallel Runtime Perspective for better monitoring

62 Debugging (C++ backend) Extends the Parallel Debugger in the Eclipse Parallel Tools Platform (PTP) with support for X10 source level debugging Breakpoints, stepping, variable views, stack views

63 Debugging (C++ backend) Breakpoints Variable View Execution controls (resume, terminate, step into, step over, step return)

64 Help Part 1: X10 Language Help Various topics of X10 language help Help -> Help Contents

65 Help : Part 2: X10DT Help Various help topics for installing, and using X10DT Help -> Help Contents

66 Demo

67 Lab Session I Objectives – Install X10/X10DT on your laptop – Local build/launch of HelloWholeWorld – Optional remote build/launch of HelloWholeWorld – Explore code in provided tutorial workspace After lunch – X10 case studies – Summary – Lab Session II

68 X10 Tutorial Outline X10 Overview X10 on GPUs X10 Tools Lab Session I LUNCH Case Studies Summary Lab Session II

69 Case Study Outline LU Factorization Common communication/concurrency patterns in X10 Teams (collective operations) Global Load Balancing Framework Usage of finish/async/at Design/usage of a generic framework Global Matrix Library Building a scalable distributed data structure in X10 API design for end-user productivity Internal communication/concurrency patterns ANUChem Extending X10’s Array library Exploitation of X10’s constrained types for performance Use of distributed “collecting finish” idiom for scalability All code available in the X10DT workspace; feel free to follow along…

70 LU Factorization SPMD implementation one top-level async per place finish ateach (p in Dist.makeUnique()) globally synchronous progress collectives: barriers, broadcast, reductions (row, column, WORLD) point-to-point RDMAs for row swaps uses BLAS for local computations HPL-like algorithm 2-d block-cyclic data distribution right-looking variant with row partial pivoting (no look ahead) recursive panel factorization with pivot search and column broadcast combined

71 Key Features Distributed Data Structures PlaceLocalHandle RemoteArray Communications Array.asyncCopy Teams Optimizations Finish Pragmas C bindings

72 PlaceLocalHandle PlaceLocalHandle: global handle to per-place objects Example: val dist = Dist.makeUnique(); val constructor = ()=>new Array[Double](N); val buffers = PlaceLocalHandle.make(dist, constructor); allocates a buffer array in each place returns a global handle to the collection of buffers In each place buffers() evaluates to the local buffer

73 RemoteArray and asyncCopy RemoteArray: global reference to a fixed local array Example: val myBuffer = buffers(); val globalRefToMyBuffer = new RemoteArray(buffer); Array.asyncCopy: asynchronous copy from source array to preallocated destination array (source or dest is local) uses RDMAs if available avoid local data copies

74 Example def exchange(row1:Int, row2:Int, min:Int, max:Int, dest:Int) { val sourceBuffer = new RemoteArray(buffers()); val size = matrix().getRow(row1, min, max, buffers()); // pack finish { at (Place(dest)) async { // go to dest finish { // copy from source to dest Array.asyncCopy[Double](sourceBuffer, 0, buffers(), 0, size); } // wait for copy 1 matrix().swapRow(row2, min, max, buffers()); // swap // copy from dest to source Array.asyncCopy[Double](buffers(), 0, sourceBuffer, 0, size) } } // wait for copies 1 and 2 matrix().setRow(row1, min, max, buffers()); // store }

75 Teams X10 Teams are like MPI communicators predefined Team.WORLD constructor val team01 = new Team([Place(0), Place(1)]); split method colRole = here.id / py; rowRole = here.id % py; col = Team.WORLD.split(here.id, rowRole, colRole); operations: barrier, broadcast, indexOfMax, allreduce… col.indexOfMax(colRole, Math.abs(max), id); col.addReduce(colRole, max, Team.MAX);

76 C Bindings @NativeCPPInclude("essl_natives.h") // include.h in code generated from X10 @NativeCPPCompilationUnit("essl_natives.cc") // link with.cc class LU { @NativeCPPExtern native static def blockTriSolve(me:Array[Double], diag:Array[Double], B:Int):void; @NativeCPPExtern native static def blockBackSolve(me:Array[Double], diag:Array[Double], B:Int):void; @NativeCPPExtern native static def blockMulSub(me:Array[Double], left:Array[Double], upper:Array[Double], B:Int):void; mapping by function names and argument positions supports primitive types and arrays of primitives

77 Finish Pragmas Optimize specific finish/async patterns @Pragma(Pragma.FINISH_HERE) finish { at (Place(dest)) async { @Pragma(Pragma.FINISH_ASYNC) finish { Array.asyncCopy[Double](sourceBuffer, 0, buffers(), 0, size); } matrix().swapRow(row2, min, max, buffers()); // swap Array.asyncCopy[Double](buffers(), 0, sourceBuffer, 0, size) } @Pragma(Pragma.FINISH_ASYNC) // unique async @Pragma(Pragma.FINISH_HERE) // last async to terminate is local @Pragma(Pragma.FINISH_LOCAL) // no remote async @Pragma(Pragma.FINISH_SPMD) // no nested remote async

78 Global Load Balancing

79 Work-stealing in distributed memory Autonomous stealing not possible Need help from NIC or victim Sparsely distributed work is expensive to find Network congestion, wasted CPU-cycles Non-trivial to detect termination Users need to implement their own Well-studied, but hard to implement One task stolen at a time Shallow trees necessitate repeated stealing Requires aggregation of steals

80 GLB 80 Main activity spawns an activity per place within a finish. Augment work-stealing with work-distributing: Each activity processes. When no work left, steals at most w-times. (Thief) If no work found, visits at most z-lifelines in turn. Donor records thief needs work (Thief) If no work found, dies. (Donor) Pushes work to recorded thieves, when it has some. When all activities die, main activity continues to completion. Caveat: This scheme assumes that work is relocatable

81 GLB: state transitions 81 computing stealing dead distributing Can enter X10 to handle asynchronous requests

82 GLB: sample execution 82 Number of steal attempts = 2 Number of lifelines = 2

83 GLB: lifeline graphs 83 Structure determines How soon work is propagated How much overhead is incurred Ring network P/2 hops on average Low overhead Desirable properties Each vertex must have bounded out-degree Connected Low diameter

84 Lifeline graph: cyclic hyper-cubes 84 20 10 11 01 00 02 12 21 P = number of processors Choose h and z such that P <= h^z P=8, h=3, z=2 Numbers in base h Each node has at most Z lifelines

85 UTS: Results 85

86 UTS: Results 86 1 attempt to steal before resorting to lifeline

87 UTS: Results 87 22% more time stealing 83 attempts to steal before resorting to lifeline

88 UTS: Results 88 87% efficiency ANL’s BG/P

89 UTS: Results 89 86% efficiency Power7 on IB cluster

90 UTS: Results 90 92% efficiency IBM’s BG/P

91 UTS: Results 91 89% efficiency IBM’s Power7 with IB

92 GLB Summary GLB combines stealing and distributing Scales with high efficiency to 2048 processors X10’s UTS was ~350 lines of actual code Future work Analytical framework for lifeline graphs Implement adaptive stealing Scaling out finish to 100K processors Hierarchical UTS with multiple threads per-node 92

93 GLB: framework 93 static final class Fib2 extends TaskFrame[UInt, UInt] { public def runTask(t:UInt, s:Stack[Uint]):void offers UInt { if (t < 20u) offer fib(t); else { s.push(t-1); s.push(t-2); } public def runRootTask(t:UInt, s:Stack[UInt]!):void offers UInt { runTask(t, s); } TaskFrame [T1, T2] is the user-interface for lifeline-based GLB T1 : Type of the work-packet T2 : Return type from the evaluation function

94 Global Matrix Library

95 Global Matrix Library (x10.gml) Object-oriented library implementing double-precision, blocked, sparse and dense matrices, distributed across multiple places Each kind of matrix implemented in a different class, all implement Matrix operations. Operations include cell-wise operations (addition, multiplication etc), matrix multiplication (using the SUMMA algorithm), Euclidean distance, etc. Presents standard sequential interface (internal parallelism and distribution). Written in X10, with native callouts to BLAS Runs on Managed Backend (multiple JVMs connected with sockets) Runs on Native Backend (on sockets/Ethernet or MPI/sockets, Infiniband).

96 Global Matrix Library: PageRank def step(G:…, P:…, GP:…, dupGP:…, UP:…, P:…, E, :…) { for (1..iteration) { GP.mult(G,P).scale(alpha).copyInto(dupGP);//broadcast P.local().mult(E, UP.mult(U,P.local())).scale(1-alpha).cellAdd(dupGP.local()).sync(); // broadcast }} while (I < max_iteration) { p = alpha*(G%*%p)+(1-alpha)*(e%*%u%*%p); } X10 DML Programmer determines distribution of the data. –G is block distributed per Row –P is replicated Programmer writes sequential, distribution- aware code. Perform usual optimizations – allocate big data-structures outside loop Could equally well have been in Java.

97 PageRank performance 40 cores (8 places, 5 workers)

98 Using GML: Gaussian Non-Negative Matrix Factorization Key kernel for topic modeling Involves factoring a large (D x W) matrix –D ~ 100M –W ~ 100K, but sparse (0.001) Iterative algorithm, involves distributed sparse matrix multiplication, cell-wise matrix operations. for (1..iteration) { H.cellMult(WV.transMult(W,V,tW).cellDiv(WWH.mult(WW.transMult(W,W),H))); W.cellMult(VH.multTrans(V,H).cellDiv(WHH.mult(W,HH.multTrans(H,H)))); } X10 Key decision is representation for matrix, and its distribution. –Note: app code is polymorphic in this choice. P0 VW H H H H P1 P2 Pn

99 GNNMF Performance 40 cores (8 places, 5 workers)

100 Linear Regression for (1..iteration) { d_p.sync(); d_q.transMult(V, Vp.mult(V, d_p), d_q2, false); q.cellAdd(p1.scale(lambda)); alpha = norm_r2 / pq.transMult(p, q)(0, 0); p.copyTo(p1); w.cellAdd(p1.scale(alpha)); old_norm_r2 = norm_r2; r.cellAdd(q.scale(alpha)); norm_r2 = r.norm(r); beta = norm_r2/old_norm_r2; p.scale(beta).cellSub(r); } while(i < max_iteration) { q = ((t(V) %*% (V %*% p)) + eps * p); alpha = norm_r2 / castAsScalar(t(p) %*% q); w = w + alpha * p; old_norm_r2 = norm_r2; r = r + alpha * q; norm_r2 = sum(r * r); beta = norm_r2 / old_norm_r2; p = -r + beta * p; i = i + 1; } DMLX10

101 Linear regression 40 cores (8 places, 5 workers)

102 Getting the GML Code Entire GML library, example programs, and build/run scripts included in your tutorial workspace Also available open source under EPL at x10-lang.org Currently only via svn (x10.gml project in x10 svn) Packaged release and formal announcement coming soon

103 ANU Chem

104 ANUChem Computational chemistry codes written in X10 – FMM: electrostatic calculation using Fast Multipole method Included in tutorial workspace – PME: electrostatic calculation using the Smooth Particle Mesh Ewald method – Pumja Rasaayani: energy calculation using Hartree-Fock SCF- method Developed by Josh Milthorpe and Ganesh Venkateshwara at ANU – http://cs.anu.edu.au/~Josh.Milthorpe/anuchem.html http://cs.anu.edu.au/~Josh.Milthorpe/anuchem.html – Open source (EPL); tracks X10 releases – Around 16,000 lines of X10 code – Overview and initial results in [Milthorpe et al IPDPS 2011] Code under active development; significant performance and scalability improvements in 2.2.1 release vs. results in the paper

105 Goals for Case Study Deeper understanding of X10 Arrays and DistArrays – Extending with application-specific behavior – Using X10’s constrained types to enable static checking and optimization – DistArray as example of efficient distributed data structure Usage of PlaceLocalHandles Usage of transient fields and caching Concurrency/distribution in FMM; use of collecting finish

106 Array and DistArray  A Point is an element of an n-dimensional Cartesian space (n>=1) with integer- valued coordinates e.g., [5], [1, 2], …  A Region is an ordered set of points, all of the same rank  An Array maps every Point in its defining Region to a corresponding data element.  Key operations  creation  apply/set (read/write elements)  map, reduce, scan  iteration, enhanced for loops  copy A Dist (distribution) maps every Point in its Region to a Place A DistArray (distributed array) is defined over a Dist and maps every Point in its distribution to a corresponding data element. Data elements in the DistArray may only be accessed in their home location (the place to which the Dist maps their Point) Key operations similar to Array, but bulk operations (map, reduce, etc) are distributed operations With the exception of Array literals ( [1, 2, 3] === new Array[int](3,(i:int)=>i+1) ), Array and DistArray are implemented completely as library classes in the x10.array package with no special compiler support.

107 DistArrays in Detail We’ll examine the code of DistArray, Dist, BlockDist – LocalStorage accessed by PlaceLocalHandle – Use of transient fields to control serialization – Indexing operations Key idea: Dist.offset is an abstract method that subclasses must implement that is responsible for mapping Points in the global index space into offsets in local storage – Bulk operations (eg reduce) implemented internally using distributed collecting finish

108 Extending DistArray: MortonDist MortonDist (3D) used in FMM to define a DistArray that distributes octree boxes (key data structure) to X10 places See au.edu.anu.mm.MortonDist.x10 Ordering of Points in 2D MortonDist of 0..7*0..7 2D MortonDist of 0..7*0..7 with 5 Places

109 Extending DistArray: PeriodicDist Enables modular (“wrap-around”) indexing at each Place Any Dist can be made periodic by encapsulating it in a PeriodicDist. Can then be used in DistArray to get modular indexing See x10.array.PeriodicDist

110 Arrays in Detail We’ll examine the implementation of x10.array.Array – Array indexing Specialized implementations based on properties. Optimized for common case of Regions of rank<5, zerobased and rectangular (no holes in index space) General case fully supported with minimal time overhead on indexing (space usage based on bounding box of Region) Constructor does computations to optimize later indexing If bounds-checks are disabled, Native X10 indexing performance identical to equivalent C array – Array iteration Compiler optimizes iteration over rectangular Regions into counted for loops. Non-rectangular regions fully supported, but uses Point- based Region iterator API (significant performance overhead)

111 FMM: ExpansionRegion In FMM, a specialized region is used to represent the triangular-shaped region of multipole expansions Enables error-checking and cleaner code when relevant Arrays defined over ExpansionRegion Indexing performance good, but in a few key loop nests “hand-inlining” of ExpansionRegion iterator required for maximal performance. Eventually will be handled by X10 compiler improvements (well-known optimizations sufficient, but not yet implemented) See au.edu.anu.mm.ExpansionRegion and Expansion

112 FMM: Nested Collecting Finish The FMM “downward pass” is implemented using an elegant pattern of nested distributed collecting finishes – Clean, easy to read code – Good scalability (tested to 512 places on BG/P) Code to examine – au.edu.anu.mm.Fmm3d.downwardPass – au.edu.anu.mm.FmmBox.downward – au.edu.anu.mm.FmmLeafBox.downward

113 X10 Tutorial Outline X10 Overview X10 on GPUs X10 Tools Lab Session I LUNCH Case Studies Summary Lab Session II

114 Programmer Resources x10-lang.org – Language specification – Online tutorials – X10-related classes and research papers – Performance Tuning Tips – FAQ Sample X10 programs – Small examples in X10 distribution and tutorials – X10 Benchmarks: optional download with X10 release – X10 Global Matrix Library – Other Applications ANUChem: http://cs.anu.edu.au/~Josh.Milthorpe/anuchem.htmlhttp://cs.anu.edu.au/~Josh.Milthorpe/anuchem.html Raytracer and OpenGL KMeans programs Your application here…

115 X10 Project Road Map Upcoming Releases – X10 2.2.2 mid-December 2011 Bug fixes and small improvements – X10 2.3.0 first half of 2012 Java interoperability Mixed-mode execution (single program running in mix of Native and Managed places) Will be backwards compatible with X10 2.2.x, but may include language enhancements for determinacy

116 X10 Community Join the X10 community at x10-lang.org! x10-users mailing list List of upcoming events, papers, classes, users, etc. X10/X10DT releases: http://x10-lang.org/software/download-x10 Try X10! Want to hear your experiences and feedback! Very interested in publicizing X10 applications and benchmarks that can be shared with the community.

117 Questions and Discussion Questions? Discussion? Next up: Lab Session II…

118 Lab Session II Objectives – Work on a more substantial X10 coding project of interest to you – Some suggested projects Explore GlobalMatrixLibrary sample programs; write some new code using GML Try writing a new client of GlobalLoadBalancing Try to parallelize (and then distribute) your favorite recursive divide and conquer algorithm Explore using X10DT by modifying sample programs Try coding up your favorite problem in X10…

1 Developing Scalable Parallel Applications in X10 David Cunningham, David Grove, Vijay Saraswat, and Olivier Tardieu IBM Research Based on material from.

Similar presentations

Presentation on theme: "1 Developing Scalable Parallel Applications in X10 David Cunningham, David Grove, Vijay Saraswat, and Olivier Tardieu IBM Research Based on material from."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Developing Scalable Parallel Applications in X10 David Cunningham, David Grove, Vijay Saraswat, and Olivier Tardieu IBM Research Based on material from.

Similar presentations

Presentation on theme: "1 Developing Scalable Parallel Applications in X10 David Cunningham, David Grove, Vijay Saraswat, and Olivier Tardieu IBM Research Based on material from."— Presentation transcript:

Similar presentations

About project

Feedback