Presentation is loading. Please wait.

Presentation is loading. Please wait.

Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Titanium: A High Performance Language Based on Java Kathy Yelick.

Similar presentations


Presentation on theme: "Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Titanium: A High Performance Language Based on Java Kathy Yelick."— Presentation transcript:

1 Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Titanium: A High Performance Language Based on Java Kathy Yelick http://titanium.cs.berkeley.edu/ U.C. Berkeley Also the UPC project at LBNL http://upc.nersc.gov

2 Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Titanium Group (Past and Present) Susan Graham Katherine Yelick Paul Hilfinger Phillip Colella (LBNL) Alex Aiken Greg Balls Andrew Begel Dan Bonachea Kaushik Datta David Gay Ed Givelberg Arvind Krishnamurthy Ben Liblit Peter McQuorquodale (LBNL) Sabrina Merchant Carleton Miyamoto Chang Sun Lin Geoff Pike Luigi Semenzato (LBNL) Jimmy Su Tong Wen (LBNL) Siu Man Yau (and many undergrad researchers)

3 Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Context Most parallel programs are written using explicit parallelism, either: Message passing with a SPMD model Usually for scientific applications with C++/Fortran Scales easily Shared memory with threads in C or Java Usually for non-scientific applications Easier to program, but usually provide less scalable performance Global Address Space Languages take the best of both global address space like threads (programmability) SPMD parallelism like MPI (performance) local/global distinction, i.e., layout matters (performance)

4 Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Based on Java, a cleaner C++ classes, automatic memory management, etc. compiled to C and then native binary (no JVM) Scalable parallelism model SPMD with a global address space Optimizing compiler static (compile-time) optimizer, not a JIT communication and memory optimizations synchronization analysis (e.g. static barrier analysis) cache and other uniprocessor optimizations

5 Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Summary of Features Added to Java 1.Scalable parallelism (Java threads replaced) 2.Multidimensional arrays with iterators 3.Checked Synchronization 4.Immutable (“value”) classes 5.Operator overloading 6.Templates 7.Zone-based memory management (regions) 8.Libraries for collective communication, distributed arrays, bulk I/O

6 Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Immutable Classes in Titanium For small objects, would sometimes prefer to avoid level of indirection pass by value (copying of entire object) especially when immutable -- fields never modified Examples: complex type multiple fields (pressure, velocity, force) in a grid Titanium introduces immutable classes all fields are final (constant) plus compiler implements as above Note: considering extension to allow mutation

7 Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Example of Immutable Classes An immutable class has few additions immutable class Complex { Complex () {real=0; imag=0; }... } Use of immutable complex values Complex c1 = new Complex(7.1, 4.3); c1 = c1.add(c1); Addresses performance and programmability Similar to structs in C in terms of performance Adds support for complex types Zero-argument constructor required new keyword Rest unchanged. No assignment to fields outside of constructors.

8 Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Operator Overloading Titanium adds operator overloading, important for readability in scientific code Very similar to operator overloading in C++ public Complex operator+(Complex c) { return new Complex(c.real + real, c.imag + imag); } Complex c1 = new Complex(7.1, 4.3); c1 = c1 + c1; Adds to programmability, not performance Must be used judiciously

9 Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Templates Many applications use containers: E.g., arrays parameterized by dimensions, element types Java supports this kind of parameterization through inheritance; Java templates based on this as well May only put Object types into containers Inefficient when used extensively Titanium provides a template mechanism closer to that of C++ E.g., can instantiate with “double” or “immutable Complex”

10 Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Example of Templates template class Stack {... public Element pop() {...} public void push( Element arrival ) {...} } template Stack list = new template Stack (); list.push( 1 ); int x = list.pop(); Addresses programmability and performance Not an object Strongly typed, No dynamic cast

11 Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Multidimensional Arrays Arrays in Java are objects Only 1D arrays directly supported Array bounds are checked Safe but potentially slow Multidimensional arrays as arrays-of-arrays General, but may be slow due to memory layout and difficulty of compiler analysis Hand-coding (array libraries) can confuse optimizer

12 Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Multidimensional Arrays in Titanium New kind of multidimensional array added Sub-arrays are supported Indexed by Points (tuple of ints) Very expressive sub-array support, e.g., Can refer to a row or column as a sub-array refer to the boundary region of an array Optimized by the compiler for caches Addresses programmability and performance

13 Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Unordered iteration With arrays, Titanium adds unordered iteration Helps compiler with loop analysis Also avoids some indexing details foreach (p within A.domain()) { A[p]... } p is a Point (tuple of ints) that can be used to index arrays Works for any dimension array Provides programmability and performance

14 Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Parallelism Model Titanium starts a copy of “main” on each processor (SPMD parallelism) Only major restriction to Java semantics Replaced Java’s thread model Many programs written with more general threads do: for i = 1 to p fork Handling dynamic thread creation on 1000s of processors is difficult Design is purely a performance consideration, dynamic threads are a future direction

15 Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Global Address Space Shared address space is partitioned References (pointers) are either local or global (meaning possibly remote) Object heaps are shared Global address space x: 1 y: 2 Program stacks are private l: g: x: 5 y: 6 x: 7 y: 8 p0p1pn

16 Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Communication Titanium has explicit global communication: Broadcast, reduction, etc. Primarily used to set up distributed data structures Most communication is implicit through the shared address space Dereferencing a global reference, g.x, can generate communication Arrays have copy operations, which generate bulk communication: A1.copy(A2)

17 Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Distributed Data Structures Building distributed arrays: Particle [1d] single [1d] allParticle = new Particle [0:Ti.numProcs-1][1d]; Particle [1d] myParticle = new Particle [0:myParticleCount-1]; allParticle.exchange(myParticle); Now each processor has array of pointers, one to each processor’s chunk of particles P0 P1P2 All to all broadcast

18 Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Global Address Space Communication through global address space designed for Productivity: explicit representation of distributed data structures Performance: exploits efficient one-sided communication (remote put/get) when it exists Tunability: shared memory style uses more global dereferences; distributed style uses more array copies

19 Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Region-Based Memory Management Extension of Java’s implicit memory management Regions are still “safe”, but can avoid or reduce need for distributed garbage collection PrivateRegion r = new PrivateRegion(); for (int j = 0; j < 10; j++) { int[] x = new ( r ) int[j + 1]; work(j, x); } try { r.delete(); } catch (RegionInUse oops) { System.out.println(“failed to delete”); } Designed for performance

20 Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Applications in Titanium Several applications and benchmarks Heart simulation Fluid solvers with Adaptive Mesh Refinement (AMR) Dense linear algebra: LU, MatMul Unstructured mesh kernel: EM3D Finite element benchmark Genetics: micro-array selection Tree-structure n-body code

21 Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium 3D AMR Gas Dynamics Hyperbolic Solver [McCorquodale and Colella] Implementation of Berger-Colella algorithm Mesh generation algorithm included 2D Example (3D supported) Mach-10 shock on solid surface at oblique angle Future: Self-gravitating gas dynamics package

22 Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Immersed Boundary Method [Peskin/MacQueen, Yau] Fibers (e.g., heart muscles) modeled by list of fiber points Fluid space modeled by a regular lattice Irregular fiber lists need to interact with regular fluid lattice Trade-off between load balancing of fibers and minimizing communication memory and communication intensive Uses several parallel numerical kernels Navier-Stokes solver 3-D FFT solver Soon to be enhanced using an adaptive multigrid solver (possibly written in KeLP) Heart Simulation

23 Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Productivity Measures Performance Programmability Robustness Portability

24 Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Serial Performance (Pure Java) Note the Ti/Java numbers use Java arrays, not Titanium arrays Ti -nobc is with bounds-checking disabled

25 Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Parallel Performance and Scalability Poisson solver using “Method of Local Corrections” Communication < 5%; Scaled speedup nearly ideal (flat) IBM SP at SDSC Cray T3E at NERSC

26 Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Performance Tuning by Compiler Advantage of compiled languages (Berkeley UPC compiler) Scaled version of GUPS

27 Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Programmability Heart simulation developed in ~1 year Extended to support 2D structures for Cochlea model in ~1 month Preliminary code length measures Simple torus model Serial torus code is 17045 lines long (2/3 comments) Parallel Titanium torus version is 3057 lines long. Full heart model Shared memory Fortran heart code is 8187 lines long Parallel Titanium version is 4249 lines long. Need to be analyzed more carefully, but not a significant overhead for distributed memory parallelism

28 Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Robustness Robustness is the primary motivation for language “safety” in Java Type-safe, array bounds checked, auto memory management Study on C++ vs. Java from Phipps at Spirus: C++ has 2-3x more bugs per line than Java Java had 30-200% more lines of code per minute Extended in Titanium Checked synchronization avoids barrier deadlocks More abstract array indexing, retains bounds checking No attempt at quantify for Titanium yet Would like to measure speed of error detection (compile time, runtime exceptions, etc.)

29 Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Portability Heart code and other applications run anywhere Titanium runs Runs on serial or shared memory machines with native C compiler Including my laptop! Very important for programmer productivity For distributed memory, requires communication layer Alpha/Quadrics, IBM SP, Cray T3E, PC/Myrinet, anything with MPI Global Address Space Networking layer (GASNet) –With C compiler, get Titanium and LBNL/UPC compilers FFTW used in heart code: strategy for performance and portability

30 Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Performance and Portability Approach Use machines, not humans for architecture- specific tuning Code generation + search-based selection Can adapt to cache size, # registers, network buffering Used in Signal processing: FFTW, SPIRAL, UHFFT Dense linear algebra: Atlas, PHiPAC Sparse linear algebra: Sparsity Rectangular grid-based computations: Titanium compiler Global communication: Atlas-derivative

31 Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Summary Titanium designed for performance and programmability Some compromises (regions, local/global refs) Retains robustness (safety) of Java Also a big help in learning, avoiding certain kinds of bugs Tunability and performance transparency Aggressive automatic optimizations can make this worse Advertising (all open source): Titanium compiler: http://titanium.cs.berkeley.eduhttp://titanium.cs.berkeley.edu Berkeley UPC compiler: http://upc.nersc.govhttp://upc.nersc.gov Automatic tuning: http://www.cs.berkeley.edu/~richie/bebophttp://www.cs.berkeley.edu/~richie/bebop


Download ppt "Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Titanium: A High Performance Language Based on Java Kathy Yelick."

Similar presentations


Ads by Google