Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.

Similar presentations

Presentation on theme: "Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley."— Presentation transcript:

1 Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley

2 Research Problems in High Performance Organ Simulation: Domain-Specific Tools BeBOP: Architecture- Specific Optimization (with Demmel) Titanium: Language for Parallel Scientific Computing (with Graham, Hilfinger)

3 Titanium Language for grid-based scientific computing Based on Java (but compiled) Extensions: –Multidimensional arrays with iterators –Immutable (“value”) classes –Templates –Operator overloading –Checked Synchronization –Zone-based memory management

4 Is High Performance Java an Oxymoron?

5 Parallel Dependence Analysis: Cycle Detection First, find potential race conditions –If none, then use traditional sequential analysis –Analysis of shared/private data can help Code defines a “program order” on accesses P is the union of these across processors Memory system defines an “access order” A is access order (read/write & write/write pairs) Avoid reordering along edges of a cycle –Intuition: time cannot flow backwards. write data read flag write flag read data

6 Parallel Control Analysis: Synchronization Given a program P, determine which segments of P could run in parallel. –Match barriers (single analysis in Titanium) –Match synchronized regions Both analyses can be used to: –Detect bugs (race conditions) –For optimizations: Prefetching, split-phase memory, loop transformations, scheduling,…

7 Titanium Research Problems Designed for block-structured grids; add support for unstructured. Optimizations for local memory hierarchies (more on this later) Design of low-cost communication layers for read/write Add communication optimizations See the projects we page:

8 Performance Tuning Motivation: performance of many applications dominated by a few kernels Heart simulation  Navier-Stokes –Sparse matrix-vector multiply (Multigrid) –Fast Fourier Transforms Information retrieval  LSI, LDA –Sparse matrix-vector multiply Image processing  filtering, segmentation –Sorting/Histograms, Cosine transform, Sparse matrix- vector multiply Many other examples

9 Architectural Trends µProc 60%/yr. DRAM 7%/yr. 1 10 100 1000 19801981198319841985198619871988198919901991199219931994199519961997199819992000 DRAM CPU 1982 Processor-Memory Performance Gap: (grows 50% / year) Performance Year “Moore’s Law” A cache miss is O(100) cycles Getting worse every year

10 Conventional Performance Tuning Vendor or user hand tunes kernels Drawbacks: –Very time consuming and difficult work –Even with intimate knowledge of architecture and compiler, performance hard to predict –Must be redone for every architecture, compiler –Not just a compiler problem: Best algorithm may depend on input, so some tuning must occur at run-time. Multiple algorithms for the same problem may not be provably equivalent by program analysis

11 Automatic Performance Tuning Approach: for each kernel 1.Identify and generate a space of algorithms 2.Search for the fastest one, by running them 3.Constrain search space using performance models What is a space of algorithms? –Depends on kernel and input –May vary instruction mix and order memory access patterns data structures mathematical formulation Search both off-line and on-the-fly

12 How Much Does Tuning Help? Experience from PHiPAC: ~10x on matmul

13 Sparse Matrices as Graphs Sparse matrix is adjacency matrix for a graph –Matrix vector multiplication is nearest neighbor computation Optimizations: –Register blocking: look for fixed size cliques Unroll loops and optimize “dense” kernels –Cache blocking: partition graph and layout in memory by partitions –Multiple vectors: Assume each node holds a vector, update them all simultaneously Common in some types of solvers –Exploit symmetry (undirected graph) –Exploit bounded degree or other special structures

14 Speedups from Sparsity with 1 Vector

15 Speedups from Sparsity with 9 Vectors

16 BeBop Research BeBop: Berkeley Benchmarking and optimization group Hand optimizations: –Understood for some problems How to build tools –Work across machines (self-tuning) –Work on multiple problems (code generation)

17 Application-Specific Tools Simulation of the human body Imagine a “digital body double” –3D image-based medical record –Includes diagnostic, pathologic, and other information Used for: –Diagnosis –Less invasive surgery-by-robot –Experimental treatments Where are we today?

18 From Visible Human to Digital Human Source: John Sullivan et al, WPI Source: Building 3D Models from images

19 Heart Simulation Calculation Developed by Peskin and McQueen at NYU –Done on a Cray C90: 1 heart-beat in 100 hours –Used for evaluating artificial heart valves –Scalable parallel version done here Implemented in Titanium –Model also used for: Inner ear Blood clotting Embryo growth Insect flight Paper making

20 Digital Human Roadmap 1995200020052010 1 organ 1 model scalable implementations 1 organ multiple models multiple organs 3D model construction new algorithms organ system coupled models 100x performance

21 Summary Three related projects –Titanium –BeBop –Organ Simulation Research issues –How to make high performance easy Increasing complex applications Increasing complex machines

22 Simulation of a Heart

Download ppt "Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley."

Similar presentations

Ads by Google