Download presentation
Presentation is loading. Please wait.
1
Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley
2
Research Problems in High Performance Organ Simulation: Domain-Specific Tools BeBOP: Architecture- Specific Optimization (with Demmel) Titanium: Language for Parallel Scientific Computing (with Graham, Hilfinger)
3
Titanium Language for grid-based scientific computing Based on Java (but compiled) Extensions: –Multidimensional arrays with iterators –Immutable (“value”) classes –Templates –Operator overloading –Checked Synchronization –Zone-based memory management
4
Is High Performance Java an Oxymoron?
5
Parallel Dependence Analysis: Cycle Detection First, find potential race conditions –If none, then use traditional sequential analysis –Analysis of shared/private data can help Code defines a “program order” on accesses P is the union of these across processors Memory system defines an “access order” A is access order (read/write & write/write pairs) Avoid reordering along edges of a cycle –Intuition: time cannot flow backwards. write data read flag write flag read data
6
Parallel Control Analysis: Synchronization Given a program P, determine which segments of P could run in parallel. –Match barriers (single analysis in Titanium) –Match synchronized regions Both analyses can be used to: –Detect bugs (race conditions) –For optimizations: Prefetching, split-phase memory, loop transformations, scheduling,…
7
Titanium Research Problems Designed for block-structured grids; add support for unstructured. Optimizations for local memory hierarchies (more on this later) Design of low-cost communication layers for read/write Add communication optimizations See the projects we page: http://titanium.cs.berkeley.edu/tasks.html http://titanium.cs.berkeley.edu/tasks.html
8
Performance Tuning Motivation: performance of many applications dominated by a few kernels Heart simulation Navier-Stokes –Sparse matrix-vector multiply (Multigrid) –Fast Fourier Transforms Information retrieval LSI, LDA –Sparse matrix-vector multiply Image processing filtering, segmentation –Sorting/Histograms, Cosine transform, Sparse matrix- vector multiply Many other examples
9
Architectural Trends µProc 60%/yr. DRAM 7%/yr. 1 10 100 1000 19801981198319841985198619871988198919901991199219931994199519961997199819992000 DRAM CPU 1982 Processor-Memory Performance Gap: (grows 50% / year) Performance Year “Moore’s Law” A cache miss is O(100) cycles Getting worse every year
10
Conventional Performance Tuning Vendor or user hand tunes kernels Drawbacks: –Very time consuming and difficult work –Even with intimate knowledge of architecture and compiler, performance hard to predict –Must be redone for every architecture, compiler –Not just a compiler problem: Best algorithm may depend on input, so some tuning must occur at run-time. Multiple algorithms for the same problem may not be provably equivalent by program analysis
11
Automatic Performance Tuning Approach: for each kernel 1.Identify and generate a space of algorithms 2.Search for the fastest one, by running them 3.Constrain search space using performance models What is a space of algorithms? –Depends on kernel and input –May vary instruction mix and order memory access patterns data structures mathematical formulation Search both off-line and on-the-fly
12
How Much Does Tuning Help? Experience from PHiPAC: ~10x on matmul
13
Sparse Matrices as Graphs Sparse matrix is adjacency matrix for a graph –Matrix vector multiplication is nearest neighbor computation Optimizations: –Register blocking: look for fixed size cliques Unroll loops and optimize “dense” kernels –Cache blocking: partition graph and layout in memory by partitions –Multiple vectors: Assume each node holds a vector, update them all simultaneously Common in some types of solvers –Exploit symmetry (undirected graph) –Exploit bounded degree or other special structures
14
Speedups from Sparsity with 1 Vector
15
Speedups from Sparsity with 9 Vectors
16
BeBop Research BeBop: Berkeley Benchmarking and optimization group Hand optimizations: –Understood for some problems How to build tools –Work across machines (self-tuning) –Work on multiple problems (code generation)
17
Application-Specific Tools Simulation of the human body Imagine a “digital body double” –3D image-based medical record –Includes diagnostic, pathologic, and other information Used for: –Diagnosis –Less invasive surgery-by-robot –Experimental treatments Where are we today?
18
From Visible Human to Digital Human Source: John Sullivan et al, WPI Source: www.madsci.org Building 3D Models from images
19
Heart Simulation Calculation Developed by Peskin and McQueen at NYU –Done on a Cray C90: 1 heart-beat in 100 hours –Used for evaluating artificial heart valves –Scalable parallel version done here Implemented in Titanium –Model also used for: Inner ear Blood clotting Embryo growth Insect flight Paper making
20
Digital Human Roadmap 1995200020052010 1 organ 1 model scalable implementations 1 organ multiple models multiple organs 3D model construction new algorithms organ system coupled models 100x performance
21
Summary Three related projects –Titanium http://titanium.cs.berkeley.edu –BeBop http://www.cs.berkeley.edu/~richie/bebop –Organ Simulation Research issues –How to make high performance easy Increasing complex applications Increasing complex machines
22
Simulation of a Heart
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.