High Performance Computing Research at Berkeley

High Performance Computing Research at Berkeley
Katherine Yelick U. C. Berkeley, EECS Dept. and Lawrence Berkeley National Laboratory November 2004

Major Research Areas in HPC at Berkeley
Programming Languages & Compilers Performance Analysis, Modeling, Tuning Algorithms & Libraries Reconfigurable Hardware Applications Many of these are collaborations With Lawrence Berkeley Lab scientists Application scientists across campus, lab, and elsewhere 11/20/2018 High Performance Computing at Berkeley

Challenges to Performance
Parallel machines are too hard to program Users “left behind” with each new major generation Drop in market size also affects those left in it Efficiency is too low and dropping Single digit efficiency numbers are common Even for Top500 < 15% get >80% efficiency Two trends in High End Computing Increasing complicated systems Increasingly sophisticated algorithms Deep understanding of performance at all levels is important 11/20/2018 High Performance Computing at Berkeley

Global Address Space Programming
Best of shared memory and message passing Ease of shared memory Performance of message passing (or better) Examples are UPC, Titanium, CAF, Split-C x: 1 y: 2 x: 5 y: 6 x: 7 y: 8 Object heaps are shared Global address space l: l: l: g: g: g: Program stacks are private 11/20/2018 High Performance Computing at Berkeley

High Performance Computing at Berkeley
Three GAS Languages Parallel extensions depends on base language UPC (parallel extension of C) Consistent with C design Mapping to hardware is explicit Widely used in DoD Titanium (based on JavaTM) Consistent with Java Programmability and safety are primary concerns Bounds checking, exception handling, barrier checking Attractive to recently-trained programmers Co-Array Fortran Array-oriented, builds on Fortran 90 11/20/2018 High Performance Computing at Berkeley

Goals of the Berkeley UPC Project
Make UPC Ubiquitous on Parallel machines Workstations and PCs for development A portable compiler: for future machines too Components of research agenda: Language development ongoing Compiler optimizations for parallel languages Runtime work for Partitioned Global Address Space (PGAS) languages in general Application demonstrations of UPC LBNL/UCB collaboration 11/20/2018 High Performance Computing at Berkeley

Where Does Berkeley UPC Run?
Runs on SMPs, clusters & supercomputers Support Operating Systems: Linux, FreeBSD, Tru64, AIX, IRIX, HPUX, Solaris, MSWindows(cygwin), MacOSX, Unicos, SuperUX Supported CPUs: x86, Itanium, Alpha, Sparc, PowerPC, PA-RISC, Opteron GASNet communication: Myrinet GM, Quadrics Elan, Mellanox Infiniband VAPI, IBM LAPI, Cray X1, SGI Altix, SHMEM Specific supercomputer platforms: Cray T3e, Cray X1, IBM SP, NEC SX-6, Cluster X (Big Mac), SGI Altix 3000 11/20/2018 High Performance Computing at Berkeley

The UPC Language UPC was developed by researchers from IDA, Berkeley, and LLNL Consortium led by GWU and IDA sets language standard Ongoing effort to understand application needs Berkeley has been key player in IO, Collectives, Memory model, and spec issues in general As standard as MPI-2 and more than SHMEM Several commercial (HP, Cray, IBM) and open (Berkeley, MTU/HP, Intrepid) Not just a language for Cray or SGI machines 11/20/2018 High Performance Computing at Berkeley

Compiling Explicitly Parallel Code
Most compilers are designed for languages with serial semantics Code motion is a critical optimization Compilers move code around Register re-use, instruction scheduling, loop transforms, overlapping communication Hardware dynamically moves operations around Out-of-order processors, network reordering, etc. When is reordering correct? Because the programs are parallel, there are more restrictions, not fewer Have to preserve semantics of what may be viewed by other processors 11/20/2018 High Performance Computing at Berkeley

Compiler Analysis Overview
When compiling sequential programs, compute dependencies: Valid if y not in expr1 and x not in expr2 (roughly) When compiling parallel code, we need to consider accesses by other processors. x = expr1; y = expr2; y = expr2; x = expr1; Initially flag = data = 0 Proc A Proc B data = 1; while (flag == 0); flag = 1; = ...data...; Work by Yelick, Krishnamurthy and Chen 11/20/2018 High Performance Computing at Berkeley

Fast Runtime Support for GAS Languages
Compiler-generated code Many networks provide native RDMA support: Infiniband, Quadrics, Cray, Myrinet Compiler-specific runtime GASNet Extended API Technical problems: Some networks require pinning  Can read/write only into pinned area  We use a “firehose” approach to virtualize this Each platform provides a different primitives: We use layered approach for portability Small core is only requirement for functionality One-sided read/write semantics are a good match, better than send/receive Work by Bonachea, Bell, Hargrove, Welcome GASNet Core API Network Hardware 11/20/2018 High Performance Computing at Berkeley

Small Message Performance
MPI Best GASNet Lower is better explain axes significant benefit to targetting the lower-level API - MPI is not always the best choice 11/20/2018 High Performance Computing at Berkeley

GASNet vs. MPI on Infiniband
MPI (MVAPI) GASNet (prepinned) GASNet (not prepinned) Higher is better performance dropoff at 512KB messages for both layers is a hardware performance bug GASNet significantly outperforms MPI at mid-range sizes: The cost is MPI tag matching, inherent in two-sided model Yellow line shows a naïve bounce-buffer pipelining 11/20/2018 High Performance Computing at Berkeley

Applications in UPC P1 P2 G1 G2 G3 G4 PROCESSOR 1 PROCESSOR 2 Problems that are hard (or tedious) in message passing: Fine-grained, asynchronous communication Dynamic load balancing required Three applications Parallel mesh generation (Husbands, using Shewchuk’s Triangle) Adaptive mesh refinement (shown, Welcome) Sparse matrix factorization (Demmel, Husbands and Li) 11/20/2018 High Performance Computing at Berkeley

Titanium Overview Object-oriented language based on Java: Same high performance parallelism model as UPC: SPMD parallelism in a global address space Emphasis on domain-specific extensions Block-structured grid-based computation Multidimensional arrays Contiguous storage, domain calculus for index opns Sparse matrices and unstructured grids Dynamic communication optimizations Support for small objects General mechanism for examples like complex numbers Semi-automatic memory management Create named “regions” for new and delete Joint project with Graham, Hilfinger and Colella The support for small objects could be used to make the FFT code more natural. It is currently written with an array of Real and array of Imaginary, rather than an array of complex. 11/20/2018 High Performance Computing at Berkeley

Research in Titanium Some problems common to UPC Analysis of parallel code Lightweight runtime support Memory hierarchy optimizations Automatic deadlock detection for bulk-synchronous code Dynamic communication optimizations Tools for debugging, performance analysis, and program development 11/20/2018 High Performance Computing at Berkeley

Runtime Optimizations
Results for sparse matrix-vector multiply Two matrices: Random and Finite Element Titanium versions use: Packing of remote data Send entire bounding box Use a model to select 1 vs 2 Compare to Fortran MPI (Aztec library) 11/20/2018 High Performance Computing at Berkeley

Heart Simulation in Titanium
Large application effort Joint with Peskin & McQueen at NYU Yelick, Givelberg at UCB Part of NSF NPACI Generic framework: Simulation of fluids with immersed elastic structures Many applications in biology, engineering Well known hard parallelism (locality/load balance) 11/20/2018 High Performance Computing at Berkeley

Berkeley Institute for Performance Studies
Newly created, joint institute between the lab and campus Goals: Bring together researchers on all aspects of performance engineering Use performance understanding to: Improve application performance Compare architectures for application suitability Influence the design of processors, networks and compilers Identify algorithmic needs National Science Foundation 11/20/2018 High Performance Computing at Berkeley

BIPS Approaches Benchmarking and Analysis Measure performance Identify opportunities for improvements in software, hardware, and algorithms Modeling Predict performance on future machines Understand performance limits Tuning Improve performance By hand or with automatic self-tuning tools 11/20/2018 High Performance Computing at Berkeley

Multi-Level Analysis Full Applications What users want Do not reveal impact features Compact Applications Can be ported with modest effort Next Gen Apps Full Compact Micro- Benchmarks System Size and Complexity  Easily match to phases of full applications Microbenchmarks Isolate architectural features Hard to tie to real applications 11/20/2018 High Performance Computing at Berkeley

Projects Within BIPS APEX: Application Performance Characterization Benchmarking (Strohmaier, Shan) BeBop: Berkeley Benchmarking and Optimization Group (Yelick, Demmel) LAPACK: Linear Algebra Package (Demmel*) LDRD Architectural Alternatives (Yelick, Hargrove) Modern Vector Architecture (Oliker*) PERC: Performance Engineering Research Center (Bailey, Shan) Top500: Linpack (Strohmaier*) ViVA: Virtual Vector Architectures (Oliker*) * many other collaborators 11/20/2018 High Performance Computing at Berkeley

Vector System Evaluation
US HPC market has been dominated by: Superscalar cache-based architectures Clusters of commodity SMPs used for cost effectiveness Two architectures offer vector alternatives: The Japanese Earth Simulator The Cray X1 Ongoing study of DOE applications on these systems Work by L. Oliker, J. Borrill, A. Canning, J. Carter, J. Shalf, S. 11/20/2018 High Performance Computing at Berkeley

Architectural Comparison
Node Type Where CPU/ Node Clock MHz Peak GFlop Mem BW GB/s Peak byte/flop Netwk BW GB/s/P Bisect BW byte/flop MPI Latency usec Network Topology Power3 NERSC 16 375 1.5 1.0 0. 47 0.13 0.087 16.3 Fat-tree Power4 ORNL 32 1300 5.2 2.3 0.44 0.025 7.0 Altix 2 1500 6.0 6.4 1.1 0.40 0.067 2.8 ES ESC 8 500 8.0 32.0 4.0 0.19 5.6 Crossbar X1 4 800 12.8 34.1 2.7 6.3 0.088 7.3 2D-torus Custom vector architectures have High memory bandwidth relative to peak Tightly integrated networks result in lower latency (Altix) Bisection bandwidth depends on topology JES also dominates here A key ‘balance point’ for vector systems is the scalar:vector ratio 11/20/2018 High Performance Computing at Berkeley

Applications Studied Chosen with potential to run at ultrascale CACTUS Astrophysics ,000 lines grid based Solves Einstein’s equations of general relativity PARATEC Material Science 50,000 lines Fourier space/grid Density Functional Theory electronic structures codes LBMHD Plasma Physics 1,500 lines grid based Lattice Boltzmann approach for magneto-hydrodynamics GTC Magnetic Fusion 5,000 lines particle based Particle in cell method for gyrokinetic Vlasov-Poisson equation MADCAP Cosmology ,000 lines dense linear algebra Extracts key data from Cosmic Microwave Background Radiation 11/20/2018 High Performance Computing at Berkeley

Summary of Results Code (P=64) % peak (P=Max avail) Speedup ES vs.
Pwr3 Pwr4 Altix ES X1 LBMHD 7% 5% 11% 58% 37% 30.6 15.3 7.2 1.5 CACTUS 6% 34% 45.0 5.1 6.4 4.0 GTC 9% 16% 9.4 4.3 4.1 0.9 PARATEC 57% 33% 54% 20% 8.2 3.9 1.4 MADCAP 61% 40% --- 53% 19% 3.4 2.3 Tremendous potential of vector architectures: 4 codes running faster than ever before Vector systems allow resolution not possible with scalar (any # procs) Advantage of having larger/faster nodes ES shows much higher sustained performance than X1 Limited X1 specific optimization so far - more may be possible (CAF, etc) Non-vectorizable code segments become very expensive (8:1 or even 32:1 ratio) Vectors potentially at odds w/ emerging methods (sparse, irregular, adaptive) GTC example code at odds with data-parallelism Social barriers to evaluation of these hard-to-vectorize codes 11/20/2018 High Performance Computing at Berkeley

PERC Performance Tools
Flexible instrumentation systems to capture: Hardware phenomena Instruction execution frequencies Memory reference behavior Execution overheads An advanced data management infrastructure to: Track performance experiments. Collect data across time and space. User-friendly tools to tie performance data to user’s source code. Application level analysis in Berkeley PERC Work by D. Bailey, H. Shan, and E. Strohmaier 11/20/2018 High Performance Computing at Berkeley

EVH1 Astrophysics Analysis
Aggregate performance measures over all tasks for a .1 simulation- second run. Collected with PAPI on an IBM SP (Nighthawk II / 375MHz). 11/20/2018 High Performance Computing at Berkeley

MicroBenchmarks Using Adaptible probes to understand micro-architecture limits Tunable to “match” application kernels Ability to collect continues data sets over parameters reveal cliffs Three examples Sqmat APEX-Map SPMV (for HPCS) 11/20/2018 High Performance Computing at Berkeley

Sqmat overview Java code generate produces unrolled C code Stream of matrices Square each Matrix M times in M controls computational intensity (CI) - the ratio between flops and mem access Each matrix is size NxN N controls working set size: 2N2 registers required per matrix. N is varied to cover observable register set size. Two storage formats: Direct Storage: Sqmat’s matrix entries stored continuously in memory Indirect: Entries accessed through indirection vector. “Stanza length” S controls degree of indirection NxN . . . S in a row 11/20/2018 High Performance Computing at Berkeley

Tolerating Irregularity
S50 Start with some M at S= (indirect unit stride) For a given M, how large must S be to achieve at least 50% of the original performance? M50 Start with M=1, S= At S=1 (every access random), how large must M be to achieve 50% of the original performance 11/20/2018 High Performance Computing at Berkeley

Tolerating Irregularity
S50: What % of memory access can be random before performance decreases by half? M50: How much computational intensity is required to hide penalty of all random access? Gather/Scatter expensive on commodity cache-based systems Power4: is only 1.6% (1 in 64) Itanium2: much less sensitive at 25% (1 in 4) Huge amount of computation may be required to hide overhead of irregular data access Itanium2 requires CI of about 9 flops/word Power4 requires CI of almost 75! 11/20/2018 High Performance Computing at Berkeley

Emerging Architectures
General purpose processors badly suited for data intensive ops Large caches not useful if re-use is low Low memory bandwidth, especially for irregular patterns Superscalar methods of increasing ILP inefficient Power consumption Three research processors designed as part of DARPA effort IRAM: Processor in memory system with vectors (UCB) Lots of memory bandwidth (on-chip DRAM) DIVA: Processor in memory system design for multiprocessor system (ISI) Scalability Imagine: Stream-based processor (Stanford) Lots of processing power (64 FPUs/chip) 11/20/2018 High Performance Computing at Berkeley

Sqmat on Future Machines
Performance of Sqmat on PIMs and others for 3x3 matrices, squared 10 times (high computational intensity!) Imagine much faster for long streams, slower for short ones 11/20/2018 High Performance Computing at Berkeley

HPCC Benchmarks and Apex-MAP
11/20/2018 High Performance Computing at Berkeley

APEX Execution Model Use an array of size M. Access data in vectors of length L. Random: Pick the start address of the vector randomly. Use the properties of the random numbers to achieve a re-use number k. Regular: Walk over consecutive (strided) vectors through memory. Re-access each vector k-times. 11/20/2018 High Performance Computing at Berkeley

Apex-Map Sequential spatial temporal 11/20/2018 High Performance Computing at Berkeley

Parallel Version Same Design Principal as sequential code. Data evenly distributed among processes. L contiguous addresses will be accessed together. Each remote access is a communication message with length L. Random Access. MPI version first Plans to do Shmem and UPC 11/20/2018 High Performance Computing at Berkeley

SPMV Benchmark Microbenchmark for sparse matrix vector multiply Less “tunable” Closer to a real app Strategy Use either Random matrix with dense blocks Dense matrix in sparse format Register block matrix Store blocks contiguously, unroll Only one index per block Developed for HPCS benchmarks 11/20/2018 High Performance Computing at Berkeley

Ultra 2i - 9% 63 Mflop/s Ultra 3 - 6% 109 Mflop/s 35 Mflop/s 53 Mflop/s Pentium III - 19% Pentium III-M - 15% 96 Mflop/s 120 Mflop/s 11/20/2018 High Performance Computing at Berkeley 42 Mflop/s 58 Mflop/s

Power3 - 13% 195 Mflop/s Power4 - 14% 703 Mflop/s 100 Mflop/s 469 Mflop/s Itanium 1 - 7% Itanium % 225 Mflop/s 1.1 Gflop/s 11/20/2018 High Performance Computing at Berkeley 103 Mflop/s 276 Mflop/s

Automatic Tuning

Motivation for Automatic Performance Tuning
Historical trends Sparse matrix-vector multiply (SpMV): 10% of peak or less 2x faster than CSR with “hand-tuning” Tuning becoming more difficult over time Performance depends on machine, kernel, matrix Matrix known at run-time Best data structure + implementation can be surprising Our approach: empirical modeling and search Up to 4x speedups and 31% of peak for SpMV Many optimization techniques for SpMV Several other kernels: triangular solve, ATA*x, Ak*x Proof-of-concept: Integrate with Omega3P Release OSKI Library, integrate into PETSc Historical trends: we’ll look at the data that supports these claims shortly 11/20/2018 High Performance Computing at Berkeley

Extra Work Can Improve Efficiency!
More complicated non-zero structure in general Example: 3x3 blocking Logical grid of 3x3 cells Fill-in explicit zeros Unroll 3x3 block multiplies “Fill ratio” = 1.5 On Pentium III: 1.5x speedup! The main point is that there can be a considerable pay-off for judicious choice of “fill” (r x c), but that allowing for fill makes the implementation space even more complicated. For this matrix on a Pentium III, we observed a 1.5x speedup, even after filling in an additional 50% explicit zeros. Two effects: Filling in zeros, but eliminating integer indices (overhead) (2) Quality of r x c code produced by compiler may be much better for particular r’s and c’s. In this particular example, overall data structure size stays the same, but 3x3 code is 2x faster than 1x1 code for a dense matrix stored in sparse format. 11/20/2018 High Performance Computing at Berkeley

Summary of Performance Optimizations
Optimizations for SpMV Register blocking (RB): up to 4x over CSR Variable block splitting: 2.1x over CSR, 1.8x over RB Diagonals: 2x over CSR Reordering to create dense structure + splitting: 2x over CSR Symmetry: 2.8x over CSR, 2.6x over RB Cache blocking: 2.2x over CSR Multiple vectors (SpMM): 7x over CSR And combinations… Sparse triangular solve Hybrid sparse/dense data structure: 1.8x over CSR Higher-level kernels AAT*x, ATA*x: 4x over CSR, 1.8x over RB A2*x: 2x over CSR, 1.5x over RB Items in bold: see Vuduc’s 455 page dissertation Other items: see collaborations with BeBOPpers 11/20/2018 High Performance Computing at Berkeley

Optimized Sparse Kernel Interface - OSKI
Provides sparse kernels automatically tuned for user’s matrix & machine BLAS-style functionality: SpMV.,TrSV, … Hides complexity of run-time tuning Includes new, faster locality-aware kernels: ATA*x, … Faster than standard implementations Up to 4x faster matvec, 1.8x trisolve, 4x ATA*x For “advanced” users & solver library writers Available as stand-alone library (Dec ’04) Available as PETSc extension (Feb ’05) Library interface defines low-level primitives in the style of the Sparse BLAS: sparse matrix-vector multiply, sparse triangular solve. Matrix-vector multiply touches each matrix element only once, whereas our locality-aware kernels can reuse these elements. The BeBOP library includes these kernels: (1) simultaneous computation of A*x, AT*z (2) AT*A*x (3) Ak*x, k is non-negative integer Unlike tuning in the dense case, sparse tuning must occur at run-time since the matrix is unknown until then. Here, the “standard implementation” stores the matrix in compressed sparse row (CSR) format with the kernel coded in C or Fortran compiled with full optimizations. To maximize the impact of our software, we are implementing a new “automatically tuned” matrix type in PETSc. Most PETSc users will be able to use our library with little or no modifications to their source code. The stand-alone version of our library is C and Fortran-callable. “Advanced users” means users willing to program at the level of the BLAS. 11/20/2018 High Performance Computing at Berkeley

How the OSKI Tunes (Overview)
Library Install-Time (offline) Application Run-Time 1. Build for Target Arch. 2. Benchmark Workload from program monitoring History Matrix Generated code variants Benchmark data 1. Evaluate Models Heuristic models Diagram key: Ovals == actions taken by library Cylinders == data stored by or with the library Solid arrows == control flow Dashed arrows == “data” flow :: At library-installation time :: 1. “Build”: Pre-compile source code. Possible code variants are stored in dynamic libraries (“Code Variants”). 2. “Benchmark”: The installation process measures the speed of the possible code variants and stores the results (“Benchmark Data”) The entire build process uses standard & portable GNU configure. :: At run-time, from within the user’s application :: 1. “Matrix from user”: The user passes her pre-assembled matrix in a standard format like compressed sparse row (CSR) or column (CSC), and the library 2. “Evaluate models”: The library contains a list of “heuristic models.” Each model is actually a procedure that analyzes the matrix, workload, and benchmarking data and chooses a data structure & code it thinks is the best for that matrix and workload. A model is typically specialized to predict tuning parameters for a particular kernel & class of data structures (e.g., predict the block size for register blocked matvec). However, higher-level models (meta-models) that combine several heuristics or predict over several possible data structures and kernels are also possible. In the initial implementation, “Evaluate Models” does the following: * Based on the workload, decide on an allowable amount of time for tuning (a “tuning budget”) * WHILE there is time left for tuning DO - Select and evaluate a model to get best predicted performance & corresponding tuning parameters 2. Select Data Struct. & Code To user: Matrix handle for kernel calls Extensibility: Advanced users may write & dynamically add “Code variants” and “Heuristic models” to system. 11/20/2018 High Performance Computing at Berkeley

High Performance Software for Numerical Linear Algebra
James Demmel, Jack Dongarra, Xiaoye Li, … LAPACK and ScaLAPACK Widely used libraries for dense linear algebra IBM, Intel, HP, SGI, Cray, NEC, Fujitsu, Matlab,… New release planned NSF support, seeking more New, faster, more accurate numerical algorithms More parallel versions into ScaLAPACK Extending functionality Improving ease of use Performance tuning Reliability and Support 11/20/2018 High Performance Computing at Berkeley

More on Performance Tuning for LAPACK and ScaLAPACK
Build on BEBOP experience to automate tuning Now > 1300 calls to “Get_tuning_parameter()” Blocksize parameters, layouts on parallel machines, Crossover points between algorithms, floating point properties,… Currently use preset values, need to tune! Need to automate optimization of these tuning parameters, at installation time 11/20/2018 High Performance Computing at Berkeley

Sparse Linear Algebra Joint with Xiaoye Li at LBNL One of the most scalable solvers for large sparse nonsymmetric linear systems Cover of Science Dec Numerous improvements in the works Fully parallelize symbolic factorization Exploit architectures to improve accuracy Automatic parameter tuning for performance 11/20/2018 High Performance Computing at Berkeley

Other Work Bit-Serial programming language Ras Bodik Mesh generation Jonathan Shewchuk Application collaborations Heart simulation (Yelick) Earthquake modeling (Demmel) High performance number theory (Bailey) Many other CSE activities at UCB/LBNL: Biology, Astrophysics, CFD Reconfigurable hardware FPGAs: Large machine emulator (Wawrzynek) 11/20/2018 High Performance Computing at Berkeley

People within BIPS Jonathan Carter Kaushik Datta James Demmel Paul Hargrove Parry Husbands Shoaib Kamil Bill Kramer Rajesh Nishtala Leonid Oliker John Shalf Hongzhang Shan Horst Simon David Skinner Erich Strohmaier Rich Vuduc Mike Welcome Katherine Yelick And many collaborators outside Berkeley Lab/Campus 11/20/2018 High Performance Computing at Berkeley

End of Slides

High Performance Computing Research at Berkeley

Similar presentations

Presentation on theme: "High Performance Computing Research at Berkeley"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

High Performance Computing Research at Berkeley

Similar presentations

Presentation on theme: "High Performance Computing Research at Berkeley"— Presentation transcript:

Similar presentations

About project

Feedback