Engineering Analysis of High Performance Parallel Programs David Culler Computer Science Division U.C.Berkeley

Engineering Analysis of High Performance Parallel Programs David Culler Computer Science Division U.C.Berkeley http://www.cs.berkeley.edu/~culler

5/16/2000LLNL ISCR2 Traditional Parallel Programming Tools Focus on showing “what program did” and “when it did it” –microscopic analysis of deterministic events –oriented towards initial development of small programs on small data sets and small machines Instrumentation –traces, counters, profiles Visualization Examples –AIMS, PTOOLS, PPP –pablo + paradyn +... => delphi –ACTS TAU - tuning and analysis util.

5/16/2000LLNL ISCR3 Example: Pablo

5/16/2000LLNL ISCR4 Beyond Zero th -order Analysis Basic level to get to a system design that is reasonable and behaves properly under “ideal condition” Subject the system to various stresses to understand its operating regime and gain deeper insight into its dynamic behavior Combine empirical data with analytical models Iterate from What? to What if? Wind Speed max displacement

5/16/2000LLNL ISCR5 Approach: Framework for Parameterized Sensitivity Analsys framework performs analysis over numerous runs –statistical filtering –vary parameter of interest provides means of combining data to isolate effects of interest => ROBUSTNESS Well-developed Parallel Program Study Parameter Problem Data Set Generator Instrumentation Tools Machine Characterizers visualization, modeling Procs Comm. perf. Cache Scheduling...

5/16/2000LLNL ISCR6 Example: NAS Parallel Benchmarks Fix problem size (NPB2.2 class A) Two different Architectures –NOW Ultrasparc Cluster (170 MHz) –SGI Origin (250 MHz) Six application kernels –BT - Block Tridiagonal Solve –SP - –LU - Sparse LU –MG - Multigrid –IS - Integer sort –FT - 3D FFT Examine sensitivity to P (# procs) –time(P), speedup(P) = Time(1)/Time(P)

5/16/2000LLNL ISCR7 Single Processor Performance

5/16/2000LLNL ISCR8 Simplest Example: Performance( P ) NPB2.2 on NOW and Origin 2000 (250)

5/16/2000LLNL ISCR9 Understanding Speedup SpeedUp(p) = T 1 MAX p (T compute + T comm. + T wait ) T compute = (work/p + extra) x efficiency With message passing (e.g., MPI) communication time and wait time are indistinguishable

5/16/2000LLNL ISCR10 A more austere metric... Time spent doing thing X Total Time X (P) =  Time X (i) Constant for perfect speedup i=1 P

5/16/2000LLNL ISCR11 Where Time is Spent ( P ) Reveal basic Processor and network loading (vs P)

5/16/2000LLNL ISCR12 Where Time is Spent ( P ) Reveal basic Processor and network loading (vs P) Basis for model derivation - comm(P)

5/16/2000LLNL ISCR13 Why do comm. costs increase? total volume? volume per processor? message overhead? contention?

5/16/2000LLNL ISCR14 Communication Volume ( P )

5/16/2000LLNL ISCR15 Communication Structure ( P )

5/16/2000LLNL ISCR16 Understanding Efficiency ( P, M ) Want to understand both what load the program is placing on the system and how well the system is handling that load => characterize the capability of the system via simple benchmarks (rather than advertised peaks) => combine with measured load for predictive model, & compare 30 MB/s150 MB/s

5/16/2000LLNL ISCR17 Communication Efficiency

5/16/2000LLNL ISCR18 Tools => Improvements in Run Time Efficiency analysis (vs parameters) gives insight into where to improve the system or the program –use traditional profiling to see where is program the ‘bad stuff’ happens –or go back and tune the system to do better

5/16/2000LLNL ISCR19 Why does comp. time decrease? Combining trace generation with simulation provides new structural insight Here: clear knees in program working set ($) shift with machine size (P)

5/16/2000LLNL ISCR20 Constant Problem Size Scaling 4 8 16 3264 128256

LU Working Sets Sharp drop in miss rate from 512 to 1024 WS captured by $ at 1024 KB per processor $ size increase (< 32KB), miss rate decrease with a constant rate New effect, 100s KB to MB $

LU Working Sets CPS scaling means smaller and smaller problem per processor Smaller WS requirement Miss rate curve “moves” to the left with P

LU Working Sets Given a fixed machine, we only observe a vertical slice of the graph

LU Working Sets Cluster Origin Cluster Origin

Working Sets LUISBT FTMGSP No Effect Cost No Effect Benefit No Effect Benefit Cost Benefit Cost There is a Cost to scaling when at larger machine size, miss rate increases There is a Benefit to scaling when at larger machine size, miss rate decreases Processing Efficiency is determined by - –the interaction between the changes in working set with the size of the machine

5/16/2000LLNL ISCR26 Sensitivity to Multiprogramming Parallel machines are increasingly general purpose –multiprogramming, at least interrupts and daemons Many ‘ideal’ programs very sensitive to perturbations –Msg Passing is loosely coupled, but implementation may not be!

5/16/2000LLNL ISCR27 Tools => Improvements in Run Time MPI implementation spin-waits on send till network available (or queue not full) or on recv- complete Should use two-phase spin-block

5/16/2000LLNL ISCR28 Sensitivity to Seemingly Unrelated Activity The mechanism for doing parameter studies is naturally extended to get statistically valid data through multiple samples at each point –tend to get crisp, fast results in the wee hours Extend study outside the app Example: two programs on big Origin – alonetogether on 64 P –8 processor IS run: 4.71 sec 6.18 –36 processor SP run:26.36 sec65.28

5/16/2000LLNL ISCR29 Repeatability The variance for the repeated runs is a key result for production codes - the real world is not ideal

5/16/2000LLNL ISCR30 Understanding the Platform A very Simple Example: broadcast(M,P) vary M, P repeat MPI barrier MPI bcast start time end time MPI barrier

5/16/2000LLNL ISCR31 NOW bcast (m, p)

5/16/2000LLNL ISCR32 Origin mean bcast (m, p)

5/16/2000LLNL ISCR33 NOW bcast (1024, p)

5/16/2000LLNL ISCR34 Origin bcast (1024, p)

5/16/2000LLNL ISCR35 NOW bcast(1042, 16) repetitions discarded first iteration

5/16/2000LLNL ISCR36 Origin bcast(1042, 16) repetitions discarded first iteration

5/16/2000LLNL ISCR37 Origin bcast(1042, 16) repetitions - 10x

5/16/2000LLNL ISCR38 Origin bcast(1042, 16) repetitions

5/16/2000LLNL ISCR39 Origin bcast(1M, 16) repetitions

5/16/2000LLNL ISCR40 Discussion Apply engineering analysis to your parallel engineering analysis codes! Isolate components Introduce controlled variations –processors –data set –communication rate –repetition Identify trouble spots

5/16/2000LLNL ISCR41 To read more Parallel Computer Architecture - a hardware/software approach, Culler and Singh, Morgan-Kaufmann Architectural Requirements and Scalability of the NAS Parallel Benchmarks, Wong, Martin, Arpaci- Dusseau, and Culler, Proc. of SC99 Building MPI for Multi-Programming Systems using Implicit Information, Wong, Arpaci- Dusseau, Culler, 6th European PVM/MPI User's Group Meeting http://www.cs.berkeley.edu/~culler/papers

Engineering Analysis of High Performance Parallel Programs David Culler Computer Science Division U.C.Berkeley

Similar presentations

Presentation on theme: "Engineering Analysis of High Performance Parallel Programs David Culler Computer Science Division U.C.Berkeley"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Engineering Analysis of High Performance Parallel Programs David Culler Computer Science Division U.C.Berkeley

Similar presentations

Presentation on theme: "Engineering Analysis of High Performance Parallel Programs David Culler Computer Science Division U.C.Berkeley"— Presentation transcript:

Similar presentations

About project

Feedback