Presentation is loading. Please wait.

Presentation is loading. Please wait.

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 1 CCS-3 P AL A CASE STUDY.

Similar presentations


Presentation on theme: "Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 1 CCS-3 P AL A CASE STUDY."— Presentation transcript:

1 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 1 CCS-3 P AL A CASE STUDY

2 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 2 CCS-3 P AL Section 3 n Overview u In this section we will show the negative consequences of the lack of coordination in a large scale machine u We analyze the behavior of a complex scientific application representative of the ASCI workload on a large scale supercomputer u A case study that emphasizes the importance of the coordination in the network and in the system software

3 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 3 CCS-3 P AL ASCI Q n 2,048 ES45 Alphaservers, with 4 processors/node n 16 GB of memory per node n 8,192 processors in total n 2 independent network rails, Quadrics Elan3 n > 8192 cables n 20 Tflops peak, #2 in the top 500 lists n A complex human artifact

4 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 4 CCS-3 P AL Dealing with the complexity of a real system n In this section of the tutorial we provide insight into our methodology, that we used to substantially improve the performance of ASCI Q. n This methodology is based on an arsenal of u analytical models u custom microbenchmarks u full applications u discrete event simulators n Dealing with the complexity of the machine and the complexity of a real parallel application, SAGE, with > 150,000 lines of Fortran & MPI code

5 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 5 CCS-3 P AL Overview n Our performance expectations for ASCI Q and the reality n Identification of performance factors u Application performance and breakdown into components n Detailed examination of system effects u A methodology to identify operating systems effects u Effect of scaling – up to 2000 nodes/ 8000 processors u Quantification of the impact n Towards the elimination of overheads u demonstrated over 2x performance improvement n Generalization of our results: application resonance n Bottom line: the importance of the integration of the various system across nodes

6 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 6 CCS-3 P AL Performance of SAGE on 1024 nodes n Performance consistent across QA and QB (the two segments of ASCI Q, with 1024 nodes/4096 processors each) u Measured time 2x greater than model (4096 PEs) There is a difference why ? Lower is better!

7 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 7 CCS-3 P AL Using fewer PEs per Node Test performance using 1,2,3 and 4 PEs per node Lower is better!

8 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 8 CCS-3 P AL Using fewer PEs per node (2) Measurements match model almost exactly for 1,2 and 3 PEs per node! Performance issue only occurs when using 4 PEs per node

9 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 9 CCS-3 P AL Mystery #1 SAGE performs significantly worse on ASCI Q than was predicted by our model

10 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 10 CCS-3 P AL SAGE performance components n Look at SAGE in terms of main components: u Put/Get (point-to-point boundary exchange) u Collectives (allreduce, broadcast, reduction) Performance issue seems to occur only on collective operations

11 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 11 CCS-3 P AL Performance of the collectives n Measure collective performance separately 4 processes per node n Collectives (e.g., allreduce and barrier) mirror the performance of the application

12 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 12 CCS-3 P AL Identifying the problem within Sage Sage Allreduce Simplify

13 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 13 CCS-3 P AL Exposing the problems with simple benchmarks Allreduce Benchmarks Add complexity Challenge: identify the simplest benchmark that exposes the problem

14 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 14 CCS-3 P AL Interconnection network and communication libraries n The initial (obvious) suspects were the interconnection network and the MPI implementation n We tested in depth the network, the low level transmission protocols and several allreduce algorithms n We also implemented allreduce in the Network Interface Card n By changing the synchronization mechanism we were able to reduce the latency of an allreduce benchmark by a factor of 7 n But we only got small improvements in Sage (5%)

15 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 15 CCS-3 P AL Mystery #2 Although SAGE spends half of its time in allreduce (at 4,096 processors), making allreduce 7 times faster leads to a small performance improvement

16 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 16 CCS-3 P AL Computational noise n After having ruled out the network and MPI we focused our attention on the compute nodes n Our hypothesis is that the computational noise is generated inside the processing nodes n This noise “freezes” a running process for a certain amount of time and generates a “computational” hole

17 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 17 CCS-3 P AL Computational noise: intuition n Running 4 processes on all 4 processors of an Alphaserver ES45 P 2 P 0 P 1 P 3 l The computation of one process is interrupted by an external event (e.g., system daemon or kernel)

18 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 18 CCS-3 P AL IDLE Computational noise: 3 processes on 3 processors n Running 3 processes on 3 processors of an Alphaserver ES45 P 2 P 0 P 1 l The “noise” can run on the 4 th processor without interrupting the other 3 processes

19 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 19 CCS-3 P AL Coarse grained measurement n We execute a computational loop for 1,000 seconds on all 4,096 processors of QB P 1 P 2 P 3 P 4 TIME STARTEND

20 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 20 CCS-3 P AL Coarse grained computational overhead per process n The slowdown per process is small, between 1% and 2.5% lower is better

21 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 21 CCS-3 P AL Mystery #3 Although the “noise” hypothesis could explain SAGE’s suboptimal performance, the microbenchmarks of per-processor noise indicate that at most 2.5% of performance is lost to noise

22 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 22 CCS-3 P AL Fine grained measurement n We run the same benchmark for 1000 seconds, but we measure the run time every millisecond n Fine granularity representative of many ASCI codes

23 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 23 CCS-3 P AL Fine grained computational overhead per node n We now compute the slowdown per-node, rather than per- process n The noise has a clear, per cluster, structure Optimum is 0 (lower is better)

24 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 24 CCS-3 P AL Finding #1 Analyzing noise on a per-node basis reveals a regular structure across nodes

25 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 25 CCS-3 P AL l The Q machine is organized in 32 node clusters (TruCluster) l In each cluster there is a cluster manager (node 0), a quorum node (node 1) and the RMS data collection (node 31) Noise in a 32 Node Cluster

26 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 26 CCS-3 P AL Per node noise distribution n Plot distribution of one million, 1 ms computational chunks n In an ideal, noiseless, machine the distribution graph is u a single bar at 1 ms of 1 million points per process (4 million per node) n Every outlier identifies a computation that was delayed by external interference n We show the distributions for the standard cluster node, and also nodes 0, 1 and 31

27 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 27 CCS-3 P AL Cluster Node (2-30) n 10% of the times the execution of the 1 ms chunk of computation is delayed

28 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 28 CCS-3 P AL Node 0, Cluster Manager n We can identify 4 main sources of noise

29 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 29 CCS-3 P AL Node 1, Quorum Node n One source of heavyweight noise (335 ms!)

30 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 30 CCS-3 P AL Node 31 n Many fine grained interruptions, between 6 and 8 milliseconds

31 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 31 CCS-3 P AL The effect of the noise n An application is usually a sequence of a computation followed by a synchronization (collective):........................... n But if an event happens on a single node then it can affect all the other nodes

32 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 32 CCS-3 P AL Effect of System Size n The probability of a random event occurring increases with the node count.............

33 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 33 CCS-3 P AL Tolerating Noise: Buffered Coscheduling (BCS)........................ We can tolerate the noise by coscheduling the activities of the system software on each node

34 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 34 CCS-3 P AL Discrete Event Simulator: used to model noise n DES used to examine and identify impact of noise: takes as input the harmonics that characterize the noise n Noise model closely approximates experimental data n The primary bottleneck is the fine-grained noise generated by the compute nodes (Tru64) Lower is better

35 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 35 CCS-3 P AL Finding #2 On fine-grained applications, more performance is lost to short but frequent noise on all nodes than to long but less frequent noise on just a few nodes

36 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 36 CCS-3 P AL Incremental noise reduction 1. removed about 10 daemons from all nodes (including: envmod, insightd, snmpd, lpd, niff) 2. decreased RMS monitoring frequency by a factor of 2 on each node (from an interval of 30s to 60s) 3. moved several daemons from nodes 1 and 2 to node 0 on each cluster.

37 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 37 CCS-3 P AL Improvements in the Barrier Synchronization Latency

38 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 38 CCS-3 P AL Resulting SAGE Performance n Nodes 0 and 31 also configured out in the optimization

39 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 39 CCS-3 P AL Finding #3 We were able to double SAGE’s performance by selectively removing noise caused by several types of system activities

40 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 40 CCS-3 P AL Generalizing our results: application resonance n The computational granularity of a balanced bulk- synchronous application correlates to the type of noise. n Intuition: u any noise source has a negative impact, a few noise sources tend to have a major impact on a given application. n Rule of thumb: u the computational granularity of the application “enters in resonance” with the noise of the same order of magnitude n The performance can be enhanced by selectively removing sources of noise n We can provide a reasonable estimate of the performance improvement knowing the computational granularity of a given application.

41 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 41 CCS-3 P AL Cumulative Noise Distribution, Sequence of Barriers with No Computation n Most of the latency is generated by the fine-grained, high-frequency noisie of the cluster nodes

42 Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 42 CCS-3 P AL Conclusions n Combination of Measurement, Simulation and Modeling to u Identify and resolve performance issues on Q F Used modeling to determine that a problem exists F Developed computation kernels to quantify O/S events: F Effect increases with the number of nodes F Impact is determined by the computation granularity in an application n Application performance has significantly improved n Method also being applied to other large-systems


Download ppt "Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 1 CCS-3 P AL A CASE STUDY."

Similar presentations


Ads by Google