Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 1 CCS-3 P AL A CASE STUDY.

Slides:

Advertisements

Similar presentations

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Advertisements

Computer and Computational Sciences Division Los Alamos National Laboratory Ideas that change the world Achieving Usability and Efficiency in Large-Scale.

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 1 CCS-3 P AL STATE OF THE ART.

Master/Slave Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.

Variance reduction techniques. 2 Introduction Simulation models should be coded such that they are efficient. Efficiency in terms of programming ensures.

CompSci Applets & Video Games. CompSci Applets & Video Games The Plan  Applets  Demo on making and running a simple applet from scratch.

BOAT - Optimistic Decision Tree Construction Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.

Stressful Life Events and Its Effects on Educational Attainment: An Agent Based Simulation of the Process CS 460 December 8, 2005.

Benchmarking Parallel Code. Benchmarking2 What are the performance characteristics of a parallel code? What should be measured?

SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.

NUMA Tuning for Java Server Applications Mustafa M. Tikir.

1 Lecture 6 Performance Measurement and Improvement.

Tools for Engineering Analysis of High Performance Parallel Programs David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

1 CS 501 Spring 2005 CS 501: Software Engineering Lecture 22 Performance of Computer Systems.

Nor Asilah Wati Abdul Hamid, Paul Coddington. School of Computer Science, University of Adelaide PDCN FEBRUARY 2007 AVERAGES, DISTRIBUTIONS AND SCALABILITY.

Software Issues Derived from Dr. Fawcett’s Slides Phil Pratt-Szeliga Fall 2009.

MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.

1 Physical Clocks need for time in distributed systems physical clocks and their problems synchronizing physical clocks u coordinated universal time (UTC)

Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.

IE 594 : Research Methodology – Discrete Event Simulation David S. Kim Spring 2009.

Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara.

Cloud MapReduce ： a MapReduce Implementation on top of a Cloud Operating System Speaker : 童耀民 MA1G Authors: Huan Liu, Dan Orban Accenture.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Applets & Video Games 1 Last Edited 1/10/04CPS4: Java for Video Games Applets &

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.

Low-Power Wireless Sensor Networks

Bulk Synchronous Parallel Processing Model Jamie Perkins.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

What are the main differences and commonalities between the IS and DA systems? How information is transferred between tasks: (i) IS it may be often achieved.

S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE On pearls and perils of hybrid OpenMP/MPI programming.

Scalable Web Server on Heterogeneous Cluster CHEN Ge.

Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.

1 M. Tudruj, J. Borkowski, D. Kopanski Inter-Application Control Through Global States Monitoring On a Grid Polish-Japanese Institute of Information Technology,

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

A Performance Comparison of DSM, PVM, and MPI Paul Werstein Mark Pethick Zhiyi Huang.

Towards Exascale File I/O Yutaka Ishikawa University of Tokyo, Japan 2009/05/21.

In Large-Scale Cluster Yutaka Ishikawa Computer Science Department/Information Technology Center The University of Tokyo

Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.

Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.

Computer and Computational Sciences Division Los Alamos National Laboratory On the Feasibility of Incremental Checkpointing for Scientific Computing Jose.

Eitan Frachtenberg MIT, 20-Sep PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Designing Parallel Operating Systems using.

Making Watson Fast Daniel Brown HON111. Need for Watson to be fast to play Jeopardy successfully – All computations have to be done in a few seconds –

Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.

Modeling Billion-Node Torus Networks Using Massively Parallel Discrete-Event Simulation Ning Liu, Christopher Carothers 1.

Distributed simulation with MPI in ns-3 Joshua Pelkey and Dr. George Riley Wns3 March 25, 2011.

Lecture 2a: Performance Measurement. Goals of Performance Analysis The goal of performance analysis is to provide quantitative information about the performance.

Coupling Facility. The S/390 Coupling Facility (CF), the key component of the Parallel Sysplex cluster, enables multisystem coordination and datasharing.

OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.

Interconnection network network interface and a case study.

Computer Simulation of Networks ECE/CSC 777: Telecommunications Network Design Fall, 2013, Rudra Dutta.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.

Performane Analyzer Performance Analysis and Visualization of Large-Scale Uintah Simulations Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance.

Sunpyo Hong, Hyesoon Kim

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

LACSI 2002, slide 1 Performance Prediction for Simple CPU and Network Sharing Shreenivasa Venkataramaiah Jaspal Subhlok University of Houston LACSI Symposium.

Background Computer System Architectures Computer System Software.

Programming for Performance Laxmikant Kale CS 433.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Parallel Programming Models EECC 756 David D. McGann 18 May, 1999.

Unified Modeling Language

Computer Simulation of Networks

Department of Computer Science University of California, Santa Barbara

Cluster Load Balancing for Fine-grain Network Services

Chapter 10 – Software Testing

CS703 – Advanced Operating Systems

Department of Computer Science University of California, Santa Barbara

Presentation transcript:

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 1 CCS-3 P AL A CASE STUDY

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 2 CCS-3 P AL Section 3 n Overview u In this section we will show the negative consequences of the lack of coordination in a large scale machine u We analyze the behavior of a complex scientific application representative of the ASCI workload on a large scale supercomputer u A case study that emphasizes the importance of the coordination in the network and in the system software

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 3 CCS-3 P AL ASCI Q n 2,048 ES45 Alphaservers, with 4 processors/node n 16 GB of memory per node n 8,192 processors in total n 2 independent network rails, Quadrics Elan3 n > 8192 cables n 20 Tflops peak, #2 in the top 500 lists n A complex human artifact

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 4 CCS-3 P AL Dealing with the complexity of a real system n In this section of the tutorial we provide insight into our methodology, that we used to substantially improve the performance of ASCI Q. n This methodology is based on an arsenal of u analytical models u custom microbenchmarks u full applications u discrete event simulators n Dealing with the complexity of the machine and the complexity of a real parallel application, SAGE, with > 150,000 lines of Fortran & MPI code

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 5 CCS-3 P AL Overview n Our performance expectations for ASCI Q and the reality n Identification of performance factors u Application performance and breakdown into components n Detailed examination of system effects u A methodology to identify operating systems effects u Effect of scaling – up to 2000 nodes/ 8000 processors u Quantification of the impact n Towards the elimination of overheads u demonstrated over 2x performance improvement n Generalization of our results: application resonance n Bottom line: the importance of the integration of the various system across nodes

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 6 CCS-3 P AL Performance of SAGE on 1024 nodes n Performance consistent across QA and QB (the two segments of ASCI Q, with 1024 nodes/4096 processors each) u Measured time 2x greater than model (4096 PEs) There is a difference why ? Lower is better!

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 7 CCS-3 P AL Using fewer PEs per Node Test performance using 1,2,3 and 4 PEs per node Lower is better!

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 8 CCS-3 P AL Using fewer PEs per node (2) Measurements match model almost exactly for 1,2 and 3 PEs per node! Performance issue only occurs when using 4 PEs per node

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 9 CCS-3 P AL Mystery #1 SAGE performs significantly worse on ASCI Q than was predicted by our model

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 10 CCS-3 P AL SAGE performance components n Look at SAGE in terms of main components: u Put/Get (point-to-point boundary exchange) u Collectives (allreduce, broadcast, reduction) Performance issue seems to occur only on collective operations

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 11 CCS-3 P AL Performance of the collectives n Measure collective performance separately 4 processes per node n Collectives (e.g., allreduce and barrier) mirror the performance of the application

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 12 CCS-3 P AL Identifying the problem within Sage Sage Allreduce Simplify

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 13 CCS-3 P AL Exposing the problems with simple benchmarks Allreduce Benchmarks Add complexity Challenge: identify the simplest benchmark that exposes the problem

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 14 CCS-3 P AL Interconnection network and communication libraries n The initial (obvious) suspects were the interconnection network and the MPI implementation n We tested in depth the network, the low level transmission protocols and several allreduce algorithms n We also implemented allreduce in the Network Interface Card n By changing the synchronization mechanism we were able to reduce the latency of an allreduce benchmark by a factor of 7 n But we only got small improvements in Sage (5%)

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 15 CCS-3 P AL Mystery #2 Although SAGE spends half of its time in allreduce (at 4,096 processors), making allreduce 7 times faster leads to a small performance improvement

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 16 CCS-3 P AL Computational noise n After having ruled out the network and MPI we focused our attention on the compute nodes n Our hypothesis is that the computational noise is generated inside the processing nodes n This noise “freezes” a running process for a certain amount of time and generates a “computational” hole

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 17 CCS-3 P AL Computational noise: intuition n Running 4 processes on all 4 processors of an Alphaserver ES45 P 2 P 0 P 1 P 3 l The computation of one process is interrupted by an external event (e.g., system daemon or kernel)

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 18 CCS-3 P AL IDLE Computational noise: 3 processes on 3 processors n Running 3 processes on 3 processors of an Alphaserver ES45 P 2 P 0 P 1 l The “noise” can run on the 4 th processor without interrupting the other 3 processes

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 19 CCS-3 P AL Coarse grained measurement n We execute a computational loop for 1,000 seconds on all 4,096 processors of QB P 1 P 2 P 3 P 4 TIME STARTEND

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 20 CCS-3 P AL Coarse grained computational overhead per process n The slowdown per process is small, between 1% and 2.5% lower is better

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 21 CCS-3 P AL Mystery #3 Although the “noise” hypothesis could explain SAGE’s suboptimal performance, the microbenchmarks of per-processor noise indicate that at most 2.5% of performance is lost to noise

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 22 CCS-3 P AL Fine grained measurement n We run the same benchmark for 1000 seconds, but we measure the run time every millisecond n Fine granularity representative of many ASCI codes

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 23 CCS-3 P AL Fine grained computational overhead per node n We now compute the slowdown per-node, rather than per- process n The noise has a clear, per cluster, structure Optimum is 0 (lower is better)

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 24 CCS-3 P AL Finding #1 Analyzing noise on a per-node basis reveals a regular structure across nodes

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 25 CCS-3 P AL l The Q machine is organized in 32 node clusters (TruCluster) l In each cluster there is a cluster manager (node 0), a quorum node (node 1) and the RMS data collection (node 31) Noise in a 32 Node Cluster

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 26 CCS-3 P AL Per node noise distribution n Plot distribution of one million, 1 ms computational chunks n In an ideal, noiseless, machine the distribution graph is u a single bar at 1 ms of 1 million points per process (4 million per node) n Every outlier identifies a computation that was delayed by external interference n We show the distributions for the standard cluster node, and also nodes 0, 1 and 31

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 27 CCS-3 P AL Cluster Node (2-30) n 10% of the times the execution of the 1 ms chunk of computation is delayed

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 28 CCS-3 P AL Node 0, Cluster Manager n We can identify 4 main sources of noise

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 29 CCS-3 P AL Node 1, Quorum Node n One source of heavyweight noise (335 ms!)

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 30 CCS-3 P AL Node 31 n Many fine grained interruptions, between 6 and 8 milliseconds

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 31 CCS-3 P AL The effect of the noise n An application is usually a sequence of a computation followed by a synchronization (collective): n But if an event happens on a single node then it can affect all the other nodes

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 32 CCS-3 P AL Effect of System Size n The probability of a random event occurring increases with the node count

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 33 CCS-3 P AL Tolerating Noise: Buffered Coscheduling (BCS) We can tolerate the noise by coscheduling the activities of the system software on each node

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 34 CCS-3 P AL Discrete Event Simulator: used to model noise n DES used to examine and identify impact of noise: takes as input the harmonics that characterize the noise n Noise model closely approximates experimental data n The primary bottleneck is the fine-grained noise generated by the compute nodes (Tru64) Lower is better

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 35 CCS-3 P AL Finding #2 On fine-grained applications, more performance is lost to short but frequent noise on all nodes than to long but less frequent noise on just a few nodes

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 36 CCS-3 P AL Incremental noise reduction 1. removed about 10 daemons from all nodes (including: envmod, insightd, snmpd, lpd, niff) 2. decreased RMS monitoring frequency by a factor of 2 on each node (from an interval of 30s to 60s) 3. moved several daemons from nodes 1 and 2 to node 0 on each cluster.

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 37 CCS-3 P AL Improvements in the Barrier Synchronization Latency

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 38 CCS-3 P AL Resulting SAGE Performance n Nodes 0 and 31 also configured out in the optimization

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 39 CCS-3 P AL Finding #3 We were able to double SAGE’s performance by selectively removing noise caused by several types of system activities

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 40 CCS-3 P AL Generalizing our results: application resonance n The computational granularity of a balanced bulk- synchronous application correlates to the type of noise. n Intuition: u any noise source has a negative impact, a few noise sources tend to have a major impact on a given application. n Rule of thumb: u the computational granularity of the application “enters in resonance” with the noise of the same order of magnitude n The performance can be enhanced by selectively removing sources of noise n We can provide a reasonable estimate of the performance improvement knowing the computational granularity of a given application.

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 41 CCS-3 P AL Cumulative Noise Distribution, Sequence of Barriers with No Computation n Most of the latency is generated by the fine-grained, high-frequency noisie of the cluster nodes

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 42 CCS-3 P AL Conclusions n Combination of Measurement, Simulation and Modeling to u Identify and resolve performance issues on Q F Used modeling to determine that a problem exists F Developed computation kernels to quantify O/S events: F Effect increases with the number of nodes F Impact is determined by the computation granularity in an application n Application performance has significantly improved n Method also being applied to other large-systems