Presentation is loading. Please wait.

Presentation is loading. Please wait.

There are no comprehensive, holistic studies of performance, power and thermals on distributed scientific systems and workloads Without innovation future.

Similar presentations


Presentation on theme: "There are no comprehensive, holistic studies of performance, power and thermals on distributed scientific systems and workloads Without innovation future."— Presentation transcript:

1 There are no comprehensive, holistic studies of performance, power and thermals on distributed scientific systems and workloads Without innovation future HEC systems will waste performance potential, waste energy, and require extravagant cooling. Improving Performance, Power, and Thermal Efficiency in High-End Systems Kirk W. Cameron Scalable Performance Laboratory Department of Computer Science and Engineering Virginia Tech cameron@ cs.vt.edu Introduction Performance Efficiency Power Efficiency Thermal Efficiency Problem Statement Left unchecked, the fundamental drive to increase peak performance using tens of thousands of components in close proximity to one another will result in: 1) an inability to sustain performance improvements, and 2) exorbitant infrastructure and operational cost for power and cooling. Performance, Power, and Thermal Facts The gap between peak and achieved performance is growing  A 5 Megawatt Supercomputer can consume $4M in energy annually. In just 2 hours, Earth Simulator can produce enough heat to heat a home in the midwest all winter long. Projections Commodity components fail at annual rate of 2-3%. Petaflop system of ~12,000 nodes (CPU, NIC, DRAM, disk) will sustain hardware failure once every 24 hours. Life expectancy of an electronic component decreases 50% for every 10°C(18°F) temperature increase. Our Approach Observations: Predictive models and techniques are needed to maximize performance of emergent systems. Additional below-peak performance may provide adequate “slack times” for improved power and thermal efficiencies. Constraint: Performance is the critical constraint. Reduce power and thermals ONLY if it does not reduce performance significantly. Relevant approaches to the problem Improving Performance Efficiencies Includes a myriad of tools and modeling techniques to analyze and optimize the performance of parallel scientific applications. In our work we focus on using fast analytical modeling techniques to optimize emergent architectures such as the IBM Cell Broadband Architecture. Improving Power Efficiencies Exploit application “slack times” to operate various components in lower power modes (e.g. dynamic voltage scaling or DVFS) to conserve power and energy. Prior to our work, no framework for profiling performance and power of parallel systems and applications. Improving Thermal Efficiencies Exploit application “slack times” to operate various components in lower power (and thermal) modes to reduce the heat emitted by the system. Prior to our work, no framework for profiling performance and thermals of parallel systems and applications. Our Contributions I.Portable framework to profile, analyze and optimize distributed applications for performance, power, and thermals with minimal performance impact. II.Performance-Power-Thermal tradeoff studies and optimizations of scientific workloads on various architectures. Performance analysis of NAS parallel benchmarks Distributed Thermal Profiles: A thermal profile of FT (above) reveals thermal patterns corresponding to code phases. Floating point intensive phases run hot while memory bound phases run cooler. Also, significant temperature drops occur in very short periods of time. Thermal behavior of BT (not pictured) shows temperatures synchronize with workload behavior across nodes. We also observe some nodes trend hotter than others. All of this data was obtained using Tempest. Temperature-Performance tradeoffs Thermal-Performance tradeoffs are studied using Tempest and DVFS strategies applied to reduce temperature in parallel scientific applications. Download Tempest Tempest is available for download from http://sourceforge.net.http://sourceforge.net Related papers can be found at http://scape.cs.vt.edu.http://scape.cs.vt.edu Tempest Software Architecture Detailed thermal profile of FT (Class C,NP=4) Thermal optimizations are achieved with minimal performance impact Thermal regulation: (top & top right) Tempest controller constrains temperature to within a threshold. Since the controller is heuristic, the temperature can exceed the threshold. However, temperature is typically controlled well using DVFS in a node. The weighted importance of thermals, performance and energy can determine the “best” operating point over a number of nodes. CPU Impact on Thermals: (left) For floating point intensive codes (e.g. SP, FT, EP from NAS) CPU is a large consumer of power under load and dissipates significant heat. Energy optimizations that significantly reduce CPU heat should impact total system temperature significantly. Thermal regulation of IS (Class C, NP=4) Thermal regulation of FT (Class C, NP=4) Thermal-aware Performance Impact: (right) The performance impact of our thermal- aware DVFS controller is less than 10% for all the NAS PB codes measured. Nonetheless, we commonly reduce operating temperature nearly 10°C(18°F) which translates to 50% reliability improvement in some cases. On average, we reduce operating temperature between 5-7 °C. Avg CPU Temp for various NAS PB codes Tempest profiling techniques are automatic, accurate, and portable. 8-node Dori PowerPack II Software Power profiling API library - synchronized profiling of parallel applications. Power control API library - synchronized DVS control within parallel application. Multimeter middleware - coordinates data from multiple meter sources. Power analyzer middleware – sorts/sifts/analyzes/correlates profiling data. Performance profiler – use common utilities to poll system performance status. This work sponsored in part by the Department of Energy Office of Science Early Career Principal Investigator (ECPI) Program under grant number DOE DE-FG02-04ER25608. Ethernet SwitchData Collection System Multimeters Resistors Node under test + - Component R V S V R P =(V S -V R )V R /R RS232/GBIC Ethernet Distributed Power Profiles: NAS codes exhibit regularity (e.g. FT on 4 nodes – above left) that reflects algorithm behavior. Intensive use of memory corresponds to decreases in CPU power and increases in memory power use (above right). Power consumption can vary with node for a single application, with number of nodes under fixed workload and with varied workload under fixed number of nodes. Results often correlate to comm/comp ratio. Normalized Energy and Delay with CPU MISER for FT.C.8 0.00 0.20 0.40 0.60 0.80 1.00 1.20 auto600800100012001400CPU MISER normalized delay normalized energy Reducing Energy Consumption: (left) CPU Miser uses dynamic voltage and frequency scaling (DVFS) to lower average processor power consumption. Using the default cpuspeed daemon (auto) or any fixed lower frequency, performance loss is common. CPU Miser is able to reduce energy consumption without reducing performance significantly. (above) Memory Miser uses power scalable DRAM to lower average memory power consumption by turning off memory DIMMs based on memory use and allocation. Note the top curve shows the amount of online memory and the bottom curve shows actual demand. CPU Miser and Memory Miser are both capable of 30% total system energy savings with less than 1% performance loss. Time for a single iteration: T i = T HPU + T APU + Offload Off-loaded time: Offload = O r + O s Total time: T = ∑i(T HPU,i + T APU,i + O offload,i ) Single APU: T APU = T APUp + C APU T APUp : APU part that can be parallelized C APU : APU sequential part Multiple APUs: T APU(1,p) = T APU(1,1) /p + C APU p: number of APUs T APU(1,1) : offloaded time for 1 APU T APU(1,p) : offloaded time for p APUs T = T HPU + T APU(1,1) /p + C APU + O offload + p·g Optimizing Heterogeneous Multicore Systems We use a variation of the log n P performance model to predict the cost of various process and data placement configurations at runtime. Using the performance model we can schedule process and data placement optimally for a heterogeneous multicore architecture. Results on the IBM Cell Broadband Engine show dynamic multicore scheduling using analytical modeling is a viable, accurate technique to improve performance efficiencies. Portions of this work were accomplished in collaboration with the Pearl Laboratory led by Prof. D. Nikolopoulos. HPU time for one iteration: T HPU(m,1) = a m · T HPU(1,1) + T CSW + O col T (m,p) = T HPU(m,p) + T APU(m,p) + O offload + p·g Application: Parallel Bayesian Phylogenetic Inference Dataset: 107 sequences, each 10000 nucleotides, 20,000 gens MMGP mean error 3.2%, std. dev. 2.6, max. error 10% PBPI executes sampling phase at the beginning of execution MMGP params are determined during the sampling phase Execution restarted after the sampling phase with MMGP PBPI with sampling phase outperforms other configs by 1% - 4x. Sampling phase overhead is 2.5%.


Download ppt "There are no comprehensive, holistic studies of performance, power and thermals on distributed scientific systems and workloads Without innovation future."

Similar presentations


Ads by Google