Computer Science Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI Programs Min Yeol Lim Computer Science Department Sep.

Slides:



Advertisements
Similar presentations
CGrid 2005, slide 1 Empirical Evaluation of Shared Parallel Execution on Independently Scheduled Clusters Mala Ghanesh Satish Kumar Jaspal Subhlok University.
Advertisements

Power Reduction Techniques For Microprocessor Systems
1 MemScale: Active Low-Power Modes for Main Memory Qingyuan Deng, David Meisner*, Luiz Ramos, Thomas F. Wenisch*, and Ricardo Bianchini Rutgers University.
Power Management (Application of Autonomic Computing Concepts) Omer Rana.
Techniques for Multicore Thermal Management Field Cady, Bin Fu and Kai Ren.
Project Proposal Presented by Michael Kazecki. Outline Background –Algorithms Goals Ideas Proposal –Introduction –Motivation –Implementation.
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
Communication Pattern Based Node Selection for Shared Networks
Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and.
Increasing the Cache Efficiency by Eliminating Noise Philip A. Marshall.
Green Governors: A Framework for Continuously Adaptive DVFS Vasileios Spiliopoulos, Stefanos Kaxiras Uppsala University, Sweden.
MPI in uClinux on Microblaze Neelima Balakrishnan Khang Tran 05/01/2006.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Adaptive Video Coding to Reduce Energy on General Purpose Processors Daniel Grobe Sachs, Sarita Adve, Douglas L. Jones University of Illinois at Urbana-Champaign.
Profile Guided MPI Protocol Selection for Point-to-Point Communication Calls 5/9/111 Aniruddha Marathe, David K. Lowenthal Department of Computer Science.
Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,
Scaling and Packing on a Chip Multiprocessor Vincent W. Freeh Tyler K. Bletsch Freeman L. Rawson, III Austin Research Laboratory.
Low Power Techniques in Processor Design
OPTIMAL SERVER PROVISIONING AND FREQUENCY ADJUSTMENT IN SERVER CLUSTERS Presented by: Xinying Zheng 09/13/ XINYING ZHENG, YU CAI MICHIGAN TECHNOLOGICAL.
Computer Performance Computer Engineering Department.
Low-Power Wireless Sensor Networks
Computer Science Department University of Pittsburgh 1 Evaluating a DVS Scheme for Real-Time Embedded Systems Ruibin Xu, Daniel Mossé and Rami Melhem.
1 Overview 1.Motivation (Kevin) 1.5 hrs 2.Thermal issues (Kevin) 3.Power modeling (David) Thermal management (David) hrs 5.Optimal DTM (Lev).5 hrs.
1 Using Multiple Energy Gears in MPI Programs on a Power- Scalable Cluster Vincent W. Freeh, David K. Lowenthal, Feng Pan, and Nandani Kappiah Presented.
Basics of Energy & Power Dissipation Lecture notes S. Yalamanchili, S. Mukhopadhyay. A. Chowdhary.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.
Critical Power Slope Understanding the Runtime Effects of Frequency Scaling Akihiko Miyoshi, Charles Lefurgy, Eric Van Hensbergen Ram Rajamony Raj Rajkumar.
1 High-Performance, Power-Aware Computing Vincent W. Freeh Computer Science NCSU
1 Michihiro Koibuchi, Takafumi Watanabe, Atsushi Minamihata, Masahiro Nakao, Tomoyuki Hiroyasu, Hiroki Matsutani, and Hideharu Amano
Performance David Monismith Jan. 16, 2015 Based on notes from Dr. Bill Siever and from the Patterson and Hennessy Text.
Heavy and lightweight dynamic network services: challenges and experiments for designing intelligent solutions in evolvable next generation networks Laurent.
IPDPS 2005, slide 1 Automatic Construction and Evaluation of “Performance Skeletons” ( Predicting Performance in an Unpredictable World ) Sukhdeep Sodhi.
Energy Management in Virtualized Environments Gaurav Dhiman, Giacomo Marchetti, Raid Ayoub, Tajana Simunic Rosing (CSE-UCSD) Inside Xen Hypervisor Online.
Embedded System Lab. 김해천 The TURBO Diaries: Application-controlled Frequency Scaling Explained.
VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.
A Utility-based Approach to Scheduling Multimedia Streams in P2P Systems Fang Chen Computer Science Dept. University of California, Riverside
Dynamic Voltage Frequency Scaling for Multi-tasking Systems Using Online Learning Gaurav DhimanTajana Simunic Rosing Department of Computer Science and.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Motivation: Sorting is among the fundamental problems of computer science. Sorting of different datasets is present in most applications, ranging from.
Lev Finkelstein ISCA/Thermal Workshop 6/ Overview 1.Motivation (Kevin) 2.Thermal issues (Kevin) 3.Power modeling (David) 4.Thermal management (David)
Thermal-aware Phase-based Tuning of Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing This work was supported.
Performance and Energy Efficiency Evaluation of Big Data Systems Presented by Yingjie Shi Institute of Computing Technology, CAS
Adaptive Sleep Scheduling for Energy-efficient Movement-predicted Wireless Communication David K. Y. Yau Purdue University Department of Computer Science.
Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
Sunpyo Hong, Hyesoon Kim
LACSI 2002, slide 1 Performance Prediction for Simple CPU and Network Sharing Shreenivasa Venkataramaiah Jaspal Subhlok University of Houston LACSI Symposium.
Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.
The CRISP Performance Model for Dynamic Voltage and Frequency Scaling in a GPGPU Rajib Nath, Dean Tullsen 1 Micro 2015.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
Introducing Adaptive 3D Control for Sigma Air Manager™
Overview Motivation (Kevin) Thermal issues (Kevin)
Jacob R. Lorch Microsoft Research
Gift Nyikayaramba 30 September 2014
Jacob R. Lorch Microsoft Research
Green cloud computing 2 Cs 595 Lecture 15.
High-Performance Power-Aware Computing
Hui Chen, Shinan Wang and Weisong Shi Wayne State University
Department of Computer Science University of California, Santa Barbara
Tosiron Adegbija and Ann Gordon-Ross+
Shreeni Venkataramaiah
Compile-time Frequency Scaling for CPU Energy and EDP Improvement
Program Phase Directed Dynamic Cache Way Reconfiguration
Department of Computer Science University of California, Santa Barbara
Presentation transcript:

Computer Science Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI Programs Min Yeol Lim Computer Science Department Sep. 8, 2006

Computer Science 2 Growing energy demand Energy efficiency is a big concern –Increased power density of microprocessors –Cooling cost for heat dissipation –Power and performance tradeoff Dynamic voltage and frequency scaling (DVFS) –Supported by newer microprocessors –Cubic drop on power consumption Power  frequency × voltage 2 –CPU is the major power consumer : 35~50% of total power

Computer Science 3 Power-performance tradeoff Cost vs. Benefit –Power  performance –Increasing execution time vs. decreasing power usage –CPU scaling is meaningful only if benefit > cost E = P1 * T1 E = P2 * T2 Time Power P1 T1 Benefit T2 P2 Cost

Computer Science 4 Power-performance tradeoff (cont’) Cost > Benefit –NPB EP benchmark CPU-bound application CPU is on critical path Benefit > Cost –NPB CG benchmark Memory-bound application CPU is NOT on critical path Ghz

Computer Science 5 Motivation 1 Cost/Benefit is code specific –Applications have different code regions –Most MPI communications are not critical on CPU P-state transition in each code region –High voltage and frequency on CPU intensive region –Low voltage and frequency on MPI communication region

Computer Science 6 Time and energy performance of MPI calls MPI_Send MPI_Alltoall

Computer Science 7 Motivation 2 Most MPI calls are too short –Scaling overhead by p-state change per call –Up to 700 microseconds in p-state transition Make regions with adjacent calls –Small interval of inter MPI calls –P-state transition occurs per region Call length (ms) MPI calls interval (ms) Fraction of calls Fraction of intervals

Computer Science 8 Reducible regions time user MPI library ABCDEFGHIJ R1R2R3

Computer Science 9 Thresholds in time –close-enough (τ): time distance between adjacent calls –long-enough (λ): region execution time Reducible regions (cont’) time user MPI library ABCDEFGHIJ δ < τ δ > λ δ < τ

Computer Science 10 How to learn regions Region-finding algorithms –by-call Reduce only in MPI code: τ=0, λ=0 Effective only if single MPI call is long enough –simple Adaptive 1-bit prediction by looking up its last behavior 2 flags : begin and end –composite Save patterns of MPI calls in each region Memorize the begin/end MPI calls and # of calls

Computer Science 11 P-state transition errors False-positive (FP) –P-state is changed in the region top p-state must be used –e.g. regions terminated earlier than expected False-negative (FN) –Top p-state is used in the reducible region –e.g. regions in first appearance

Computer Science 12 P-state transition errors (cont’) users MPI library AAABBB Program execution top p-state reduced p-state Optimal transition FN top p-state reduced p-state Simple FN top p-state reduced p-state Composite

Computer Science 13 P-state transition errors (cont’) users MPI library AAAAAA Program execution top p-state reduced p-state Optimal transition FN top p-state reduced p-state Composite FNFN FP top p-state reduced p-state Simple

Computer Science 14 Selecting proper p-state automatic algorithm –Use composite algorithm to find regions –Use hardware performance counters Evaluation of CPU dependency in reducible regions A metric of CPU load: micro-operations/microsecond (OPS) –Specify p-state mapping table OPSFrequency > Mhz 1000 ~ Mhz 400 ~ Mhz 200 ~ Mhz 100 ~ Mhz < Mhz

Computer Science 15 Implementation Use PMPI –MPI profiling interface –Intercept pre and post hooks of any MPI call transparently MPI call unique identifier –Use the hash value of all program counters in call history –Insert assembly code in C

Computer Science 16 Results System environment –8 or 9 nodes with AMD Athlon-64 system –7 p-states are supported: 2000~800Mhz Benchmarks –NPB MPI benchmark suite C class 8 applications –ASCI Purple benchmark suite Aztec 10 ms in thresholds (τ, λ)

Computer Science 17 Benchmark analysis –Used composite for region information

Computer Science 18 Taxonomy –Profile does not have FN or FP Reduced p-state SingleMultiple Region findings NaiveBy-call AdaptiveSimple AdaptiveCompositeAutomatic StaticProfile

Computer Science 19 Overall Energy Delay Product (EDP)

Computer Science 20 Comparison of p-state transition errors Breakdown of execution time SimpleComposite

Computer Science 21 τ evaluation SP benchmark

Computer Science 22 τ evaluation (cont’) MGCG BT LU

Computer Science 23 Conclusion Contributions –Design and implement an adaptive p-state transition system in MPI communication phases Identify reducible regions on the fly Determine proper p-state dynamically –Provide transparency to users Future work –Evaluate the performance with other applications –Experiments on the OPT cluster

Computer Science 24

Computer Science 25 State transition diagram Simple OUTIN not “close enough” else “close enough” begin == 1 else end == 1

Computer Science 26 State transition diagram (cont’) Composite OUT INREC else “close enough” pattern mismatch “close enough” not “close enough” end of region operation begins reducible region

Computer Science 27 Performance

Computer Science 28 Benchmark analysis –Region information from composite with τ = 10 ms