Dec-2009Chip with Multiple Clock and Voltage Domains 1 Multiple Clock and Voltage Domains for Chip Multi Processors December - 2009 Efraim Rotem Intel.

Slides:

Advertisements

Similar presentations

DRAM background Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling, Garnesh, HPCA'07 CS 8501, Mario D. Marino, 02/08.

Advertisements

International Symposium on Low Power Electronics and Design Qing Xie, Mohammad Javad Dousti, and Massoud Pedram University of Southern California ISLPED.

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

AN ANALYTICAL MODEL TO STUDY OPTIMAL AREA BREAKDOWN BETWEEN CORES AND CACHES IN A CHIP MULTIPROCESSOR Taecheol Oh, Hyunjin Lee, Kiyeon Lee and Sangyeun.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

High Performing Cache Hierarchies for Server Workloads

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

Scheduling Algorithms for Unpredictably Heterogeneous CMP Architectures J. Winter and D. Albonesi, Cornell University International Conference on Dependable.

Ensuring Robustness via Early- Stage Formal Verification Multicore Power Management: Anita Lungu *, Pradip Bose **, Daniel Sorin *, Steven German **, Geert.

SuperRange: Wide Operational Range Power Delivery Design for both STV and NTV Computing Xin He, Guihai Yan, Yinhe Han, Xiaowei Li Institute of Computing.

1 MemScale: Active Low-Power Modes for Main Memory Qingyuan Deng, David Meisner*, Luiz Ramos, Thomas F. Wenisch*, and Ricardo Bianchini Rutgers University.

ElasticTree: Saving Energy in Data Center Networks Brandon Heller, Srini Seetharaman, Priya Mahadevan, Yiannis Yiakoumis, Puneed Sharma, Sujata Banerjee,

Techniques for Multicore Thermal Management Field Cady, Bin Fu and Kai Ren.

RUN: Optimal Multiprocessor Real-Time Scheduling via Reduction to Uniprocessor Paul Regnier † George Lima † Ernesto Massa † Greg Levin ‡ Scott Brandt ‡

SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.

Configurational Workload Characterization Hashem H. Najaf-abadi Eric Rotenberg.

1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.

- Sam Ganzfried - Ryan Sukauye - Aniket Ponkshe. Outline Effects of asymmetry and how to handle them Design Space Exploration for Core Architecture Accelerating.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science August 20, 2009 Enabling.

Green Governors: A Framework for Continuously Adaptive DVFS Vasileios Spiliopoulos, Stefanos Kaxiras Uppsala University, Sweden.

CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September.

Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma

Tradeoffs in CDN Designs for Throughput Oriented Traffic Minlan Yu University of Southern California 1 Joint work with Wenjie Jiang, Haoyuan Li, and Ion.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

The Impact of Performance Asymmetry in Multicore Architectures Saisanthosh Ravi Michael Konrad Balakrishnan Rajwar Upton Lai UW-Madison and, Intel Corp.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

Folklore Confirmed: Compiling for Speed = Compiling for Energy Tomofumi Yuki INRIA, Rennes Sanjay Rajopadhye Colorado State University 1.

An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Network Aware Resource Allocation in Distributed Clouds.

1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

Energy Savings with DVFS Reduction in CPU power Extra system power.

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++

An Analysis of Efficient Multi-Core Global Power Management Policies Authors: Canturk Isci†, Alper Buyuktosunoglu†, Chen-Yong Cher†, Pradip Bose† and Margaret.

Srihari Makineni & Ravi Iyer Communications Technology Lab

Energy Management in Virtualized Environments Gaurav Dhiman, Giacomo Marchetti, Raid Ayoub, Tajana Simunic Rosing (CSE-UCSD) Inside Xen Hypervisor Online.

Computational Sprinting on a Real System: Preliminary Results Arun Raghavan *, Marios Papaefthymiou +, Kevin P. Pipe +#, Thomas F. Wenisch +, Milo M. K.

VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

Evaluating the Impact of Job Scheduling and Power Management on Processor Lifetime for Chip Multiprocessors (SIGMETRICS 2009) Authors: Ayse K. Coskun,

1 Lecture 2: Performance, MIPS ISA Today’s topics:  Performance equations  MIPS instructions Reminder: canvas and class webpage:

Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

1 Lecture 2: Metrics to Evaluate Systems Topics: Metrics: power, reliability, cost, benchmark suites, performance equation, summarizing performance with.

Sunpyo Hong, Hyesoon Kim

Department of Electrical and Computer Engineering University of Wisconsin - Madison Optimizing Total Power of Many-core Processors Considering Voltage.

UNIT III -PIPELINE.

Determining Optimal Processor Speeds for Periodic Real-Time Tasks with Different Power Characteristics H. Aydın, R. Melhem, D. Mossé, P.M. Alvarez University.

Simulation. Types of simulation Discrete-event simulation – Used for modeling of a system as it evolves over time by a representation in which the state.

GPGPU Performance and Power Estimation Using Machine Learning Gene Wu – UT Austin Joseph Greathouse – AMD Research Alexander Lyashevsky – AMD Research.

Protecting C and C++ programs from current and future code injection attacks Yves Younan, Wouter Joosen and Frank Piessens DistriNet Department of Computer.

Exploiting Detachability Hashem H. Najaf-abadi Eric Rotenberg.

SizeCap: Efficiently Handling Power Surges for Fuel Cell Powered Data Centers Yang Li, Di Wang, Saugata Ghose, Jie Liu, Sriram Govindan, Sean James, Eric.

Core Architecture Optimization for Heterogeneous CMPs R. Kumar, D. M. Tullsen, and N.P. Jouppi İlker YILDIRIM

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

Seth Pugsley, Jeffrey Jestes,

Green cloud computing 2 Cs 595 Lecture 15.

Zhichun Zhu Zhao Zhang ECE Department ECE Department

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

On the Scale and Performance of Cooperative Web Proxy Caching

ElasticTree: Saving Energy in Data Center Networks

Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle

Faustino J. Gomez, Doug Burger, and Risto Miikkulainen

Lev Finkelstein ISCA/Thermal Workshop 6/2004

Cache - Optimization.

Rajeev Balasubramonian

Presentation transcript:

Dec-2009Chip with Multiple Clock and Voltage Domains 1 Multiple Clock and Voltage Domains for Chip Multi Processors December Efraim Rotem Intel Corporation, Israel Ran Ginosar Technion, Israel Avi Mendelson Microsoft R&D, Israel Uri Weiser Technion, Israel

Dec Compute Performance matters ,000 10,000 Source: Dave Patterson Fueled by a combination of process and arch We would like to keep on providing performance – Power is #1 limiter Both process technology and ILP slow down  multi core architectures 1W 10W 100W An order of magnitude more power efficient but deep in the power wall Chip with Multiple Clock and Voltage Domains

Dec-2009 Chip with Multiple Clock and Voltage Domains 3 Work Overview - scope How to best architect and manage Clock and voltage domains of a CMP to max performance under power constraints 16 core Power constrained CMP 1 thru 16 voltage regulators (VR) –Either on chip or off chip VR 1 thru 16 clock domains –FIFO buffers increase latency Paper contributions: –Power delivery constrains DVFS Multi-voltage domains not so easy –Methodology to evaluate CMP workloads –Clustered voltage and clock domains

Dec Operation point and constraints Process technology voltages –Voltage range V min – V max –Frequency range f min – 2f min –Nominal working point V min, f min Lower bound on quality of service –Frequency DFS down to ½ f min Total power is a constraint –Not exceed nominal power Power delivery has been added as a constraint Most constraining parameter wins Chip with Multiple Clock and Voltage Domains

Dec-2009 Chip with Multiple Clock and Voltage Domains 5 Why is VR a constraint? Simplified example Given a 16 core 100A shared power delivery –Tying all cores together allows sharing current among cores –Allow one core to consume all the current I / 16 Assume we can split the same VR into 16 –Allow each core a fixed 100A / 16 –Sharing is not possible –Keeping capability requires 1,600A! I Core

Dec-2009 Power delivery is constrained Need power delivery headroom for performance Replacing 1 VR by 16 individual VRs: –Does not allow current sharing between cores –Results in degraded power delivery New technologies: –Need less area / volume, BUT –Still deliver limited current More details in the paper 6

Dec Chip with Multiple Clock and Voltage Domains 7 Modeling methodology Workload construction

Dec-2009 Chip with Multiple Clock and Voltage Domains 8 Hybrid model Offline characterization of a real CPU: –Instrumented Intel® Core™-2 Duo for power performance measurements –Characterized SPEC-2K traces behavior –Extracted DVFS parameters and V/F scaling Cycle accurate simulation for FIFO impacts –3 clocks each direction Coded analytic model to calculate performance –Function of power frequency and workload

Dec-2009 Chip with Multiple Clock and Voltage Domains 9 Workload construction Typical Multi Threaded benchmarks insufficient –Server or HPC centric Highly regular and uniform –But client and cloud computing is non uniform We performed Monte-Carlo simulation –Used SPEC-2K as an application pool –Randomly assigned a subset of 16 threads to the cores –Both fully and partially threaded studies –Performed all studies on the same workload –Repeated workload selection and analysis 200 times

Dec-2009 Chip with Multiple Clock and Voltage Domains 10 Results

Dec-2009 Chip with Multiple Clock and Voltage Domains 11 Baseline: Single Voltage and Clock DVFS 10-25% performance gain from use of power headroom Serves as baseline for the studies to follow 200 random workloads DVFS to lowest constraint Sorted by performance Shown relative performance % = 16XGalgel 140% = 16XCrafty I Core

Dec-2009 Chip with Multiple Clock and Voltage Domains 12 Different topologies - Fully threaded workloads Example with power supply capability of 150% Some workloads gain performance, some lose compared to baseline –In contrast with previous studies – Assign budget asymmetrically 200 random workloads Oracle study Three topologies vs. baseline Each Sorted independently Performance relative to baseline 50% apps Loose perf 50% apps better perf 1V – Single voltage domain nV – Multiple Voltage domains 1C – Single Clock domain nC – Multiple Clock domains

Dec-2009 Chip with Multiple Clock and Voltage Domains 13 Partially threaded workload Fewer threads  higher benefit from shared power Multi VR better Single VR better 1C – Single Clock domain nC – Multiple Clock domains 1V – Single voltage domain nV – Multiple Voltage domains Oracle Study

Dec-2009 Chip with Multiple Clock and Voltage Domains 14 Gaining the best of both worlds: Clusters N clusters with 16/N cores each Sharing VR between cores in a cluster Setting optimal voltage frequency for each cluster I/4I/4 I/4I/4 I/4I/4 I/4I/4

Dec-2009 Chip with Multiple Clock and Voltage Domains 15 Clusters Clustered topology almost equal to the best of both topologies Outperforms both when number of threads = number of clusters 1V – Single voltage domain nV – Multiple Voltage domains 1C – Single Clock domain nC – Multiple Clock domains xT – X Threads Cluster always the best

Dec-2009 Chip with Multiple Clock and Voltage Domains 16 How to pick the best cluster size? Oracle study Compared to non-clustered (by workload) Calculated quadratic error from best topology Best scenarios highlighted “Diagonal behavior” –More constrained power delivery  larger clusters Columns – power delivery capability Rows – number of clusters Cells showing distance from Oracle (Smaller is better)

Dec-2009 Chip with Multiple Clock and Voltage Domains 17 Summary Power delivery is a major CPU perf. constraint –Overlooked by previous works –Multiple voltage domain do not allow power sharing –Lightly threaded workloads are most constrained Clustered topology mitigates sharing limitations –Allows sharing power within subsets of cores –Optimal cluster size: function of power delivery capability Explored the non uniform workloads –Different application types –Partially vs. fully threaded workloads

Dec-2009 Chip with Multiple Clock and Voltage Domains 18 Thank You

Dec-2009 Chip with Multiple Clock and Voltage Domains 19 Run time policies Policy to: –Evaluate run time parameters and select frequency Three control functions –Input: power or scalability –Compute: frequency for each core Scale each domain to lowest constrain (e.g. power delivery, max freq) Calculated quadratic error from Oracle results Input – Power / Scalability Freq. Input – Power / Scalability Freq. Input – Power / Scalability Freq. Greedy (Winner Takes All) LinearPolynomial Linear dependency

Dec-2009 Chip with Multiple Clock and Voltage Domains 20 Run time policy results Winning policy is a greedy (WTA) based on scalability –Very close to Oracle Random and power based policies are not good policies MaxAverage WTA 50%5.84%1.3% WTA 33%4.41%0.6% WTA 10%1.23%0.0% WTA by Power 50%22.76%6.9% Linear by SCA9.60%6.1% Linear by power49.76%36.6% Polynomial by SCA5.23%3.3% Random33.28%19.9% 1VnC MaxAverage WTA 50%2.90%0.8% WTA 33%3.37%0.8% WTA 10%4.63%1.7% WTA by Power 50%4.60%2.3% Linear by SCA2.72%1.5% Linear by power5.77%3.8% Polinomial by SCA3.58%1.5% Random8.66%4.3% nVnC Distance from Oracle (Smaller is better) WTA – Winner Take All SCA - Scalability

Dec-2009 Chip with Multiple Clock and Voltage Domains 21 Workload characterization Measured score at two frequencies Measured total CPU power –Scaled power = (Workload Power)/(Max Power) –Results 33%-100% leakage + Idle is ~30% –Most applications use less than 100% power Even at V max, f max they consume less than I max Reason: Not all parts of the CPU are utilized Scalability = ΔPerf/ΔFrequency –Result 0%-100% Low  Memory bound High  CPU bound A A BC B SPEC int Scaled Power Perf. Scaling with freq. FIFO impact gzip48% % vpr44% % gcc35% % mcf49% % crafty33% % parser60% % eon42% % perlbmk50% % gap45% % vortex60% % bzip249% % twolf97% % Int_rate51% %

Dec-2009 Chip with Multiple Clock and Voltage Domains 22 Workload characterization Used cycle accurate simulation to evaluate FIFO impact / application ABC C SPEC int Scaled Power Perf. Scaling with freq. FIFO impact gzip48% % vpr44% % gcc35% % mcf49% % crafty33% % parser60% % eon42% % perlbmk50% % gap45% % vortex60% % bzip249% % twolf97% % Int_rate51% % All studies are average over the entire run, not accounting for variance over time Study applies also to phases in workload

Dec-2009 Chip with Multiple Clock and Voltage Domains 23 Some DVFS model details All models are built with relative values and not absolute voltages, freq. or performance From min Vcc – linear scaling of frequency only Freq [GHz] Voltage [V]

Dec-2009Chip with Multiple Clock and Voltage Domains 24 Workload characteristics – few observations Application power is distributed around ~60% of max power –Min 33% - Leakage + idle power –Very few apps reach 100% Scalability is evenly distributed No correlation found between power and scalability –OOO characteristics –Simpler core is expected to show positive correlation Random pick of 16 cores: –Tighter overall power distribution –Very low probability for all application high or low power

Dec-2009 Chip with Multiple Clock and Voltage Domains 25 Why is VR constraint - physics Battery GFX Controller Drivers Inductors CPU Bulk Cap. Need close proximity

Dec-2009 Chip with Multiple Clock and Voltage Domains 26 Overview How to best architect and manage Clock and voltage domains of a CMP to achieve max performance under power constraints Contributions: –Power delivery constrains DVFS Multi-voltage domains not so easy –Methodology to evaluate CMP workloads –Clustered voltage and clock domains

Dec-2009 Chip with Multiple Clock and Voltage Domains 27 Work Overview - scope 16 core Power constrained CMP 1 thru 16 voltage regulators (VR) and clock domains –Either on chip or off chip VR Independent clock domains require a FIFO buffer  increased latency Best topology ? Optimal policy ? Under constraints