Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dec-2009Chip with Multiple Clock and Voltage Domains 1 Multiple Clock and Voltage Domains for Chip Multi Processors December - 2009 Efraim Rotem Intel.

Similar presentations


Presentation on theme: "Dec-2009Chip with Multiple Clock and Voltage Domains 1 Multiple Clock and Voltage Domains for Chip Multi Processors December - 2009 Efraim Rotem Intel."— Presentation transcript:

1 Dec-2009Chip with Multiple Clock and Voltage Domains 1 Multiple Clock and Voltage Domains for Chip Multi Processors December - 2009 Efraim Rotem Intel Corporation, Israel Ran Ginosar Technion, Israel Avi Mendelson Microsoft R&D, Israel Uri Weiser Technion, Israel

2 Dec-2009 2 Compute Performance matters 19781982 1986 19901994199820022006 1 10 100 1,000 10,000 Source: Dave Patterson Fueled by a combination of process and arch We would like to keep on providing performance – Power is #1 limiter Both process technology and ILP slow down  multi core architectures 1W 10W 100W An order of magnitude more power efficient but deep in the power wall Chip with Multiple Clock and Voltage Domains

3 Dec-2009 Chip with Multiple Clock and Voltage Domains 3 Work Overview - scope How to best architect and manage Clock and voltage domains of a CMP to max performance under power constraints 16 core Power constrained CMP 1 thru 16 voltage regulators (VR) –Either on chip or off chip VR 1 thru 16 clock domains –FIFO buffers increase latency Paper contributions: –Power delivery constrains DVFS Multi-voltage domains not so easy –Methodology to evaluate CMP workloads –Clustered voltage and clock domains

4 Dec-2009 4 Operation point and constraints Process technology voltages –Voltage range V min – V max –Frequency range f min – 2f min –Nominal working point V min, f min Lower bound on quality of service –Frequency DFS down to ½ f min Total power is a constraint –Not exceed nominal power Power delivery has been added as a constraint Most constraining parameter wins Chip with Multiple Clock and Voltage Domains

5 Dec-2009 Chip with Multiple Clock and Voltage Domains 5 Why is VR a constraint? Simplified example Given a 16 core 100A shared power delivery –Tying all cores together allows sharing current among cores –Allow one core to consume all the current I / 16 Assume we can split the same VR into 16 –Allow each core a fixed 100A / 16 –Sharing is not possible –Keeping capability requires 1,600A! I Core

6 Dec-2009 Power delivery is constrained Need power delivery headroom for performance Replacing 1 VR by 16 individual VRs: –Does not allow current sharing between cores –Results in degraded power delivery New technologies: –Need less area / volume, BUT –Still deliver limited current More details in the paper 6

7 Dec - 2009Chip with Multiple Clock and Voltage Domains 7 Modeling methodology Workload construction

8 Dec-2009 Chip with Multiple Clock and Voltage Domains 8 Hybrid model Offline characterization of a real CPU: –Instrumented Intel® Core™-2 Duo for power performance measurements –Characterized SPEC-2K traces behavior –Extracted DVFS parameters and V/F scaling Cycle accurate simulation for FIFO impacts –3 clocks each direction Coded analytic model to calculate performance –Function of power frequency and workload

9 Dec-2009 Chip with Multiple Clock and Voltage Domains 9 Workload construction Typical Multi Threaded benchmarks insufficient –Server or HPC centric Highly regular and uniform –But client and cloud computing is non uniform We performed Monte-Carlo simulation –Used SPEC-2K as an application pool –Randomly assigned a subset of 16 threads to the cores –Both fully and partially threaded studies –Performed all studies on the same workload –Repeated workload selection and analysis 200 times

10 Dec-2009 Chip with Multiple Clock and Voltage Domains 10 Results

11 Dec-2009 Chip with Multiple Clock and Voltage Domains 11 Baseline: Single Voltage and Clock DVFS 10-25% performance gain from use of power headroom Serves as baseline for the studies to follow 200 random workloads DVFS to lowest constraint Sorted by performance Shown relative performance 20 40 60 80 100 120 140 160 180 200 100% = 16XGalgel 140% = 16XCrafty I Core

12 Dec-2009 Chip with Multiple Clock and Voltage Domains 12 Different topologies - Fully threaded workloads Example with power supply capability of 150% Some workloads gain performance, some lose compared to baseline –In contrast with previous studies – Assign budget asymmetrically 200 random workloads Oracle study Three topologies vs. baseline Each Sorted independently Performance relative to baseline 50% apps Loose perf 50% apps better perf 1V – Single voltage domain nV – Multiple Voltage domains 1C – Single Clock domain nC – Multiple Clock domains

13 Dec-2009 Chip with Multiple Clock and Voltage Domains 13 Partially threaded workload Fewer threads  higher benefit from shared power Multi VR better Single VR better 1C – Single Clock domain nC – Multiple Clock domains 1V – Single voltage domain nV – Multiple Voltage domains Oracle Study

14 Dec-2009 Chip with Multiple Clock and Voltage Domains 14 Gaining the best of both worlds: Clusters N clusters with 16/N cores each Sharing VR between cores in a cluster Setting optimal voltage frequency for each cluster I/4I/4 I/4I/4 I/4I/4 I/4I/4

15 Dec-2009 Chip with Multiple Clock and Voltage Domains 15 Clusters Clustered topology almost equal to the best of both topologies Outperforms both when number of threads = number of clusters 1V – Single voltage domain nV – Multiple Voltage domains 1C – Single Clock domain nC – Multiple Clock domains xT – X Threads Cluster always the best

16 Dec-2009 Chip with Multiple Clock and Voltage Domains 16 How to pick the best cluster size? Oracle study Compared to non-clustered (by workload) Calculated quadratic error from best topology Best scenarios highlighted “Diagonal behavior” –More constrained power delivery  larger clusters Columns – power delivery capability Rows – number of clusters Cells showing distance from Oracle (Smaller is better)

17 Dec-2009 Chip with Multiple Clock and Voltage Domains 17 Summary Power delivery is a major CPU perf. constraint –Overlooked by previous works –Multiple voltage domain do not allow power sharing –Lightly threaded workloads are most constrained Clustered topology mitigates sharing limitations –Allows sharing power within subsets of cores –Optimal cluster size: function of power delivery capability Explored the non uniform workloads –Different application types –Partially vs. fully threaded workloads

18 Dec-2009 Chip with Multiple Clock and Voltage Domains 18 Thank You

19 Dec-2009 Chip with Multiple Clock and Voltage Domains 19 Run time policies Policy to: –Evaluate run time parameters and select frequency Three control functions –Input: power or scalability –Compute: frequency for each core Scale each domain to lowest constrain (e.g. power delivery, max freq) Calculated quadratic error from Oracle results Input – Power / Scalability Freq. Input – Power / Scalability Freq. Input – Power / Scalability Freq. Greedy (Winner Takes All) LinearPolynomial Linear dependency

20 Dec-2009 Chip with Multiple Clock and Voltage Domains 20 Run time policy results Winning policy is a greedy (WTA) based on scalability –Very close to Oracle Random and power based policies are not good policies MaxAverage WTA 50%5.84%1.3% WTA 33%4.41%0.6% WTA 10%1.23%0.0% WTA by Power 50%22.76%6.9% Linear by SCA9.60%6.1% Linear by power49.76%36.6% Polynomial by SCA5.23%3.3% Random33.28%19.9% 1VnC MaxAverage WTA 50%2.90%0.8% WTA 33%3.37%0.8% WTA 10%4.63%1.7% WTA by Power 50%4.60%2.3% Linear by SCA2.72%1.5% Linear by power5.77%3.8% Polinomial by SCA3.58%1.5% Random8.66%4.3% nVnC Distance from Oracle (Smaller is better) WTA – Winner Take All SCA - Scalability

21 Dec-2009 Chip with Multiple Clock and Voltage Domains 21 Workload characterization Measured score at two frequencies Measured total CPU power –Scaled power = (Workload Power)/(Max Power) –Results 33%-100% leakage + Idle is ~30% –Most applications use less than 100% power Even at V max, f max they consume less than I max Reason: Not all parts of the CPU are utilized Scalability = ΔPerf/ΔFrequency –Result 0%-100% Low  Memory bound High  CPU bound A A BC B SPEC int Scaled Power Perf. Scaling with freq. FIFO impact gzip48%0.950.13% vpr44%0.682.92% gcc35%0.670.92% mcf49%0.302.92% crafty33%0.990.59% parser60%0.781.29% eon42%0.990.00% perlbmk50%1.000.31% gap45%0.561.14% vortex60%0.731.45% bzip249%0.700.71% twolf97%0.994.68% Int_rate51%0.771.42%

22 Dec-2009 Chip with Multiple Clock and Voltage Domains 22 Workload characterization Used cycle accurate simulation to evaluate FIFO impact / application ABC C SPEC int Scaled Power Perf. Scaling with freq. FIFO impact gzip48%0.950.13% vpr44%0.682.92% gcc35%0.670.92% mcf49%0.302.92% crafty33%0.990.59% parser60%0.781.29% eon42%0.990.00% perlbmk50%1.000.31% gap45%0.561.14% vortex60%0.731.45% bzip249%0.700.71% twolf97%0.994.68% Int_rate51%0.771.42% All studies are average over the entire run, not accounting for variance over time Study applies also to phases in workload

23 Dec-2009 Chip with Multiple Clock and Voltage Domains 23 Some DVFS model details All models are built with relative values and not absolute voltages, freq. or performance From min Vcc – linear scaling of frequency only Freq [GHz] Voltage [V]

24 Dec-2009Chip with Multiple Clock and Voltage Domains 24 Workload characteristics – few observations Application power is distributed around ~60% of max power –Min 33% - Leakage + idle power –Very few apps reach 100% Scalability is evenly distributed No correlation found between power and scalability –OOO characteristics –Simpler core is expected to show positive correlation Random pick of 16 cores: –Tighter overall power distribution –Very low probability for all application high or low power

25 Dec-2009 Chip with Multiple Clock and Voltage Domains 25 Why is VR constraint - physics Battery GFX Controller Drivers Inductors CPU Bulk Cap. Need close proximity

26 Dec-2009 Chip with Multiple Clock and Voltage Domains 26 Overview How to best architect and manage Clock and voltage domains of a CMP to achieve max performance under power constraints Contributions: –Power delivery constrains DVFS Multi-voltage domains not so easy –Methodology to evaluate CMP workloads –Clustered voltage and clock domains

27 Dec-2009 Chip with Multiple Clock and Voltage Domains 27 Work Overview - scope 16 core Power constrained CMP 1 thru 16 voltage regulators (VR) and clock domains –Either on chip or off chip VR Independent clock domains require a FIFO buffer  increased latency Best topology ? Optimal policy ? Under constraints


Download ppt "Dec-2009Chip with Multiple Clock and Voltage Domains 1 Multiple Clock and Voltage Domains for Chip Multi Processors December - 2009 Efraim Rotem Intel."

Similar presentations


Ads by Google