How to Turn the Technological Constraint Problem into a Control Policy Problem Using Slack Brian FieldsRastislav BodíkMark D. Hill University of Wisconsin-Madison.

Slides:



Advertisements
Similar presentations
1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
Advertisements

Computer Organization and Architecture
CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Superscalar processors Review. Dependence graph S1S2 Nodes: instructions Edges: ordered relations among the instructions Any ordering-based transformation.
Dynamically Collapsing Dependencies for IPC and Frequency Gain Peter G. Sassone D. Scott Wills Georgia Tech Electrical and Computer Engineering { sassone,
Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.
Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Chapter Six 1.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Instruction-Level Parallelism (ILP)
Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
Goal: Reduce the Penalty of Control Hazards
Pipelined Processor II CPSC 321 Andreas Klappenecker.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 2: Pipeline problems & tricks dr.ir. A.C. Verschueren Eindhoven.
Revisiting Load Value Speculation:
Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)
Spring 2014Jim Hogg - UW - CSE - P501O-1 CSE P501 – Compiler Construction Instruction Scheduling Issues Latencies List scheduling.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.
Microprocessor Microarchitecture Limits of Instruction-Level Parallelism Lynn Choi Dept. Of Computer and Electronics Engineering.
Pipeline Hazards. CS5513 Fall Pipeline Hazards Situations that prevent the next instructions in the instruction stream from executing during its.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
CMPE 421 Parallel Computer Architecture
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
1 Out-Of-Order Execution (part I) Alexander Titov 14 March 2015.
EECE 476: Computer Architecture Slide Set #5: Implementing Pipelining Tor Aamodt Slide background: Die photo of the MIPS R2000 (first commercial MIPS microprocessor)
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Winter 2002CSE Topic Branch Hazards in the Pipelined Processor.
Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining.
CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and
Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.
FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Topics n Low power design. n Pipelining.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
Instruction Scheduling Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
Using Interaction Cost (icost) for Microarchitectural Bottleneck Analysis Brian Fields 1 Rastislav Bodik 1 Mark Hill 2 Chris Newburn 3 1 UC-Berkeley, 2.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.
CSC 4250 Computer Architectures
Multilevel Memories (Improving performance using alittle “cash”)
5.2 Eleven Advanced Optimizations of Cache Performance
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Exploring Value Prediction with the EVES predictor
The Microarchitecture of the Pentium 4 processor
Advanced Computer Architecture
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
November 5 No exam results today. 9 Classes to go!
Computer Architecture
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Spring 2019 Prof. Eric Rotenberg
Presentation transcript:

How to Turn the Technological Constraint Problem into a Control Policy Problem Using Slack Brian FieldsRastislav BodíkMark D. Hill University of Wisconsin-Madison

The Problem: Managing constraints Technological constraints dominate memory design Non-uniformity: Load latencies Cache hierarchy Design: Memory latency Constraint: Policy: What to replace?

The Problem: Managing constraints In the future, technological constraints will also dominate microprocessor design  Policy Goal: Minimize effect of lower-quality resources Clusters Fast/Slow ALUs Grid, ILDP Design: Wires Power Complexity Constraint: Non-uniformity: Bypasses Exe. Latencies L1 latencies Policy: ? ? ?

Key Insight: Control policy crucial With non-uniform machines, the technological constraint problem becomes a control policy problem

Key Insight: Control policy crucial The best possible policy: Delays are imposed only on instructions so that execution time is not increased Achieved through slack: The amount an instruction can be delayed without increasing execution time

Contributions/Outline Understanding (measure slack in a simulator?) determining slack: resource constraints important reporting slack: apportion to individual instructions analysis: suggest nonuniform machines to build Predicting (how to predict slack in hardware?) simple, delay and observe approach works well Case study (how to design a control policy?) on power-efficient machine, up to 20% speedup

Determining slack: Why hard? “Probe the processor” approach: Delay and observe 1. Delay dynamic instruction by n cycles 2. See if execution time increased a) No, increase n; restart; go to step 1 Srinivasan and Lebeck approximation, for loads (MICRO ’98) heuristics to predict execution time increase Microprocessors are complex: Sometimes slack is determined by resources (e.g. ROB)

Determining slack Alternative approach: Dependence-graph analysis 1. Build resource-sensitive dependence graph 2. Analyze to find slack Casmira and Grunwald’s solution (Kool Chips Workshop ’00) Graphs only with instructions in issue window But, how to build resource-sensitive graph?

Data-Dependence Graph Slack = 0 cycles

Our Dependence Graph Model (ISCA ‘01) EEEEE FFFFF CCCCC Slack = 0 cycles

Our Dependence Graph Model (ISCA ‘01) E 1 EEEE FFFFF CCCCC Slack = 6 cycles  Modeling resources increases observable slack

Reporting slack Global slack: # cycles a dynamic operation can be delayed without increasing execution time Apportioned slack: Distribute global slack among operations using an apportioning strategy GS = AS = 10 AS = 5

Slack measurements (Perl) 6-wide out-of-order superscalar 128-entry issue window 12-stage pipeline

Slack measurements (Perl) global

Slack measurements (Perl) apportioned global

Analysis via apportioning strategy What non-uniform designs can slack tolerate? Design Non-uniformity App. Strategy Fast/slow ALU Exe. latency Double latency Good news: 80% of dynamic instructions can have latency doubled

Contributions/Outline Understanding (measure slack in a simulator?) determining slack: resource constraints important reporting slack: apportion to individual instructions analysis: suggest nonuniform machines to build Predicting (how to predict slack in hardware?) simple, delay and observe approach works well Case study (how to design a control policy?) on power-efficient machine, up to 20% speedup 

Measuring slack in hardware Goal: Determine whether static instruction has n cycles of slack 1. Delay a dynamic instance by n cycles 2. Check if critical (via critical-path analyzer): a) No, instruction has n cycles of slack b) Yes, instruction does not have n cycles of slack delay and observe ISCA ‘01

Two predictor designs 2. Implicit slack predictor delay and observe with natural non-uniform delays “Bin” instructions to match non-uniform hardware 1. Explicit slack predictor Retry delay and observe with different values of slack Problem: obtaining unperturbed measurements

Contributions/Outline Understanding (measure slack in a simulator?) determining slack: resource constraints important reporting slack: apportion to individual instructions analysis: suggest nonuniform machines to build Predicting (how to predict slack in hardware?) simple, delay and observe approach works well Case study (how to design a control policy?) on power-efficient machine, up to 20% speedup 

Fast/slow pipeline microarchitecture Data Cache WIN Reg WIN Reg Fast, 3-wide pipeline Slow, 3-wide pipeline ALUs Fetch + Rename Design has three nonuniformities: Higher execution latencies Increased (cross-domain) bypass latency Decreased effective issue bandwidth Steer Bypass Bus P  F 2 save ~37% core power

Selecting bins for implicit slack predictor Use implicit slack predictor with four (2 2 ) bins: Two decisions 1.Steer to fast/slow pipeline, then 2.Schedule with high/low priority within a pipeline High Low Fast Slow 1 Steer Schedule 2 3 4

Putting it all together Slack prediction table 4 KB Fast/slow pipeline core Slack bin # Training Path PC Prediction Path Criticality Analyzer ~1 KB 4-bin slack state machine

Fast/slow pipeline performance 2 fast, high-power pipelines slack-based policy reg-dep steering

Slack used up Average global slack per dynamic instruction 2 fast, high-power pipelines slack-based policy

Slack used up Average global slack per dynamic instruction 2 fast, high-power pipelines slack-based policy reg-dep steering

Conclusion: Future processor design flow Future processors will be non-uniform. A slack-based policy can control them. 1. Measure slack in a simulator decide early on what designs to build 2. Predict slack in hardware simple implementation 3. Design a control policy policy decisions  slack bins

Backup slides

Define local slack Define Local Slack: # cycles edge latency can be increased without delaying subsequent instructions 2 cycles 1 cycle In real programs, ~20% insts have local slack of at least 5 cycles

Compute local slack cycles 1 cycle Define Local Slack: # cycles edge latency can be increased without delaying subsequent instructions In real programs, ~20% insts have local slack of at least 5 cycles Arrival Time

Define global slack Global Slack: # cycles edge latency can be increased without delaying the last instruction in the program cycles 1 cycle In real programs, >90% insts have global slack of at least 5 cycles

Compute global slack Calculate global slack: backward propagate, accumulating local slacks LS 5 =2 LS 3 =1 LS 1 =1 LS 2 =0 GS 3 =GS 6 +LS 3 =1 GS 1 =MIN(GS 3,GS 5 )+LS 1 =2 GS 6 =LS 6 =0 GS 5 =LS 5 =2 In real programs, >90% insts have global slack of at least 5 cycles

Apportioned slack Goal: Distribute slack to instructions that need it Thus, apportioning strategy depends upon nature of non-uniformities in machine e.g.: non-uniformity: 2 speed bypass busses (1 cycle, 2 cycle) strategy: give 1 cycle slack to as many edges as possible

Define apportioned slack Apportioned slack: Distribute global slack among edges For example: GS 3 =1, AS 3 =0 GS 2 =1, AS 2 =1 GS 1 =2, AS 1 =1GS 5 =2, AS 5 =1 In real programs, >75% insts can be apportioned slack of at least 5 cycles

Slack measurements local apportioned global

Multi-speed ALUs Can we tolerate ALUs running at half frequency? Yes, but: 1. For all types of operations? (needed for multi-speed clusters) 2. Can we make all integer ops double latency?

Load slack Can we tolerate a long-latency L1 hit? design: wire-constrained machine, e.g. Grid non-uniformity: multi-latency L1 apportioning strategy: apportion ALL slack to load instructions

Apportion all slack to loads Most loads can tolerate an L2 cache hit

Multi-speed ALUs Can we tolerate ALUs running at half frequency? design: fast/slow ALUs non-uniformity: multi-latency execution latency, bypass apportioning strategy: give slack equal to original latency + 1

Latency+1 apportioning Most instructions can tolerate doubling their latency

Breakdown by operation (Latency+1 apportioning)

Validation Two steps: 1. Increase latencies of insts. by their apportioned slack for three apportioning strategies: 1) latency+1, 2) 5-cycles to as many instructions as possible, 3) 12-cycles to as many loads as possible 2. Compare to baseline (no delays inserted)

Validation Worst case: Inaccuracy of 0.6%

Predicting slack Two steps to PC-indexed, history-based prediction: 1. Measure slack of a dynamic instruction 2. Store in array indexed by PC of static instruction Need: Locality of slack can capture 80% of potential exploitable slack Need: Ability to measure slack of a dynamic instruction

Locality of slack experiment For each static instruction: 1. Measure % slackful dynamic instances 2. Multiply by # of dynamic instances 3. Sum across all static instructions 4. Compare to total slackful dynamic instructions (ideal case) slackful = has enough apportioned slack to double latency

Locality of slack

PC-indexed, history-based predictor can capture most of the available slack

Predicting slack Two steps to PC-indexed, history-based prediction: 1. Measure slack of a dynamic instruction 2. Store in array indexed by PC of static instruction Need: Locality of slack can capture 80% of potential exploitable slack Need: Ability to measure slack of a dynamic instruction

Measuring slack in hardware Goal: Determine whether static instruction has n cycles of slack 1. Delay a dynamic instance by n cycles 2. Check if critical (via critical-path analyzer): a) No, instruction has n cycles of slack b) Yes, instruction does not have n cycles of slack delay and observe 

Review: Critical-path analyzer (ISCA ’01)

Don’t need to measure latencies

Review: Critical-path analyzer (ISCA ’01) Just observe last-arriving edges

Review: Critical-path analyzer (ISCA ’01) Plant token and propagate forward If token survives, node is critical If token dies, node is noncritical

Baseline policies (existing, not based on slack) 1.Simple reg dep steering (reg dep) Send to fast cluster until: 2.Window half full (fast-first win) 3.Too many ready insts (fast-first rdy)

Baseline policies (existing, not based on slack) 2 fast clusters register dependence fast-first window fast-first ready

Slack-based policies 2 fast clusters token-passing slack ALOLD slack reg-dep steering 10% better performance from hiding non-uniformities

Extra slow cluster (still save ~25% core power) 2 fast clusters token-passing slack ALOLD slack best-existing policy