Presentation is loading. Please wait.

Presentation is loading. Please wait.

How to Turn the Technological Constraint Problem into a Control Policy Problem Using Slack Brian FieldsRastislav BodíkMark D. Hill University of Wisconsin-Madison.

Similar presentations


Presentation on theme: "How to Turn the Technological Constraint Problem into a Control Policy Problem Using Slack Brian FieldsRastislav BodíkMark D. Hill University of Wisconsin-Madison."— Presentation transcript:

1

2 How to Turn the Technological Constraint Problem into a Control Policy Problem Using Slack Brian FieldsRastislav BodíkMark D. Hill University of Wisconsin-Madison

3 The Problem: Managing constraints Technological constraints dominate memory design Non-uniformity: Load latencies Cache hierarchy Design: Memory latency Constraint: Policy: What to replace?

4 The Problem: Managing constraints In the future, technological constraints will also dominate microprocessor design  Policy Goal: Minimize effect of lower-quality resources Clusters Fast/Slow ALUs Grid, ILDP Design: Wires Power Complexity Constraint: Non-uniformity: Bypasses Exe. Latencies L1 latencies Policy: ? ? ?

5 Key Insight: Control policy crucial With non-uniform machines, the technological constraint problem becomes a control policy problem

6 Key Insight: Control policy crucial The best possible policy: Delays are imposed only on instructions so that execution time is not increased Achieved through slack: The amount an instruction can be delayed without increasing execution time

7 Contributions/Outline Understanding (measure slack in a simulator?) determining slack: resource constraints important reporting slack: apportion to individual instructions analysis: suggest nonuniform machines to build Predicting (how to predict slack in hardware?) simple, delay and observe approach works well Case study (how to design a control policy?) on power-efficient machine, up to 20% speedup

8 Determining slack: Why hard? “Probe the processor” approach: Delay and observe 1. Delay dynamic instruction by n cycles 2. See if execution time increased a) No, increase n; restart; go to step 1 Srinivasan and Lebeck approximation, for loads (MICRO ’98) heuristics to predict execution time increase Microprocessors are complex: Sometimes slack is determined by resources (e.g. ROB)

9 Determining slack Alternative approach: Dependence-graph analysis 1. Build resource-sensitive dependence graph 2. Analyze to find slack Casmira and Grunwald’s solution (Kool Chips Workshop ’00) Graphs only with instructions in issue window But, how to build resource-sensitive graph?

10 Data-Dependence Graph 1 11 1 1 2 3 Slack = 0 cycles

11 Our Dependence Graph Model (ISCA ‘01) EEEEE FFFFF CCCCC Slack = 0 cycles

12 Our Dependence Graph Model (ISCA ‘01) E 1 EEEE 11 1 1 2 3 FFFFF CCCCC 1 1 1 1 1 1 1 1 1 1 10 00 1 1 00 1 Slack = 6 cycles  Modeling resources increases observable slack

13 Reporting slack Global slack: # cycles a dynamic operation can be delayed without increasing execution time Apportioned slack: Distribute global slack among operations using an apportioning strategy 12 10 GS = 15 3 35 0 0 AS = 10 AS = 5

14 Slack measurements (Perl) 6-wide out-of-order superscalar 128-entry issue window 12-stage pipeline

15 Slack measurements (Perl) global

16 Slack measurements (Perl) apportioned global

17 Analysis via apportioning strategy What non-uniform designs can slack tolerate? Design Non-uniformity App. Strategy Fast/slow ALU Exe. latency Double latency Good news: 80% of dynamic instructions can have latency doubled

18 Contributions/Outline Understanding (measure slack in a simulator?) determining slack: resource constraints important reporting slack: apportion to individual instructions analysis: suggest nonuniform machines to build Predicting (how to predict slack in hardware?) simple, delay and observe approach works well Case study (how to design a control policy?) on power-efficient machine, up to 20% speedup 

19 Measuring slack in hardware Goal: Determine whether static instruction has n cycles of slack 1. Delay a dynamic instance by n cycles 2. Check if critical (via critical-path analyzer): a) No, instruction has n cycles of slack b) Yes, instruction does not have n cycles of slack delay and observe ISCA ‘01

20 Two predictor designs 2. Implicit slack predictor delay and observe with natural non-uniform delays “Bin” instructions to match non-uniform hardware 1. Explicit slack predictor Retry delay and observe with different values of slack Problem: obtaining unperturbed measurements

21 Contributions/Outline Understanding (measure slack in a simulator?) determining slack: resource constraints important reporting slack: apportion to individual instructions analysis: suggest nonuniform machines to build Predicting (how to predict slack in hardware?) simple, delay and observe approach works well Case study (how to design a control policy?) on power-efficient machine, up to 20% speedup 

22 Fast/slow pipeline microarchitecture Data Cache WIN Reg WIN Reg Fast, 3-wide pipeline Slow, 3-wide pipeline ALUs Fetch + Rename Design has three nonuniformities: Higher execution latencies Increased (cross-domain) bypass latency Decreased effective issue bandwidth Steer Bypass Bus P  F 2 save ~37% core power

23 Selecting bins for implicit slack predictor Use implicit slack predictor with four (2 2 ) bins: Two decisions 1.Steer to fast/slow pipeline, then 2.Schedule with high/low priority within a pipeline High Low Fast Slow 1 Steer Schedule 2 3 4

24 Putting it all together Slack prediction table 4 KB Fast/slow pipeline core Slack bin # Training Path PC Prediction Path Criticality Analyzer ~1 KB 4-bin slack state machine

25 Fast/slow pipeline performance 2 fast, high-power pipelines slack-based policy reg-dep steering

26 Slack used up Average global slack per dynamic instruction 2 fast, high-power pipelines slack-based policy

27 Slack used up Average global slack per dynamic instruction 2 fast, high-power pipelines slack-based policy reg-dep steering

28 Conclusion: Future processor design flow Future processors will be non-uniform. A slack-based policy can control them. 1. Measure slack in a simulator decide early on what designs to build 2. Predict slack in hardware simple implementation 3. Design a control policy policy decisions  slack bins

29 Backup slides

30 Define local slack 1 11 1 1 1 3 Define Local Slack: # cycles edge latency can be increased without delaying subsequent instructions 2 cycles 1 cycle In real programs, ~20% insts have local slack of at least 5 cycles

31 Compute local slack 1 11 1 1 1 3 1 3 3 21 5 4 2 cycles 1 cycle Define Local Slack: # cycles edge latency can be increased without delaying subsequent instructions In real programs, ~20% insts have local slack of at least 5 cycles Arrival Time

32 Define global slack Global Slack: # cycles edge latency can be increased without delaying the last instruction in the program 1 11 1 1 1 3 2 cycles 1 cycle In real programs, >90% insts have global slack of at least 5 cycles

33 Compute global slack Calculate global slack: backward propagate, accumulating local slacks LS 5 =2 LS 3 =1 LS 1 =1 LS 2 =0 GS 3 =GS 6 +LS 3 =1 GS 1 =MIN(GS 3,GS 5 )+LS 1 =2 GS 6 =LS 6 =0 GS 5 =LS 5 =2 In real programs, >90% insts have global slack of at least 5 cycles

34 Apportioned slack Goal: Distribute slack to instructions that need it Thus, apportioning strategy depends upon nature of non-uniformities in machine e.g.: non-uniformity: 2 speed bypass busses (1 cycle, 2 cycle) strategy: give 1 cycle slack to as many edges as possible

35 Define apportioned slack Apportioned slack: Distribute global slack among edges For example: GS 3 =1, AS 3 =0 GS 2 =1, AS 2 =1 GS 1 =2, AS 1 =1GS 5 =2, AS 5 =1 In real programs, >75% insts can be apportioned slack of at least 5 cycles

36 Slack measurements local apportioned global

37 Multi-speed ALUs Can we tolerate ALUs running at half frequency? Yes, but: 1. For all types of operations? (needed for multi-speed clusters) 2. Can we make all integer ops double latency?

38 Load slack Can we tolerate a long-latency L1 hit? design: wire-constrained machine, e.g. Grid non-uniformity: multi-latency L1 apportioning strategy: apportion ALL slack to load instructions

39 Apportion all slack to loads Most loads can tolerate an L2 cache hit

40 Multi-speed ALUs Can we tolerate ALUs running at half frequency? design: fast/slow ALUs non-uniformity: multi-latency execution latency, bypass apportioning strategy: give slack equal to original latency + 1

41 Latency+1 apportioning Most instructions can tolerate doubling their latency

42 Breakdown by operation (Latency+1 apportioning)

43 Validation Two steps: 1. Increase latencies of insts. by their apportioned slack for three apportioning strategies: 1) latency+1, 2) 5-cycles to as many instructions as possible, 3) 12-cycles to as many loads as possible 2. Compare to baseline (no delays inserted)

44 Validation Worst case: Inaccuracy of 0.6%

45 Predicting slack Two steps to PC-indexed, history-based prediction: 1. Measure slack of a dynamic instruction 2. Store in array indexed by PC of static instruction Need: Locality of slack can capture 80% of potential exploitable slack Need: Ability to measure slack of a dynamic instruction

46 Locality of slack experiment For each static instruction: 1. Measure % slackful dynamic instances 2. Multiply by # of dynamic instances 3. Sum across all static instructions 4. Compare to total slackful dynamic instructions (ideal case) slackful = has enough apportioned slack to double latency

47 Locality of slack

48

49 PC-indexed, history-based predictor can capture most of the available slack

50 Predicting slack Two steps to PC-indexed, history-based prediction: 1. Measure slack of a dynamic instruction 2. Store in array indexed by PC of static instruction Need: Locality of slack can capture 80% of potential exploitable slack Need: Ability to measure slack of a dynamic instruction

51 Measuring slack in hardware Goal: Determine whether static instruction has n cycles of slack 1. Delay a dynamic instance by n cycles 2. Check if critical (via critical-path analyzer): a) No, instruction has n cycles of slack b) Yes, instruction does not have n cycles of slack delay and observe 

52 Review: Critical-path analyzer (ISCA ’01) 1 11 1 1 1 4

53 Don’t need to measure latencies

54 Review: Critical-path analyzer (ISCA ’01) Just observe last-arriving edges

55 Review: Critical-path analyzer (ISCA ’01) Plant token and propagate forward If token survives, node is critical If token dies, node is noncritical

56 Baseline policies (existing, not based on slack) 1.Simple reg dep steering (reg dep) Send to fast cluster until: 2.Window half full (fast-first win) 3.Too many ready insts (fast-first rdy)

57 Baseline policies (existing, not based on slack) 2 fast clusters register dependence fast-first window fast-first ready

58 Slack-based policies 2 fast clusters token-passing slack ALOLD slack reg-dep steering 10% better performance from hiding non-uniformities

59 Extra slow cluster (still save ~25% core power) 2 fast clusters token-passing slack ALOLD slack best-existing policy


Download ppt "How to Turn the Technological Constraint Problem into a Control Policy Problem Using Slack Brian FieldsRastislav BodíkMark D. Hill University of Wisconsin-Madison."

Similar presentations


Ads by Google