Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Slides:



Advertisements
Similar presentations
CSCI 4717/5717 Computer Architecture
Advertisements

CS 7810 Lecture 22 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001.
CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
How to Turn the Technological Constraint Problem into a Control Policy Problem Using Slack Brian FieldsRastislav BodíkMark D. Hill University of Wisconsin-Madison.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Instruction-Level Parallelism (ILP)
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)
Computer Organization and Architecture The CPU Structure.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Multiscalar processors
1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
TDC 311 The Microarchitecture. Introduction As mentioned earlier in the class, one Java statement generates multiple machine code statements Then one.
Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
Branch Hazards and Static Branch Prediction Techniques
Sunpyo Hong, Hyesoon Kim
1 Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
Using Interaction Cost (icost) for Microarchitectural Bottleneck Analysis Brian Fields 1 Rastislav Bodik 1 Mark Hill 2 Chris Newburn 3 1 UC-Berkeley, 2.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Branch Prediction Perspectives Using Machine Learning Veerle Desmet Ghent University.
A Quantitative Framework for Pre-Execution Thread Selection Gurindar S. Sohi University of Wisconsin-Madison MICRO-35 Nov. 22, 2002 Amir Roth University.
CSC 4250 Computer Architectures
Simultaneous Multithreading
How will execution time grow with SIZE?
5.2 Eleven Advanced Optimizations of Cache Performance
Lecture: SMT, Cache Hierarchies
Exploring Value Prediction with the EVES predictor
CSCI1600: Embedded and Real Time Software
Hardware Multithreading
Lecture: SMT, Cache Hierarchies
Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Lecture: SMT, Cache Hierarchies
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
CSCI1600: Embedded and Real Time Software
Presentation transcript:

Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Bottleneck Analysis Bottleneck Analysis: Determining the performance effect of an event on execution time An event could be: an instruction’s execution an instruction-window-full stall a branch mispredict a network request inter-processor communication etc.

Why is Bottleneck Analysis Important?

Bottleneck Analysis Applications Run-time Optimization Resource arbitration e.g., how to scheduling memory accesses? Effective speculation e.g., which branches to predicate? Dynamic reconfiguration e.g, when to enable hyperthreading? Energy efficiency e.g., when to throttle frequency? Design Decisions Overcoming technology constraints e.g., how to mitigate effect of long wire latencies? Programmer Performance Tuning Where have the cycles gone? e.g., which cache misses should be prefetched?

Why is Bottleneck Analysis Hard?

Current state-of-art Event counts: Exe. time = (CPU cycles + Mem. cycles) * Clock cycle time where: Mem. cycles = Number of cache misses * Miss penalty 1 (100 cycles) miss 1 (100 cycles) 2 (100 cycles) miss 2 (100 cycles) 2 misses but only 1 miss penalty

Parallelism in systems complicates performance understanding Parallelism A branch mispredict and full-store-buffer stall occur in the same cycle that three loads are waiting on the memory system and two floating- point multiplies are executing Two parallel cache misses Two parallel threads

Criticality Challenges Cost How much speedup possible from optimizing an event? Slack How much can an event be “slowed down” before increasing execution time? Interactions When do multiple events need to be optimized simultaneously? When do we have a choice? Exploit in Hardware

Our Approach

Our Approach: Criticality Critical events affect execution time, non-critical do not Bottleneck Analysis: Determining the performance effect of an event on execution time

Defining criticality Need Performance Sensitivity slowing down a “critical” event should slow down the entire program speeding up a “noncritical” event should leave execution time unchanged

Time R5 = 0FEC R3 = 0FEC R1 = #array + R3FEC R6 = ld[R1]FEC R3 = R3 + 1FEC R5 = R6 + R5FEC cmp R6, 0FEC bf L1FEC R5 = R FEC R0 = R5FEC Ret R0FEC Standard Waterfall Diagram

Time R5 = 0FEC R3 = 0FEC R1 = #array + R3FEC R6 = ld[R1]FEC R3 = R3 + 1FEC R5 = R6 + R5FEC cmp R6, 0FEC bf L1FEC R5 = R FEC R0 = R5FEC Ret R0FEC Annotated with Dependence Edges (MISP)

Time R5 = 0FEC R3 = 0FEC R1 = #array + R3FEC R6 = ld[R1]FEC R3 = R3 + 1FEC R5 = R6 + R5FEC cmp R6, 0FEC bf L1FEC R5 = R FEC R0 = R5FEC Ret R0FEC Fetch BW ROB Data Dep Branch Misp. Annotated with Dependence Edges

Time R5 = 0FEC R3 = 0FEC R1 = #array + R3FEC R6 = ld[R1]FEC R3 = R3 + 1FEC R5 = R6 + R5FEC cmp R6, 0FEC bf L1FEC R5 = R FEC R0 = R5FEC Ret R0FEC Edge Weights Added

R5 = 0 R3 = 0 R1 = #array + R3 R6 = ld[R1] R3 = R3 + 1 R5 = R6 + R5 cmp R6, 0 bf L1 R5 = R R0 = R5 Ret R0 FECFECFECFEC FEC FECFECFECFECFECFEC Convert to Graph

R5 = 0 R3 = 0 R1 = #array + R3 R6 = ld[R1] R3 = R3 + 1 R5 = R6 + R5 cmp R6, 0 bf L1 R5 = R R0 = R5 Ret R0 FECFECFECFEC FEC FECFECFECFECFECFEC Convert to Graph

Smaller graph instance E 1 EEEE 3 FFFFF CCCCC Non-critical, But how much slack? 1 Critical Icache miss, But how costly?

Add “hidden” constraints E 1 EEEE FFFFF CCCCC Non-critical, But how much slack? Critical Icache miss, But how costly?

Add “hidden” constraints E 1 EEEE FFFFF CCCCC Slack = 13 – 7 = 6 cycles Cost = 13 – 7 = 6 cycles

Slack “sharing” E 1 EEEE FFFFF CCCCC Slack = 6 cycles Can delay one edge by 6 cycles, but not both!

Machine Imbalance apportioned global ~80% insts have at least 5 cycles of apportioned slack

Criticality Challenges Cost How much speedup possible from optimizing an event? Slack How much can an event be “slowed down” before increasing execution time? Interactions When do multiple events need to be optimized simultaneously? When do we have a choice? Exploit in Hardware

Simple criticality not always enough Sometimes events have nearly equal criticality miss #1 (99) miss #2 (100) Want to know how critical is each event? how far from critical is each event? Actually, even that is not enough

Our solution: measure interactions Two parallel cache misses miss #1 (99) miss #2 (100) Cost(miss #1) = 0 Cost(miss #2) = 1 Cost({miss #1, miss #2}) = 100 Aggregate cost > Sum of individual costs  Parallel interaction icost = aggregate cost – sum of individual costs = 100 – 0 – 1 = 99

Interaction cost (icost) icost = aggregate cost – sum of individual costs 2. Zero icost ? 1. Positive icost  parallel interaction miss #1 miss #2

Interaction cost (icost) icost = aggregate cost – sum of individual costs miss #1 miss #2 1. Positive icost  parallel interaction 2. Zero icost  independent miss #1 miss # Negative icost ?

Negative icost Two serial cache misses (data dependent) miss #1 (100)miss #2 (100) Cost(miss #1) = ? ALU latency (110 cycles)

Negative icost Two serial cache misses (data dependent) Cost(miss #1) = 90 Cost(miss #2) = 90 Cost({miss #1, miss #2}) = 90 ALU latency (110 cycles) miss #1 (100)miss #2 (100) icost = aggregate cost – sum of individual costs = 90 – 90 – 90 = -90 Negative icost  serial interaction

Interaction cost (icost) icost = aggregate cost – sum of individual costs miss #1 miss #2 1. Positive icost  parallel interaction 2. Zero icost  independent miss #1 miss # Negative icost  serial interaction ALU latency miss #1 miss #2 Branch mispredict Fetch BW Load-Replay Trap LSQ stall

Why care about serial interactions? ALU latency (110 cycles) miss #1 (100)miss #2 (100) Reason #1 We are over-optimizing! Prefetching miss #2 doesn’t help if miss #1 is already prefetched (but the overhead still costs us) Reason #2 We have a choice of what to optimize Prefetching miss #2 has the same effect as miss #1

Icost Case Study: Deep pipelines Looking for serial interactions! Dcache (DL1) 1 4

Icost Breakdown (6 wide, 64-entry window) gccgzipvortex DL1 DL1+window DL1+bw DL1+bmisp DL1+dmiss DL1+alu DL1+imiss... Total

Icost Breakdown (6 wide, 64-entry window) gccgzipvortex DL130.5 % DL1+window DL1+bw DL1+bmisp DL1+dmiss DL1+alu DL1+imiss... Total

Icost Breakdown (6 wide, 64-entry window) gccgzipvortex DL130.5 % DL1+window-15.3 DL1+bw6.0 DL1+bmisp-3.4 DL1+dmiss-0.4 DL1+alu-8.2 DL1+imiss Total100.0

Icost Breakdown (6 wide, 64-entry window) gccgzipvortex DL118.3 %30.5 %25.8 % DL1+window DL1+bw DL1+bmisp DL1+dmiss DL1+alu DL1+imiss Total100.0

Icost Case Study: Deep pipelines EEEEE FFFFF CCCCC E F C i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 4 4 DL1 access window edge

Icost Case Study: Deep pipelines EEEEE FFFFF CCCCC E F C i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 4 4 DL1 access window edge

Icost Case Study: Deep pipelines EEEEE FFFFF CCCCC E F C i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 4 4 DL1 access window edge

Icost Case Study: Deep pipelines EEEEE FFFFF CCCCC E F C i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 4 4 DL1 access window edge

Icost Case Study: Deep pipelines EEEEE FFFFF CCCCC E F C i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 4 4 DL1 access window edge

Icost Case Study: Deep pipelines EEEEE FFFFF CCCCC E F C i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 4 4 DL1 access window edge

Criticality Challenges Cost How much speedup possible from optimizing an event? Slack How much can an event be “slowed down” before increasing execution time? Interactions When do multiple events need to be optimized simultaneously? When do we have a choice? Exploit in Hardware

Criticality Analyzer Online, fast-feedback Limited to critical/not critical Replacement for Performance Counters Requires offline analysis Constructs entire graph

Only last-arriving edges can be critical Observation: R1 R2 + R3 If dependence into R2 is on critical path, then value of R2 arrived last. critical  arrives last arrives last  critical E R2 R3 Dependence resolved early 

Determining last-arrive edges Observe events within the machine last_arrive[F] = last_arrive[E] = E F CC E F CC F  E if data ready on fetch E F CC E F CC E F CC E  E observe arrival order of operands E F CC E F C C last_arrive[C] = E  C if commit pointer is delayed C  C otherwise E F C C E F C C E F CC E F CC E F CC E F CC E  F if branch misp. E F CC E F CC E F C C E F C C C  F if ROB stallF  F otherwise

Last-arrive edges The last-arrive rule  CP consists only of “last-arrive” edges F E C

Prune the graph Only need to put last-arrive edges in graph No other edges could be on CP F E C newest

…and we’ve found the critical path! Backward propagate along last-arrive edges newest F E C  Found CP by only observing last-arrive edges  but still requires constructing entire graph

Step 2. Reducing storage reqs CP is a ”long” chain of last-arrive edges.  the longer a given chain of last-arrive edges, the more likely it is part of the CP Algorithm: find sufficiently long last-arrive chains 1. Plant token into a node n 2. Propagate forward, only along last-arrive edges 3. Check for token after several hundred cycles 4. If token alive, n is assumed critical

Online Criticality Detection Forward propagate token newest F E C Plant Token

Online Criticality Detection Forward propagate token newest F E C Plant Token Tokens “Die”

Online Criticality Detection Forward propagate token F E C Plant Token Token survives!

Putting it all together CP prediction table Last-arrive edges (producer  retired instr) OOO Core E-critical? Training Path PC Prediction Path Token-Passing Analyzer

Results Performance (Speed) Scheduling in clustered machines 10% speedup Selective value prediction Deferred scheduling (Crowe, et al) 11% speedup Heterogeneous cache (Rakvic, et al.) 17% speedup Energy Non-uniform machine: fast and slow pipelines ~25% less energy Instruction queue resizing (Sasanka, et al.) Multiple frequency scaling (Semeraro, et al.) 19% less energy with 3% less performance Selective pre-execution (Petric, et al.)

Exploit in Hardware Criticality Analyzer Online, fast-feedback Limited to critical/not critical Replacement for Performance Counters Requires offline analysis Constructs entire graph

Profiling goal Goal: Construct graph many dynamic instructions Constraint: Can only sample sparsely

Profiling goal Goal: Construct graph Constraint: Can only sample sparsely DNA DNA strand Genome sequencing

“Shotgun” genome sequencing DNA

“Shotgun” genome sequencing DNA

“Shotgun” genome sequencing... DNA

“Shotgun” genome sequencing... Find overlaps among samples DNA

Mapping “shotgun” to our situation many dynamic instructions Icache miss Dcache miss Branch misp. No event

... Profiler hardware requirements

... Profiler hardware requirements Match!

Sources of error Error Source GccParserTwolf Modeling execution as a graph 2.1 %6.0%0.1 % Errors in graph construction 5.3 %1.5 %1.6 % Sampling only a few graph fragments 4.8 %6.5 %7.2 % Total12.2 %14.0 %8.9 %

Conclusion: Grand Challenges Cost How much speedup possible from optimizing an event? Slack How much can an event be “slowed down” before increasing execution time? Interactions When do multiple events need to be optimized simultaneously? When do we have a choice? modeling token-passing analyzer parallel interactions serial interactions shotgun profiling

Conclusion: Bottleneck Analysis Applications Run-time Optimization Effective speculation Resource arbitration Dynamic reconfiguration Energy efficiency Design Decisions Overcoming technology constraints Programmer Performance Tuning Where have the cycles gone? Selective value prediction Scheduling and steering in clustered processors Resize instruction windowNon-uniform machinesHelped cope with high- latency dcache Measured cost of cache misses/branch mispredicts

Outline Simple Criticality Definition (ISCA ’01) Detection (ISCA ’01) Application (ISCA ’01-’02) Advanced Criticality Interpretation (MICRO ’03) What types of interactions are possible? Hardware Support (MICRO ’03, TACO ’04) Enhancement to performance counters

Simple criticality not always enough Sometimes events have nearly equal criticality miss #1 (99) miss #2 (100) Want to know how critical is each event? how far from critical is each event? Actually, even that is not enough

Our solution: measure interactions Two parallel cache misses miss #1 (99) miss #2 (100) Cost(miss #1) = 0 Cost(miss #2) = 1 Cost({miss #1, miss #2}) = 100 Aggregate cost > Sum of individual costs  Parallel interaction icost = aggregate cost – sum of individual costs = 100 – 0 – 1 = 99

Interaction cost (icost) icost = aggregate cost – sum of individual costs 2. Zero icost ? 1. Positive icost  parallel interaction miss #1 miss #2

Interaction cost (icost) icost = aggregate cost – sum of individual costs miss #1 miss #2 1. Positive icost  parallel interaction 2. Zero icost  independent miss #1 miss # Negative icost ?

Negative icost Two serial cache misses (data dependent) miss #1 (100)miss #2 (100) Cost(miss #1) = ? ALU latency (110 cycles)

Negative icost Two serial cache misses (data dependent) Cost(miss #1) = 90 Cost(miss #2) = 90 Cost({miss #1, miss #2}) = 90 ALU latency (110 cycles) miss #1 (100)miss #2 (100) icost = aggregate cost – sum of individual costs = 90 – 90 – 90 = -90 Negative icost  serial interaction

Interaction cost (icost) icost = aggregate cost – sum of individual costs miss #1 miss #2 1. Positive icost  parallel interaction 2. Zero icost  independent miss #1 miss # Negative icost  serial interaction ALU latency miss #1 miss #2 Branch mispredict Fetch BW Load-Replay Trap LSQ stall

Why care about serial interactions? ALU latency (110 cycles) miss #1 (100)miss #2 (100) Reason #1 We are over-optimizing! Prefetching miss #2 doesn’t help if miss #1 is already prefetched (but the overhead still costs us) Reason #2 We have a choice of what to optimize Prefetching miss #2 has the same effect as miss #1

Outline Simple Criticality Definition (ISCA ’01) Detection (ISCA ’01) Application (ISCA ’01-’02) Advanced Criticality Interpretation (MICRO ’03) What types of interactions are possible? Hardware Support (MICRO ’03, TACO ’04) Enhancement to performance counters

Profiling goal Goal: Construct graph many dynamic instructions Constraint: Can only sample sparsely

Profiling goal Goal: Construct graph Constraint: Can only sample sparsely DNA DNA strand Genome sequencing

“Shotgun” genome sequencing DNA

“Shotgun” genome sequencing DNA

“Shotgun” genome sequencing... DNA

“Shotgun” genome sequencing... Find overlaps among samples DNA

Mapping “shotgun” to our situation many dynamic instructions Icache miss Dcache miss Branch misp. No event

... Profiler hardware requirements

... Profiler hardware requirements Match!

Sources of error Error Source GccParserTwolf

Sources of error Error Source GccParserTwolf Modeling execution as a graph 2.1 %6.0%0.1 %

Sources of error Error Source GccParserTwolf Modeling execution as a graph 2.1 %6.0%0.1 % Errors in graph construction 5.3 %1.5 %1.6 %

Sources of error Error Source GccParserTwolf Modeling execution as a graph 2.1 %6.0%0.1 % Errors in graph construction 5.3 %1.5 %1.6 % Sampling only a few graph fragments 4.8 %6.5 %7.2 % Total12.2 %14.0 %8.9 %

Conclusion: Bottleneck Analysis Applications Run-time Optimization Effective speculation Resource arbitration Dynamic reconfiguration Energy efficiency Design Decisions Overcoming technology constraints Programmer Performance Tuning Where have the cycles gone? Selective value prediction Scheduling and steering in clustered processors Resize instruction windowNon-uniform machinesHelped cope with high- latency dcache Measured cost of cache misses/branch mispredicts

Conclusion: Grand Challenges Cost How much speedup possible from optimizing an event? Slack How much can an event be “slowed down” before increasing execution time? Interactions When do multiple events need to be optimized simultaneously? When do we have a choice? modeling token-passing analyzer parallel interactions serial interactions shotgun profiling

Backup Slides

Related Work

Criticality Prior Work Critical-Path Method, PERT charts Developed for Navy’s “Polaris” project-1957 Used as a project management tool Simple critical-path, slack concepts “Attribution” Heuristics Rosenblum et al.: SOSP-1995, and many others Marks instruction at head of ROB as critical, etc. Empirically, has limited accuracy Does not account for interactions between events

Related Work: Microprocessor Criticality Latency tolerance analysis Srinivasan and Lebeck: MICRO-1998 Heuristics-driven criticality predictors Tune et al.: HPCA-2001 Srinivasan et al.: ISCA-2001 “Local” slack detector Casmira and Grunwald: Kool Chips Workshop-2000 ProfileMe with pair-wise sampling Dean, et al.: MICRO-1997

Unresolved Issues

Alternative I: Addressing Unresolved Issues Modeling and Measurement What resources can we model effectively? difficulty with mutual-exclusion-type resouces (ALUs) Efficient algorithms Release tool for measuring cost/slack Hardware Detailed design for criticality analyzer Shotgun profiler simplifications gradual path from counters Optimization explore heuristics for exploiting interactions

Alternative II: Chip-Multiprocessors Design Decisions Should each core support out-of-order execution? Should SMT be supported? How many processors are useful? What is the effect of inter-processor latency? Programmer Performance Tuning Parallelizing applications What makes a good division into threads? How can we find them automatically, or at least help programmers to find them?

Unresolved issues Modeling and Measurement What resources can we model effectively? difficulty with mutual-exclusion-type resouces (ALUs) In other words, unanticipated side effects ld r2, [Mem] 2. add r3  r ld r4, [Mem] 4. add r6  r4 + 1 (cache miss) F E C F E C F E C F E C Original Execution (cache miss) (cache hit) No contention 1. ld r2, [Mem] 2. add r3  r ld r4, [Mem] 4. add r6  r4 + 1 F E C F E C F E C F E C Altered Execution (to compute cost of inst #3 cache miss) Adder contention Contention edge Incorrect critical path due to contention edge Should not be here

Unresolved issues Modeling and Measurement (cont.) How should processor policies be modeled? relationship to icost definition Efficient algorithms for measuring icosts pairs of events, etc. Release tool for measuring cost/slack

Unresolved issues Hardware Detailed design for criticality analyzer help to convince industry-types to build it Shotgun profiler simplifications gradual path from counters Optimization Explore icost optimization heuristics icosts are difficult to interpret

Validation

Validation: can we trust our model? Run two simulations : Reduce CP latencies Reduce non-CP latencies  Expect “big” speedup  Expect no speedup

Validation: can we trust our model?

Validation Two steps: 1. Increase latencies of insts. by their apportioned slack for three apportioning strategies: 1) latency+1, 2) 5-cycles to as many instructions as possible, 3) 12-cycles to as many loads as possible 2. Compare to baseline (no delays inserted)

Validation Worst case: Inaccuracy of 0.6%

Slack Measurements

Three slack variants Local slack: # cycles latency can be increased without delaying any subsequent instructions Global slack: # cycles latency can be increased without delaying the last instruction in the program Apportioned slack: Distribute global slack among instructions using an apportioning strategy

Slack measurements ~21% insts have at least 5 cycles of local slack local

Slack measurements ~90% insts have at least 5 cycles of global slack local global

Slack measurements ~80% insts have at least 5 cycles of apportioned slack local apportioned global A large amount of exploitable slack exists

Application-centered Slack Measurements

Load slack Can we tolerate a long-latency L1 hit? design: wire-constrained machine, e.g. Grid non-uniformity: multi-latency L1 apportioning strategy: apportion ALL slack to load instructions

Apportion all slack to loads Most loads can tolerate an L2 cache hit

Multi-speed ALUs Can we tolerate ALUs running at half frequency? design: fast/slow ALUs non-uniformity: multi-latency execution latency, bypass apportioning strategy: give slack equal to original latency + 1

Latency+1 apportioning Most instructions can tolerate doubling their latency

Slack Locality and Prediction

Predicting slack Two steps to PC-indexed, history-based prediction: 1. Measure slack of a dynamic instruction 2. Store in array indexed by PC of static instruction Two requirements: 1. Locality of slack 2. Ability to measure slack of a dynamic instruction

Locality of slack

PC-indexed, history-based predictor can capture most of the available slack

Slack Detector Problem #2 Determining if overall execution time increased Solution Check if delay made instruction critical delay and observe effective for hardware predictor Problem #1 Iterating repeatedly over same dynamic instruction Solution Only sample dynamic instruction once

Slack Detector Goal: Determine whether instruction has n cycles of slack 1. Delay the instruction by n cycles 2. Check if critical (via critical-path analyzer) 3. No, instruction has n cycles of slack 4. Yes, instruction does not have n cycles of slack delay and observe

Slack Application

Fast/slow cluster microarchitecture Data Cache WIN Reg WIN Reg Fast, 3-wide cluster Slow, 3-wide cluster ALUs Fetch + Rename Aggressive non-uniform design: Higher execution latencies Increased (cross-domain) bypass latency Decreased effective issue bandwidth Steer Bypass Bus P  F 2 save ~37% core power

Picking bins for the slack predictor Use implicit slack predictor with four bins: 1. Steer to fast cluster + schedule with high priority 2. Steer to fast cluster + schedule with low priority 3. Steer to slow cluster + schedule with high priority 4. Steer to slow cluster + schedule with low priority Two decisions 1.Steer to fast/slow cluster 2.Schedule with high/low priority within a cluster

Slack-based policies 2 fast, high-power clusters slack-based policy reg-dep steering 10% better performance from hiding non-uniformities

CMP case study

Multithreaded Execution Case Study Two questions: How should a program be divided into threads? what makes a good cutpoint? how can we find them automatically, or at least help programmers find them? What should a multiple-core design look like? should each core support out-of-order execution? should SMT be supported? how many processors are useful? what is the effect of inter-processor latency?

Parallelizing an application Why parallelize a single-thread application? Legacy code, large code bases Difficult to parallelize apps Interpreted code, kernels of operating systems Like to use better programming languages Scheme, Java instead of C/C++

Parallelizing an application Simplifying assumption Program binary unchanged Simplified problem statement Given a program of length L, find a cutpoint that divides the program into two threads that provides maximum speedup Must consider: data dependences, execution latencies, control dependences, proper load balancing

Parallelizing an application Naive solution: try every possible cutpoint Our solution: efficiently determine the effect of every possible cutpoint model execution before and after every cut

Solution last instruction F E C first instruction start

Parallelizing an application Considerations: Synchronization overhead add latency to EE edges Synchronization may involve turning EE to EF Scheduling of threads additional CF edges Challenges: State behavior (one thread to multiple processors) caches, branch predictor Control behavior limits where cutpoints can be made

Parallelizing an application More general problem: Divide a program into N threads NP-complete Icost can help: icost(p1,p2) << 0 implies p1 and p2 redundant action: move p1 and p2 further apart

Preliminary Results Experimental Setup Simulator, based loosely on SimpleScalar Alpha SpecInt binaries Procedure 1. Assume execution trace is known 2. Look at each 1k run of instructions 3. Test every possible cutpoint using 1k graphs

Dynamic Cutpoints Only 20% of cuts yield benefits of > 20 cycles

Usefulness of cost-based policy

Static Cutpoints Up to 60% of cuts yield benefits of > 20 cycles

Future Avenues of Research Map cutpoints back to actual code Compare automatically generated cutpoints to human- generated ones See what performance gains are in a simulator, as opposed to just on the graph Look at the effect of synchronization operations What additional overhead do they introduce? Deal with state, control problems Might need some technique outside of the graph

Multithreaded Execution Case Study Two possible questions: How should a program be divided into threads? what makes a good cutpoint? how can we find them automatically, or at least help programmers find them? What should a multiple-core design look like? should each core support out-of-order execution? should SMT be supported? how many processors are useful? what is the effect of inter-processor latency?

CMP design study What we can do: Try out many configurations quickly dramatic changes in architecture often only small changes in graph Identifying bottlenecks especially interactions

CMP design study: Out-of-orderness Is out-of-order execution necessary in a CMP? Procedure model execution with different configurations adjust CD edges compute breakdowns notice resource/events interacting with CD edges

CMP design study: Out-of-orderness last instruction F E C first instruction

CMP design study: Out-of-orderness Results summary Single-core: Performance taps out at 256 entries CMP: Performance gains up through 1024 entries some benchmarks see gains up to 16k entries Why more beneficial? Use breakdowns to find out.....

CMP design study: Out-of-orderness Components of window cost cache misses holding up retirement? long strands of data dependencies? predictable control flow? Icost breakdowns give quantitative and qualitative answers

CMP design study: Out-of-orderness cost(window) + icost(window, A) + icost(window, B) + icost(window, AB) = 0 window cost 100% 0% ALU cache misses Independent ALU cache misses interaction Parallel Interaction ALU cache misses interaction Serial Interaction equal

Summary of Preliminary Results icost(window, ALU operations) << 0 primarily communication between processors window often stalled waiting for data Implications larger window may be overkill need a cheap non-blocking solution e.g., continual-flow pipelines

CMP design study: SMT? Benefits reduced thread start-up latency reduced communication costs How we could help distribution of thread lengths breakdowns to understand effect of communication

#1 #2 #1 Start #1 #2 CMP design study: How many processors?

CMP design study: Other Questions What is the effect of inter-processor communication latency? understand hidden vs. exposed communication Allocating processors to programs methodology for O/S to better assign programs to processors

Waterfall To Graph Story

Time R5 = 0FEC R3 = 0FEC R1 = #array + R3FEC R6 = ld[R1]FEC R3 = R3 + 1FEC R5 = R6 + R5FEC cmp R6, 0FEC bf L1FEC R5 = R FEC R0 = R5FEC Ret R0FEC Standard Waterfall Diagram

Time R5 = 0FEC R3 = 0FEC R1 = #array + R3FEC R6 = ld[R1]FEC R3 = R3 + 1FEC R5 = R6 + R5FEC cmp R6, 0FEC bf L1FEC R5 = R FEC R0 = R5FEC Ret R0FEC Annotated with Dependence Edges

Time R5 = 0FEC R3 = 0FEC R1 = #array + R3FEC R6 = ld[R1]FEC R3 = R3 + 1FEC R5 = R6 + R5FEC cmp R6, 0FEC bf L1FEC R5 = R FEC R0 = R5FEC Ret R0FEC Fetch BW Data Dep ROB Branch Misp. Annotated with Dependence Edges

Time R5 = 0FEC R3 = 0FEC R1 = #array + R3FEC R6 = ld[R1]FEC R3 = R3 + 1FEC R5 = R6 + R5FEC cmp R6, 0FEC bf L1FEC R5 = R FEC R0 = R5FEC Ret R0FEC Edge Weights Added

R5 = 0 R3 = 0 R1 = #array + R3 R6 = ld[R1] R3 = R3 + 1 R5 = R6 + R5 cmp R6, 0 bf L1 R5 = R R0 = R5 Ret R0 FECFECFECFEC FEC FECFECFECFECFECFEC Convert to Graph

R5 = 0 R3 = 0 R1 = #array + R3 R6 = ld[R1] R3 = R3 + 1 R5 = R6 + R5 cmp R6, 0 bf L1 R5 = R R0 = R5 Ret R0 FECFECFECFEC FEC FECFECFECFECFECFEC Find Critical Path

R5 = 0 R3 = 0 R1 = #array + R3 R6 = ld[R1] R3 = R3 + 1 R5 = R6 + R5 cmp R6, 0 bf L1 R5 = R R0 = R5 Ret R0 Add Non-last-arriving Edges

R5 = 0 R3 = 0 R1 = #array + R3 R6 = ld[R1] R3 = R3 + 1 R5 = R6 + R5 cmp R6, 0 bf L1 R5 = R R0 = R5 Ret R0 Branch misprediction made correct Graph Alterations

Token-passing analyzer

Step 1. Observing Observation: R1 R2 + R3 If dependence into R2 is on critical path, then value of R2 arrived last. critical  arrives last arrives last  critical E R2 R3 Dependence resolved early 

Determining last-arrive edges Observe events within the machine last_arrive[F] = last_arrive[E] = E F CC E F CC F  E if data ready on fetch E F CC E F CC E F CC E  E observe arrival order of operands E F CC E F C C last_arrive[C] = E  C if commit pointer is delayed C  C otherwise E F C C E F C C E F CC E F CC E F CC E F CC E  F if branch misp. E F CC E F CC E F C C E F C C C  F if ROB stallF  F otherwise

Last-arrive edges: a CPU stethoscope CPU E  C E  E F  E C  F F  F E  F C  C

Last-arrive edges F E C

Remove latencies F E C Do not need explicit weights

Last-arrive edges The last-arrive rule  CP consists only of “last-arrive” edges F E C

Prune the graph Only need to put last-arrive edges in graph No other edges could be on CP F E C newest

…and we’ve found the critical path! Backward propagate along last-arrive edges newest F E C  Found CP by only observing last-arrive edges  but still requires constructing entire graph

Step 2. Efficient analysis CP is a ”long” chain of last-arrive edges.  the longer a given chain of last-arrive edges, the more likely it is part of the CP Algorithm: find sufficiently long last-arrive chains 1. Plant token into a node n 2. Propagate forward, only along last-arrive edges 3. Check for token after several hundred cycles 4. If token alive, n is assumed critical

1. plant token Token-passing example 2. propagate token 3. is token alive? 4. yes, train critical Critical  Found CP without constructing entire graph ROB Size

Implementation: a small SRAM array Last-arrive producer node (inst id, type) Token Queue Read Write Commited (inst id, type) Size of SRAM: 3 bits  ROB size < 200 Bytes … Simply replicate for additional tokens

Putting it all together CP prediction table Last-arrive edges (producer  retired instr) OOO Core E-critical? Training Path PC Prediction Path Token-Passing Analyzer

Scheduling and Steering

Case Study #1: Clustered architectures steering issue window scheduling 1.Current state of art (Base) 2.Base + CP Scheduling 3.Base + CP Scheduling + CP Steering

unclustered 2 cluster 4 cluster Current State of the Art  Avg. clustering penalty for 4 clusters: 19% Constant issue width, clock frequency

unclustered 2 cluster 4 cluster CP Optimizations Base + CP Scheduling

unclustered 2 cluster 4 cluster CP Optimizations  Avg. clustering penalty reduced from 19% to 6% Base + CP Scheduling + CP Steering

Token-passing Vs. Heuristics

Local Vs. Global Analysis oldest-uncommited oldest-unissued token-passing Previous CP predictors: local resource-sensitive predictions (HPCA 01, ISCA 01)  CP exploitation seems to require global analysis

Icost case study

Icost Case Study: Deep pipelines Deep pipelines cause long latency loops: level-one (DL1) cache access, issue-wakeup, branch misprediction, … But can often mitigate them indirectly Assume 4-cycle DL1 access; how to mitigate? Increase cache ports? Increase window size? Increase fetch BW? Reduce cache misses? Really, looking for serial interactions!

Icost Case Study: Deep pipelines EEEEE FFFFF CCCCC E F C i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 4 4 DL1 access window edge

Icost Case Study: Deep pipelines EEEEE FFFFF CCCCC E F C i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 4 4 DL1 access window edge

Icost Case Study: Deep pipelines EEEEE FFFFF CCCCC E F C i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 4 4 DL1 access window edge

Icost Case Study: Deep pipelines EEEEE FFFFF CCCCC E F C i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 4 4 DL1 access window edge

Icost Case Study: Deep pipelines EEEEE FFFFF CCCCC E F C i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 4 4 DL1 access window edge

Icost Case Study: Deep pipelines EEEEE FFFFF CCCCC E F C i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 4 4 DL1 access window edge

Icost Case Study: Deep pipelines EEEEE FFFFF CCCCC E F C i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 4 4 DL1 access window edge

Icost Breakdown (6 wide, 64-entry window) gccgzipvortex DL1 DL1+window DL1+bw DL1+bmisp DL1+dmiss DL1+alu DL1+imiss... Total

Icost Breakdown (6 wide, 64-entry window) gccgzipvortex DL130.5 % DL1+window DL1+bw DL1+bmisp DL1+dmiss DL1+alu DL1+imiss... Total

Icost Breakdown (6 wide, 64-entry window) gccgzipvortex DL130.5 % DL1+window-15.3 DL1+bw6.0 DL1+bmisp-3.4 DL1+dmiss-0.4 DL1+alu-8.2 DL1+imiss Total100.0

Icost Breakdown (6 wide, 64-entry window) gccgzipvortex DL118.3 %30.5 %25.8 % DL1+window DL1+bw DL1+bmisp DL1+dmiss DL1+alu DL1+imiss Total100.0

Vortex Breakdowns, enlarging the window DL1 DL1+window DL1+bw DL1+bmisp DL1+dmiss DL1+alu DL1+imiss... Total

Vortex Breakdowns, enlarging the window DL DL1+window DL1+bw DL1+bmisp DL1+dmiss DL1+alu DL1+imiss Total

Shotgun Profiling

Profiling goal Goal: Construct graph many dynamic instructions Constraint: Can only sample sparsely

Profiling goal Goal: Construct graph Constraint: Can only sample sparsely DNA DNA strand Genome sequencing

“Shotgun” genome sequencing DNA

“Shotgun” genome sequencing DNA

“Shotgun” genome sequencing... DNA

“Shotgun” genome sequencing... Find overlaps among samples DNA

Mapping “shotgun” to our situation many dynamic instructions Icache miss Dcache miss Branch misp. No event

... Profiler hardware requirements

... Profiler hardware requirements Match!

Offline Profiler Algorithm long sample detailed samples

= then = if Design issues  Identify microexecution context Choosing signature bits Determining PCs (for better detailed sample matching) long sample Start PC branch encode taken/not-taken bit in signature

Sources of error Error Source GccParserTwolf

Sources of error Error Source GccParserTwolf Building graph fragments

Sources of error Error Source GccParserTwolf Building graph fragments Sampling only a few graph fragments

Sources of error Error Source GccParserTwolf Building graph fragments Sampling only a few graph fragments Modeling execution as a graph

Sources of error Error Source GccParserTwolf Building graph fragments 5.3 %1.5 %1.6 % Sampling only a few graph fragments Modeling execution as a graph

Sources of error Error Source GccParserTwolf Building graph fragments 5.3 %1.5 %1.6 % Sampling only a few graph fragments 4.8 %6.5 %7.2 % Modeling execution as a graph

Sources of error Error Source GccParserTwolf Building graph fragments 5.3 %1.5 %1.6 % Sampling only a few graph fragments 4.8 %6.5 %7.2 % Modeling execution as a graph 2.1 %6.0%0.1 %

Sources of error Error Source GccParserTwolf Building graph fragments 5.3 %1.5 %1.6 % Sampling only a few graph fragments 4.8 %6.5 %7.2 % Modeling execution as a graph 2.1 %6.0%0.1 % Total12.2 %14.0 %8.9 %

Icost vs. Sensitivity Study

Compare Icost and Sensitivity Study Corollary to DL1 and ROB serial interaction: As load latency increases, the benefit from enlarging the ROB increases. EEEEE FFFFF CCCCC E F C i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 4 3 DL1 access

Compare Icost and Sensitivity Study

Sensitivity Study Advantages More information e.g., concave or convex curves Interaction Cost Advantages Easy (automatic) interpretation Sign and magnitude have well defined meanings Concise communication DL1 and ROB interact serially

Outline Definition (ISCA ’01) what does it mean for an event to be critical? Detection (ISCA ’01) how can we determine what events are critical? Interpretation (MICRO ’04, TACO ’04) what does it mean for two events to interact? Application (ISCA ’01-’02, TACO ’04) how can we exploit criticality in hardware?

Our solution: measure interactions Two parallel cache misses (Each 100 cycles) miss #1 (100) miss #2 (100) Cost(miss #1) = 0 Cost(miss #2) = 0 Cost({miss #1, miss #2}) = 100 Aggregate cost > Sum of individual costs  Parallel interaction icost = aggregate cost – sum of individual costs = 100 – 0 – 0 = 100

Interaction cost (icost) icost = aggregate cost – sum of individual costs 2. Zero icost ? 1. Positive icost  parallel interaction miss #1 miss #2

Interaction cost (icost) icost = aggregate cost – sum of individual costs miss #1 miss #2 1. Positive icost  parallel interaction 2. Zero icost  independent miss #1 miss # Negative icost ?

Negative icost Two serial cache misses (data dependent) miss #1 (100)miss #2 (100) Cost(miss #1) = ? ALU latency (110 cycles)

Negative icost Two serial cache misses (data dependent) Cost(miss #1) = 90 Cost(miss #2) = 90 Cost({miss #1, miss #2}) = 90 ALU latency (110 cycles) miss #1 (100)miss #2 (100) icost = aggregate cost – sum of individual costs = 90 – 90 – 90 = -90 Negative icost  serial interaction

Interaction cost (icost) icost = aggregate cost – sum of individual costs miss #1 miss #2 1. Positive icost  parallel interaction 2. Zero icost  independent miss #1 miss # Negative icost  serial interaction ALU latency miss #1 miss #2 Branch mispredict Fetch BW Load-Replay Trap LSQ stall

Why care about serial interactions? ALU latency (110 cycles) miss #1 (100)miss #2 (100) Reason #1 We are over-optimizing! Prefetching miss #2 doesn’t help if miss #1 is already prefetched (but the overhead still costs us) Reason #2 We have a choice of what to optimize Prefetching miss #2 has the same effect as miss #1

Outline Definition (ISCA ’01) what does it mean for an event to be critical? Detection (ISCA ’01) how can we determine what events are critical? Interpretation (MICRO ’04, TACO ’04) what does it mean for two events to interact? Application (ISCA ’01-’02, TACO ’04) how can we exploit criticality in hardware?

Criticality Analyzer (ISCA ‘01) Procedure 1. Observe last-arriving edges uses simple rules 2. Propagate a token forward along last-arriving edges at worst, a read-modify-write sequence to a small array 3. If token dies, non-critical; otherwise, critical Goal Detect criticality of dynamic instructions

Slack Analyzer (ISCA ‘02) Goal Detect likely slack of static instructions Procedure 1. Delay the instruction by n cycles 2. Check if critical (via critical-path analyzer) No, instruction has n cycles of slack Yes, instruction does not have n cycles of slack

Shotgun Profiling (TACO ‘04) Goal Create representative graph fragments Procedure Enhance ProfileMe counters with context Use context to piece together counter samples