Spring 2019 Prof. Eric Rotenberg

Slides:



Advertisements
Similar presentations
Topics Left Superscalar machines IA64 / EPIC architecture
Advertisements

Lecture 19: Cache Basics Today’s topics: Out-of-order execution
CS6290 Pentiums. Case Study1 : Pentium-Pro Basis for Centrinos, Core, Core 2 (We’ll also look at P4 after this.)
Dynamic Branch Prediction (Sec 4.3) Control dependences become a limiting factor in exploiting ILP So far, we’ve discussed only static branch prediction.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
Instruction-Level Parallelism (ILP)
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
CS 7810 Lecture 7 Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching E. Rotenberg, S. Bennett, J.E. Smith Proceedings of MICRO-29.
EECC722 - Shaaban #1 Lec # 5 Fall Decoupled Fetch/Execute Superscalar Processor Engines Superscalar processor micro-architecture is divided.
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Microprocessor Microarchitecture Instruction Fetch Lynn Choi Dept. Of Computer and Electronics Engineering.
Page 1 Trace Caches Michele Co CS 451. Page 2 Motivation  High performance superscalar processors  High instruction throughput  Exploit ILP –Wider.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
Pipeline Hazards. CS5513 Fall Pipeline Hazards Situations that prevent the next instructions in the instruction stream from executing during its.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Computer System Design
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
Fetch Directed Prefetching - a Study
CS 6290 Branch Prediction. Control Dependencies Branches are very frequent –Approx. 20% of all instructions Can not wait until we know where it goes –Long.
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
Dynamic Branch Prediction
Lecture: Out-of-order Processors
Cache Memory.
Prof. Hsien-Hsin Sean Lee
CS203 – Advanced Computer Architecture
Instruction Level Parallelism
ALPHA Introduction I- Stream
Dynamic Branch Prediction
Computer Architecture Advanced Branch Prediction
CSC 4250 Computer Architectures
Morgan Kaufmann Publishers
The University of Adelaide, School of Computer Science
PowerPC 604 Superscalar Microprocessor
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Pipeline Implementation (4.6)
Flow Path Model of Superscalars
Morgan Kaufmann Publishers The Processor
Lecture: SMT, Cache Hierarchies
Instruction Level Parallelism and Superscalar Processors
Computer Architecture Lecture 3
Lecture: SMT, Cache Hierarchies
Ka-Ming Keung Swamy D Ponpandi
Dynamic Branch Prediction
Lecture 10: Branch Prediction and Instruction Delivery
Lecture 20: OOO, Memory Hierarchy
Lecture: SMT, Cache Hierarchies
Lecture 20: OOO, Memory Hierarchy
Overview Prof. Eric Rotenberg
pipelining: static branch prediction Prof. Eric Rotenberg
pipelining: data hazards Prof. Eric Rotenberg
Dynamic Hardware Prediction
Ka-Ming Keung Swamy D Ponpandi
Spring 2019 Prof. Eric Rotenberg
ECE 721, Spring 2019 Prof. Eric Rotenberg.
Handling Stores and Loads
Sizing Structures Fixed relations Empirical (simulation-based)
ECE 721 Modern Superscalar Microarchitecture
Dynamic Scheduling Physical Register File ready bits Issue Queue (IQ)
Presentation transcript:

Spring 2019 Prof. Eric Rotenberg ECE 721 Trace Caches Spring 2019 Prof. Eric Rotenberg

High ILP Processors High ILP processor Many parallel execution lanes Between 4 and 16 Large window to find enough independent instructions to keep execution lanes busy Large PRF, IQ, LQ/SQ Implications for instruction fetch? Fetch rate must match execution rate If 16 execution lanes, ideally need a sustained fetch rate of 16 instr./cycle ECE 721, Spring 2019 Prof. Eric Rotenberg

Instruction Fetch Issues Branch throughput Predict more than one branch per cycle Noncontiguous instruction blocks Fetch past taken branches Instruction Cache not taken taken A B C ECE 721, Spring 2019 Prof. Eric Rotenberg

Noncontiguous Instructions Fundamental problem Conventional I$ stores instructions in their static order Decoder wants to see instructions in their dynamic order Two classes of hardware solutions Deal with conventional instruction cache: Construct dynamic order on-the-fly Direct approach: Cache instructions in dynamic order ECE 721, Spring 2019 Prof. Eric Rotenberg

Trace Cache Concept A t Trace Cache later... A t t Trace Cache trace: {A, taken, taken} Trace Cache Fill new trace from instruction cache trace: {A, taken, taken} later... A t t Access existing trace using A and predictions (t, t) Trace Cache { A, t , t } to DECODER ECE 721, Spring 2019 Prof. Eric Rotenberg

MUX A A A A A A A path match? multiple branch predictor instruction cache trace cache A A A instruction cache trace cache A A A MUX path match? A multiple branch predictor ECE 721, Spring 2019 Prof. Eric Rotenberg

Trace Selection Trace selection Constraints for forming traces Hardware-determined constraints Goal-oriented constraints ECE 721, Spring 2019 Prof. Eric Rotenberg

Trace Selection (cont.) Hardware-determined constraints Maximum of n instructions (trace cache line size) Maximum of m basic blocks (predictor bandwidth = m predictions/cycle) ECE 721, Spring 2019 Prof. Eric Rotenberg

Trace Selection (cont.) Goal-oriented constraints Maximize number of instructions delivered if T$ hit Make n and m large for long traces Minimize T$ miss rate n and m too large => number of unique traces explodes Too many unique traces can’t all be stored in T$ ECE 721, Spring 2019 Prof. Eric Rotenberg

Trace Selection (cont.) Typical trace selection n = 16 instructions m = 3 basic blocks Stop at the following instruction types Indirect branches and returns These have multiple targets Not embedding these reduces number of unique traces => lower miss rate System calls ECE 721, Spring 2019 Prof. Eric Rotenberg

Trace Cache Contents n instructions Control information Tag: Upper bits of start PC of trace Branch flags: m-1 branch directions Branch mask: Encodes the number of embedded branches, and whether or not the trace ends in a branch Trace fall-through: Next PC if trace-ending branch is predicted not-taken or doesn’t end in branch Trace target: Next PC if trace-ending branch is predicted taken ECE 721, Spring 2019 Prof. Eric Rotenberg

Trace Cache Detail CORE FETCH UNIT TRACE CACHE to DECODER FETCH ADDRESS CORE FETCH UNIT TRACE CACHE fall-thru address branch flags BTB branch mask target address fill unit tag RAS A 01 11,1 X Y PRED INSTR. CACHE hit logic interchange & shift m (3) predictions n (16) instructions n (16) instructions MUX to DECODER ECE 721, Spring 2019 Prof. Eric Rotenberg

Breakdown of Fetch Address Assuming 32-bit addresses and 4 bytes/instr. log2(# trace cache sets) 32 - (# index bits) - 2 low two bits always 0 T$ tag T$ index 00 ECE 721, Spring 2019 Prof. Eric Rotenberg

Example EX: 64KB 4-way set-associative T$ Trace size = 16 4-byte instructions # sets = 64KB / (4 traces/set * 16 instr./trace * 4 bytes/instr.) = 256 # index bits = log2(# sets) = 8 22 8 2 T$ tag T$ index 00 ECE 721, Spring 2019 Prof. Eric Rotenberg

Branch Mask Encodes the number of embedded branches Number of embedded branches = number of leading 1’s in the mask If m = 3 (up to 3 basic blocks): # embedded branches branch_mask 3 (not possible) 2 11 1 10 00 ECE 721, Spring 2019 Prof. Eric Rotenberg

Examples (n=16, m=3) example trace branch_mask ends_in_branch flag b 11 1 10 00 ECE 721, Spring 2019 Prof. Eric Rotenberg

T$ Hit Logic Use index to read out trace(s) Hit conditions 1. Start PC check: Tag from fetch address == Tag from trace cache 2. Path check: (branch_mask & predictions) == (branch_mask & branch_flags) ECE 721, Spring 2019 Prof. Eric Rotenberg

Additional Inputs to Next-PC Logic BTB RAS PC++ Target from decode Result from execution Trace fall-through (T$ hit) && (!ends_in_branch || last_not_taken) core fetch unit Trace target (T$ hit) && (ends_in_branch && last_taken) Next-PC MUX ECE 721, Spring 2019 Prof. Eric Rotenberg

T$ Redundancy & Fragmentation B C B C D A B D D CFG Instruction Cache Trace Cache ECE 721, Spring 2019 Prof. Eric Rotenberg

Design Space Conventional Specific to trace cache Size Set-associativity Specific to trace cache Partial matching Path-associativity Indexing methods (including predictions) Trace selection policies Trace cache fill options ECE 721, Spring 2019 Prof. Eric Rotenberg

Partial Matching Return part of trace if only prefix matches Must add fall-through & target addresses for each embedded branch (not just trace-ending branch) Needed for next-PC logic ECE 721, Spring 2019 Prof. Eric Rotenberg

Path-Associativity Definition Set-associative trace cache Ability to store different traces with same start PC Set-associative trace cache Normal associativity gives path-associativity Direct-mapped trace cache Combine PC with prediction bits to form trace cache index (e.g., concatenate, XOR, etc.) Different traces with same start PC will map to different sets ECE 721, Spring 2019 Prof. Eric Rotenberg

Indexing Methods Include prediction bits in trace cache index for path-associativity How many bits? XOR or concatenate? XOR with which PC bits? Modifying indexing does not change hit logic ECE 721, Spring 2019 Prof. Eric Rotenberg

Trace Cache Fill Options When to build traces and update T$ Fetch stage Retire stage Build one or multiple traces at a time OR Trace A Trace B Trace C Trace A Trace B Trace C Same dynamic instruction stream, multiple starting points Trace A’ Trace B’ Trace C’ etc... ECE 721, Spring 2019 Prof. Eric Rotenberg

Other Trace Cache Applications Use in new microarchitectures Trace processors Trace pre-processing Either improves performance, reduces pipeline complexity, or both Examples Compiler-like optimizations (dead code removal, constant propagation, strength reduction, etc.) Re-schedule instructions (allows simple issue logic, e.g., in-order, without degrading performance) Pre-renaming (simplify renaming logic) ECE 721, Spring 2019 Prof. Eric Rotenberg