Lecture 9. Branch Target Prediction and Trace Cache

Slides:



Advertisements
Similar presentations
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Advertisements

ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.
CS 7810 Lecture 7 Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching E. Rotenberg, S. Bennett, J.E. Smith Proceedings of MICRO-29.
EECC722 - Shaaban #1 Lec # 5 Fall Decoupled Fetch/Execute Superscalar Processor Engines Superscalar processor micro-architecture is divided.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 8, 2003 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 7, 2002 Topic: Instruction-Level Parallelism (Dynamic Branch Prediction)
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Goal: Reduce the Penalty of Control Hazards
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
Branch Target Buffers BPB: Tag + Prediction
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
1 COMP 740: Computer Architecture and Implementation Montek Singh Thu, Feb 19, 2009 Topic: Instruction-Level Parallelism III (Dynamic Branch Prediction)
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Microbenchmarks and Mechanisms for Reverse Engineering of Branch Predictor Structures Vladimir Uzelac and Aleksandar Milenković LaCASA Laboratory Electrical.
Microprocessor Microarchitecture Instruction Fetch Lynn Choi Dept. Of Computer and Electronics Engineering.
Page 1 Trace Caches Michele Co CS 451. Page 2 Motivation  High performance superscalar processors  High instruction throughput  Exploit ILP –Wider.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.
Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
ECE 4100/6100 Advanced Computer Architecture Lecture 2 Instruction-Level Parallelism (ILP) Prof. Hsien-Hsin Sean Lee School of Electrical and Computer.
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
CS 6290 Branch Prediction. Control Dependencies Branches are very frequent –Approx. 20% of all instructions Can not wait until we know where it goes –Long.
Effective ahead pipelining of instruction block address generation André Seznec and Antony Fraboulet IRISA/ INRIA.
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
Advanced Microarchitecture
So Far, Focus on ILP for Pipelines
Computer Architecture Lecture 10: Branch Prediction II
Computer Architecture: Branch Prediction (II) and Predicated Execution
COSC3330 Computer Architecture
Prof. Hsien-Hsin Sean Lee
CS203 – Advanced Computer Architecture
Computer Structure Advanced Branch Prediction
15-740/ Computer Architecture Lecture 21: Superscalar Processing
Dynamic Branch Prediction
COMP 740: Computer Architecture and Implementation
Computer Architecture Advanced Branch Prediction
CS5100 Advanced Computer Architecture Advanced Branch Prediction
COSC3330 Computer Architecture Lecture 15. Branch Prediction
The University of Adelaide, School of Computer Science
CS252 Graduate Computer Architecture Spring 2014 Lecture 8: Advanced Out-of-Order Superscalar Designs Part-II Krste Asanovic
5.2 Eleven Advanced Optimizations of Cache Performance
William Stallings Computer Organization and Architecture 7th Edition
ECE/CS 552: Pipelining to Superscalar
Flow Path Model of Superscalars
Pipelining: Advanced ILP
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Seoul National University
Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections )
Ka-Ming Keung Swamy D Ponpandi
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Lecture 10: Branch Prediction and Instruction Delivery
Sampoorani, Sivakumar and Joshua
So far we have dealt with control hazards in instruction pipelines by:
Dynamic Hardware Prediction
So far we have dealt with control hazards in instruction pipelines by:
rePLay: A Hardware Framework for Dynamic Optimization
So far we have dealt with control hazards in instruction pipelines by:
Ka-Ming Keung Swamy D Ponpandi
Spring 2019 Prof. Eric Rotenberg
Presentation transcript:

Lecture 9. Branch Target Prediction and Trace Cache COSC6385 Advanced Computer Architecture Lecture 9. Branch Target Prediction and Trace Cache Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston

Branch Target Prediction Try the easy ones first Direct jumps Call/Return Conditional branch (bi-directional) Branch Target Buffer (BTB) Return Address Stack (RAS)

Branch Target Buffer (BTB) Branch PC BTB Tag Target Tag Target … Tag Target 4 + Predicted Branch Direction = = … = Branch Target 1

Return Address Stack (RAS) Different call sites make return address hard to predict Printf() being called by many callers The target of “return” instruction in printf() is a moving target A hardware stack (LIFO) Call will push return address on the stack Return uses the prediction off of TOS

Return Address Stack BTB BTB + 4 Call PC Return PC BTB Return? 4 BTB + Push Return Address May not know it is a return instruction prior to decoding Rely on BTB for speculation Fix once recognize Return

Indirect Jump Need Target Prediction Tagless Target Prediction Many (potentially 230 for 32-bit machine) In reality, not so many Similar to predicting values Tagless Target Prediction Tagged Target Prediction

Indirect Branch SPARC jmpl %o7 MIPS jr $ra X86 jmp *%eax ARM mov pc, r2 Itanium   br.ret.sptk.few rp

Tagless Target Prediction [ChangHaoPatt’97] Target Cache (2N entries) PC  BHR Pattern 00…..00 00…..01 Branch PC 00…..10 Predicted Target Address Hash 1 . . . . . 11…..10 Branch History Register (BHR) 11…..11 Modify the PHT to be a “Target Cache” (indirect jump) ? (from target cache) : (from BTB) Alias?

Tagged Target Prediction [ChangHaoPatt’97] Target Cache (2n entries per way) 00…..00 00…..01 Branch PC Predicted Target Address 00…..10 Hash n 1 . . . . . =? BHR 11…..10 11…..11 Tag Array To reduce aliasing with set-associative target cache Use branch PC and/or history for tags

Multiple Branch Prediction For a really wide machine Across several basic blocks Need to predict multiple branches per cycle How to fetch non-contiguous instructions in one cycle? Prediction accuracy extremely critical (will be reduced geometrically)

Instruction Supply Issues Execution Core Instruction Fetch Unit Instruction buffer Fetch throughput defines max performance that can be achieved in later stages Superscalar processors need to supply more than 1 instruction per cycle Instruction Supply limited by Misalignment of multiple instructions in a fetch group Change of Flow (interrupting instruction supply) Memory latency and bandwidth

Flynn’s Bottleneck ILP  1.86  Programs on IBM 7090 BB0 ILP  1.86  Programs on IBM 7090 ILP exploited within basic blocks [Riseman & Foster’72] Breaking control dependency A perfect machine model Benchmark includes numerical programs, assembler and compiler BB1 BB2 BB4 BB3 passed jumps 0 jump 1 jump 2 jumps 8 jumps 32 jumps 128 jumps  jumps Average ILP 1.72 2.72 3.62 7.21 14.8 24.2 51.2

Aligned Instruction Fetching (4 instructions) PC=..xx000000 00 01 10 11 ..00 A0 A1 A2 A3 ..01 One 64B I-cache line A4 A5 A6 A7 ..10 A8 A9 A10 A11 Row Decoder ..11 A12 A13 A14 A15 Can pull out one row at a time inst 1 inst 2 inst 3 inst 4 Cycle n Assume one fetch group = 16B

Misaligned Fetch PC=..xx001000 A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 ..00 A0 A1 A2 A3 ..01 One 64B I-cache line A4 A5 A6 A7 ..10 A8 A9 A10 A11 Row Decoder ..11 A12 A13 A14 A15 Rotating network inst 1 inst 2 inst 3 inst 4 Cycle n IBM RS/6000

Split Cache Line Access PC=..xx111000 00 01 10 11 ..00 A0 A1 A2 A3 ..01 cache line A A4 A5 A6 A7 ..10 A8 A9 A10 A11 Row Decoder ..11 A12 A13 A14 A15 B0 B1 B2 B3 cache line B B4 B5 B6 B7 inst 1 inst 2 Cycle n inst 3 inst 4 Cycle n+1 Be broken down to 2 physical accesses

Split Cache Line Access Miss PC=..xx111000 00 01 10 11 ..00 A0 A1 A2 A3 ..01 cache line A A4 A5 A6 A7 ..10 A8 A9 A10 A11 Row Decoder ..11 A12 A13 A14 A15 C0 C1 C2 C3 cache line C C4 C5 C6 C7 Cache line B misses inst 1 inst 2 Cycle n inst 3 inst 4 Cycle n+X

High Bandwidth Instruction Fetching Wider issue  More instruction feed Major challenge: to fetch more than one non-contiguous basic block per cycle Enabling technique? Predication Branch alignment based on profiling Other hardware solutions (branch prediction is a given) BB1 BB4 BB2 BB3 BB5 BB7 BB6

Assembly w/ predication Predication Example Source code lw r2, [r1+4] lw r3, [r1] blt r3, r2, L1 sw r0, [r1] j L2 L1: sw r0, [r1+4] L2: Typical assembly lw r2, [r1+4] lw r3, [r1] sgt pr4, r2, r3 (p4) sw r0, [r1+4] (!p4) sw r0, [r1] Assembly w/ predication if (a[i+1]>a[i]) a[i+1] = 0 else a[i] = 0 Convert control dependency into data dependency Enlarge basic block size More room for scheduling No fetch disruption

Collapse Buffer [ISCA 95] To fetch multiple (often non-contiguous) instructions Use interleaved BTB to enable multiple branch predictions Align instructions in the predicted sequential order Use banked I-cache for multiple line access

Collapsing Buffer Interleaved BTB Fetch PC Cache Bank 1 Cache Bank 2 Interchange Switch Collapsing Circuit

Collapsing Buffer Mechanism Interleaved BTB Valid Instruction Bits E F G H A B C D A E Interchange Switch A B C D E F G H Bank Routing E A D F H Collapsing Circuit A B C E G E F G H A B C D

High Bandwidth Instruction Fetching To fetch more, we need to cross multiple basic blocks (and/or multiple cache lines) Multiple branches predictions BB1 BB4 BB2 BB3 BB5 BB7 BB6

Multiple Branch Predictor [YehMarrPatt ICS’93] Pattern History Table (PHT) design to support MBP Based on global history only Pattern History Table (PHT) Branch History Register (BHR) Tertiary prediction bk b1 …… p2 p1 p2 update Secondary prediction p1 Primary prediction

Multiple Branch Predictin Fetch address (br0 Primary prediction) Fetch address could be retrieved from BTB Predicted path: BB1  BB2  BB5 How to fetch BB2 and BB5? BTB? Can’t. Branch PCs of br1 and br2 not available when MBP made Use a BAC design (branch address cache) BTB entry BB1 br1 T (2nd) F BB2 br2 BB3 F (3rd) T F T BB4 BB5 BB6 BB7

Branch Address Cache V br V br V br Tag Taken Target Address Not-Taken Target Address T-T Address T-N Address N-T Address N-N Address 23 bits 1 2 30 bits 30 bits 212 bits per fetch address entry Fetch Addr (from BTB) Use a Branch Address Cache (BAC): Keep 6 possible fetch addresses for 2 more predictions br: 2 bits for branch type (cond, uncond, return) V: single valid bit (to indicate if hits a branch in the sequence) To make one more level prediction Need to cache another 8 more addresses (i.e. total=14 addresses) 464 bits per entry = (23+3)*1 + (30+3) * (2+4) + 30*8

Caching Non-Consecutive Basic Blocks High Fetch Bandwidth + Low Latency BB3 BB5 BB1 BB2 BB4 Fetch in Conventional Instruction Cache BB1 BB2 BB3 BB4 BB5 Fetch in Linear Memory Location

Trace Cache Cache dynamic non-contiguous instructions (traces) Cross multiple basic blocks Need to predict multiple branches (MBP) E F G A B C D E F G H I J I$ Fetch (5 cycles) A B C D E F G H I J Collapsing Buffer Fetch (3 cycles) A B C D E F G H I J Trace Cache H I J K A B C D E F G H I J T$ Fetch (1 cycle) A B C D I$

Trace Cache [Rotenberg Bennett Smith MICRO‘96] 11, 1 11: 3 branches. 1: the trace ends w/ a branch 10 1st Br taken 2nd Br Not taken For T.C. miss Br flag Br mask Line fill buffer Tag Fall-thru Address Taken Address M branches BB2 BB1 BB3 T.C. hits, N instructions Branch 1 Branch 2 Branch 3 Fetch Addr Cache at most (in original paper) M branches OR (M = 3 in all follow-up TC studies due to MBP) N instructions (N = 16 in all follow-up TC studies) Fall-thru address if last branch is predicted not taken MBP

Trace Hit Logic Fetch: A Multi-BPred A 10 11,1 X Y N T N = Cond. AND Tag BF Mask Fall-thru Target Multi-BPred A 10 11,1 X Y N T N = 0 1 Cond. AND Match 1st Block Next Fetch Address Match Remaining Block(s) Trace hit

BB Traversal Path: ABDABDACDABDACDABDAC Trace Cache Example BB Traversal Path: ABDABDACDABDACDABDAC A B C D Exit 5 insts 12 insts 4 insts 6 insts Cond 1: 3 branches Cond 2: Fill a trace cache line Cond 3: Exit A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 D1 D2 D3 D4 A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 16 instructions Trace Cache (5 lines)

BB Traversal Path: ABDABDACDABDACDABDAC Trace Cache Example BB Traversal Path: ABDABDACDABDACDABDAC A B C D Exit 5 insts 12 insts 4 insts 6 insts Cond 1: 3 branches Cond 2: Fill a trace cache line Cond 3: Exit A1 A2 A3 A4 A5 A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 D1 D2 D3 D4 A1 A2 A3 A4 A5 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C12 D1 D2 D3 D4 D1 D2 D3 D4 A1 A2 A3 A4 A5 A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 D1 D2 D3 D4 A1 A2 A3 A4 A5 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 D1 D2 D3 D4 D1 D2 D3 D4 Trace Cache (5 lines)

BB Traversal Path: ABDABDACDABDACDABDAC Trace Cache Example BB Traversal Path: ABDABDACDABDACDABDAC A B C D Exit 5 insts 12 insts 4 insts 6 insts Trace Cache is Full A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C12 D1 D2 D3 D4 A1 A2 A3 A4 A5 C12 D1 D2 D3 D4 A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 D1 D2 D3 D4 Trace Cache (5 lines)

Redundancy Duplication A Fragmentation B C Example D Note that instructions only appear once in I-Cache Same instruction appears many times in TC Fragmentation If 3 BBs < 16 instructions If multiple-target branch (e.g. return, indirect jump or trap) is encountered, stop “trace construction”. Empty slots  wasted resources Example A single BB is broken up to (ABC), (BCD), (CDA), (DAB) Duplicating each instruction 3 times 6 B C D A 4 (ABC) =16 inst (BCD) =13 inst (CDA) =15 inst (DAB) =13 inst 6 3 A B C D Trace Cache

Indexability TC saved traces (EAC) and (BCD) Path: (EAC) to (D) Cannot index interior block (D) Can cause duplication Need partial matching (BCD) is cached, if (BC) is needed A B C G D E C B D Trace Cache A

Pentium 4 (NetBurst) Trace Cache Front-end BTB iTLB and Prefetcher L2 Cache No I$ !! Decoder Trace $ BTB Trace $ Rename, execute, etc. Trace-based prediction (predict next-trace, not next-PC) Decoded Instructions

Pentium M - BTB Background: BTB is a cache structure Instructions are fetched in 16-byte blocks (Intel) Can have multiple branches per line BTB can have multiple hits (same tags) => Offset field in each entry => Offset algorithm selects the target among several offered Try to find: Number of BTB entries (NBTB) Number of sets (NSETS) Number of ways (NWAYS) Index, Tag bits Offset bits and presence of offset algorithm Bogus branches handling Replacement policy

BTB Number of BTB entries: 2048 Number of sets: 512 Number of ways : 4 Index= IP[12:4], Tag=IP[21:13], Offset=IP[3:0] Offset algorithm: When multiple hits, selects the target with the lowest offset yet no smaller than the current IP Bogus branches handling: Evict whole set Replacement policy: Tree based pseudo LRU