Lecture 9. Branch Target Prediction and Trace Cache

Lecture 9. Branch Target Prediction and Trace Cache
COSC6385 Advanced Computer Architecture Lecture 9. Branch Target Prediction and Trace Cache Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston

Branch Target Prediction
Try the easy ones first Direct jumps Call/Return Conditional branch (bi-directional) Branch Target Buffer (BTB) Return Address Stack (RAS)

Branch Target Buffer (BTB)
Branch PC BTB Tag Target Tag Target … Tag Target 4 + Predicted Branch Direction = = … = Branch Target 1

Return Address Stack (RAS)
Different call sites make return address hard to predict Printf() being called by many callers The target of “return” instruction in printf() is a moving target A hardware stack (LIFO) Call will push return address on the stack Return uses the prediction off of TOS

Return Address Stack BTB BTB + 4
Call PC Return PC BTB Return? 4 BTB + Push Return Address May not know it is a return instruction prior to decoding Rely on BTB for speculation Fix once recognize Return

Indirect Jump Need Target Prediction Tagless Target Prediction
Many (potentially 230 for 32-bit machine) In reality, not so many Similar to predicting values Tagless Target Prediction Tagged Target Prediction

Indirect Branch SPARC jmpl %o7 MIPS jr $ra X86 jmp *%eax ARM
mov pc, r2 Itanium br.ret.sptk.few rp

Tagless Target Prediction [ChangHaoPatt’97]
Target Cache (2N entries) PC  BHR Pattern 00…..00 00…..01 Branch PC 00…..10 Predicted Target Address Hash 1 11…..10 Branch History Register (BHR) 11…..11 Modify the PHT to be a “Target Cache” (indirect jump) ? (from target cache) : (from BTB) Alias?

Tagged Target Prediction [ChangHaoPatt’97]
Target Cache (2n entries per way) 00…..00 00…..01 Branch PC Predicted Target Address 00…..10 Hash n 1 =? BHR 11…..10 11…..11 Tag Array To reduce aliasing with set-associative target cache Use branch PC and/or history for tags

Multiple Branch Prediction
For a really wide machine Across several basic blocks Need to predict multiple branches per cycle How to fetch non-contiguous instructions in one cycle? Prediction accuracy extremely critical (will be reduced geometrically)

Instruction Supply Issues
Execution Core Instruction Fetch Unit Instruction buffer Fetch throughput defines max performance that can be achieved in later stages Superscalar processors need to supply more than 1 instruction per cycle Instruction Supply limited by Misalignment of multiple instructions in a fetch group Change of Flow (interrupting instruction supply) Memory latency and bandwidth

Flynn’s Bottleneck ILP  1.86  Programs on IBM 7090
BB0 ILP   Programs on IBM 7090 ILP exploited within basic blocks [Riseman & Foster’72] Breaking control dependency A perfect machine model Benchmark includes numerical programs, assembler and compiler BB1 BB2 BB4 BB3 passed jumps 0 jump 1 jump 2 jumps 8 jumps 32 jumps 128 jumps  jumps Average ILP 1.72 2.72 3.62 7.21 14.8 24.2 51.2

Aligned Instruction Fetching (4 instructions)
PC=..xx000000 00 01 10 11 ..00 A0 A1 A2 A3 ..01 One 64B I-cache line A4 A5 A6 A7 ..10 A8 A9 A10 A11 Row Decoder ..11 A12 A13 A14 A15 Can pull out one row at a time inst inst inst inst 4 Cycle n Assume one fetch group = 16B

Misaligned Fetch PC=..xx001000 A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11
..00 A0 A1 A2 A3 ..01 One 64B I-cache line A4 A5 A6 A7 ..10 A8 A9 A10 A11 Row Decoder ..11 A12 A13 A14 A15 Rotating network inst inst inst inst 4 Cycle n IBM RS/6000

Split Cache Line Access
PC=..xx111000 00 01 10 11 ..00 A0 A1 A2 A3 ..01 cache line A A4 A5 A6 A7 ..10 A8 A9 A10 A11 Row Decoder ..11 A12 A13 A14 A15 B0 B1 B2 B3 cache line B B4 B5 B6 B7 inst inst 2 Cycle n inst inst 4 Cycle n+1 Be broken down to 2 physical accesses

Split Cache Line Access Miss
PC=..xx111000 00 01 10 11 ..00 A0 A1 A2 A3 ..01 cache line A A4 A5 A6 A7 ..10 A8 A9 A10 A11 Row Decoder ..11 A12 A13 A14 A15 C0 C1 C2 C3 cache line C C4 C5 C6 C7 Cache line B misses inst inst 2 Cycle n inst inst 4 Cycle n+X

High Bandwidth Instruction Fetching
Wider issue  More instruction feed Major challenge: to fetch more than one non-contiguous basic block per cycle Enabling technique? Predication Branch alignment based on profiling Other hardware solutions (branch prediction is a given) BB1 BB4 BB2 BB3 BB5 BB7 BB6

Assembly w/ predication
Predication Example Source code lw r2, [r1+4] lw r3, [r1] blt r3, r2, L1 sw r0, [r1] j L2 L1: sw r0, [r1+4] L2: Typical assembly lw r2, [r1+4] lw r3, [r1] sgt pr4, r2, r3 (p4) sw r0, [r1+4] (!p4) sw r0, [r1] Assembly w/ predication if (a[i+1]>a[i]) a[i+1] = 0 else a[i] = 0 Convert control dependency into data dependency Enlarge basic block size More room for scheduling No fetch disruption

Collapse Buffer [ISCA 95]
To fetch multiple (often non-contiguous) instructions Use interleaved BTB to enable multiple branch predictions Align instructions in the predicted sequential order Use banked I-cache for multiple line access

Collapsing Buffer Interleaved BTB Fetch PC Cache Bank 1 Cache Bank 2
Interchange Switch Collapsing Circuit

Collapsing Buffer Mechanism
Interleaved BTB Valid Instruction Bits E F G H A B C D A E Interchange Switch A B C D E F G H Bank Routing E A D F H Collapsing Circuit A B C E G E F G H A B C D

High Bandwidth Instruction Fetching
To fetch more, we need to cross multiple basic blocks (and/or multiple cache lines) Multiple branches predictions BB1 BB4 BB2 BB3 BB5 BB7 BB6

Multiple Branch Predictor [YehMarrPatt ICS’93]
Pattern History Table (PHT) design to support MBP Based on global history only Pattern History Table (PHT) Branch History Register (BHR) Tertiary prediction bk b1 …… p2 p1 p2 update Secondary prediction p1 Primary prediction

Multiple Branch Predictin
Fetch address (br0 Primary prediction) Fetch address could be retrieved from BTB Predicted path: BB1  BB2  BB5 How to fetch BB2 and BB5? BTB? Can’t. Branch PCs of br1 and br2 not available when MBP made Use a BAC design (branch address cache) BTB entry BB1 br1 T (2nd) F BB2 br2 BB3 F (3rd) T F T BB4 BB5 BB6 BB7

Branch Address Cache V br V br V br Tag Taken Target Address Not-Taken Target Address T-T Address T-N Address N-T Address N-N Address 23 bits 1 2 30 bits 30 bits 212 bits per fetch address entry Fetch Addr (from BTB) Use a Branch Address Cache (BAC): Keep 6 possible fetch addresses for 2 more predictions br: 2 bits for branch type (cond, uncond, return) V: single valid bit (to indicate if hits a branch in the sequence) To make one more level prediction Need to cache another 8 more addresses (i.e. total=14 addresses) 464 bits per entry = (23+3)*1 + (30+3) * (2+4) + 30*8

Caching Non-Consecutive Basic Blocks
High Fetch Bandwidth + Low Latency BB3 BB5 BB1 BB2 BB4 Fetch in Conventional Instruction Cache BB1 BB2 BB3 BB4 BB5 Fetch in Linear Memory Location

Trace Cache Cache dynamic non-contiguous instructions (traces)
Cross multiple basic blocks Need to predict multiple branches (MBP) E F G A B C D E F G H I J I$ Fetch (5 cycles) A B C D E F G H I J Collapsing Buffer Fetch (3 cycles) A B C D E F G H I J Trace Cache H I J K A B C D E F G H I J T$ Fetch (1 cycle) A B C D I$

Trace Cache [Rotenberg Bennett Smith MICRO‘96]
11, 1 11: 3 branches. 1: the trace ends w/ a branch 10 1st Br taken 2nd Br Not taken For T.C. miss Br flag Br mask Line fill buffer Tag Fall-thru Address Taken Address M branches BB2 BB1 BB3 T.C. hits, N instructions Branch 1 Branch 2 Branch 3 Fetch Addr Cache at most (in original paper) M branches OR (M = 3 in all follow-up TC studies due to MBP) N instructions (N = 16 in all follow-up TC studies) Fall-thru address if last branch is predicted not taken MBP

Trace Hit Logic Fetch: A Multi-BPred A 10 11,1 X Y N T N = Cond. AND
Tag BF Mask Fall-thru Target Multi-BPred A 10 11,1 X Y N T N = 0 1 Cond. AND Match 1st Block Next Fetch Address Match Remaining Block(s) Trace hit

BB Traversal Path: ABDABDACDABDACDABDAC
Trace Cache Example BB Traversal Path: ABDABDACDABDACDABDAC A B C D Exit 5 insts 12 insts 4 insts 6 insts Cond 1: 3 branches Cond 2: Fill a trace cache line Cond 3: Exit A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 D1 D2 D3 D4 A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 16 instructions Trace Cache (5 lines)

Trace Cache Example BB Traversal Path: ABDABDACDABDACDABDAC A B C D Exit 5 insts 12 insts 4 insts 6 insts Cond 1: 3 branches Cond 2: Fill a trace cache line Cond 3: Exit A1 A2 A3 A4 A5 A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 D1 D2 D3 D4 A1 A2 A3 A4 A5 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C12 D1 D2 D3 D4 D1 D2 D3 D4 A1 A2 A3 A4 A5 A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 D1 D2 D3 D4 A1 A2 A3 A4 A5 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 D1 D2 D3 D4 D1 D2 D3 D4 Trace Cache (5 lines)

Trace Cache Example BB Traversal Path: ABDABDACDABDACDABDAC A B C D Exit 5 insts 12 insts 4 insts 6 insts Trace Cache is Full A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C12 D1 D2 D3 D4 A1 A2 A3 A4 A5 C12 D1 D2 D3 D4 A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 D1 D2 D3 D4 Trace Cache (5 lines)

Redundancy Duplication A Fragmentation B C Example D
Note that instructions only appear once in I-Cache Same instruction appears many times in TC Fragmentation If 3 BBs < 16 instructions If multiple-target branch (e.g. return, indirect jump or trap) is encountered, stop “trace construction”. Empty slots  wasted resources Example A single BB is broken up to (ABC), (BCD), (CDA), (DAB) Duplicating each instruction 3 times 6 B C D A 4 (ABC) =16 inst (BCD) =13 inst (CDA) =15 inst (DAB) =13 inst 6 3 A B C D Trace Cache

Indexability TC saved traces (EAC) and (BCD) Path: (EAC) to (D)
Cannot index interior block (D) Can cause duplication Need partial matching (BCD) is cached, if (BC) is needed A B C G D E C B D Trace Cache A

Pentium 4 (NetBurst) Trace Cache
Front-end BTB iTLB and Prefetcher L2 Cache No I$ !! Decoder Trace $ BTB Trace $ Rename, execute, etc. Trace-based prediction (predict next-trace, not next-PC) Decoded Instructions

Pentium M - BTB Background: BTB is a cache structure
Instructions are fetched in 16-byte blocks (Intel) Can have multiple branches per line BTB can have multiple hits (same tags) => Offset field in each entry => Offset algorithm selects the target among several offered Try to find: Number of BTB entries (NBTB) Number of sets (NSETS) Number of ways (NWAYS) Index, Tag bits Offset bits and presence of offset algorithm Bogus branches handling Replacement policy

BTB Number of BTB entries: 2048 Number of sets: 512 Number of ways : 4
Index= IP[12:4], Tag=IP[21:13], Offset=IP[3:0] Offset algorithm: When multiple hits, selects the target with the lowest offset yet no smaller than the current IP Bogus branches handling: Evict whole set Replacement policy: Tree based pseudo LRU

Lecture 9. Branch Target Prediction and Trace Cache

Similar presentations

Presentation on theme: "Lecture 9. Branch Target Prediction and Trace Cache"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 9. Branch Target Prediction and Trace Cache

Similar presentations

Presentation on theme: "Lecture 9. Branch Target Prediction and Trace Cache"— Presentation transcript:

Similar presentations

About project

Feedback