Presentation is loading. Please wait.

Presentation is loading. Please wait.

Microarchitecture of Superscalars (3) Branch Prediction Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007.

Similar presentations


Presentation on theme: "Microarchitecture of Superscalars (3) Branch Prediction Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007."— Presentation transcript:

1 Microarchitecture of Superscalars (3) Branch Prediction Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007

2 Branch prediction 1. Introdutcion 2. Basic branch prediction mechanisms 3. Auxiliary branch prediction mechanisms 4. Accessing the branch target path

3 1.1 The branch processing problem of pipelining (1) Figure 1.1: Straightforward processing of an unconditional branch on a four stage pipeline BTI FDE W 2 bubbles BTI Branch BTA F F fetching Branch detection calculation fetching i j i+1 i i+2 i i i t i t i+1 t i+2 t i+3 t i+4 b D F

4 1.1 The branch processing problem of pipelining (2) Figure 1.2: Straightforward processing of a conditional branch on a four stage pipeline with immediate condition resolution

5 1.1 The branch processing problem of pipelining (3) Figure 1.3: Straightforward processing of a conditional branch on a four stage pipeline, with delayed condition resolution

6 1.1 The branch processing problem of pipelining (4) Year * * * * * Pentium (5) 2005 No of pipeline stages Pentium Pro (~12) Pentium 4 (~20) Athlon-64 (12) P4 Prescott (~30) (14) Conroe * Athlon (6) K6 (6) * 1995 * Core Duo Figure 1.4: Number of pipeline stages in Intel’s and AMD’s processors

7 1.2 Branch statistics (1) Figure 1.5: Dynamic ratio of branches

8 1.2 Branch statistics (2) Figure 1.6: Ratio of the main instruction types Source: Stephens et al. „Instruction level profiling and evaluation of the IBM RS/6000”, Proc. 18th ISCA, pp

9 Branches Unconditional branchesConditional branches Simple unconditional branch Branch to subroutine Return from subroutine Loop-closing conditional branch Other conditional branches Taken for the first (n-1) iterations Taken Not taken ~ 1/6 Taken ~ 1/6 ~ 1/3 ~ 1/6 ~ 5/6 Figure 1.7: Grohoski’s estimate of branch statistics Source: Grohoski, G.F, IBM J. Res. Develop., 34 Jan. pp Branch statistics (3)

10 Figure 1.8: Frequency of taken and not taken branches Source: Sima, D et. al., ACA, Addison Wesley, 1997, pp. 303 Reference Frequency of taken branches Frequency of not taken branches Lee, Smith % % Edenfield & al %25 % Grohoski 1990~ 5/6 ~ 1/6

11 1.3 The principle of branch prediction (1) Figure 1.9: Correctly predicted conditional branch with delayed condition resolution on a four stage pipeline

12 1.3 The principle of branch prediction (2) calculation BTA E EFE E D stop bc Condition fetching bc detection checking Branch pred. (branch!) BTA calc. Dynamic D F t i t i+1 t i+2 t i+3 t i+4 F Condition checking W t j t j+1 t j+2 t j+3 Condition checking Condition checking i+1 i i+2 i i i bc (no branch!) t j+4 A large number of bubbles F i j+1 fetching BTI (speculative) i j F F D i+3 i BTA fetching BTI decode F i+1 i Figure 1.10: Incorrectly predicted conditional branch with delayed condition resolution on a four stage pipeline

13 Figure 1.11: Branch misprediction penalty on a long pipeline 1.3 The principle of branch prediction (3)

14 1.4 Branch prediction accuracy/penalty (1) Figure 1.12: Branch prediction accuracy Source: Sima, D et. al., ACA, Addison Wesley, 1997, pp. 340 Processor Guessing method (relevant for prediction accuracy) Implementation Prediction accuracy Reference Am (1987)Implicit dynamic32-entry two-way set associative BTIC 60 % for repetitive branches Weiss 1987 MC (1991)Implicit dynamic, overridden by opcode- based static 32-entry fully associative BTIC 70 % on SPECDiefendorff, Allen 1992 MC (1993)2-bit dynamic256-entry BTAC> 90 %Circello, Goodrich 1993 MIPS R10000 (1996)2-bit dynamic512-entry BHT90 %Halfhill, 1994 PowerPC 620 (1995)Implicit dynamic, augmented with 2-bit dynamic 256-entry fully associative BTAC, 2-K- entry BHT 90 %Thomson, Ryan 1994 PA-8000 (1995)Implicit dynamic, overridden by 3-bit dynamic or compiler based static 32-entry fully associative BTAC, 256-entry BHT 80 % on SPECint92Gwennap 1994 UltraSparc (1995)2-bit dynamic2 K-entries in the IC, each shared among two instructions 88 % on SPECint92 94 % on SPECfp92 Wayner 1994 BHT: Branch history tableBTAC: Branch target address cache BTIC: Branch target instruction cacheIC: Instruction cache

15 Examples: f c :Probability (frequency) of correctly predicted branches f m :Probability (frequency) of mispredicted branches P c :Penalty of correctly predicted branches P m :Penalty of mispredicted branches PPro P4 Willamette P4 Prescott cycles cycles cycles Effective penalty of branch processing (simplified) 1.4 Prediction accuracy/penalty (2)

16 2.1 Introduction (1) Branch prediction Branch detection Accessing the branch target path Branch processing 2. Basic branch prediction mechanisms

17 Basic branch prediction mechanism Auxilliary branch prediction mechanism Branch prediction mechanisms 2.1 Introduction (2)

18 Basic branch prediction mechanism Processor based Local Compiler hints 2.1 Introduction (2)

19 Figure 2.1.: Local prediction ? Prediction depends only on the behaviour of the branch considered

20 Basic branch prediction mechanism Processor based Global (2-level) Local Compiler hints 2.1 Introduction (2)

21 Figure 2.2.: Global prediction Path 2: Path 1:..000 ?..100 Prediction depends on the actual execution path, that is on all branches executed

22 Basic branch prediction mechanism Processor based Global Local Compiler hints Combined (Choice prediction) (2-level) 2.1 Introduction (2)

23 1-level2-level Local prediction 2.2. Local prediction (1)

24 80486 (1989) PPC 601 (1993) POWER2 (1993) POWER1 (1990) Static prediction Displacement- based Dynamic prediction 1-level (local) prediction Opcode- based 1-bit prediction 'Always taken' Fixed prediction 'Always not taken' approach MC (1990) PPC 601 (1993) SuperSparc (1992) R4000 (1992) R8000 (1994) PPC: PowerPC 2.2. Local prediction (2) Always the same predictionBased on the object codeBased on the execution history

25 2.2. Local prediction (3) IFA: BHT (Branch History Table) 0: sequential cont 1: branch. } x: Figure 2.3: Principle of the 1-bit dynamic prediction x

26 taken Taken NT T T T: Branch has been taken Not NT: Branch has not been taken Figure 2.4: State transition diagram of the 1-bit dynamic prediction 2.2. Local prediction (4)

27 80486 (1989) PPC 601 (1993) POWER2 (1993) POWER1 (1990) Pentium (1993) PPC 604 (1995) PPC 620 (1996) Static prediction Displacement- based Dynamic prediction 1-level (local) prediction Opcode- based 1-bit prediction 2-bit'Always taken' Fixed prediction 'Always not taken' approach MC (1990) PPC 601 (1993) SuperSparc (1992) R4000 (1992) R8000 (1994) R10000 (1996) MC (1993) UltraSparc (1995) PPC: PowerPC 2.2. Local prediction (6) Always the same predictionBased on the object codeBased on the execution history

28 2.2. Local prediction (7) IFA: BHT 00,01: sequential cont 10,11: branch. } xx: BHT: Branch History Table Figure 2.6: Principle of the 2-bit dynamic prediction x

29 AT: actually taken ANT: actually not taken Branch has been : taken ANT AT Strongly Weakly not Initialised when a branch is taken first Prediction: "Taken" Prediction: "Not Taken" not Figure 2.7: State transition diagram of the most frequently used 2-bit dynamic prediction (Smith algorithm) 2.2. Local prediction (8)

30 2.2. Local prediction (5) Figure 2.5: Alternatives for accessing Branch History Tables or Branch Target Address Buffers Accessing BHTs/BTACs Cache-like access Associative access (direct / set associative) Indexed access Index BHT C IFA: (Counters) For large tables most branches will map to a unique entry. For smaller tables multiple branches may map to the same entry, resulting in interferences and thus in degrated prediction accuracy. Examples: 16K entry local BHT (Power4) 16K entry global BHT (Power4) 16K entry selector table (Power4) IFA IFA: IFAC Avoids interference but stronly increases cost. 64 entry BTAC (PPC 604) Index IFA: Tags C C Reduces interferences but increases cost. (E.g. two-way set associative) 128*4 way BHT/BTAC (Pentium Pro) 1K*4 way BHT/BTAC (Pentium II, III, 4) 128*2 way BTAC (Power3)

31 80486 (1989) PPC 601 (1993) POWER2 (1993) POWER1 (1990) Pentium (1993) PPC 604 (1995) PPC 620 (1996) Static prediction Displacement- based Dynamic prediction 1-level (local) prediction Opcode- based 1-bit prediction 2-bit 3-bit 'Always taken ' Fixed prediction 'Always not taken ' approach MC (1990) PPC 601 (1993) SuperSparc (1992) R4000 (1992) R8000 (1994) R10000 (1996) MC (1993) UltraSparc (1995) PPC: PowerPC Figure 2.8: Early branch prediction mechanisms and their trends indicated by subsequent models of pipelined, 1. and 2. generation superscalars 2.2. Local prediction (9) Always the same predictionBased on the object codeBased on the execution history

32 1-level2-level Fixed prediction Static prediction Dynamic prediction Local prediction Always the same prediction Based on the object code Based on the execution history 2.2. Local prediction (10)

33 Local BHT 2-level local branch prediction With a shared global history table for all patterns (Alpha 21264) With individual history tables for different patterns (Pentium Pro) IFA: (e.g. 1K×10 bit) Local BHT (e.g. 1K×3 bit) 1 Shared counters Individual counters 2.2. Local prediction (11) (1.- level: branch patterns, 2.-level: history bits) 2-level local prediction Local BHT IFA: 1 0 (e.g. 128×4 bit) e.g. 4-ways each Local BHT (e.g. 16×2 bit) 6 Branch Branch The uses 3-bit saturating counters whose most significant bit provides the prediction

34 Figure 2.9.: The principle of Pentium Pro’s 128x4 way set associative BHT Tags HistoryTagsHistory Tags HistoryTags History 4-bit BTA (linear) Tag Index 0 15 x xx: 00/01 not taken 10/11 taken Way 3Way 1 Way 2 Way 0 Counters BHT Local prediction (12)

35 Figure 2.10.: The actual layout of Pentium Pro’s 128x4 way set associative BHT Tag Tag HHHHC C CC 2.2. Local prediction (13)

36 Basic branch prediction mechanism Processor based Global Local Compiler hints Combined (Choice prediction) (2-level) 2.3. Global prediction (1)

37 Simple global Global prediction 2.3. Global prediction (1)

38 Figure 2.11.: Simple global prediction BHT Global history (shift register) 000 x 1111 Branch history 2.3. Global prediction (1)

39 Simple globalGshare Global prediction 2.3. Global prediction (1)

40 Figure 2.12.: Principle of the Gshare prediction } Global history IFA 111 x BHT XOR... Branch history 2.3. Global prediction (1)

41 Simple globalGshareGselect Global prediction 2.3. Global prediction (1)

42 Global history IFA: BHT x Branch history Figure 2.13.: Principle of the Gselect prediction 2.3. Global prediction (1)

43 Basic branch prediction mechanism Processor based Global Local Compiler hints Combined (Choice prediction) (2-level) 2.4. Combined prediction (1)

44 Figure 2.14.: Principle of the combined local and global prediction (as used in the Alpha 21264, or the POWER 4) BHT IFA: Global history Global BHT Local IFA: Best choice BHT Resulting prediction x Local prediction Global prediction Local prediction Global prediction Actual prediction (for updating) 2.4. Combined prediction (2)

45 Alpha level local dynamic prediction with a shared counter table for all patterns (1K * 10 bits/1K * 3 bits) Simple 2-level global prediction (12-bit global history/4K * 2 bits) Global history referenced choice table (12-bit global history/4K * 2-bits) Figure 2.15.: Implementation alternatives of the combined prediction Combined prediction 1. prediction 2. prediction Choice 2.4. Combined prediction (3)

46 Source: Microprocessor Report, 10/28/96 Minimum branch penalty: 7 cycles Typical branch penalty: 11+ cycles (IQ delay) 48K bits of target addresses stored in I-cache 32-entry return address stack Predictor tables are reset on a context switch 2.4. Combined prediction (4) Figure 2.16.: The combined predictor of the Alpha 21264

47 1-level local dynamic prediction Alpha POWER 4 2-level local dynamic prediction with a shared counter table for all patterns (1K * 10 bits/1K * 3 bits) Simple 2-level global prediction (12-bit global history/4K * 2 bits) Global history referenced choice table (12-bit global history/4K * 2-bits) (16K * 1-bit) 2-level Gshare global prediction (11-bit global history is hashed with the IFA, 16K * 1-bit counter table) Accessed in the same way as the global counter table (16K * 1-bit) Figure 2.17.: Implementation alternatives of the combined prediction Combined prediction 1. prediction 2. prediction Choice 2.4. Combined prediction (5)

48 Figure 2.18.: The principle of the combined predictor of the POWER Combined prediction (6) } 16K*1bit IFA 111 BHT XOR... Select the better 18 IFA: 5 14 Local History 14 16K*1bit Selector Table 16K*1bit Global History Local prediction Global 14 Update 1-bit per group 11-bit global history

49 Figure 2.20.: Trends of branch prediction schemes used in 2. and 3. generation superscalars 2.5. Overview of the basic branch prediction mechanisms

50 Figure 3.1.: Overview of auxiliary branch prediction mechanisms in 2. and 3. generation superscalars 1 Pentium Pentium Pro P4 Will/Northw. P4 Prescott K6 K7 K8 PPC 604 PPC 620 POWER 3 POWER 4 Alpha Alpha PA-8000 PA-8500/8700 UltraSPARC-III Pentium Pro Pentium P4 Will/Northw. P4 Prescott Backup use of static prediction Auxiliary branch prediction mechanisms 1 : 1. generation superscalars 1 1 RAS: Return Address Stack POWER 5 2 : Supported by compiler hints 3. Auxillary branch prediction mechanisms

51 Figure 3.2: Static branch prediction algorithm of the Pentium Pro Source: Shanley T., „Pentium Pro Processor System Architecture„, Addison-Wesley Developers Press, 1996

52 Figure 3.1.: Overview of auxiliary branch prediction mechanisms in 2. and 3. generation superscalars 1 Pentium Pentium Pro P4 Will/Northw. P4 Prescott K6 K7 K8 PPC 604 PPC 620 POWER 3 POWER 4 Alpha Alpha PA-8000 PA-8500/8700 UltraSPARC-III Pentium Pro Pentium P4 Will/Northw. P4 Prescott K6 K7 K8 PPC 620 POWER 4 POWER 4 2 Alpha Alpha PA-8000 UltraSPARC-III (16-entries) (12-entries) (8-entries) (32-entries) (12-entries) Backup use of static prediction Dedicated prediction RA S Preemptive use of compiler hints Auxiliary branch prediction mechanisms 1 : 1. generation superscalars 1 1 POWER 3 RAS: Return Address Stack POWER 5 POWER : Supported by compiler hints 3. Auxillary branch prediction mechanisms

53 Return Address Stack (RAS) PUSH return address on a CALL POP return address on a RET RAS used to continue execution speculatively from the popped up return address PUSH return address on a CALL POP return address on a RET Architectural stack with preserved sequential consistency A procedure, such as a printf () might be called from many different locations, so there are many different return addresses. During speculative ooo execution however, the logical sequence of the related PUSH RET instructions may be disturbed, so the predicted return address may be wrong. For checking the prediction the RET instruction will be executed, and for a misprediction a repair mechanism will be activated (to cancel wrongly executed instructions and repair the corrupted RAS). The Problem of RASs :

54 Figure 3.1.: Overview of auxiliary branch prediction mechanisms in 2. and 3. generation superscalars 1 Pentium Pentium Pro P4 Will/Northw. P4 Prescott K6 K7 K8 PPC 604 PPC 620 POWER 3 POWER 4 Alpha Alpha PA-8000 PA-8500/8700 UltraSPARC-III Pentium Pro Pentium P4 Will/Northw. P4 Prescott K6 K7 K8 PPC 604 PPC 620 POWER 4 POWER 4 2 POWER 4 Alpha Alpha PA-8000 UltraSPARC-III (16-entries) (12-entries) (8-entries) (32-entries) (12-entries) Backup use of static prediction Dedicated prediction RA S Loop detector Indirect branch pred. Preemptive use of compiler hints Auxiliary branch prediction mechanisms 1 : 1. generation superscalars 1 1 POWER 3 RAS: Return Address Stack POWER 5 POWER : Supported by compiler hints 3. Auxililary branch prediction mechanisms

55 Figure 4.1.: Alternatives to generate the BTA BTA Calculated on the fly 4. Accessing the branch target path (1) 4.1. Overview

56 I-cache I F A R AII+1I+2I+3 IIFA BTA BTIBTI+1BTI+2BTI+3 Instruction fetch address + sequential address Compute BTA (IFA) Figure 4.2.: Principle of calculating the BTA on the fly This scheme is employed in earlier scalar (pipeline) processors as well as in a number of superscalar processors, such as: Z i486 MC Sparc CY7C601 SuperSparc Power PC Power1 Power A R4000 R (1984) (1989) (1990) (1988), (1992p), (1993), (1990), (1993), (1992), (1994), (1995), (1992), (1996) Source: Sima, D et. al., ACA, Addison Wesley, 1997, pp. 303 POWER4 (2001), POWER5 (2005) Ultra SPARC III (2003)

57 Figure 4.1.: Alternatives to generate the BTA BTA Accessed from the BTACCalculated on the fly 4. Accessing the branch target path (1) 4.1. Overview

58 I F A R IIFA Instruction fetch address (IFA) BTAC BA-1BTA + I-cache AII+1I+2I+3 BTIBTI+1BTI+2BTI+3 Sequential address Branch target address The Branch Target Address Cache (BTAC) contains branch target addresses (BTAs). These BTAs are read from the BTAC when the instruction immediately preceding a branh is fetched. (Their addresses are designated as BA-1). Figure 4.3.: Principle of the BTAC scheme to access the branch target path

59 Figure 4.4.: The principle of branch prediction using both a BHT and a BTAC (C: counter) IFA : TagsBTA IFA : I$ IB Further processing BHT C IFA: I F A R + Update BHT with branch result Update BTAC with BTA if BHT initiates it. (create/delete Update BTAC BTAC entry) IIFA if BTAC misses BTA if mispred. if BTAC hits Tag BTAC (Designated as BTB (Branch Target Buffer) by Intel)

60 Processor Number of BTAC entries Implementation of the BTAC ES/ based procs (1992p) 4K2-way associative Pentium (1994)256Fully associative Pentium Pro5124-way associative Pentium 44K4-way associative MC (1993)2564-way associative R 8000 (1994)11K PA 8000 (1995)32Fully associative Power PC 604 (1994)64Fully associative Power PC 620 (1995)256Fully associative 1: Each entry is shared among 4 instructions Figure 4.5.: Examples of processors using the BTAC scheme

61 Figure 4.6.: The physical implementation of branch prediction in Intel’s P4 Northwood and Prescott cores Source: de Vries H., „Looking at Intel’s Prescott die, part II.”, April 2003

62 4. Accessing the branch target path (1) Figure 4.1.: Alternatives to generate the BTA BTA Accessed from BTAC From the I$ Calculated on the fly 4.1. Overview

63 I-cache I F A R I A IFA Instruction fetch address (IFA) BABTIBTA+ + To decoding The BTIC contains the addresses of the last recently taken branches (BA), the corresponding branch target instructions (BTI) and the addresses of the instructions following the BTIs (BTA+). When there is an entry in the BTIC for the actual IFA, the corresponding BTI is fetched from the BTIC and selected for decoding instead of the instruction from the I-cache. The address of the subsequent instruction along the taken path is also read from BTIC and becomes the next IFA Examples: Gmicrol/200 (1988), AM (1988), MC (1993). BTIC Figure 4.7.: Principle of the BTIC scheme to access the branch target path

64 Figure 4.8.:Trends to generate the BTA BTA Accessed from BTAC From the I$ Ultra SPARC III Calculated on the fly K6 PPro/PII/PIII/P4 K7/K8 Power 4, 5 Power Examples 4. Accessing the branch target path (1) 4.1. Overview

65 Fetch block (16-Byte) Selector block (16-bit) BTA Instruction execution The selector block identifies branches, included in the associated fetch block. Two bits of the selector block correspont to two bytes of the fetch block. RETs are a single byte long all other branches are at least two bytes long. Assuming max. a single RET in the fetch block, there may be at most one branch ending in any pair of Bytes. In a fetch block, there are up to a single RET and two non-RET branches. More branches in a fetch block lead to conflicts in the prediction logic. To each 16-Byte long fetch block a 16 bit selector block is allocated as follows: 4.2. Case example 1: K7 (1)

66 Each two bit entry indicates whether or not there is a branch ending in the corresponding two bytes in the fetch block, if yes, it identifies the type of the branch as well. A branch instruction that crosses the 16-byte boundary is counted to the second 16 byte window. Coding of the two bits (assumed) 00: no branch 01: RET 10: There is a conditional branch whose brach is in the BTA0 field of the BTAC 11: There is a conditional branch whose brach is in the BTA1 field of the BTAC 4.2. Case example 1: K7 (2)

67 Characteristic examples of selector settings: xx00 No branch xx A RET instruction xx A cond. branch (it’s BTA is in the BTAC 0 field) xx Two cond. branches (their BTAs are in the BTAC 0 and BTAC 1 fields) IFA+16 Return address of the RET BTA0 if taken else IFA+16 Y Y N N BC1 BC2 BTA0 BTA1 IFA+16 During predecoding instruction boundaries as well as branch instructions are detected and the appropriate selector entries are marked accordingly. Predecoding is performed not faster than 4 bytes/cycle If a cache line (64 bytes = 4 fetch blocks) is replaced, all associated selector blocks are invalidated 4.2. Case example 1: K7 (3)

68 The selector table is shared between the upper and lower part of the I$, and an extra address bit (A) identifies whether the entry belongt to the upper or the lower part of the I$ Case example 1: K7 (4) Source: Kaiser, A.,”K7 Branch Prediction”, Dec. 1999,

69 Figure 4.9.: Assumed simplified scheme of accessing the branch target path in the K7, without showing the global prediction (A: address bit, C: Conditional branch, W: Way) 4.2. Case example 1: K7 (5)

70 The K8 doubled the size of the selector table, so each fetch block has it’s own selector entry. The K8 allows any mix of up to 3 branches (CALL, JMP, RET, conditional) / fetch block, the coding of the selector entries is modified accordingly. When instruction cache lines are evicted to the L2 cache, branch selectors and predecode information are also stored in the L2 cache. The K8 uses 48-bit addresses but the BTAC keeps only the 15 least significant bits to identify the next address. Each BTA entry identifies the least significant 15-bits of the IFA as well as additional information, such as 3-bit old IFA (bits 16,15) W bit: W identificator 4.2. Case example 2: K8 (1)

71 Figure 4.10.: Assumed simplified scheme of accessing the branch target path in the K8, without showing the global prediction (C: Conditional branch, R: Return, W: Way 0/1, SA: Start address) 4.2. Case example 2: K8 (2)

72 Figure 4.11.: Logical view of Opteron’s (K8’s) instruction fetch and decode stages Source: de Vries H., „Understanding the detailed Architecture of AMD’s 64 bit Core”, Sept., Case example 2: K8 (3)

73 Figure 4.12.: Physical implementation of Opteron’s (K8’s) instruction cache and decoding Source: de Vries H., „Understanding the detailed Architecture of AMD’s 64 bit Core”, Sept., Case example 2: K8 (4)


Download ppt "Microarchitecture of Superscalars (3) Branch Prediction Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007."

Similar presentations


Ads by Google