Microprocessor Microarchitecture Instruction Fetch Lynn Choi Dept. Of Computer and Electronics Engineering.

Slides:

Advertisements

Similar presentations

Computer Organization and Architecture

Advertisements

CSCI 4717/5717 Computer Architecture

Dynamic Branch Prediction (Sec 4.3) Control dependences become a limiting factor in exploiting ILP So far, we’ve discussed only static branch prediction.

Wish Branches Combining Conditional Branching and Predication for Adaptive Predicated Execution The University of Texas at Austin *Oregon Microarchitecture.

ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia.

Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.

8 Processing of control transfer instructions TECH Computer Science 8.1 Introduction 8.2 Basic approaches to branch handling 8.3 Delayed branching 8.4.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.

CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

CS 7810 Lecture 7 Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching E. Rotenberg, S. Bennett, J.E. Smith Proceedings of MICRO-29.

EECC722 - Shaaban #1 Lec # 5 Fall Decoupled Fetch/Execute Superscalar Processor Engines Superscalar processor micro-architecture is divided.

EECC722 - Shaaban #1 Lec # 5 Fall High Bandwidth Instruction Fetching Techniques Instruction Bandwidth Issues –The Basic Block Fetch Limitation.

EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

EECC722 - Shaaban #1 Lec # 5 Fall High Bandwidth Instruction Fetching Techniques Instruction Bandwidth Issues –The Basic Block Fetch Limitation/Cache.

EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.

Goal: Reduce the Penalty of Control Hazards

Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

Spring 2003CSE P5481 Control Hazard Review The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction.

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )

EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.

Evaluation of Dynamic Branch Prediction Schemes in a MIPS Pipeline Debajit Bhattacharya Ali JavadiAbhari ELE 475 Final Project 9 th May, 2012.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

ACSAC’04 Choice Predictor for Free Mongkol Ekpanyapong Pinar Korkmaz Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Institute.

Page 1 Trace Caches Michele Co CS 451. Page 2 Motivation  High performance superscalar processors  High instruction throughput  Exploit ILP –Wider.

Microprocessor Microarchitecture Limits of Instruction-Level Parallelism Lynn Choi Dept. Of Computer and Electronics Engineering.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.

CSCI 6461: Computer Architecture Branch Prediction Instructor: M. Lancaster Corresponding to Hennessey and Patterson Fifth Edition Section 3.3 and Part.

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

Fetch Directed Prefetching - a Study

Cache (Memory) Performance Optimization. Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate.

High Bandwidth Instruction Fetching Techniques

1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.

CMPE750 - Shaaban #1 Lec # 5 Spring High Bandwidth Instruction Fetching Techniques Instruction Bandwidth Issues –The Basic Block Fetch Limitation/Cache.

Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.

Dynamic Branch Prediction

CSL718 : Pipelined Processors

Lecture 9. Branch Target Prediction and Trace Cache

Prof. Hsien-Hsin Sean Lee

CS203 – Advanced Computer Architecture

15-740/ Computer Architecture Lecture 21: Superscalar Processing

Dynamic Branch Prediction

Lynn Choi Dept. Of Computer and Electronics Engineering

Prof. Onur Mutlu Carnegie Mellon University

5.2 Eleven Advanced Optimizations of Cache Performance

Chapter 14 Instruction Level Parallelism and Superscalar Processors

High Bandwidth Instruction Fetching Techniques

Module 3: Branch Prediction

Instruction Level Parallelism and Superscalar Processors

Yingmin Li Ting Yan Qi Zhao

Ka-Ming Keung Swamy D Ponpandi

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Control unit extension for data hazards

Lecture 10: Branch Prediction and Instruction Delivery

pipelining: static branch prediction Prof. Eric Rotenberg

Adapted from the slides of Prof

Dynamic Hardware Prediction

Ka-Ming Keung Swamy D Ponpandi

Spring 2019 Prof. Eric Rotenberg

Presentation transcript:

Microprocessor Microarchitecture Instruction Fetch Lynn Choi Dept. Of Computer and Electronics Engineering

Instruction Fetch w/ branch prediction  On every cycle, 3 accesses are done in parallel  Instruction cache access  Branch target buffer access  If hit, determines that it is a branch and provides target address  Else, use fall-through address (PC+4) for the next sequential access  Branch prediction table access  If taken, instructions after the branch are not sent to back end and next fetch starts from target address  If not taken, next fetch starts from fall-through address

Motivation  Wider issue demands higher instruction fetch rate  However, Ifetch bandwidth limited by  Basic block size  Average block size is 4 ~ 5 instructions  Need to increase basic block size!  Branch prediction hit rate  Cost of redirecting fetching  More accurate prediction is needed  Branch throughput  Multiple branch prediction per cycle is necessary for wide-issue superscalar!  Can fetch multiple contiguous basic blocks  The number of instructions between taken branches is 6 ~ 7  Limited by instruction cache line size  Taken branches  Fetch mechanism for non-contiguous basic blocks  Instruction cache hit rate  Instruction prefetching

Solutions  Solutions  Increase basic block size (using a compiler)  Trace scheduling, superblock scheduling, predication  Hardware mechanism to fetch multiple non-consecutive basic blocks are needed!  Multiple branch predictions per cycle  Generate fetch addresses for multiple basic blocks  Non-contiguous instruction alignment  Need to fetch and align multiple noncontiguous basic blocks and pass them to the pipeline

Current Work  Existing schemes to fetch multiple basic blocks per cycle  Branch address cache + multiple branch prediction - Yeh  Branch address cache  Natural extension of branch target buffer  Provides the starting addresses of the next several basic blocks  Interleaved instruction cache organization to fetch multiple basic blocks per cycle  Trace cache - Rotenberg  Caching of dynamic instruction sequences  Exploit locality of dynamic instruction streams, eliminating the need to fetch multiple non-contiguous basic blocks and the need to align them to be presented to the pipeline

Branch Address Cache Yeh & Patt  Hardware mechanism to fetch multiple non-consecutive basic blocks are needed!  Multiple branch prediction per cycle using two-level adaptive predictors  Branch address cache to generate fetch addresses for multiple basic blocks  Interleaved instruction cache organization to provide enough bandwidth to supply multiple non-consecutive basic blocks  Non-contiguous instruction alignment  Need to fetch and align multiple non-contiguous basic blocks and pass them to the pipeline

Multiple Branch Predictions

Multiple Branch Predictor  Variations of global schemes are proposed  Multiple Branch Global Adaptive Prediction using a Global Pattern History Table (MGAg)  Multiple Branch Global Adaptive Prediction using a Per-Set Pattern History Table (MGAs)  Multiple branch prediction based on local schemes  Require more complicated BHT access due to sequential access of primary/secondary/tertiary branches

Multiple Branch Predictors

Branch Address Cache  Only a single fetch address is used to access the BAC which provides multiple target addresses  For each prediction level L, BAC provides 2 L of target address and fall- through address  For example, 3 branch predictions per cycle, BAC provides 14 ( ) target addresses  For 2 branch predictions per cycle, TAC provides  TAG  Primary_valid, Primary_type  Taddr, Naddr  ST_valid, ST_type, SN_valid, SN_type  TTaddr, TNaddr, SNaddr, NNaddr

ICache for Multiple BB Access  Two alternatives  Interleaved cache organization  As long as there is no bank conflict  Increasing the number of banks reduces conflicts  Multi-ported cache  Expensive  ICache miss rate increases  Since more instructions are fetched each cycle, there are fewer cycles between Icache misses  Increase associativity  Increase cache size  Prefetching

Fetch Performance

Issues  Issues of branch address cache  I cache to support simultaneous access to multiple non-contiguous cache lines  Too expensive (multi-ported caches)  Bank conflicts (interleaved organization)  Complex shift and alignment logic to assemble non-contiguous blocks into sequential instruction stream  The number of target addresses stored in branch address cache increases substantially as you increase the branch prediction throughput

Trace Cache Rotenberg & Smith  Idea  Caching of dynamic instruction stream (Icache stores static instruction stream)  Based on the following two characteristics  Temporal locality of instruction stream  Branch behavior  Most branches tend to be biased towards one direction or another  Issues  Redundant instruction storage  Same instructions both in Icache and trace cache  Same instructions among trace cache lines

Trace Cache Rotenberg & Smith  Organization  A special top-level instruction cache each line of which stores a trace, a dynamic instruction stream sequence  Trace  A sequence of the dynamic instruction stream  At most n instructions and m basic blocks  n is the trace cache line size  m is the branch predictor throughput  Specified by a starting address and m - 1 branch outcomes  Trace cache hit  If a trace cache line has the same starting address and predicted branch outcomes as the current IP  Trace cache miss  Fetching proceeds normally from instruction cache

Trace Cache Organization

Design Options  Associativity  Path associativity  The number of traces that start at the same address  Partial matches  When only the first few branch predictions match the branch flags, provide a prefix of trace  Indexing  Fetch address vs. fetch address + predictions  Multiple fill buffers  Victim trace cache

Experimentation  Assumption  Unlimited hardware resources  Constrained by true data dependences  Unlimited register renaming  Full dynamic execution  Schemes  SEQ1: 1 basic block at a time  SEQ3: 3 consecutive basic blocks at a time  TC: Trace cache  CB: Collapsing buffer (Conte)  BAC: Branch address cache (Yeh)

Performance

Trace Cache Miss Rates  Trace Miss Rate - % accesses missing TC  Instruction miss rate - % instructions not supplied by TC

Exercises and Discussion  Itanium uses instruction buffer between FE and BE? What is the advantages of using this structure?  How can you add path associativity to the normal trace cache?