Lecture 11: Memory Data Flow Techniques

Slides:



Advertisements
Similar presentations
Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
Advertisements

1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.
Lec18.1 Step by step for Dynamic Scheduling by reorder buffer Copyright by John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative.
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
CS 152 Computer Architecture and Engineering Lecture 15 - Advanced Superscalars Krste Asanovic Electrical Engineering and Computer Sciences University.
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )
Computer Architecture Lecture 18 Superscalar Processor and High Performance Computing.
Review of CS 203A Laxmi Narayan Bhuyan Lecture2.
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )
1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Trace cache and Back-end Oper. CSE 4711 Instruction Fetch Unit Using I-cache I-cache I-TLB Decoder Branch Pred Register renaming Execution units.
1 Lecture 5 Overview of Superscalar Techniques CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading: Textbook, Ch. 2.1 “Complexity-Effective.
1 Lecture 7: Speculative Execution and Recovery using Reorder Buffer Branch prediction and speculative execution, precise interrupt, reorder buffer.
© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture Instruction Execution: Dynamic Scheduling.
1 Lecture 5: Dependence Analysis and Superscalar Techniques Overview Instruction dependences, correctness, inst scheduling examples, renaming, speculation,
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.
1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
1 Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha Example.
CS203 – Advanced Computer Architecture ILP and Speculation.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
IBM System 360. Common architecture for a set of machines
Lecture: Out-of-order Processors
CSL718 : Superscalar Processors
/ Computer Architecture and Design
OOO Execution of Memory Operations
Out of Order Processors
OOO Execution of Memory Operations
CS203 – Advanced Computer Architecture
Lecture: Out-of-order Processors
Pipeline Implementation (4.6)
Microprocessor Microarchitecture Dynamic Pipeline
Lecture 12 Reorder Buffers
Flow Path Model of Superscalars
Lecture 6: Advanced Pipelines
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Out of Order Processors
Lecture 10: Out-of-order Processors
Lecture 11: Out-of-order Processors
Lecture: Out-of-order Processors
Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Lecture: Out-of-order Processors
Lecture 8: Dynamic ILP Topics: out-of-order processors
Adapted from the slides of Prof
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Krste Asanovic Electrical Engineering and Computer Sciences
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Advanced Computer Architecture
Adapted from the slides of Prof
Lecture 7 Dynamic Scheduling
Lecture 10: ILP Innovations
Lecture 9: ILP Innovations
Lecture 9: Dynamic ILP Topics: out-of-order processors
Conceptual execution on a processor which exploits ILP
Handling Stores and Loads
Presentation transcript:

Lecture 11: Memory Data Flow Techniques Load/store buffer design, memory-level parallelism, consistency model, memory disambiguation This lecture will cover sections 1.1-1.4: introduction, application, technology trends, cost and price.

Load/Store Execution Steps Load: LW R2, 0(R1) Generate virtual address; may wait on base register Translate virtual address into physical address Write data cache Store: SW R2, 0(R1) Generate virtual address; may wait on base register and data register Translate virtual address into physical address Write data cache Unlike in register accesses, memory addresses are not known prior to execution

Load/store Buffer in Tomasulo Support memory-level parallelism Loads wait in load buffer until their address is ready; memory reads are then processed Stores wait in store buffer until their address and data are ready; memory writes wait further until stores are committed IM Fetch Unit Reorder Buffer Decode Rename Regfile S-buf L-buf RS RS DM FU1 FU2

Load/store Unit with Centralized RS IM Centralized RS includes part of load/store buffer in Tomasulo Loads and stores wait in RS until there are ready Fetch Unit Reorder Buffer Decode Rename Regfile RS S-unit L-unit FU1 FU2 data addr addr Store buffer cache

Memory-level Parallelism for (i=0;i<100;i++) A[i] = A[i]*2; Loop: L.S F2, 0(R1) MULT F2, F2, F4 SW F2, 0(R1) ADD R1, R1, 4 BNE R1, R3,Loop F4 store 2.0 Significant improvement from sequential reads/writes LW1 LW2 LW3 SW1 SW2 SW3

Memory Consistency Memory contents must be the same as by sequential execution Must respect RAW, WRW, and WAR dependences Practical implementations: Reads may proceed out-of-order Writes proceed to memory in program order Reads may bypass earlier writes only if their addresses are different

Store Stages in Dynamic Execution Wait in RS until base address and store data are available (ready) Move to store unit for address calculation and address translation Move to store buffer (finished) Wait for ROB commit (completed) Write to data cache (retired) Stores always retire in for WAW and WRA Dep. RS Store unit Load unit finished completed D-cache Source: Shen and Lipasti, page 197

Load Bypassing and Memory Disambiguation To exploit memory parallelism, loads have to bypass writes; but this may violate RAW dependences Dynamic Memory Disambiguation: Dynamic detection of memory dependences Compare load address with every older store addresses

Load Bypassing Implementation RS in-order 1. address calc. 2. address trans. 3. if no match, update dest reg Associative search for matching Load unit Store unit 1 1 2 2 match 3 data addr D-cache Assume in-order execution of load/stores addr data

Load Forwarding Load Forwarding: if a load address matches a older write address, can forward data If a match is found, forward the related data to dest register (in ROB) Multiple matches may exists; last one wins RS in-order Load unit Store unit 1 1 2 2 match 3 To dest. reg data addr D-cache addr data

In-order Issue Limitation Any store in RS station may blocks all following loads When is F2 of SW available? When is the next L.S ready? Assume reasonable FU latency and pipeline length for (i=0;i<100;i++) A[i] = A[i]/2; Loop: L.S F2, 0(R1) DIV F2, F2, F4 SW F2, 0(R1) ADD R1, R1, 4 BNE R1, R3,Loop

Speculative Load Execution RS Forwarding does not always work if some addresses are unknown No match: predict a load has no RAW on older stores Flush pipeline at commit if predicted wrong out-order Load unit Store unit 1 1 2 2 match 3 Match at completion addr data addr data Finished load buffer D-cache data If match: flush pipeline

Alpha 21264 Pipeline

Alpha 21264 Load/Store Queues Int issue queue fp issue queue Addr ALU Int ALU Int ALU Addr ALU FP ALU FP ALU Int RF(80) Int RF(80) FP RF(72) D-TLB L-Q S-Q AF Dual D-Cache 32-entry load queue, 32-entry store queue

Load Bypassing, Forwarding, and RAW Detection commit match LQ SQ ROB Load/store? Load: WAIT if LQ head not completed, then move LQ head Store: mark SQ head as completed, then move SQ head completed IQ IQ If match: forwarding D-cache D-cache If match: mark store-load trap to flush pipeline (at commit)

Speculative Memory Disambiguation Fetch PC Load forwarding 1024 1-bit entry table Renamed inst 1 int issue queue When a load is trapped at commit, set stWait bit in the table, indexed by the load’s PC When the load is fetched, get its stWait from the table The load waits in issue queue until old stores are issued stWait table is cleared periodically

Architectural Memory States LQ SQ Committed states Completed entries L1-Cache L2-Cache L3-Cache (optional) Memory Disk, Tape, etc. Memory request: search the hierarchy from top to bottom

Summary of Superscalar Execution Instruction flow techniques Branch prediction, branch target prediction, and instruction prefetch Register data flow techniques Register renaming, instruction scheduling, in-order commit, mis-prediction recovery Memory data flow techniques Load/store units, memory consistency Source: Shen & Lipasti