Lecture 11: Memory Data Flow Techniques

Lecture 11: Memory Data Flow Techniques
Load/store buffer design, memory-level parallelism, consistency model, memory disambiguation This lecture will cover sections : introduction, application, technology trends, cost and price.

Load/Store Execution Steps
Load: LW R2, 0(R1) Generate virtual address; may wait on base register Translate virtual address into physical address Write data cache Store: SW R2, 0(R1) Generate virtual address; may wait on base register and data register Translate virtual address into physical address Write data cache Unlike in register accesses, memory addresses are not known prior to execution

Load/store Buffer in Tomasulo
Support memory-level parallelism Loads wait in load buffer until their address is ready; memory reads are then processed Stores wait in store buffer until their address and data are ready; memory writes wait further until stores are committed IM Fetch Unit Reorder Buffer Decode Rename Regfile S-buf L-buf RS RS DM FU1 FU2

Load/store Unit with Centralized RS
IM Centralized RS includes part of load/store buffer in Tomasulo Loads and stores wait in RS until there are ready Fetch Unit Reorder Buffer Decode Rename Regfile RS S-unit L-unit FU1 FU2 data addr addr Store buffer cache

Memory-level Parallelism
for (i=0;i<100;i++) A[i] = A[i]*2; Loop: L.S F2, 0(R1) MULT F2, F2, F4 SW F2, 0(R1) ADD R1, R1, 4 BNE R1, R3,Loop F4 store 2.0 Significant improvement from sequential reads/writes LW1 LW2 LW3 SW1 SW2 SW3

Memory Consistency Memory contents must be the same as by sequential execution Must respect RAW, WRW, and WAR dependences Practical implementations: Reads may proceed out-of-order Writes proceed to memory in program order Reads may bypass earlier writes only if their addresses are different

Store Stages in Dynamic Execution
Wait in RS until base address and store data are available (ready) Move to store unit for address calculation and address translation Move to store buffer (finished) Wait for ROB commit (completed) Write to data cache (retired) Stores always retire in for WAW and WRA Dep. RS Store unit Load unit finished completed D-cache Source: Shen and Lipasti, page 197

Load Bypassing and Memory Disambiguation
To exploit memory parallelism, loads have to bypass writes; but this may violate RAW dependences Dynamic Memory Disambiguation: Dynamic detection of memory dependences Compare load address with every older store addresses

Load Bypassing Implementation
RS in-order 1. address calc. 2. address trans. 3. if no match, update dest reg Associative search for matching Load unit Store unit 1 1 2 2 match 3 data addr D-cache Assume in-order execution of load/stores addr data

Load Forwarding Load Forwarding: if a load address matches a older write address, can forward data If a match is found, forward the related data to dest register (in ROB) Multiple matches may exists; last one wins RS in-order Load unit Store unit 1 1 2 2 match 3 To dest. reg data addr D-cache addr data

In-order Issue Limitation
Any store in RS station may blocks all following loads When is F2 of SW available? When is the next L.S ready? Assume reasonable FU latency and pipeline length for (i=0;i<100;i++) A[i] = A[i]/2; Loop: L.S F2, 0(R1) DIV F2, F2, F4 SW F2, 0(R1) ADD R1, R1, 4 BNE R1, R3,Loop

Speculative Load Execution
RS Forwarding does not always work if some addresses are unknown No match: predict a load has no RAW on older stores Flush pipeline at commit if predicted wrong out-order Load unit Store unit 1 1 2 2 match 3 Match at completion addr data addr data Finished load buffer D-cache data If match: flush pipeline

Alpha Pipeline

Alpha 21264 Load/Store Queues
Int issue queue fp issue queue Addr ALU Int ALU Int ALU Addr ALU FP ALU FP ALU Int RF(80) Int RF(80) FP RF(72) D-TLB L-Q S-Q AF Dual D-Cache 32-entry load queue, 32-entry store queue

Load Bypassing, Forwarding, and RAW Detection
commit match LQ SQ ROB Load/store? Load: WAIT if LQ head not completed, then move LQ head Store: mark SQ head as completed, then move SQ head completed IQ IQ If match: forwarding D-cache D-cache If match: mark store-load trap to flush pipeline (at commit)

Speculative Memory Disambiguation
Fetch PC Load forwarding bit entry table Renamed inst 1 int issue queue When a load is trapped at commit, set stWait bit in the table, indexed by the load’s PC When the load is fetched, get its stWait from the table The load waits in issue queue until old stores are issued stWait table is cleared periodically

Architectural Memory States
LQ SQ Committed states Completed entries L1-Cache L2-Cache L3-Cache (optional) Memory Disk, Tape, etc. Memory request: search the hierarchy from top to bottom

Summary of Superscalar Execution
Instruction flow techniques Branch prediction, branch target prediction, and instruction prefetch Register data flow techniques Register renaming, instruction scheduling, in-order commit, mis-prediction recovery Memory data flow techniques Load/store units, memory consistency Source: Shen & Lipasti

Lecture 11: Memory Data Flow Techniques

Similar presentations

Presentation on theme: "Lecture 11: Memory Data Flow Techniques"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 11: Memory Data Flow Techniques

Similar presentations

Presentation on theme: "Lecture 11: Memory Data Flow Techniques"— Presentation transcript:

Similar presentations

About project

Feedback