Memory Data Flow Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Slides:

Advertisements

Similar presentations

TRIPS Primary Memory System Simha Sethumadhavan 1.

Advertisements

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.

Computer Organization and Architecture

CSCI 4717/5717 Computer Architecture

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative.

Lecture 12 Reduce Miss Penalty and Hit Time

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Instruction-Level Parallelism (ILP)

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture 20: Cache Hierarchies, Virtual Memory Today’s topics:  Cache hierarchies  Virtual memory Reminder:  Assignment 8 will be posted soon (due.

ECE/CS 552: Cache Performance Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes based on notes by Mark Hill Updated by.

Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

Chapter 12 Pipelining Strategies Performance Hazards.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.

Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.

1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)

EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

Cache (Memory) Performance Optimization. Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate.

Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

1 Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha Example.

CS203 – Advanced Computer Architecture ILP and Speculation.

Use of Pipelining to Achieve CPI < 1

CS 352H: Computer Systems Architecture

Associativity in Caches Lecture 25

The University of Adelaide, School of Computer Science

5.2 Eleven Advanced Optimizations of Cache Performance

Pipeline Implementation (4.6)

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

CSCE 432/832 High Performance Processor Architectures Memory Data Flow

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Lecture 11: Memory Data Flow Techniques

Ka-Ming Keung Swamy D Ponpandi

Advanced Computer Architecture

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

* From AMD 1996 Publication #18522 Revision E

CSC3050 – Computer Architecture

Memory Data Flow Prof. Mikko H. Lipasti

CSC3050 – Computer Architecture

Ka-Ming Keung Swamy D Ponpandi

Presentation transcript:

Memory Data Flow Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti

Memory Data Flow – Memory Data Dependences – Load Bypassing – Load Forwarding – Speculative Disambiguation – The Memory Bottleneck Cache Hits and Cache Misses

Memory Data Dependences Besides branches, long memory latencies are one of the biggest performance challenges today. To preserve sequential (in-order) state in the data caches and external memory (so that recovery from exceptions is possible) stores are performed in order. This takes care of antidependences and output dependences to memory locations. However, loads can be issued out of order with respect to stores if the out-of-order loads check for data dependences with respect to previous, pending stores. WAWWARRAW store Xload Xstore X ::: store Xstore Xload X

Memory Data Dependences “Memory Aliasing” = Two memory references involving the same memory location (collision of two memory addresses). “Memory Disambiguation” = Determining whether two memory references will alias or not (whether there is a dependence or not). Memory Dependency Detection: – Must compute effective addresses of both memory references – Effective addresses can depend on run-time data and other instructions – Comparison of addresses require much wider comparators Example code: (1)STOREV (2)ADD (3)LOADW (4)LOADX (5)LOADV (6)ADD (7)STOREW RAW WAR

Total Order of Loads and Stores Keep all loads and stores totally in order with respect to each other. However, loads and stores can execute out of order with respect to other types of instructions. Consequently, stores are held for all previous instructions, and loads are held for stores. – I.e. stores performed at commit point – Sufficient to prevent wrong branch path stores since all prior branches now resolved

Illustration of Total Order

Load Bypassing Loads can be allowed to bypass stores (if no aliasing). Two separate reservation stations and address generation units are employed for loads and stores. Store addresses still need to be computed before loads can be issued to allow checking for load dependences. If dependence cannot be checked, e.g. store address cannot be determined, then all subsequent loads are held until address is valid (conservative). Stores are kept in ROB until all previous instructions complete; and kept in the store buffer until gaining access to cache port. – Store buffer is “future file” for memory

Illustration of Load Bypassing

Load Forwarding If a subsequent load has a dependence on a store still in the store buffer, it need not wait till the store is issued to the data cache. The load can be directly satisfied from the store buffer if the address is valid and the data is available in the store buffer. Since data is sourced from the store buffer: – Could avoid accessing the cache to reduce power/latency

Illustration of Load Forwarding

The DAXPY Example Total Order

Performance Gains From Weak Ordering

Optimizing Load/Store Disambiguation Non-speculative load/store disambiguation 1.Loads wait for addresses of all prior stores 2.Full address comparison 3.Bypass if no match, forward if match (1) can limit performance: load r5,MEM[r3]  cache miss store r7, MEM[r5]  RAW for agen, stalled … load r8, MEM[r9]  independent load stalled

Speculative Disambiguation What if aliases are rare? 1.Loads don’t wait for addresses of all prior stores 2.Full address comparison of stores that are ready 3.Bypass if no match, forward if match 4.Check all store addresses when they commit –No matching loads – speculation was correct –Matching unbypassed load – incorrect speculation 5.Replay starting from incorrect load Load Queue Store Queue Load/Store RS Agen Reorder Buffer Mem

Speculative Disambiguation: Load Bypass Load Queue Store Queue Agen Reorder Buffer Mem i1: st R3, MEM[R8]: ?? i2: ld R9, MEM[R4]: ?? i1: st R3, MEM[R8]: x800Ai2: ld R9, MEM[R4]: x400A i1 and i2 issue in program order i2 checks store queue (no match)

Speculative Disambiguation: Load Forward Load Queue Store Queue Agen Reorder Buffer Mem i1: st R3, MEM[R8]: ?? i2: ld R9, MEM[R4]: ?? i1: st R3, MEM[R8]: x800Ai2: ld R9, MEM[R4]: x800A i1 and i2 issue in program order i2 checks store queue (match=>forward)

Speculative Disambiguation: Safe Speculation Load Queue Store Queue Agen Reorder Buffer Mem i1: st R3, MEM[R8]: ?? i2: ld R9, MEM[R4]: ?? i1: st R3, MEM[R8]: x800Ai2: ld R9, MEM[R4]: x400C i1 and i2 issue out of program order i1 checks load queue at commit (no match)

Speculative Disambiguation: Violation Load Queue Store Queue Agen Reorder Buffer Mem i1: st R3, MEM[R8]: ?? i2: ld R9, MEM[R4]: ?? i1: st R3, MEM[R8]: x800Ai2: ld R9, MEM[R4]: x800A i1 and i2 issue out of program order i1 checks load queue at commit (match) – i2 marked for replay

Use of Prediction If aliases are rare: static prediction – Predict no alias every time Why even implement forwarding? PowerPC 620 doesn’t – Pay misprediction penalty rarely If aliases are more frequent: dynamic prediction – Use PHT-like history table for loads If alias predicted: delay load If aliased pair predicted: forward from store to load – More difficult to predict pair [store sets, Alpha 21264] – Pay misprediction penalty rarely Memory cloaking [Moshovos, Sohi] – Predict load/store pair – Directly copy store data register to load target register – Reduce data transfer latency to absolute minimum

Load/Store Disambiguation Discussion RISC ISA: – Many registers, most variables allocated to registers – Aliases are rare – Most important to not delay loads (bypass) – Alias predictor may/may not be necessary CISC ISA: – Few registers, many operands from memory – Aliases much more common, forwarding necessary – Incorrect load speculation should be avoided – If load speculation allowed, predictor probably necessary Address translation: – Can’t use virtual address (must use physical) – Wait till after TLB lookup is done – Or, use subset of untranslated bits (page offset) Safe for proving inequality (bypassing OK) Not sufficient for showing equality (forwarding not OK)

The Memory Bottleneck

Load/Store Processing For both Loads and Stores: 1.Effective Address Generation: Must wait on register value Must perform address calculation 2.Address Translation: Must access TLB Can potentially induce a page fault (exception) For Loads: D-cache Access (Read) Can potentially induce a D-cache miss Check aliasing against store buffer for possible load forwarding If bypassing store, must be flagged as “speculative” load until completion For Stores: D-cache Access (Write) When completing must check aliasing against “speculative” loads After completion, wait in store buffer for access to D-cache Can potentially induce a D-cache miss

Easing The Memory Bottleneck

Memory Bottleneck Techniques Dynamic Hardware (Microarchitecture): Use Multiple Load/Store Units (need multiported D-cache) Use More Advanced Caches (victim cache, stream buffer) Use Hardware Prefetching (need load history and stride detection) Use Non-blocking D-cache (need missed-load buffers/MSHRs) Large instruction window (memory-level parallelism) Static Software (Code Transformation): Insert Prefetch or Cache-Touch Instructions (mask miss penalty) Array Blocking Based on Cache Organization (minimize misses) Reduce Unnecessary Load/Store Instructions (redundant loads) Software Controlled Memory Hierarchy (expose it to above DSI)

Caches and Performance Caches – Enable design for common case: cache hit Cycle time, pipeline organization Recovery policy – Uncommon case: cache miss Fetch from next level – Apply recursively if multiple levels What to do in the meantime? What is performance impact? Various optimizations are possible

Performance Impact Cache hit latency – Included in “pipeline” portion of CPI E.g. IBM study: 1.15 CPI with 100% cache hits – Typically 1-3 cycles for L1 cache Intel/HP McKinley: 1 cycle – Heroic array design – No address generation: load r1, (r2) IBM Power4: 3 cycles – Address generation – Array access – Word select and align – Register file write (no bypass)

Cache Hit continued Cycle stealing common – Address generation < cycle – Array access > cycle – Clean, FSD cycle boundaries violated Speculation rampant – “Predict” cache hit – Don’t wait for (full) tag check – Consume fetched word in pipeline – Recover/flush when miss is detected Reportedly 7 (!) cycles later in Pentium 4 AGENCACHE AGENCACHE

Cache Hits and Performance Cache hit latency determined by: – Cache organization Associativity – Parallel tag checks expensive, slow – Way select slow (fan-in, wires) Block size – Word select may be slow (fan-in, wires) Number of block (sets x associativity) – Wire delay across array – “Manhattan distance” = width + height – Word line delay: width – Bit line delay: height Array design is an art form – Detailed analog circuit/wire delay modeling Word Line Bit Line

Cache Misses and Performance Miss penalty – Detect miss: 1 or more cycles – Find victim (replace block): 1 or more cycles Write back if dirty – Request block from next level: several cycles May need to find line from one of many caches (coherence) – Transfer block from next level: several cycles (block size) / (bus width) – Fill block into data array, update tag array: 1+ cycles – Resume execution In practice: 6 cycles to 100s of cycles

Cache Miss Rate Determined by: – Program characteristics Temporal locality Spatial locality – Cache organization Block size, associativity, number of sets

Improving Locality Instruction text placement – Profile program, place unreferenced or rarely referenced paths “elsewhere” Maximize temporal locality – Eliminate taken branches Fall-through path has spatial locality

Improving Locality Data placement, access order – Arrays: “block” loops to access subarray that fits into cache Maximize temporal locality – Structures: pack commonly-accessed fields together Maximize spatial, temporal locality – Trees, linked lists: allocate in usual reference order Heap manager usually allocates sequential addresses Maximize spatial locality Hard problem, not easy to automate: – C/C++ disallows rearranging structure fields – OK in Java

Cache Miss Rates: 3 C’s [Hill] Compulsory miss – First-ever reference to a given block of memory – Cold misses = m c : number of misses for FA infinite cache Capacity – Working set exceeds cache capacity – Useful blocks (with future references) displaced – Capacity misses = m f - m c : add’l misses for finite FA cache Conflict – Placement restrictions (not fully-associative) cause useful blocks to be displaced – Think of as capacity within set – Conflict misses = m a - m f : add’l misses in actual cache

Cache Miss Rate Effects Number of blocks (sets x associativity) – Bigger is better: fewer conflicts, greater capacity Associativity – Higher associativity reduces conflicts – Very little benefit beyond 8-way set-associative Block size – Larger blocks exploit spatial locality – Usually: miss rates improve until 64B-256B – 512B or more miss rates get worse Larger blocks less efficient: more capacity misses Fewer placement choices: more conflict misses

Cache Miss Rate Subtle tradeoffs between cache organization parameters – Large blocks reduce compulsory misses but increase miss penalty #compulsory ~= (working set) / (block size) #transfers = (block size)/(bus width) – Large blocks increase conflict misses #blocks = (cache size) / (block size) – Associativity reduces conflict misses – Associativity increases access time Can associative cache ever have higher miss rate than direct-mapped cache of same size?

Cache Miss Rates: 3 C’s Vary size and associativity – Compulsory misses are constant – Capacity and conflict misses are reduced

Cache Miss Rates: 3 C’s Vary size and block size – Compulsory misses drop with increased block size – Capacity and conflict can increase with larger blocks

Summary Memory Data Flow – Memory Data Dependences – Load Bypassing – Load Forwarding – Speculative Disambiguation – The Memory Bottleneck Cache Hits and Cache Misses Further coverage of memory hierarchy later in the semester