18-447: Computer Architecture Lecture 25: Main Memory

Slides:

Advertisements

Similar presentations

Lecture 12 Reduce Miss Penalty and Hit Time

Advertisements

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Main Mem.. CSE 471 Autumn 011 Main Memory The last level in the cache – main memory hierarchy is the main memory made of DRAM chips DRAM parameters (memory.

Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.

1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )

1 Lecture 13: Cache Innovations Today: cache access basics and innovations, DRAM (Sections )

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

1 Lecture 14: Virtual Memory Today: DRAM and Virtual memory basics (Sections )

How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

1 Lecture 14: DRAM Main Memory Systems Today: cache/TLB wrap-up, DRAM basics (Section 2.3)

Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)

CS203 – Advanced Computer Architecture Main Memory Slides adapted from Onur Mutlu (CMU)

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

Computer Architecture Lecture 22: Main Memory Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 3/26/2014.

15-740/ Computer Architecture Lecture 25: Main Memory

1 Lecture 16: Main Memory Innovations Today: DRAM basics, innovations, trends HW5 due on Thursday; simulations can take a few hours Midterm: 32 scores.

1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,

CS161 – Design and Architecture of Computer Main Memory Slides adapted from Onur Mutlu (CMU)

Computer Architecture Lecture 20: Better Caching

CS 704 Advanced Computer Architecture

COMP541 Memories II: DRAMs

Memory Hierarchy Ideal memory is fast, large, and inexpensive

The Memory System (Chapter 5)

Reducing Memory Interference in Multicore Systems

CSE 502: Computer Architecture

Lecture: Large Caches, Virtual Memory

Computer Architecture Lecture 18: Caches, Caches, Caches

18-447: Computer Architecture Lecture 23: Caches

Reducing Hit Time Small and simple caches Way prediction Trace caches

Associativity in Caches Lecture 25

15-740/ Computer Architecture Lecture 20: Caching III

Computer Architecture Lecture 19: High-Performance Caches

5.2 Eleven Advanced Optimizations of Cache Performance

Cache Memory Presentation I

Samira Khan University of Virginia Oct 9, 2017

Lecture 15: DRAM Main Memory Systems

Lecture 12: Cache Innovations

Samira Khan University of Virginia Nov 5, 2017

The Main Memory system: DRAM organization

CMSC 611: Advanced Computer Architecture

15-740/ Computer Architecture Lecture 13: More Caching

Lecture: DRAM Main Memory

Lecture 23: Cache, Memory, Virtual Memory

Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections )

Lecture 22: Cache Hierarchies, Memory

Lecture: DRAM Main Memory

Lecture: DRAM Main Memory

Lecture: Cache Innovations, Virtual Memory

MICROPROCESSOR MEMORY ORGANIZATION

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Lecture 24: Memory, VM, Multiproc

Adapted from slides by Sally McKee Cornell University

Memory Organization.

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

Lecture 15: Memory Design

Lecture 22: Cache Hierarchies, Memory

Summary 3 Cs: Compulsory, Capacity, Conflict Misses Reducing Miss Rate

15-740/ Computer Architecture Lecture 14: Prefetching

15-740/ Computer Architecture Lecture 19: Main Memory

15-740/ Computer Architecture Lecture 18: Caching Basics

Lecture: Cache Hierarchies

Chapter Five Large and Fast: Exploiting Memory Hierarchy

Main Memory Background

18-447: Computer Architecture Lecture 19: Main Memory

Presentation transcript:

18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013

Reminder: Homework 5 (Today) Due April 3 (Wednesday!) Topics: Vector processing, VLIW, Virtual memory, Caching

Reminder: Lab Assignment 5 (Friday) Due Friday, April 5 Modeling caches and branch prediction at the microarchitectural level (cycle level) in C Extra credit: Cache design optimization Size, block size, associativity Replacement and insertion policies Cache indexing policies Anything else you would like

Heads Up: Midterm II in Two Weeks April 17 Similar format as Midterm I

Last Lecture Wrap up virtual memory – cache interaction Virtually-indexed physically-tagged caches Solutions to the synonym problem Improving cache (and memory hierarchy) performance Cheaper alternatives to more associativity Blocking and code reorganization Memory-level-parallelism (MLP) aware cache replacement Enabling multiple accesses in parallel

Today Enabling multiple accesses in parallel Main memory

Improving Basic Cache Performance Reducing miss rate More associativity Alternatives/enhancements to associativity Victim caches, hashing, pseudo-associativity, skewed associativity Better replacement/insertion policies Software approaches Reducing miss latency/cost Multi-level caches Critical word first Subblocking/sectoring Non-blocking caches (multiple cache misses in parallel) Multiple accesses per cycle

Review: Memory Level Parallelism (MLP) parallel miss isolated miss B A C time Memory Level Parallelism (MLP) means generating and servicing multiple memory accesses in parallel [Glew’98] Several techniques to improve MLP (e.g., out-of-order execution) MLP varies. Some misses are isolated and some parallel How does this affect cache replacement? Processors are fast, Memory is slow. One way to bridge this gap is to service the memory accesses in parallel. If misses are serviced in parallel, the processor incurs only one long latency stall for all the parallel misses. The notion of generating and servicing off-chip accesses in parallel is termed as Memory Level Parallelism or MLP. Out of order processors improve MLP by continuing processing of independent instructions till the instruction window is full. If there is a miss during the processing of independent instructions, the miss is serviced concurrently with the stalling miss. There are several techniques, such as Runahead, CFP, to improve the MLP of out-of-order processors. MLP is not uniform for all cache misses. Some misses, such as pointer chasing loads, are typically serviced with low MLP while other misses, such as array accesses are serviced with high MLP. Current cache management techniques are not aware of this variation in MLP.

Review: Fewest Misses = Best Performance P4 P3 P2 P1 P1 P2 P3 P4 S1 S2 S3 Hit/Miss H H H M H H H H M M M Misses=4 Stalls=4 Time stall Belady’s OPT replacement Although Belady’s OPT minimizes misses, it is impossible to implement it in a real processor as it requires knowledge of the future. We employ the commonly used LRU policy in our experiments Hit/Miss H M M M H M M M H H H Saved cycles Time stall Misses=6Stalls=2 MLP-Aware replacement

Reading: MLP-Aware Cache Replacement How do we incorporate MLP into replacement decisions? Qureshi et al., “A Case for MLP-Aware Cache Replacement,” ISCA 2006. Required reading for this week

Enabling Multiple Outstanding Misses

Handling Multiple Outstanding Accesses Non-blocking or lockup-free caches Kroft, “Lockup-Free Instruction Fetch/Prefetch Cache Organization," ISCA 1981. Question: If the processor can generate multiple cache accesses, can the later accesses be handled while a previous miss is outstanding? Idea: Keep track of the status/data of misses that are being handled in Miss Status Handling Registers (MSHRs) A cache access checks MSHRs to see if a miss to the same block is already pending. If pending, a new request is not generated If pending and the needed data available, data forwarded to later load Requires buffering of outstanding miss requests

Non-Blocking Caches (and MLP) Enable cache access when there is a pending miss Enable multiple misses in parallel Memory-level parallelism (MLP) generating and servicing multiple memory accesses in parallel Why generate multiple misses? Enables latency tolerance: overlaps latency of different misses How to generate multiple misses? Out-of-order execution, multithreading, runahead, prefetching parallel miss isolated miss C A B time

Miss Status Handling Register Also called “miss buffer” Keeps track of Outstanding cache misses Pending load/store accesses that refer to the missing cache block Fields of a single MSHR entry Valid bit Cache block address (to match incoming accesses) Control/status bits (prefetch, issued to memory, which subblocks have arrived, etc) Data for each subblock For each pending load/store Valid, type, data size, byte in block, destination register or store buffer entry address

Miss Status Handling Register Entry

MSHR Operation On a cache miss: Search MSHRs for a pending access to the same block Found: Allocate a load/store entry in the same MSHR entry Not found: Allocate a new MSHR No free entry: stall When a subblock returns from the next level in memory Check which loads/stores waiting for it Forward data to the load/store unit Deallocate load/store entry in the MSHR entry Write subblock in cache or MSHR If last subblock, dellaocate MSHR (after writing the block in cache)

Non-Blocking Cache Implementation When to access the MSHRs? In parallel with the cache? After cache access is complete? MSHRs need not be on the critical path of hit requests Which one below is the common case? Cache miss, MSHR hit Cache hit

Enabling High Bandwidth Caches (and Memories in General)

Multiple Instructions per Cycle Can generate multiple cache accesses per cycle How do we ensure the cache can handle multiple accesses in the same clock cycle? Solutions: true multi-porting virtual multi-porting (time sharing a port) multiple cache copies banking (interleaving)

Handling Multiple Accesses per Cycle (I) True multiporting Each memory cell has multiple read or write ports + Truly concurrent accesses (no conflicts regardless of address) -- Expensive in terms of latency, power, area What about read and write to the same location at the same time? Peripheral logic needs to handle this

Peripheral Logic for True Multiporting

Peripheral Logic for True Multiporting

Handling Multiple Accesses per Cycle (I) Virtual multiporting Time-share a single port Each access needs to be (significantly) shorter than clock cycle Used in Alpha 21264 Is this scalable?

Handling Multiple Accesses per Cycle (II) Multiple cache copies Stores update both caches Loads proceed in parallel Used in Alpha 21164 Scalability? Store operations form a bottleneck Area proportional to “ports” Port 1 Load Cache Copy 1 Port 1 Data Store Cache Copy 2 Port 2 Data Port 2 Load

Handling Multiple Accesses per Cycle (III) Banking (Interleaving) Bits in address determines which bank an address maps to Address space partitioned into separate banks Which bits to use for “bank address”? + No increase in data store area -- Cannot satisfy multiple accesses to the same bank -- Crossbar interconnect in input/output Bank conflicts Two accesses are to the same bank How can these be reduced? Hardware? Software? Bank 0: Even addresses Bank 1: Odd addresses

General Principle: Interleaving Interleaving (banking) Problem: a single monolithic memory array takes long to access and does not enable multiple accesses in parallel Goal: Reduce the latency of memory array access and enable multiple accesses in parallel Idea: Divide the array into multiple banks that can be accessed independently (in the same cycle or in consecutive cycles) Each bank is smaller than the entire memory storage Accesses to different banks can be overlapped Issue: How do you map data to different banks? (i.e., how do you interleave data across banks?)

Main Memory

Main Memory in the System CORE 0 L2 CACHE 0 L2 CACHE 1 CORE 1 SHARED L3 CACHE DRAM INTERFACE DRAM MEMORY CONTROLLER DRAM BANKS CORE 2 L2 CACHE 2 L2 CACHE 3 CORE 3

The Memory Chip/System Abstraction

Review: Memory Bank Organization Read access sequence: 1. Decode row address & drive word-lines 2. Selected bits drive bit-lines • Entire row read 3. Amplify row data 4. Decode column address & select subset of row • Send to output 5. Precharge bit-lines • For next access

Review: SRAM (Static Random Access Memory) Read Sequence 1. address decode 2. drive row select 3. selected bit-cells drive bitlines (entire row is read together) 4. diff. sensing and col. select (data is ready) 5. precharge all bitlines (for next read or write) Access latency dominated by steps 2 and 3 Cycling time dominated by steps 2, 3 and 5 step 2 proportional to 2m step 3 and 5 proportional to 2n row select bitline _bitline bit-cell array 2n row x 2m-col (nm to minimize overall latency) n+m 2n n m 2m diff pairs sense amp and mux 1

Review: DRAM (Dynamic Random Access Memory) row enable Bits stored as charges on node capacitance (non-restorative) bit cell loses charge when read bit cell loses charge over time Read Sequence 1~3 same as SRAM 4. a “flip-flopping” sense amp amplifies and regenerates the bitline, data bit is mux’ed out 5. precharge all bitlines Refresh: A DRAM controller must periodically read all rows within the allowed refresh time (10s of ms) such that charge is restored in cells _bitline bit-cell array 2n row x 2m-col (nm to minimize overall latency) RAS 2n n m 2m sense amp and mux 1 A DRAM die comprises of multiple such arrays CAS

Review: DRAM vs. SRAM DRAM SRAM Slower access (capacitor) Higher density (1T 1C cell) Lower cost Requires refresh (power, performance, circuitry) Manufacturing requires putting capacitor and logic together SRAM Faster access (no capacitor) Lower density (6T cell) Higher cost No need for refresh Manufacturing compatible with logic process (no capacitor)

Some Fundamental Concepts (I) Physical address space Maximum size of main memory: total number of uniquely identifiable locations Physical addressability Minimum size of data in memory can be addressed Byte-addressable, word-addressable, 64-bit-addressable Addressability depends on the abstraction level of the implementation Alignment Does the hardware support unaligned access transparently to software? Interleaving

Some Fundamental Concepts (II) Interleaving (banking) Problem: a single monolithic memory array takes long to access and does not enable multiple accesses in parallel Goal: Reduce the latency of memory array access and enable multiple accesses in parallel Idea: Divide the array into multiple banks that can be accessed independently (in the same cycle or in consecutive cycles) Each bank is smaller than the entire memory storage Accesses to different banks can be overlapped Issue: How do you map data to different banks? (i.e., how do you interleave data across banks?)

Interleaving

Interleaving Options

Some Questions/Concepts Remember CRAY-1 with 16 banks 11 cycle bank latency Consecutive words in memory in consecutive banks (word interleaving) 1 access can be started (and finished) per cycle Can banks be operated fully in parallel? Multiple accesses started per cycle? What is the cost of this? We have seen it earlier (today) Modern superscalar processors have L1 data caches with multiple, fully-independent banks

The Bank Abstraction

Rank

The DRAM Subsystem

DRAM Subsystem Organization Channel DIMM Rank Chip Bank Row/Column

The DRAM Bank Structure

Page Mode DRAM A DRAM bank is a 2D array of cells: rows x columns A “DRAM row” is also called a “DRAM page” “Sense amplifiers” also called “row buffer” Each address is a <row,column> pair Access to a “closed row” Activate command opens row (placed into row buffer) Read/write command reads/writes column in the row buffer Precharge command closes the row and prepares the bank for next access Access to an “open row” No need for activate command

DRAM Bank Operation Access Address: (Row 0, Column 0) Columns Row address 0 Row address 1 Row decoder Rows Row 0 Empty Row 1 Row Buffer CONFLICT ! HIT HIT Column address 0 Column address 1 Column address 0 Column address 85 Column mux Data

The DRAM Chip Consists of multiple banks (2-16 in Synchronous DRAM) Banks share command/address/data buses The chip itself has a narrow interface (4-16 bits per read)

128M x 8-bit DRAM Chip

DRAM Rank and Module Rank: Multiple chips operated together to form a wide interface All chips comprising a rank are controlled at the same time Respond to a single command Share address and command buses, but provide different data A DRAM module consists of one or more ranks E.g., DIMM (dual inline memory module) This is what you plug into your motherboard If we have chips with 8-bit interface, to read 8 bytes in a single access, use 8 chips in a DIMM

A 64-bit Wide DIMM (One Rank)

A 64-bit Wide DIMM (One Rank) Advantages: Acts like a high-capacity DRAM chip with a wide interface Flexibility: memory controller does not need to deal with individual chips Disadvantages: Granularity: Accesses cannot be smaller than the interface width

Multiple DIMMs Advantages: Disadvantages: Enables even higher capacity Interconnect complexity and energy consumption can be high

DRAM Channels 2 Independent Channels: 2 Memory Controllers (Above) 2 Dependent/Lockstep Channels: 1 Memory Controller with wide interface (Not Shown above)

Generalized Memory Structure

Generalized Memory Structure

The DRAM Subsystem The Top Down View

DRAM Subsystem Organization Channel DIMM Rank Chip Bank Row/Column

DIMM (Dual in-line memory module) The DRAM subsystem “Channel” DIMM (Dual in-line memory module) Processor Memory channel Memory channel

DIMM (Dual in-line memory module) Breaking down a DIMM DIMM (Dual in-line memory module) Side view Front of DIMM Back of DIMM

DIMM (Dual in-line memory module) Breaking down a DIMM DIMM (Dual in-line memory module) Side view Front of DIMM Back of DIMM Rank 0: collection of 8 chips Rank 1

Rank Rank 0 (Front) Rank 1 (Back) <0:63> <0:63> Addr/Cmd CS <0:1> Data <0:63> Memory channel

. . . Breaking down a Rank Chip 0 Chip 1 Chip 7 Rank 0 <0:63> <0:7> <8:15> <56:63> Data <0:63>

Breaking down a Chip ... 8 banks Chip 0 Bank 0 <0:7> <0:7>

Breaking down a Bank ... ... Row-buffer Bank 0 <0:7> 1B 1B 1B 1B (column) row 16k-1 Bank 0 ... row 0 <0:7> Row-buffer 1B 1B 1B ... <0:7>

DRAM Subsystem Organization Channel DIMM Rank Chip Bank Row/Column

Example: Transferring a cache block Physical memory space 0xFFFF…F Channel 0 ... DIMM 0 Mapped to 0x40 Rank 0 64B cache block 0x00

Example: Transferring a cache block Physical memory space Chip 0 Chip 1 Chip 7 Rank 0 0xFFFF…F . . . ... <0:7> <8:15> <56:63> 0x40 64B cache block Data <0:63> 0x00

Example: Transferring a cache block Physical memory space Chip 0 Chip 1 Chip 7 Rank 0 0xFFFF…F . . . Row 0 Col 0 ... <0:7> <8:15> <56:63> 0x40 64B cache block Data <0:63> 0x00

Example: Transferring a cache block Physical memory space Chip 0 Chip 1 Chip 7 Rank 0 0xFFFF…F . . . Row 0 Col 0 ... <0:7> <8:15> <56:63> 0x40 64B cache block Data <0:63> 8B 0x00 8B

Example: Transferring a cache block Physical memory space Chip 0 Chip 1 Chip 7 Rank 0 0xFFFF…F . . . Row 0 Col 1 ... <0:7> <8:15> <56:63> 0x40 64B cache block Data <0:63> 8B 0x00

Example: Transferring a cache block Physical memory space Chip 0 Chip 1 Chip 7 Rank 0 0xFFFF…F . . . Row 0 Col 1 ... <0:7> <8:15> <56:63> 0x40 64B cache block 8B Data <0:63> 8B 0x00 8B

Example: Transferring a cache block Physical memory space Chip 0 Chip 1 Chip 7 Rank 0 0xFFFF…F . . . Row 0 Col 1 ... <0:7> <8:15> <56:63> 0x40 64B cache block 8B Data <0:63> 8B 0x00 A 64B cache block takes 8 I/O cycles to transfer. During the process, 8 columns are read sequentially.

Latency Components: Basic DRAM Operation CPU → controller transfer time Controller latency Queuing & scheduling delay at the controller Access converted to basic commands Controller → DRAM transfer time DRAM bank latency Simple CAS if row is “open” OR RAS + CAS if array precharged OR PRE + RAS + CAS (worst case) DRAM → CPU transfer time (through controller)

Multiple Banks (Interleaving) and Channels Enable concurrent DRAM accesses Bits in address determine which bank an address resides in Multiple independent channels serve the same purpose But they are even better because they have separate data buses Increased bus bandwidth Enabling more concurrency requires reducing Bank conflicts Channel conflicts How to select/randomize bank/channel indices in address? Lower order bits have more entropy Randomizing hash functions (XOR of different address bits)

How Multiple Banks/Channels Help

Multiple Channels Advantages Disadvantages Increased bandwidth Multiple concurrent accesses (if independent channels) Disadvantages Higher cost than a single channel More board wires More pins (if on-chip memory controller)

Address Mapping (Single Channel) Single-channel system with 8-byte memory bus 2GB memory, 8 banks, 16K rows & 2K columns per bank Row interleaving Consecutive rows of memory in consecutive banks Cache block interleaving Consecutive cache block addresses in consecutive banks 64 byte cache blocks Accesses to consecutive cache blocks can be serviced in parallel How about random accesses? Strided accesses? Row (14 bits) Bank (3 bits) Column (11 bits) Byte in bus (3 bits) Row (14 bits) High Column Bank (3 bits) Low Col. Byte in bus (3 bits) 8 bits 3 bits

Bank Mapping Randomization DRAM controller can randomize the address mapping to banks so that bank conflicts are less likely 3 bits Column (11 bits) Byte in bus (3 bits) XOR Bank index (3 bits)

Address Mapping (Multiple Channels) Row (14 bits) Bank (3 bits) Column (11 bits) Byte in bus (3 bits) Where are consecutive cache blocks? Row (14 bits) C Bank (3 bits) Column (11 bits) Byte in bus (3 bits) Row (14 bits) Bank (3 bits) C Column (11 bits) Byte in bus (3 bits) Row (14 bits) Bank (3 bits) Column (11 bits) C Byte in bus (3 bits) C Row (14 bits) High Column Bank (3 bits) Low Col. Byte in bus (3 bits) 8 bits 3 bits Row (14 bits) C High Column Bank (3 bits) Low Col. Byte in bus (3 bits) 8 bits 3 bits Row (14 bits) High Column C Bank (3 bits) Low Col. Byte in bus (3 bits) 8 bits 3 bits Row (14 bits) High Column Bank (3 bits) C Low Col. Byte in bus (3 bits) 8 bits 3 bits Row (14 bits) High Column Bank (3 bits) Low Col. C Byte in bus (3 bits) 8 bits 3 bits

Interaction with VirtualPhysical Mapping Operating System influences where an address maps to in DRAM Operating system can control which bank/channel/rank a virtual page is mapped to. It can perform page coloring to minimize bank conflicts Or to minimize inter-application interference Virtual Page number (52 bits) Page offset (12 bits) VA Physical Frame number (19 bits) Page offset (12 bits) PA Row (14 bits) Bank (3 bits) Column (11 bits) Byte in bus (3 bits) PA