15-740/ Computer Architecture Lecture 25: Main Memory

Slides:



Advertisements
Similar presentations
5-1 Memory System. Logical Memory Map. Each location size is one byte (Byte Addressable) Logical Memory Map. Each location size is one byte (Byte Addressable)
Advertisements

COEN 180 DRAM. Dynamic Random Access Memory Dynamic: Periodically refresh information in a bit cell. Else it is lost. Small footprint: transistor + capacitor.
Understanding a Problem in Multicore and How to Solve It
Main Mem.. CSE 471 Autumn 011 Main Memory The last level in the cache – main memory hierarchy is the main memory made of DRAM chips DRAM parameters (memory.
CS.305 Computer Architecture Memory: Structures Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made.
1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.
Chapter 9 Memory Basics Henry Hexmoor1. 2 Memory Definitions  Memory ─ A collection of storage cells together with the necessary circuits to transfer.
Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.
1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )
1 Lecture 15: DRAM Design Today: DRAM basics, DRAM innovations (Section 5.3)
Memory Hierarchy.1 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
1 Lecture 13: Cache Innovations Today: cache access basics and innovations, DRAM (Sections )
1 Lecture 14: Virtual Memory Today: DRAM and Virtual memory basics (Sections )
1 Lecture 7: Caching in Row-Buffer of DRAM Adapted from “A Permutation-based Page Interleaving Scheme: To Reduce Row-buffer Conflicts and Exploit Data.
Main Memory by J. Nelson Amaral.
Physical Memory and Physical Addressing By: Preeti Mudda Prof: Dr. Sin-Min Lee CS147 Computer Organization and Architecture.
1 Lecture 1: Introduction and Memory Systems CS 7810 Course organization:  5 lectures on memory systems  5 lectures on cache coherence and consistency.
Memory Technology “Non-so-random” Access Technology:
1 Lecture 4: Memory: HMC, Scheduling Topics: BOOM, memory blades, HMC, scheduling policies.
CPE232 Memory Hierarchy1 CPE 232 Computer Organization Spring 2006 Memory Hierarchy Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.
CSIE30300 Computer Architecture Unit 07: Main Memory Hsin-Chou Chi [Adapted from material by and
Survey of Existing Memory Devices Renee Gayle M. Chua.
Systems Overview Computer is composed of three main components: CPU Main memory IO devices Refers to page
EEE-445 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output Cache Main Memory Secondary Memory (Disk)
Memory System Unit-IV 4/24/2017 Unit-4 : Memory System.
Main Memory CS448.
1 Lecture 14: DRAM Main Memory Systems Today: cache/TLB wrap-up, DRAM basics (Section 2.3)
Computer Memory Storage Decoding Addressing 1. Memories We've Seen SIMM = Single Inline Memory Module DIMM = Dual IMM SODIMM = Small Outline DIMM RAM.
By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim
Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.
Computer Architecture Lecture 24 Fasih ur Rehman.
Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.
COMP541 Memories II: DRAMs
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
1 Lecture 3: Memory Buffers and Scheduling Topics: buffers (FB-DIMM, RDIMM, LRDIMM, BoB, BOOM), memory blades, scheduling policies.
1 Lecture: Memory Technology Innovations Topics: memory schedulers, refresh, state-of-the-art and upcoming changes: buffer chips, 3D stacking, non-volatile.
Contemporary DRAM memories and optimization of their usage Nebojša Milenković and Vladimir Stanković, Faculty of Electronic Engineering, Niš.
1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)
CS35101 Computer Architecture Spring 2006 Lecture 18: Memory Hierarchy Paul Durand ( ) [Adapted from M Irwin (
DRAM Tutorial Lecture Vivek Seshadri. Vivek Seshadri – Thesis Proposal DRAM Module and Chip 2.
CS203 – Advanced Computer Architecture Main Memory Slides adapted from Onur Mutlu (CMU)
Computer Architecture Lecture 22: Main Memory Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 3/26/2014.
15-740/ Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University.
1 Lecture 4: Memory Scheduling, Refresh Topics: scheduling policies, refresh basics.
1 Lecture 16: Main Memory Innovations Today: DRAM basics, innovations, trends HW5 due on Thursday; simulations can take a few hours Midterm: 32 scores.
1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,
CS161 – Design and Architecture of Computer Main Memory Slides adapted from Onur Mutlu (CMU)
18-447: Computer Architecture Lecture 25: Main Memory
CS 704 Advanced Computer Architecture
Reducing Memory Interference in Multicore Systems
CSE 502: Computer Architecture
Reducing Hit Time Small and simple caches Way prediction Trace caches
SRAM Memory External Interface
Samira Khan University of Virginia Oct 9, 2017
Lecture 15: DRAM Main Memory Systems
The Main Memory system: DRAM organization
Lecture: DRAM Main Memory
Lecture 23: Cache, Memory, Virtual Memory
Lecture: DRAM Main Memory
Lecture: DRAM Main Memory
Lecture: Memory Technology Innovations
Lecture 15: Memory Design
Samira Khan University of Virginia Nov 14, 2018
15-740/ Computer Architecture Lecture 19: Main Memory
DRAM Hwansoo Han.
Bob Reese Micro II ECE, MSU
Computer Architecture Lecture 30: In-memory Processing
18-447: Computer Architecture Lecture 19: Main Memory
Presentation transcript:

15-740/18-740 Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Yoongu Kim Carnegie Mellon University

Today SRAM vs. DRAM Interleaving/Banking DRAM Microarchitecture Memory controller Memory buses Banks, ranks, channels, DIMMs Address mapping: software vs. hardware DRAM refresh Memory scheduling policies Memory power/energy management Multi-core issues Fairness, interference Large DRAM capacity

Readings Recommended: Mutlu and Moscibroda, “Parallelism-Aware Batch Scheduling: Enabling High-Performance and Fair Memory Controllers,” IEEE Micro Top Picks 2009. Mutlu and Moscibroda, “Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors,” MICRO 2007. Zhang et al., “A Permutation-based Page Interleaving Scheme to Reduce Row-buffer Conflicts and Exploit Data Locality,” MICRO 2000. Lee et al., “Prefetch-Aware DRAM Controllers,” MICRO 2008. Rixner et al., “Memory Access Scheduling,” ISCA 2000.

Main Memory in the System CORE 0 L2 CACHE 0 L2 CACHE 1 CORE 1 SHARED L3 CACHE DRAM INTERFACE DRAM MEMORY CONTROLLER DRAM BANKS CORE 2 L2 CACHE 2 L2 CACHE 3 CORE 3

Memory Bank Organization Read access sequence: 1. Decode row address & drive word-lines 2. Selected bits drive bit-lines • Entire row read 3. Amplify row data 4. Decode column address & select subset of row • Send to output 5. Precharge bit-lines • For next access

SRAM (Static Random Access Memory) Read Sequence 1. address decode 2. drive row select 3. selected bit-cells drive bitlines (entire row is read together) 4. diff. sensing and col. select (data is ready) 5. precharge all bitlines (for next read or write) Access latency dominated by steps 2 and 3 Cycling time dominated by steps 2, 3 and 5 step 2 proportional to 2m step 3 and 5 proportional to 2n row select bitline _bitline bit-cell array 2n row x 2m-col (nm to minimize overall latency) n+m 2n n m 2m diff pairs sense amp and mux 1

DRAM (Dynamic Random Access Memory) row enable Bits stored as charges on node capacitance (non-restorative) bit cell loses charge when read bit cell loses charge over time Read Sequence 1~3 same as SRAM 4. a “flip-flopping” sense amp amplifies and regenerates the bitline, data bit is mux’ed out 5. precharge all bitlines Refresh: A DRAM controller must periodically read all rows within the allowed refresh time (10s of ms) such that charge is restored in cells _bitline bit-cell array 2n row x 2m-col (nm to minimize overall latency) RAS 2n n m 2m sense amp and mux 1 A DRAM die comprises of multiple such arrays CAS

SRAM vs. DRAM SRAM is preferable for register files and L1/L2 caches Fast access No refreshes Simpler manufacturing (compatible with logic process) Lower density (6 transistors per cell) Higher cost DRAM is preferable for stand-alone memory chips Much higher capacity Higher density Lower cost

Memory subsystem organization Channel DIMM Rank Chip Bank Row/Column

DIMM (Dual in-line memory module) Memory subsystem “Channel” DIMM (Dual in-line memory module) Processor Memory channel Memory channel

DIMM (Dual in-line memory module) Breaking down a DIMM DIMM (Dual in-line memory module) Side view Front of DIMM Back of DIMM

DIMM (Dual in-line memory module) Breaking down a DIMM DIMM (Dual in-line memory module) Side view Front of DIMM Back of DIMM Rank 0: collection of 8 chips Rank 1

Rank Rank 0 (Front) Rank 1 (Back) <0:63> <0:63> Addr/Cmd CS <0:1> Data <0:63> Memory channel

DIMM & Rank (from JEDEC)

. . . Breaking down a Rank Chip 0 Chip 1 Chip 7 Rank 0 <0:63> <0:7> <8:15> <56:63> Data <0:63>

Breaking down a Chip ... 8 banks Chip 0 Bank 0 <0:7> <0:7>

Breaking down a Bank ... ... Row-buffer Bank 0 <0:7> 1B 1B 1B 1B (column) row 16k-1 Bank 0 ... row 0 <0:7> Row-buffer 1B 1B 1B ... <0:7>

Memory subsystem organization Channel DIMM Rank Chip Bank Row/Column

Example: Transferring a cache block Physical memory space 0xFFFF…F Channel 0 ... DIMM 0 Mapped to 0x40 Rank 0 64B cache block 0x00

Example: Transferring a cache block Physical memory space Chip 0 Chip 1 Chip 7 Rank 0 0xFFFF…F . . . ... <0:7> <8:15> <56:63> 0x40 64B cache block Data <0:63> 0x00

Example: Transferring a cache block Physical memory space Chip 0 Chip 1 Chip 7 Rank 0 0xFFFF…F . . . Row 0 Col 0 ... <0:7> <8:15> <56:63> 0x40 64B cache block Data <0:63> 0x00

Example: Transferring a cache block Physical memory space Chip 0 Chip 1 Chip 7 Rank 0 0xFFFF…F . . . Row 0 Col 0 ... <0:7> <8:15> <56:63> 0x40 64B cache block Data <0:63> 8B 0x00 8B

Example: Transferring a cache block Physical memory space Chip 0 Chip 1 Chip 7 Rank 0 0xFFFF…F . . . Row 0 Col 1 ... <0:7> <8:15> <56:63> 0x40 64B cache block Data <0:63> 8B 0x00

Example: Transferring a cache block Physical memory space Chip 0 Chip 1 Chip 7 Rank 0 0xFFFF…F . . . Row 0 Col 1 ... <0:7> <8:15> <56:63> 0x40 64B cache block 8B Data <0:63> 8B 0x00 8B

Example: Transferring a cache block Physical memory space Chip 0 Chip 1 Chip 7 Rank 0 0xFFFF…F . . . Row 0 Col 1 ... <0:7> <8:15> <56:63> 0x40 64B cache block 8B Data <0:63> 8B 0x00 A 64B cache block takes 8 I/O cycles to transfer. During the process, 8 columns are read sequentially.

Page Mode DRAM A DRAM bank is a 2D array of cells: rows x columns A “DRAM row” is also called a “DRAM page” “Sense amplifiers” also called “row buffer” Each address is a <row,column> pair Access to a “closed row” Activate command opens row (placed into row buffer) Read/write command reads/writes column in the row buffer Precharge command closes the row and prepares the bank for next access Access to an “open row” No need for activate command

DRAM Bank Operation Access Address: (Row 0, Column 0) Columns Row address 0 Row address 1 Row decoder Rows Row 0 Empty Row 1 Row Buffer CONFLICT ! HIT HIT Column address 0 Column address 1 Column address 0 Column address 85 Column mux Data

Latency Components: Basic DRAM Operation CPU → controller transfer time Controller latency Queuing & scheduling delay at the controller Access converted to basic commands Controller → DRAM transfer time DRAM bank latency Simple CAS is row is “open” OR RAS + CAS if array precharged OR PRE + RAS + CAS (worst case) DRAM → CPU transfer time (through controller)

A DRAM Chip and DIMM Chip: Consists of multiple banks (2-16 in Synchronous DRAM) Banks share command/address/data buses The chip itself has a narrow interface (4-16 bits per read) Multiple chips are put together to form a wide interface Called a module DIMM: Dual Inline Memory Module All chips in one side of a DIMM are operated the same way (rank) Respond to a single command Share address and command buses, but provide different data If we have chips with 8-bit interface, to read 8 bytes in a single access, use 8 chips in a DIMM

128M x 8-bit DRAM Chip

A 64-bit Wide DIMM

A 64-bit Wide DIMM Advantages: Disadvantages: Acts like a high-capacity DRAM chip with a wide interface Flexibility: memory controller does not need to deal with individual chips Disadvantages: Granularity: Accesses cannot be smaller than the interface width

Multiple DIMMs Advantages: Disadvantages: Enables even higher capacity Interconnect complexity and energy consumption can be high

DRAM Channels 2 Independent Channels: 2 Memory Controllers (Above) 2 Dependent/Lockstep Channels: 1 Memory Controller with wide interface (Not Shown above)

Generalized Memory Structure

Multiple Banks (Interleaving) and Channels Enable concurrent DRAM accesses Bits in address determine which bank an address resides in Multiple independent channels serve the same purpose But they are even better because they have separate data buses Increased bus bandwidth Enabling more concurrency requires reducing Bank conflicts Channel conflicts How to select/randomize bank/channel indices in address? Lower order bits have more entropy Randomizing hash functions (XOR of different address bits)

How Multiple Banks/Channels Help

Multiple Channels Advantages Disadvantages Increased bandwidth Multiple concurrent accesses (if independent channels) Disadvantages Higher cost than a single channel More board wires More pins (if on-chip memory controller)

Address Mapping (Single Channel) Single-channel system with 8-byte memory bus 2GB memory, 8 banks, 16K rows & 2K columns per bank Row interleaving Consecutive rows of memory in consecutive banks Cache block interleaving Consecutive cache block addresses in consecutive banks 64 byte cache blocks Accesses to consecutive cache blocks can be serviced in parallel How about random accesses? Strided accesses? Row (14 bits) Bank (3 bits) Column (11 bits) Byte in bus (3 bits) Row (14 bits) High Column Bank (3 bits) Low Col. Byte in bus (3 bits) 8 bits 3 bits

Bank Mapping Randomization DRAM controller can randomize the address mapping to banks so that bank conflicts are less likely 3 bits Column (11 bits) Byte in bus (3 bits) XOR Bank index (3 bits)

Address Mapping (Multiple Channels) Row (14 bits) Bank (3 bits) Column (11 bits) Byte in bus (3 bits) Where are consecutive cache blocks? Row (14 bits) C Bank (3 bits) Column (11 bits) Byte in bus (3 bits) Row (14 bits) Bank (3 bits) C Column (11 bits) Byte in bus (3 bits) Row (14 bits) Bank (3 bits) Column (11 bits) C Byte in bus (3 bits) C Row (14 bits) High Column Bank (3 bits) Low Col. Byte in bus (3 bits) 8 bits 3 bits Row (14 bits) C High Column Bank (3 bits) Low Col. Byte in bus (3 bits) 8 bits 3 bits Row (14 bits) High Column C Bank (3 bits) Low Col. Byte in bus (3 bits) 8 bits 3 bits Row (14 bits) High Column Bank (3 bits) C Low Col. Byte in bus (3 bits) 8 bits 3 bits Row (14 bits) High Column Bank (3 bits) Low Col. C Byte in bus (3 bits) 8 bits 3 bits

Interaction with VirtualPhysical Mapping Operating System influences where an address maps to in DRAM Operating system can control which bank a virtual page is mapped to. It can randomize Page<Bank,Channel> mappings Application cannot know/determine which bank it is accessing Virtual Page number (52 bits) Page offset (12 bits) VA Physical Frame number (19 bits) Page offset (12 bits) PA Row (14 bits) Bank (3 bits) Column (11 bits) Byte in bus (3 bits) PA

DRAM Refresh (I) DRAM capacitor charge leaks over time The memory controller needs to read each row periodically to restore the charge Activate + precharge each row every N ms Typical N = 64 ms Implications on performance? -- DRAM bank unavailable while refreshed -- Long pause times: If we refresh all rows in burst, every 64ms the DRAM will be unavailable until refresh ends Burst refresh: All rows refreshed immediately after one another Distributed refresh: Each row refreshed at a different time, at regular intervals

DRAM Refresh (II) Distributed refresh eliminates long pause times How else we can reduce the effect of refresh on performance? Can we reduce the number of refreshes?

DRAM Controller Purpose and functions Ensure correct operation of DRAM (refresh) Service DRAM requests while obeying timing constraints of DRAM chips Constraints: resource conflicts (bank, bus, channel), minimum write-to-read delays Translate requests to DRAM command sequences Buffer and schedule requests to improve performance Reordering and row-buffer management Manage power consumption and thermals in DRAM Turn on/off DRAM chips, manage power modes

DRAM Controller Issues Where to place? In chipset + More flexibility to plug different DRAM types into the system + Less power density in the CPU chip On CPU chip + Reduced latency for main memory access + Higher bandwidth between cores and controller More information can be communicated (e.g. request’s importance in the processing core)

DRAM Controller (II)

A Modern DRAM Controller

DRAM Scheduling Policies (I) FCFS (first come first served) Oldest request first FR-FCFS (first ready, first come first served) 1. Row-hit first 2. Oldest first Goal: Maximize row buffer hit rate  maximize DRAM throughput Actually, scheduling is done at the command level Column commands (read/write) prioritized over row commands (activate/precharge) Within each group, older commands prioritized over younger ones

DRAM Scheduling Policies (II) A scheduling policy is essentially a prioritization order Prioritization can be based on Request age Row buffer hit/miss status Request type (prefetch, read, write) Requestor type (load miss or store miss) Request criticality Oldest miss in the core? How many instructions in core are dependent on it?