© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Lecture 5-6: Memory System.

Slides:



Advertisements
Similar presentations
Main MemoryCS510 Computer ArchitecturesLecture Lecture 15 Main Memory.
Advertisements

Chapter 5 Internal Memory
Prith Banerjee ECE C03 Advanced Digital Design Spring 1998
Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.
Anshul Kumar, CSE IITD CSL718 : Main Memory 6th Mar, 2006.
COEN 180 DRAM. Dynamic Random Access Memory Dynamic: Periodically refresh information in a bit cell. Else it is lost. Small footprint: transistor + capacitor.
JAZiO ™ IncorporatedPlatform JAZiO ™ Supplemental SupplementalInformation.
Main Mem.. CSE 471 Autumn 011 Main Memory The last level in the cache – main memory hierarchy is the main memory made of DRAM chips DRAM parameters (memory.
CS.305 Computer Architecture Memory: Structures Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made.
CS Lecture 10 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers N.P. Jouppi Proceedings.
EECC551 - Shaaban #1 Lec # 10 Fall Computer System Components SDRAM PC100/PC MHZ bits wide 2-way inteleaved ~ 900 MBYTES/SEC.
Chapter 9 Memory Basics Henry Hexmoor1. 2 Memory Definitions  Memory ─ A collection of storage cells together with the necessary circuits to transfer.
Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.
1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )
DRAM. Any read or write cycle starts with the falling edge of the RAS signal. –As a result the address applied in the address lines will be latched.
1 Lecture 15: DRAM Design Today: DRAM basics, DRAM innovations (Section 5.3)
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 19, 2003 Topic: Main Memory (DRAM) Organization.
EECC722 - Shaaban #1 lec # 12 Fall Computer System Components SDRAM PC100/PC MHZ bits wide 2-way interleaved ~ 900 MBYTES/SEC.
EECC550 - Shaaban #1 Lec # 10 Spring Main Memory Main memory generally utilizes Dynamic RAM (DRAM), which use a single transistor to store.
Memory Hierarchy.1 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
1 Above: The first magnetic core memory, from the IBM 405 Alphabetical Accounting Machine. This experimental system was tested successfully in April 1952.
EECC550 - Shaaban #1 Lec # 9 Winter Main Memory Main memory generally utilizes Dynamic RAM (DRAM), which use a single transistor to store.
1 Lecture 13: Cache Innovations Today: cache access basics and innovations, DRAM (Sections )
EECC551 - Shaaban #1 Lec # 9 Fall Main Memory Main memory generally utilizes Dynamic RAM (DRAM), which use a single transistor to store.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
EECC551 - Shaaban #1 Lec # 10 Winter Computer System Components SDRAM PC100/PC MHZ bits wide 2-way inteleaved ~ 900 MBYTES/SEC.
1 Lecture 14: Virtual Memory Today: DRAM and Virtual memory basics (Sections )
EECS 470 Cache and Memory Systems Lecture 14 Coverage: Chapter 5.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 18, 2002 Topic: Main Memory (DRAM) Organization – contd.
Mainstream Computer System Components
8-5 DRAM ICs High storage capacity Low cost Dominate high-capacity memory application Need “refresh” (main difference between DRAM and SRAM) -- dynamic.
CSIT 301 (Blum)1 Memory. CSIT 301 (Blum)2 Types of DRAM Asynchronous –The processor timing and the memory timing (refreshing schedule) were independent.
©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM.
Memory Technology “Non-so-random” Access Technology:
Charles Kime & Thomas Kaminski © 2008 Pearson Education, Inc. (Hyperlinks are active in View Show mode) Chapter 8 – Memory Basics Logic and Computer Design.
Case Study - SRAM & Caches
Main Memory Background Random Access Memory (vs. Serial Access Memory) Cache uses SRAM: Static Random Access Memory –No refresh (6 transistors/bit vs.
CSIE30300 Computer Architecture Unit 07: Main Memory Hsin-Chou Chi [Adapted from material by and
Survey of Existing Memory Devices Renee Gayle M. Chua.
Main Memory CS448.
CPEN Digital System Design
Topics to be covered : How to model memory in Verilog RAM modeling Register Bank.
University of Tehran 1 Interface Design DRAM Modules Omid Fatemi
Asynchronous vs. Synchronous Counters Ripple Counters Deceptively attractive alternative to synchronous design style State transitions are not sharp! Can.
1 Lecture 14: DRAM Main Memory Systems Today: cache/TLB wrap-up, DRAM basics (Section 2.3)
Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.
The Memory Hierarchy Lecture # 30 15/05/2009Lecture 30_CA&O_Engr Umbreen Sabir.
CS/EE 5810 CS/EE 6810 F00: 1 Main Memory. CS/EE 5810 CS/EE 6810 F00: 2 Main Memory Bottom Rung of the Memory Hierarchy 3 important issues –capacity »BellÕs.
How do you model a RAM in Verilog. Basic Memory Model.
The Evolution of Dynamic Random Access Memory (DRAM) CS 350 Computer Organization and Architecture Spring 2002 Section 1 Nicole Chung Brian C. Hoffman.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
COMP541 Memories II: DRAMs
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
1 Memory Hierarchy (I). 2 Outline Random-Access Memory (RAM) Nonvolatile Memory Disk Storage Suggested Reading: 6.1.
Contemporary DRAM memories and optimization of their usage Nebojša Milenković and Vladimir Stanković, Faculty of Electronic Engineering, Niš.
Cache Perf. CSE 471 Autumn 021 Cache Performance CPI contributed by cache = CPI c = miss rate * number of cycles to handle the miss Another important metric.
1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)
Microprocessor Microarchitecture Memory Hierarchy Optimization Lynn Choi Dept. Of Computer and Electronics Engineering.
CS35101 Computer Architecture Spring 2006 Lecture 18: Memory Hierarchy Paul Durand ( ) [Adapted from M Irwin (
“With 1 MB RAM, we had a memory capacity which will NEVER be fully utilized” - Bill Gates.
COMP541 Memories II: DRAMs
Reducing Hit Time Small and simple caches Way prediction Trace caches
Cache Memory Presentation I
Lecture 15: DRAM Main Memory Systems
Lecture: DRAM Main Memory
Lecture: DRAM Main Memory
Direct Rambus DRAM (aka SyncLink DRAM)
DRAM Hwansoo Han.
Bob Reese Micro II ECE, MSU
Presentation transcript:

© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Lecture 5-6: Memory System

© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Outline Data Supply –Challenges –Some solutions, including Victim Caches

© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Data Supply Objective Minimize the number of cycles each load access is required to wait. Maximize the number of loads/stores handled per cycle. Current Paradigm: multiple levels of caches –Itanium 2 has 3 levels of on-chip caches

© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Performance of Memory Hierarchy

© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois DRAM Technologies DRAM and the memory subsystem significantly impacts the performance and cost of a system Need to understand DRAM technologies –to architect an appropriate memory subsystem for an application –to utilize chosen DRAM efficiently –to design a memory controller

© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois DRAM Basics Wide datapath on DRAM chip has potentially huge bandwidth Pin limitations restrict band-width available off-chip Memory Cell Array Row Decoder Sense Amps Column Latches Mux Row Addr Column Addr Off-chip Data Wide Narrow

© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Fill Frequency Ratio of bandwidth to granularity –Minimum granularity impacts total system cost –Low-cost systems with large bandwidth needs (e.g. handheld consumer multimedia) have very high FF requirements –Increasing DRAM device sizes negatively impacts FF

© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Asynchronous DRAM No clock RAS* (Row Address Strobe) CAS* (Column Address Strobe) Multiplexed address lines System performance constrained: –by inter-signal timing requirements –by board-level inter-device skew

© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Asynchronous DRAM: Fast Page-Mode (FPM) Wide row transferred to column latches on-chip (RAS*) Columns in same row can be accessed more quickly (CAS*) RAS* CAS* RowCol1 Addr Col2Col3 1 Data 23

© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Asynchronous DRAM: Extended Data Out (EDO) Change in data output control reduces “dead time” between column accesses to same row Enables tighter timing, higher bandwidth RAS* CAS* RowCol1 Addr Col2Col3 1 Data 23

© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Synchronous DRAM (SDRAM) Clock signal distributed to all devices “Command” inputs (RAS*, CAS*, etc.) sampled on positive clock edge Higher performance synchronous operation –all inputs latched by clock –only need to consider set-up and hold time of signals relative to clock –on-chip column access path pipelined

© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Virtual Channel (VC) Each device includes 16 channel buffers, each of which is very wide –each channel buffer can hold a large segment of any row (fully associative) –wide internal bus can transfer entire segment to/from memory array at once Memory Cell Array Sense Amps Channel Buffer 1 Mux Channel Buffer 2 Channel Buffer Channel Select

© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Virtual Channel (cont.) Virtual Channels are associated with independent access streams –cache line refill, graphics, I/O DMA, etc. –reduces row penalties for multiple tasks interleaving accesses to different rows Data access is to channel buffers, allowing concurrent background operations –activate, precharge, prefetch, etc.

© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Virtual Channel (cont.) Maintains same command protocol and interface as SDRAM –command set is superset of SDRAM –extra commands encoded using unused address bits during column addressing Requires memory controller support –Channel select, etc. –Default to buffer 1 under traditional controller

© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Next-Generation Packet DRAMs Wrap extra logic and high-speed signaling circuitry around DRAM core Split-transaction “packet” protocols –sends command packets (including addresses) on control lines –sends or receives corresponding data packets some time later on data lines –synchronous with high-speed clock(s)

© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Packet DRAMs (cont.) Individual devices independently decode command packets and access DRAM core accordingly Other common features –in-system timing and signaling calibration and optimization –customized low-voltage, high-speed signaling –multiple internal banks –higher bandwidth with small granularity Direct RDRAM and SLDRAM

© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Direct Rambus DRAM (Direct RDRAM) Developed by Rambus Inc. –proprietary technology requires licensing and royalty payments 400 MHz differential transmit and receive clocks DDR signaling (+/- edge sampling) for both control and data, yielding 800 Mbits/sec/pin 16-bit data bus: peak data bandwidth of 1600 MBytes/sec from one device

© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Direct RDRAM (cont.) Control bus with independent row and column lines –3 row lines and 5 column lines –control packets take 8 cycles (10 ns) –high control bandwidth and independent row and column control provide efficient bus utilization; claim 95% efficiency “over a wide range of workloads” 16 byte minimum data transfer size –4 cycles (10 ns) on 16-bit data bus

© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Direct RDRAM (cont.) Rambus Signaling Logic (RSL) –0.8 V signal swing –careful control of capacitance and inductance, transmission line issues –board design and layout constraints Memory subsystem architecture –1.6 GBytes/sec for 1 to 32 devices on a single channel –multiply bandwidth by adding parallel channels

© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Synchronous-Link DRAM (SLDRAM) Developed by industry working group –IEEE and JEDEC open standard –claimed to be more evolutionary; fewer radical changes than Direct Rambus 200 MHz differential clocks –1 command clock driven by controller –2 data clocks driven by source of data allows seamless change from one driving device to another

© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois SLDRAM (cont.) DDR signaling for 400 Mbits/sec/pin, control and data 16-bit data bus: peak data bandwidth of 800 MBytes/sec from one device 10-bit control bus –no independent row / column control –control packets take 4 cycles (10 ns) –support for chip multicast addressing 8 byte minimum data transfer size –4 cycles (10 ns) on 16-bit data bus

© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois SLDRAM (cont.) Synchronous-Link I/O (SLIO) –0.7 V signal swing –careful control of capacitance and inductance, transmission line issues –board design and layout constrain –less tightly constrained relative to RSL

© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois SLDRAM (cont.) Memory subsystem architecture –800 MBytes/sec for 1 to 8 devices on a single unbuffered channel –up to 256 devices on a hierarchical buffered channel –more bandwidth with parallel channels parallel data channels can share single command channel

© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Physical Realities of Caches Cache latency increases with: –size –associativity –number of ports –line size (less of a factor) –Address translation (virtual vs. physical addressing) Hit rates increase with: –size –associativity –what about line size???

© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Alternative Implementations [on-chip data cache hierarchies] 8kB 4-way 32B line 256kB 8-way unified 32kB 4-way 32B line 96kB 6-way unified 3MB 1 cycle 6 cycles 1 cycle 6 cycles 20 cycles Pentium IV-like McKinley-like

© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Increasing Cache Bandwidth Multiporting Multiple banks Eliminating loads Splitting the reference stream Well-known Experimental

© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Categorizing Cache Misses Compulsory –First access is a compulsory miss Capacity –Miss induced because of smaller cache size Conflict [Jouppi: 30-40% of misses are conflicts] –Miss would not have occurred on a fully-associative version of the same cache Coherence –Would have been a hit if something did not invalidate the line. Question: given a memory reference stream, how do you classify misses?

© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Miss reduction How do we reduce Compulsory misses? –prefetching Capacity misses? –cache only critical or repetitive data Conflict misses? –set associativity, replacement policies Coherence? –we’ll see later…

© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Conflict Misses The problem with direct-mapped –data layout contributes to conflicts –data access (arrays) contributes to conflicts Set associativity with less latency –victim caches –way prediction

© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Caching Techniques A Sampling of Data Supply Optimizations –Victim Caches [conflicts] –Way prediction [conflicts, latency] –Prefetching [compulsory, conflicts, capacity] –Address Prediction and precomputation [latency] –Stack cache[bandwidth, conflicts] –Variable line size[fill bandwidth, capacity]

© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Prefetching Additional conflicts Bandwidth increase Timeliness Software-based Stream buffers Stride prefetching Markov/Shadow prefetching Issues Techniques

© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Dynamic classification of memory references by region [T. Haquin and S. Patel]