April 20, 2004 Prof. Andreas Savvides Spring 2004

April 20, 2004 Prof. Andreas Savvides Spring 2004
EENG 449bG/CPSC 439bG Computer Systems Lecture 19 Memory Hierarchy Design Part III Memory Technologies April 20, 2004 Prof. Andreas Savvides Spring 2004 Review today, not so fast in future

Announcements Midterm 2 next time (20% of class grade)
Material from chapters 3,4,5 Use lecture slides and HW exercises as a study guide Project presentation (10% of grade) April 26th (or May 4th) Project reports (15% of grade) Due May 6th

Main Memory Background
Performance of Main Memory: Latency: Cache Miss Penalty Access Time: time between request and word arrives Cycle Time: time between requests Bandwidth: I/O & Large Block Miss Penalty (L2) Main Memory is DRAM: Dynamic Random Access Memory Dynamic since needs to be refreshed periodically (8 ms, 1% time) Addresses divided into 2 halves (Memory as a 2D matrix): RAS or Row Access Strobe CAS or Column Access Strobe Cache uses SRAM: Static Random Access Memory No refresh (6 transistors/bit vs. 1 transistor for DRAM Size: DRAM/SRAM 4-8, Cost/Cycle time: SRAM/DRAM 8-16

Main Memory Organizations
Simple: CPU, Cache, Bus, Memory same width (32 or 64 bits) Memory Performance Example 4 cycles to send address 56 cycles access time per word 4 clock cycles to send a word of data Cache block size 4, 8-byte words Miss Penalty 4 x ( ) = 256 clock cycles

Wide Memory Organization: CPU/Mux 1 word; Mux/Cache, Bus, Memory N words (Alpha: 64 bits & 256 bits; UtraSPARC 512)

Wide Memory Organization: CPU/Mux 1 word; Mux/Cache, Bus, Memory N words (Alpha: 64 bits & 256 bits; UtraSPARC 512) Consider memory width of 2 words – miss penalty: 2 x ( ) = 128 cycles 4-word width => 64

Wide Memory drawbacks HW overhead – wider bus and multiplexers at each level If error correction is supported, then the whole block has to be read at each byte write to compute a new code

Memory Interleaving: CPU, Cache, Bus 1 word: Memory N Modules (4 Modules); example is word interleaved New Miss Penalty: ( 4 x 4 ) = 76 cycles Bank advantages Can have up to 1-word/cycle writes if writes are not on the same bank

Memory Technologies

Memory Technologies DRAM
Dynamic Random Access Memory Write Charge bitline HIGH or LOW and set wordline HIGH Read Bit line is precharged to a voltage halfway between HIGH and LOW, and then the word line is set HIGH. Depending on the charge in the cap, the precharged bitline is pulled slightly higher or lower. Sense Amp Detects change Destructive read! Explains why Cap can’t shrink Need to sufficiently drive bitline Increase density => increase parasitic capacitance Word Line Bit Line C Sense Amp . . .

DRAM Charge Leakage Need to have frequent refresh, rates vary from ms to ns, update approx. every 8 reads

DRAM logical organization (4 Mbit)
Column Decoder … Data In Sense Amps & I/O D Data Out Memory Array Q A0…A1 3 Address buffer Bit Line Row decoder (16,384 x 16,384) Storage Cell Word Line Square root of bits per RAS/CAS

DRAM-chip internal organization
64K x 1 DRAM

RAS/CAS operation Row Address Strobe, Column Address Strobe
n address bits are provided in two steps using n/2 pins, referenced to the falling edges of RAS_L and CAS_L Traditional method of DRAM operation for 20 years. Now being supplanted by synchronous, clocked interfaces in SDRAM (synchronous DRAM).

DRAM read timing

DRAM read timing Read Latency

DRAM refresh timing

DRAM write timing

DRAM History DRAMs: capacity +60%/yr, cost –30%/yr
2.5X cells/area, 1.5X die size in 3 years ‘98 DRAM fab line costs $2B DRAM only: density, leakage v. speed Rely on increasing no. of computers & memory per computer (60% market) SIMM or DIMM is replaceable unit => computers use any generation DRAM Commodity, second source industry => high volume, low profit, conservative Little organization innovation in 20 years Order of importance: 1) Cost/bit 2) Capacity First RAMBUS: 10X BW, +30% cost => little impact

So, Why do I freaking care?
By it’s nature, DRAM isn’t built for speed Reponse times dependent on capacitive circuit properties which get worse as density increases DRAM process isn’t easy to integrate into CMOS process DRAM is off chip Connectors, wires, etc introduce slowness IRAM efforts looking to integrating the two Memory Architectures are designed to minimize impact of DRAM latency Low Level: Memory chips High Level memory designs. You will pay $$$$$$ and then some $$$ for a good memory system.

So, Why do I freaking care?
: Speed = ƒ(no. operations) 1990 Pipelined Execution & Fast Clock Rate Out-of-Order execution Superscalar Instruction Issue 1998: Speed = ƒ(non-cached memory accesses)

DRAM Future: 1 Gbit DRAM (ISSCC ‘96; production ‘02?)
Mitsubishi Samsung Blocks 512 x 2 Mbit x 1 Mbit Clock 200 MHz 250 MHz Data Pins 64 16 Die Size 24 x 24 mm 31 x 21 mm Sizes will be much smaller in production Metal Layers 3 4 Technology 0.15 micron micron Latency comparison 180ns in 1980, 40ns in 2002

Single Port 6-T SRAM Cell
Static RAM (SRAM) Six transistors in cross connected fashion Prevent the information from being disturbed when read SRAM requires minimal power to retain the charges – better than SRAM On the same process DRAM 4-8 times more capacity, SRAMs 8-16 times faster Single Port 6-T SRAM Cell

Fast Memory Systems: DRAM specific
Multiple CAS accesses: several names (page mode) Extended Data Out (EDO): 30% faster in page mode New DRAMs to address gap; what will they cost, will they survive? RAMBUS: startup company; reinvent DRAM interface Each Chip a module vs. slice of memory Short bus between CPU and chips ( MHz < 4inches long) Does own refresh Variable amount of data returned 1 byte / 2 ns (500 MB/s per bandwidth 20% increase in DRAM area Synchronous DRAM (SDRAM): 2 banks on chip, a clock signal to DRAM, transfer synchronous to system clock ( MHz in 2001)

RAMBUS (RDRAM) Protocol based RAM w/ narrow (16-bit) bus
High clock rate (400 Mhz), but long latency Pipelined operation Multiple arrays w/ data transferred on both edges of clock RAMBUS Bank RDRAM Memory System

RAMBUS vs. SDRAM SDRAM comes in DIMMs, RAMBUS comes in RIMMs – similar in size but incompatible SDRAMs have almost comparable performance to RAMBUS Newer DRAM generations of DRAM such as RDRAM and DRDRAM provide more bandwidth at a price premium

Need for Error Correction!
Motivation: Failures/time proportional to number of bits! As DRAM cells shrink, more vulnerable Went through period in which failure rate was low enough without error correction that people didn’t do correction DRAM banks too large now Servers always corrected memory systems Basic idea: add redundancy through parity bits Simple but wasteful version: Keep three copies of everything, vote to find right value 200% overhead, so not good! Common configuration: Random error correction SEC-DED (single error correct, double error detect) One example: 64 data bits + 8 parity bits (11% overhead) Papers up on reading list from last term tell you how to do these types of codes Really want to handle failures of physical components as well Organization is multiple DRAMs/SIMM, multiple SIMMs Want to recover from failed DRAM and failed SIMM! Requires more redundancy to do this All major vendors thinking about this in high-end machines

More esoteric Storage Technologies?
Tunneling Magnetic Junction RAM (TMJ-RAM): Speed of SRAM, density of DRAM, non-volatile (no refresh) New field called “Spintronics”: combination of quantum spin and electronics Same technology used in high-density disk-drives MicroElecromechanicalSystems(MEMS) storage devices: Large magnetic “sled” floating on top of lots of little read/write heads Micromechanical actuators move the sled back and forth over the heads

Tunneling Magnetic Junction

MEMS-based Storage Magnetic “sled” floats on array of read/write heads
Approx 250 Gbit/in2 Data rates: IBM: 250 MB/s w 1000 heads CMU: 3.1 MB/s w 400 heads Electrostatic actuators move media around to align it with heads Sweep sled ±50m in < 0.5s Capacity estimated to be in the 1-10GB in 10cm2 See Ganger et all:

Embedded Processor Memory Technologies
Read Only Memory (ROM) – programmed once at manufacture time – non-destructible FLASH Memory Non-volatile but re-programmable Almost DRAM reading speeds but 10 – 100 slower writing Typical access times 65ns for 16Mbit flash and 150ns for 128Mbit flash Flash building blocks are based on NOR or NAND devices NOR devices can be reprogrammed about 100,000 cycles NAND devices can be reprogrammed up to 1,000,000 cycles

Project & Reports Your final report should build up on the midterm report Document your software architecture & approach If you have hardware, document the hardware Useful metrics: Power consumption (in mA current drawn from the power supply) Projects dealing with evaluations Report on your results Experiments with negative results are as important as positive results. Explain what did not work and why Demo your project status on the day of final presentation

Concluding Remarks Processor internals Processor interfaces
Performance, Pipelining, ILP SW & HW, Memory Hierarchies Processor interfaces Using microcontrollers, peripherals and tools

April 20, 2004 Prof. Andreas Savvides Spring 2004

Similar presentations

Presentation on theme: "April 20, 2004 Prof. Andreas Savvides Spring 2004"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

April 20, 2004 Prof. Andreas Savvides Spring 2004

Similar presentations

Presentation on theme: "April 20, 2004 Prof. Andreas Savvides Spring 2004"— Presentation transcript:

Similar presentations

About project

Feedback