1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 19, 2003 Topic: Main Memory (DRAM) Organization.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 19, 2003 Topic: Main Memory (DRAM) Organization

2Outline  Introduction  DRAM Organization  Challenges Bandwidth Bandwidth Granularity Granularity  Performance Reading: HP3 5.8 and 5.9

3 Basics of DRAM Technology  DRAM (Dynamic RAM)  Used mostly in main mem.  Capacitor + 1 transistor/bit  Need refresh every 4-8 ms 5% of total time 5% of total time  Read is destructive (need for write-back)  Access time < cycle time (because of writing back)  Density (25-50):1 to SRAM  Address lines multiplexed pins are scarce! pins are scarce!  SRAM (Static RAM)  Used mostly in caches (I, D, TLB, BTB)  1 flip-flop (4-6 transistors) per bit  Read is not destructive  Access time = cycle time  Speed (8-16):1 to DRAM  Address lines not multiplexed high speed of decoding imp. high speed of decoding imp.

4 DRAM Organization: Fig. 5.29

5 Chip Organization  Chip capacity (= number of data bits) tends to quadruple tends to quadruple 1K, 4K, 16K, 64K, 256K, 1M, 4M, … 1K, 4K, 16K, 64K, 256K, 1M, 4M, …  In early designs, each data bit belonged to a different address (x1 organization)  Starting with 1Mbit chips, wider chips (4, 8, 16, 32 bits wide) began to appear Advantage: Higher bandwidth Advantage: Higher bandwidth Disadvantage: More pins, hence more expensive packaging Disadvantage: More pins, hence more expensive packaging

6 Chip Organization Example: 64Mb DRAM

7 DRAM Access Several steps in DRAM access: Half of the address bits select a row of the square array Half of the address bits select a row of the square array Whole row of bits is brought out of the memory array into a buffer register (slow, 60-80% of access time) Whole row of bits is brought out of the memory array into a buffer register (slow, 60-80% of access time) Other half of address bits select one bit of buffer register (with the help of multiplexer), which is read or written Other half of address bits select one bit of buffer register (with the help of multiplexer), which is read or written Whole row is written back to memory array Whole row is written back to memory arrayNotes: This organization is demanded by needs of refresh This organization is demanded by needs of refresh Has advantages: e.g., nibble, page, and static column mode operation Has advantages: e.g., nibble, page, and static column mode operation

8 DRAM Refresh  Refreshes are performed one row at a time. Consider a 1Mx1 DRAM chip with 190 ns cycle time Consider a 1Mx1 DRAM chip with 190 ns cycle time Time for refreshing one row at a time Time for refreshing one row at a time  190  10 -9  10 3 = 0.19 ms < 4-8 ms  Refresh complicates operation of memory Refresh control competes with CPU for access to DRAM Refresh control competes with CPU for access to DRAM Each row refreshed once every 4-8 ms irrespective of the use of that row Each row refreshed once every 4-8 ms irrespective of the use of that row  Want to keep refresh fast (< 5-10% of total time)

9 Memory Performance Characteristics  Latency (access time) The time interval between the instant at which the data is called for (READ) or requested to be stored (WRITE), and the instant at which it is delivered or completely stored The time interval between the instant at which the data is called for (READ) or requested to be stored (WRITE), and the instant at which it is delivered or completely stored  Cycle time The time between the instant the memory is accessed, and the instant at which it may be validly accessed again The time between the instant the memory is accessed, and the instant at which it may be validly accessed again  Bandwidth (throughput) The rate at which data can be transferred to or from memory The rate at which data can be transferred to or from memory Reciprocal of cycle time Reciprocal of cycle time “Burst mode” bandwidth is of greatest interest “Burst mode” bandwidth is of greatest interest  Cycle time > access time for conventional DRAM  Cycle time < access time in “burst mode” when a sequence of consecutive locations is read or written

10 Improving Performance  Latency can be reduced by Reducing access time of chips Reducing access time of chips Using a cache (“cache trades latency for bandwidth”) Using a cache (“cache trades latency for bandwidth”)  Bandwidth can be increased by using Wider memory (more chips) Wider memory (more chips) More data pins per DRAM chip More data pins per DRAM chip Increased bandwidth per data pin Increased bandwidth per data pin

11 Two Recent Problems  DRAM chip sizes quadrupling every three years  Main memory sizes doubling every three years  Thus, the main memory of the same kind of computer is being constructed from fewer and fewer DRAM chips  This results in two serious problems Diminishing main memory bandwidth Diminishing main memory bandwidth Increasing granularity of memory systems Increasing granularity of memory systems

12 Increasing Granularity of Memory Systems  Granularity of memory system is the minimum memory size, and also the minimum increment in the amount of memory permitted by the memory system  Too large a granularity is undesirable Increases cost of system Increases cost of system Restricts its competitiveness Restricts its competitiveness  Granularity can be decreased by Widening the DRAM chips Widening the DRAM chips Increasing the per-pin bandwidth of the DRAM chips Increasing the per-pin bandwidth of the DRAM chips

13 Granularity Example We are using 16K  1 DRAM parts, running at 2.5 MHz (400ns cycle time). Eight such DRAM parts provide 16KB of memory with 2.5MB/s bandwidth. We are using 16K  1 DRAM parts, running at 2.5 MHz (400ns cycle time). Eight such DRAM parts provide 16KB of memory with 2.5MB/s bandwidth. Industry switches to 64Kb (64K  1) DRAM parts. Two such DRAM parts provide the desired 16KB of memory. Such a system would have a 2-bit wide bus. Industry switches to 64Kb (64K  1) DRAM parts. Two such DRAM parts provide the desired 16KB of memory. Such a system would have a 2-bit wide bus. To maintain a 2.5MB/s bandwidth, parts would need to run at 10 MHz. But the parts run only at 3.7 MHz. What are the options? To maintain a 2.5MB/s bandwidth, parts would need to run at 10 MHz. But the parts run only at 3.7 MHz. What are the options? 8 2

14 Granularity Example (2) 8 Solution 1 Use eight 64K  1 DRAM parts (six would suffice for required bandwidth). Problem: Now we have 64KB of memory rather than 16KB. Solution 1 Use eight 64K  1 DRAM parts (six would suffice for required bandwidth). Problem: Now we have 64KB of memory rather than 16KB. Solution 2 Use two 16K  4 DRAM parts (same capacity, different organization). This provides 16KB of memory at the required bandwidth. Solution 2 Use two 16K  4 DRAM parts (same capacity, different organization). This provides 16KB of memory at the required bandwidth. 8

15 Improving Memory Chip Performance Several techniques to get more bits/sec from a DRAM chip: Allow repeated accesses to the row buffer without another row access time Allow repeated accesses to the row buffer without another row access time  burst mode, fast page mode, EDO mode, … Simplify the DRAM-CPU interface Simplify the DRAM-CPU interface  add a clock to reduce overhead of synchronizing with the controller  = synchronous DRAM (SDRAM) Transfer data on both rising and falling clock edges Transfer data on both rising and falling clock edges  double data rate (DDR)  Each of the above adds a small amount of logic to exploit the high internal DRAM bandwidth

16 Basic Mode of Operation  Slowest mode  Uses only single row and column address  Row access is slow (60-70ns) compared to column access (5-10ns)  Leads to three techniques for DRAM speed improvement Getting more bits out of DRAM on one access given timing constraints Getting more bits out of DRAM on one access given timing constraints Pipelining the various operations to minimize total time Pipelining the various operations to minimize total time Segmenting the data in such a way that some operations are eliminated for a given set of accesses Segmenting the data in such a way that some operations are eliminated for a given set of accesses RowColumn Address RAS CAS Data

17 Nibble (or Burst) Mode  Several consecutive columns are accessed  Only first column address is explicitly specified  Rest are internally generated using a counter RAS------------------------------------ CASCASCASCAS RACA D1D2D3D4 RAS------------------------------------ CASCASCASCAS RACA D1D2D3D4

18 Fast Page Mode  Accesses arbitrary columns within same row  Static column mode is similar RAS------------------------------------ CASCASCASCAS RACA1CA2CA3CA4 D1D2D3D4 RAS------------------------------------ CASCASCASCAS RACA1CA2CA3CA4 D1D2D3D4

19 EDO Mode  Arbitrary column addresses  Pipelined  EDO = Extended Data Out  Has other modes like “burst EDO”, which allows reading of a fixed number of bytes starting with each specified column address RAS------------------------------------ CASCASCASCASCASCASCAS RACA1CA2CA3CA4CA5CA6CA7 D1D2D3D4D5D6 RAS------------------------------------ CASCASCASCASCASCASCAS RACA1CA2CA3CA4CA5CA6CA7 D1D2D3D4D5D6

20 Evolutionary DRAM Architectures  SDRAM (Synchronous DRAM) Interface retains a good part of conventional DRAM interface Interface retains a good part of conventional DRAM interface  addresses multiplexed in two halves  separate data pins  two control signals All address, data, and control signals are synchronized with an external clock (100-150 MHz) All address, data, and control signals are synchronized with an external clock (100-150 MHz)  Allows decoupling of processor and memory  Allows pipelining a series of reads and writes Peak speed per memory module: 800-1200 MB/sec Peak speed per memory module: 800-1200 MB/sec

21 Revolutionary DRAM Architectures  Examples RDRAM (Rambus DRAM) RDRAM (Rambus DRAM) MDRAM (MoSys DRAM) MDRAM (MoSys DRAM)  Salient features Many smaller memory banks interleaved on one chip Many smaller memory banks interleaved on one chip “Protocol based” architecture “Protocol based” architecture  Narrow, fully multiplexed communication protocol Example: RAMBUS (RDRAM, DRDRAM) Example: RAMBUS (RDRAM, DRDRAM)  Each chip is more like a memory system than a component  Interleaved memory and a high-speed interface  Packet-switched bus (split transaction bus)  Chip can return variable #bytes from a single request, performs own reset, transfers on both clock edges  Narrow bus (1-2 data bytes) –Upto 3 transactions can be done concurrently  Internally, 72-bit wide bus with 5 ns cycle time  Up to 1600 Mbps peak bandwidth  Expensive!

22 Achieving Higher Memory Bandwidth Fig. 5.27 HP3

23 Memory Interleaving  Goal: Try to take advantage of bandwidth of multiple DRAMs in memory system  Memory address A is converted into (b,w) pair, where b = bank index b = bank index w = word index within bank w = word index within bank  Logically a wide memory Accesses to B banks staged over time to share internal resources such as memory bus Accesses to B banks staged over time to share internal resources such as memory bus  Interleaving can be on Low-order bits of address (cyclic) Low-order bits of address (cyclic)  b = A mod B, w = A div B High-order bits of address (block) High-order bits of address (block) Combination of the two (block-cyclic) Combination of the two (block-cyclic)

24 Low-order Bit Interleaving

25 Mixed Interleaving  Memory address register is 6 bits wide Most significant 2 bits give bank address Most significant 2 bits give bank address Next 3 bits give word address within bank Next 3 bits give word address within bank LSB gives (parity of) module within bank LSB gives (parity of) module within bank  6 = 000110 2 = (00, 011, 0) = (0, 3, 0)  41 = 101001 2 = (10, 100, 1) = (2, 4, 1)

26 Other types of Memory  ROM = Read-only Memory  Flash = ROM which can be written once in a while Used in embedded systems, small microcontrollers Used in embedded systems, small microcontrollers Offer IP protection, security Offer IP protection, security

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 19, 2003 Topic: Main Memory (DRAM) Organization.

Similar presentations

Presentation on theme: "1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 19, 2003 Topic: Main Memory (DRAM) Organization."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 19, 2003 Topic: Main Memory (DRAM) Organization.

Similar presentations

Presentation on theme: "1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 19, 2003 Topic: Main Memory (DRAM) Organization."— Presentation transcript:

Similar presentations

About project

Feedback