Presentation is loading. Please wait.

Presentation is loading. Please wait.

Main MemoryCS510 Computer ArchitecturesLecture 15 - 1 Lecture 15 Main Memory.

Similar presentations


Presentation on theme: "Main MemoryCS510 Computer ArchitecturesLecture 15 - 1 Lecture 15 Main Memory."— Presentation transcript:

1 Main MemoryCS510 Computer ArchitecturesLecture 15 - 1 Lecture 15 Main Memory

2 Main MemoryCS510 Computer ArchitecturesLecture 15 - 2 Main Memory Background Performance of Main Memory: –Latency: Cache Miss Penalty Access Time(AT): time between request and word arrives Cycle Time(CT): time between requests –Bandwidth: I/O & Large Block Miss Penalty (L2) Main Memory, a 2D matrix, is DRAM: –Dynamic since needs to be refreshed periodically (8 ms) Difference in AT and CT, AT<CT –Addresses divided into 2 halves, multiplexing them to memory: RAS or Row Access Strobe CAS or Column Access Strobe Cache uses SRAM: –No refresh (6 transistors/bit vs. 1 transistor/bit) No difference in AT and CT, AT=CT –Address not divided

3 Main MemoryCS510 Computer ArchitecturesLecture 15 - 3 Main Memory Background Size:Size: DRAM/SRAM  4~8 Cost and Cycle time: SRAM/DRAM  8~16 Capacity of DRAM : 4 times/3 years or 60%/year RAS access time : 7% per year

4 Main MemoryCS510 Computer ArchitecturesLecture 15 - 4 Main Memory Organization Simple: –CPU, Cache, Bus, Memory are same width (32 bits) CPU Cache BUS M 1-word-wide memory Interleaved : –CPU, Cache, Bus 1wd: Memory N Modules (4 Modules); shows word interleave Wide: –CPU/Mux 1 word; Mux/Cache, Bus, Memory N words (Alpha: 64 bits & 256 bits) MMMM Bank bank bank bank 0 1 2 3 CPU Cache BUS Interleaved Memory M CPU Cache BUS Wide Memory MUX

5 Main MemoryCS510 Computer ArchitecturesLecture 15 - 5 Main Memory Performance Timing model –1 to send address, –6 access time, –1 to send data –Block access time Assuming Cache Block is 4 words Address Bank 0 Bank1 Bank 2 Bank 3                 Simple M.P. = 4 x (1+6+1) = 32 Wide M.P. = 1 + 6 + 1 = 8 Interleaved M.P. = 1 + 6 + (4x1) = 11

6 Main MemoryCS510 Computer ArchitecturesLecture 15 - 6 Technique for Higher BW: 1. Wider Main Memory Alpha AXP 21064 : 256-bit wide L2, Memory Bus, Memory Drawbacks –expandability doubling the width needs doubling the capacity –bus width need a multiplexer to get the desired word from a block –error correction - separate error correction every 32 bits otherwise, on WRITE, read block -> modify word -> calculate the new ECC -> store

7 Main MemoryCS510 Computer ArchitecturesLecture 15 - 7 Technique for Higher BW: 2. Interleaved Memory Interleaved Memory and Wide Memory –Consider the following description of a machine and its cache performance mem bus width = 1 word=32 bit »memory accesses/ instr = 1.2 »cache miss penalty = 8(1+6+1) cycles »average CPI(ignoring cache misses) = 2 –What is the improvement over the base machine(block size=1) in performance of interleaving 2-way and 4-way versus doubling the width of memory and the bus block size(word) 1 2 4 miss rate(%) 3 2 1

8 Main MemoryCS510 Computer ArchitecturesLecture 15 - 8 Interleaved Memory Answer –CPI + (M ref/instr. x miss rate x miss penalty) =2 + (1.2 x (0.03 for 1-way, 0.02 for 2-way, or 0.01 for 4-way) x mis penalty) –the CPI for the base machine(Simple Memory)(BM) 2+(1.2 x 0.03 x 8) = 2.288 –2-word wide memory 32-bit bus and mem, no interleaving = 2+(1.2x0.02x(2x8)) = 2.384 slower than BM 32-bit bus and mem, interleaving = 2+(1.2x0.02x(1+6+(2x1))) = 2.216 faster than BM 64-bit bus and mem, no interleaving = 2+(1.2x0.02x8) = 2.192 faster than BM –4-word wide memory 32-bit bus and mem, no interleaving = 2+(1.2x0.01x(4x8)) = 2.384 slower than BM 32-bit bus and mem, interleaving = 2+(1.2x0.01x(1+6+(4x1))) = 2.132 faster than 2-word 64-bit bus and mem, no interleaving = 2+(1.2x0.01x(2x8)) = 2.192 same as 2-word

9 Main MemoryCS510 Computer ArchitecturesLecture 15 - 9 Technique for Higher BW: 3. Independent Memory Banks Interleaved Memory-Faster Sequential Accesses; Independent Memory Banks - Faster Independent Accesses Motivation: Higher BW for sequential accesses by interleaving sequential bank addresses - each bank shares the address line Memory banks for independent accesses - each bank has a bank controller, separate address lines –1 bank for I/O, 1 bank for cache read, 1 bank for cache write, etc. –If 1 controller controls all the banks, it can only provide fast access time for one operation –Benefit of memory banks for Miss under Miss in Non-faulting caches Superbank: all memory banks active on one block transfer Bank: portion within a superbank that is word interleaved Superbank Number Superbank Offset Bank Number Bank Offset Superbank Bank

10 Main MemoryCS510 Computer ArchitecturesLecture 15 - 10 Independent Memory Banks How many banks? –For sequential accesses, a new bank delivers a word on each clock –For sequential accesses, number of banks  number of clocks to access a word in a bank –Otherwise will return to the original bank before it has the next word ready Increasing capacity of a DRAM chip => fewer chips to build the same capacity memory system => harder to have banks

11 Main MemoryCS510 Computer ArchitecturesLecture 15 - 11 Technique for Higher BW: 4. Avoiding Bank Conflicts Even a lot of banks, still bank conflict in certain regular accesses - e.g. Storing 256x512 array in 128 banks and column processing (512 is an even multiple of 128)         Bank0 Bank1 Bank127,…, Bank511 int x[256][512]; for (j = 0; j < 512; j = j+1) for (i = 0; i < 256; i = i+1) x[i][j] = 2 * x[i][j]; Column processing Column elements are in the same bank Inner Loop is a column processing which causes bank conflicts

12 Main MemoryCS510 Computer ArchitecturesLecture 15 - 12 Avoiding Bank Conflicts SW approaches –Loop interchange to avoid accessing the same bank –Declaring array size not power of 2(number of banks is a power of 2) so that addresses point to the different banks, i.e., a column elements are spread around different banks HW: Prime number of banks –bank number = (address) MOD (number of banks) –address within bank = address / number of banks To avoid calculation of divide per memory access address within bank = (address) MOD (number words in bank ) 3=(31)MOD(7) –bank number? words per bank? Easy if both are power of 2

13 Main MemoryCS510 Computer ArchitecturesLecture 15 - 13 Chinese Remainder Theorem As long as two sets of integers a i and b i follow these rules Fast Bank Number b i =(x) MOD (a i ), 0 < b i < a i, 0 < x < a 0 x a 1 x a 2 x... and that a i and a j are co-prime if i  j, then the integer x has only one solution (unambiguous mapping): bank number = b 0 =(x) Mod (a 0 ); number of banks = a 0 (= 3 in ex), 0 < b 0 < a 0 address within a bank = b 1 =(x) Mod (a 1 ); size of a bank = a 1 (= 8 in ex) N words’ addresses 0 to N-1; prime no. of banks(3); words/bank power of 2(8)

14 Main MemoryCS510 Computer ArchitecturesLecture 15 - 14 Fast Bank Numbers Seq. Interleaved Modulo Interleaved Bank Number:012012 Addr in Bank: 00120168 13459117 267818102 39101131911 412131412420 515161721135 618192062214 721222315723 Bank # = (5) Mod (3) = 2: (5) Mod (8) = 5 5/3 = 1 Address = 5

15 Main MemoryCS510 Computer ArchitecturesLecture 15 - 15 Technique for Higher BW: 5. DRAM Specific Interleaving DRAM access - Row Access(RAS) and Column Access(CAS) Multiple accesses to a RAS buffer: several names (page mode) –64 Mbit DRAM: cycle time = 100 ns, page mode = 20 ns New DRAMs to address CPU-DRAM speed gap; what will they cost, will they survive? –Synchronous DRAM: Provide a clock signal to DRAM, transfer synchronous to system clock –RAMBUS: startup company; reinvent DRAM interface Each Chip acts as a module vs. slice of memory(or bank) Short bus between CPU and chips Does own refresh Variable amount of data returned 1 byte / 2 ns (500 MB/s per chip) Niche memory only? or main memory? –e.g., Video RAM for frame buffers, DRAM + fast serial output

16 Main MemoryCS510 Computer ArchitecturesLecture 15 - 16 Main Memory Summary Wider Memory: for independent access Interleaved Memory: for sequential or independent accesses Avoiding bank conflicts: SW & HW DRAM specific optimizations: page mode & Specialty DRAM


Download ppt "Main MemoryCS510 Computer ArchitecturesLecture 15 - 1 Lecture 15 Main Memory."

Similar presentations


Ads by Google