Presentation is loading. Please wait.

Presentation is loading. Please wait.

EE30332 Ch7 DP.1 °Most of the slides are from Prof. Dave Patterson of University of California at Berkeley °Part of the materials is from Sun Microsystems.

Similar presentations


Presentation on theme: "EE30332 Ch7 DP.1 °Most of the slides are from Prof. Dave Patterson of University of California at Berkeley °Part of the materials is from Sun Microsystems."— Presentation transcript:

1 EE30332 Ch7 DP.1 °Most of the slides are from Prof. Dave Patterson of University of California at Berkeley °Part of the materials is from Sun Microsystems °Part of the material is from AF °"Copyright 1997 UCB." Permission is granted to alter and distribute this material provided that the following credit line is included: "Adapted from (complete bibliographic citation). Copyright 1997 UCB. Ch 7 Memory Hierarchy

2 EE30332 Ch7 DP.2 Ch 7 Memory Hierarchy DRAM YearSizeCycle Time 198064 Kb250 ns 1983256 Kb220 ns 19861 Mb190 ns 19894 Mb165 ns 199216 Mb145 ns 199564 Mb120 ns CapacitySpeed (latency) Logic:2x in 3 years2x in 3 years DRAM:4x in 3 years2x in 10 years Disk:4x in 3 years2x in 10 years 1000:1!2:1!

3 EE30332 Ch7 DP.3 Who Cares About the Memory Hierarchy? µProc 60%/yr. (2X/1.5yr) DRAM 9%/yr. (2X/10 yrs) 1 10 100 1000 19801981198319841985198619871988 1989 19901991199219931994199519961997199819992000 DRAM CPU 1982 Processor-Memory Performance Gap: (grows 50% / year) Performance Time “Moore’s Law” Processor-DRAM Memory Gap (latency)

4 EE30332 Ch7 DP.4 Impact on Performance °Suppose a processor executes at Clock Rate = 200 MHz (5 ns per cycle) CPI = 1.1 50% arith/logic, 30% ld/st, 20% control °Suppose that 10% of memory operations get 50 cycle miss penalty °CPI = ideal CPI + average stalls per instruction = 1.1(cyc) +( 0.30 (datamops/ins) x 0.10 (miss/datamop) x 50 (cycle/miss) ) = 1.1 cycle + 1.5 cycle = 2. 6 °58 % of the time the processor is stalled waiting for memory! °a 1% instruction miss rate would add an additional 0.5 cycles to the CPI!

5 EE30332 Ch7 DP.5 The Goal: illusion of large, fast, cheap memory °Fact: Large memories are slow, fast memories are small °How do we create a memory that is large, cheap and fast (most of the time)? Hierarchy Parallelism

6 EE30332 Ch7 DP.6 An Expanded View of the Memory System Control Datapath Memory Processor Memory Fastest Slowest Smallest Biggest Highest Lowest Speed: Size: Cost:

7 EE30332 Ch7 DP.7 Why hierarchy works °The Principle of Locality: Program access a relatively small portion of the address space at any instant of time. Address Space 02^n - 1 Probability of reference

8 EE30332 Ch7 DP.8 Memory Hierarchy: How Does it Work? °Temporal Locality (Locality in Time): => Keep most recently accessed data items closer to the processor °Spatial Locality (Locality in Space): => Move blocks consists of contiguous words to the upper levels Lower Level Memory Upper Level Memory To Processor From Processor Blk X Blk Y

9 EE30332 Ch7 DP.9 Memory Hierarchy: Terminology °Hit: data appears in some block in the upper level (example: Block X) Hit Rate: the fraction of memory access found in the upper level Hit Time: RAM access time + time to determine hit/miss °Miss: data needs to be retrieved from lower storage Miss Rate = 1 - (Hit Rate) Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor °Hit Time << Miss Penalty Lower Level Memory Upper Level Memory To Processor From Processor Blk X Blk Y

10 EE30332 Ch7 DP.10 How is the hierarchy managed? °Registers Memory (cache is faster copy of mem) by compiler (programmer?) °cache memory by the hardware °memory disks by the hardware and operating system (virtual memory) by the programmer (files) Processor sees Registers, cache and memory but not disk, when executing instruction Disks are handled as I/O, thru Page Fault to/from memory

11 EE30332 Ch7 DP.11 Memory Hierarchy Technology °Random Access: “Random” is good: access time is the same for all locations DRAM: Dynamic Random Access Memory -High density, low power, cheap, slow -Dynamic: need to be “refreshed” regularly SRAM: Static Random Access Memory -Low density, high power, expensive, fast -Static: content will last “forever”(until lose power) °“Non-so-random” Access Technology: Access time varies from location to location and from time to time Examples: Disk, CDROM °Sequential Access Technology: access time linear in location (e.g.,Tape) The Main Memory: DRAMs + Caches: SRAMs

12 EE30332 Ch7 DP.12 Main Memory Background °Performance of Main Memory: Latency: Cache Miss Penalty -Access Time: time between request and word arrives -Cycle Time: time between requests Bandwidth: I/O & Large Block Miss Penalty (L2) °Main Memory is DRAM: Dynamic Random Access Memory (except supercomputers) Dynamic since needs to be refreshed periodically (8 ms) Addresses divided into 2 halves (Memory as a 2D matrix): -RAS or Row Access Strobe -CAS or Column Access Strobe °Cache uses SRAM: Static Random Access Memory No refresh (6 transistors/bit vs. 1 transistor /bit) Address not divided °Size: DRAM/SRAM ­ 4-8 °Cost/Cycle time: SRAM/DRAM ­ 8-16

13 EE30332 Ch7 DP.13 Random Access Memory (RAM) Technology °Why do computer designers need to know about RAM technology? Processor performance is usually limited by memory bandwidth As IC densities increase, lots of memory will fit on processor chip -Tailor on-chip memory to specific needs -Instruction cache -Data cache -Write buffer °What makes RAM different from a bunch of flip-flops? Density: RAM is much more denser

14 EE30332 Ch7 DP.14 Classical DRAM Organization (square) rowdecoderrowdecoder row address Column Selector & I/O Circuits Column Address data RAM Cell Array word (row) select bit (data) lines °Row and Column Address together: Select 1 bit a time Each intersection represents a 1-T DRAM Cell

15 EE30332 Ch7 DP.15 Increasing Bandwidth - Interleaving Access Pattern without Interleaving: Start Access for D1 CPUMemory Start Access for D2 D1 available Access Pattern with 4-way Interleaving: Access Bank 0 Access Bank 1 Access Bank 2 Access Bank 3 We can Access Bank 0 again CPU Memory Bank 1 Memory Bank 0 Memory Bank 3 Memory Bank 2

16 EE30332 Ch7 DP.16 Main Memory Performance °Timing model 1 to send address, 6 access time, 1 to send data Cache Block is 4 words °Simple M.P. = 4 x (1+6+1) = 32 °Wide M.P. = 1 + 6 + 1 = 8 °Interleaved M.P. = 1 + 6 + 4x1 = 11

17 EE30332 Ch7 DP.17 Independent Memory Banks °How many banks? number banks  number clocks to access word in bank For sequential accesses, otherwise will return to original bank before it has next word ready °Increasing DRAM => fewer chips => harder to have banks Growth bits/chip DRAM : 50%-60%/yr Nathan Myrvold M/S: mature software growth (33%/yr for NT) ­ growth MB/$ of DRAM (25%-30%/yr)

18 EE30332 Ch7 DP.18 Fewer DRAMs/System over Time Minimum PC Memory Size DRAM Generation ‘86 ‘89 ‘92‘96‘99‘02 1 Mb 4 Mb 16 Mb 64 Mb 256 Mb1 Gb 4 MB 8 MB 16 MB 32 MB 64 MB 128 MB 256 MB 328 164 82 41 82 41 82 Memory per System growth @ 25%-30% / year Memory per DRAM growth @ 60% / year (from Pete MacWilliams, Intel)

19 EE30332 Ch7 DP.19 Today’s Situation: DRAM °Commodity, second source industry  high volume, low profit, conservative Little organization innovation (vs. processors) in 20 years: page mode, EDO, Synch DRAM °DRAM industry at a crossroads: Fewer DRAMs per computer over time -Growth bits/chip DRAM : 50%-60%/yr -Nathan Myrvold M/S: mature software growth (33%/yr for NT) ­ growth MB/$ of DRAM (25%-30%/yr) °DRAM is often chosen as the first major product for a new semiconductor technology, because the cell are simple and arrangement is regular, and it is large volume and earlier return of investment

20 EE30332 Ch7 DP.20 Example: 1 KB Direct Mapped Cache with 32 B Blocks °For a 2 ** N byte cache: The uppermost (32 - N) bits are always the Cache Tag The lowest M bits are the Byte Select (Block Size = 2 ** M) Cache Index 0 1 2 3 : Cache Data Byte 0 0431 : Cache TagExample: 0x50 Ex: 0x01 0x50 Stored as part of the cache “state” Valid Bit : 31 Byte 1Byte 31 : Byte 32Byte 33Byte 63 : Byte 992Byte 1023 : Cache Tag Byte Select Ex: 0x00 9

21 EE30332 Ch7 DP.21 Block Size Tradeoff °In general, larger block size take advantage of spatial locality BUT: Larger block size means larger miss penalty -Takes longer time to fill up the block If block size is too big relative to cache size, miss rate will go up - Too few cache blocks °In general, Average Access Time: = Hit Time x (1 - Miss Rate) + Miss Penalty x Miss Rate Miss Penalty Block Size Miss Rate Exploits Spatial Locality Fewer blocks: compromises temporal locality Average Access Time Increased Miss Penalty & Miss Rate Block Size

22 EE30332 Ch7 DP.22 Extreme Example: single big line °Cache Size = 4 bytesBlock Size = 4 bytes Only ONE entry in the cache °If an item is accessed, likely that it will be accessed again soon But it is unlikely that it will be accessed again immediately!!! The next access will likely to be a miss again -Continually loading data into the cache but discard (force out) them before they are used again -Worst nightmare of a cache designer: Ping Pong Effect °Conflict Misses are misses caused by: Different memory locations mapped to the same cache index -Solution 1: make the cache size bigger -Solution 2: Multiple entries for the same Cache Index 0 Cache DataValid Bit Byte 0Byte 1Byte 3 Cache Tag Byte 2

23 EE30332 Ch7 DP.23 Another Extreme Example: Fully Associative °Fully Associative Cache Forget about the Cache Index Compare the Cache Tags of all entries in parallel °By definition: there is no Conflict Miss for a fully associative cache : Cache Data Byte 0 0431 : Cache Tag (27 bits long) Valid Bit : Byte 1Byte 31 : Byte 32Byte 33Byte 63 : Cache Tag Byte Select Ex: 0x01 X X X X X

24 EE30332 Ch7 DP.24 A Two-way Set Associative Cache °N-way set associative: N entries for each Cache Index N direct mapped caches operates in parallel °Example: Two-way set associative cache Cache Index selects a “set” from the cache The two tags in the set are compared in parallel Data is selected based on the tag result Cache Data Cache Block 0 Cache TagValid ::: Cache Data Cache Block 0 Cache TagValid ::: Cache Index Mux 01 Sel1Sel0 Cache Block Compare Adr Tag Compare OR Hit

25 EE30332 Ch7 DP.25 Disadvantage of Set Associative Cache °N-way Set Associative Cache versus Direct Mapped Cache: N comparators vs. 1 Extra MUX delay for the data Data comes AFTER Hit/Miss decision and set selection °In a direct mapped cache, Cache Block is available BEFORE Hit/Miss: Possible to assume a hit and continue. Recover later if miss. Cache Data Cache Block 0 Cache TagValid ::: Cache Data Cache Block 0 Cache TagValid ::: Cache Index Mux 01 Sel1Sel0 Cache Block Compare Adr Tag Compare OR Hit

26 EE30332 Ch7 DP.26 A Summary on Sources of Cache Misses °Compulsory (cold start or process migration, first reference): first access to a block “Cold” fact of life: not a whole lot you can do about it Note: If you are going to run “billions” of instruction, Compulsory Misses are insignificant °Conflict (collision): Multiple memory locations mapped to the same cache location Solution 1: increase cache size Solution 2: increase associativity °Capacity: Cache cannot contain all blocks access by the program Solution: increase cache size °Invalidation: other process (e.g., I/O) updates memory

27 EE30332 Ch7 DP.27 Direct MappedN-way Set AssociativeFully Associative Compulsory Miss: Cache Size: Small, Medium, Big? Capacity Miss Invalidation Miss Conflict Miss Source of Cache Misses Quiz Choices: Zero, Low, Medium, High, Same

28 EE30332 Ch7 DP.28 Sources of Cache Misses Answer Direct MappedN-way Set AssociativeFully Associative Compulsory Miss Cache Size Capacity Miss Invalidation Miss BigMediumSmall Note: If you are going to run “billions” of instruction, Compulsory Misses are insignificant. Same Conflict MissHighMediumZero LowMediumHigh Same

29 EE30332 Ch7 DP.29 Impact on Cycle Time Example: direct map allows miss signal after data IR PC I -Cache D Cache AB R T IRex IRm IRwb miss invalid Miss Cache Hit Time: directly tied to clock rate increases with cache size increases with associativity Average Memory Access time = Hit Time + Miss Rate x Miss Penalty Time = IC x CT x (ideal CPI + memory stalls)

30 EE30332 Ch7 DP.30 Improving Cache Performance: 3 general options 1.Reduce the miss rate, °Larger cache, higher associativity 2. Reduce the miss penalty, faster memory system 3. Reduce the time to hit in the cache. faster parts for cache

31 EE30332 Ch7 DP.31 Basic Cache Types (for write) °Write through—The information is written to both the block in the cache and to the block in the lower- level memory. °Write back—The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. is block clean or dirty? °Pros and Cons of each? WT: read misses cannot result in writes WB: no writes of repeated writes °WT always combined with write buffers so that don’t wait for lower level memory

32 EE30332 Ch7 DP.32 Write Buffer for Write Through °A Write Buffer is needed between the Cache and Memory Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory °Write buffer is can be a FIFO or a small associative cache: Typical number of entries: 4 Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle °Memory system designer’s nightmare: Store frequency (w.r.t. time) -> 1 / DRAM write cycle Write buffer saturation Processor Cache Write Buffer DRAM

33 EE30332 Ch7 DP.33 Write Buffer Saturation °Solution for write buffer saturation: Use a write back cache Install a second level (L2) cache: Processor Cache Write Buffer DRAM Processor Cache Write Buffer DRAM L2 Cache

34 EE30332 Ch7 DP.34 Write-miss Policy: Write Allocate versus Not Allocate °Assume: a 16-bit write to memory location 0x0 and causes a miss Do we read in the block? -Yes: Write Allocate -No: Write Not Allocate Cache Index 0 1 2 3 : Cache Data Byte 0 0431 : Cache TagExample: 0x00 Ex: 0x00 0x00 Valid Bit : 31 Byte 1Byte 31 : Byte 32Byte 33Byte 63 : Byte 992Byte 1023 : Cache Tag Byte Select Ex: 0x00 9

35 EE30332 Ch7 DP.35 Recall: Levels of the Memory Hierarchy CPU Registers 100s Bytes <10s ns Cache K Bytes 10-100 ns $.01-.001/bit Main Memory M Bytes 100ns-1us $.01-.001 Disk G Bytes ms 10 - 10 cents -3 -4 Capacity Access Time Cost Tape infinite sec-min 10 -6 Registers Cache Memory Disk Tape Instr. Operands Blocks Pages Files Staging Xfer Unit prog./compiler 1-8 bytes cache cntl 8-128 bytes OS 512-4K bytes user/operator Mbytes Upper Level Lower Level faster Larger

36 EE30332 Ch7 DP.36 Basic Issues in Virtual Memory System Design size of information blocks that are transferred from secondary to main storage (M) block of information brought into M, and M is full, then some region of M must be released to make room for the new block --> replacement policy which region of M is to hold the new block --> placement policy missing item fetched from secondary memory only on the occurrence of a fault --> demand load policy Paging Organization virtual and physical address space partitioned into blocks of equal size page frames pages reg cache mem disk frame


Download ppt "EE30332 Ch7 DP.1 °Most of the slides are from Prof. Dave Patterson of University of California at Berkeley °Part of the materials is from Sun Microsystems."

Similar presentations


Ads by Google