Presentation is loading. Please wait.

Presentation is loading. Please wait.

EEM 486 EEM 486: Computer Architecture Lecture 6 Memory Systems and Caches.

Similar presentations


Presentation on theme: "EEM 486 EEM 486: Computer Architecture Lecture 6 Memory Systems and Caches."— Presentation transcript:

1 EEM 486 EEM 486: Computer Architecture Lecture 6 Memory Systems and Caches

2 Lec 6.2  The Five Classic Components of a Computer The Big Picture: Where are We Now? Control Datapath Memory Processor Input Output

3 Lec 12.3 Processor Cache Main Memory reference stream,,,,... op: i-fetch, read, write Optimize the memory system organization to minimize the average memory access time for typical workloads Workload or Benchmark programs The Art of Memory System Design SRAM DRAM

4 Lec 6.4 Technology Trends DRAM Year Size Cycle Time 198064 Kb250 ns 1983256 Kb220 ns 19861 Mb190 ns 19894 Mb165 ns 199216 Mb145 ns 199564 Mb120 ns 1000:1! 2:1!

5 Lec 6.5 Processor-DRAM Memory Gap

6 Lec 6.6 The Goal: illusion of large, fast, cheap memory  Facts Large memories are slow but cheap (DRAM) Fast memories are small yet expensive (SRAM)  How do we create a memory that is large, fast and cheap? Memory hierarchy Parallelism

7 Lec 6.7 The Principle of Locality The principle of locality: Programs access a relatively small portion of their address space at any instant of time  Temporal Locality (Locality in Time) => If an item is referenced, it will tend to be referenced again soon => Keep most recently accessed data items closer to the processor  Spatial Locality (Locality in Space) => If an item is referenced, nearby items will tend to be referenced soon => Move blocks of contiguous words to the upper levels Q: Why does code have locality?

8 Lec 6.8 Memory Hierarchy  Based on the principle of locality  A way of providing large, cheap, and fast memory Control Datapath Secondary Storage (Disk) Processor Registers Main Memory (DRAM) Second Level Cache (SRAM) On-Chip Cache 1s 10,000,000s (10s ms) Speed (ns): 10s100s GsSize (bytes):KsMs Tertiary Storage (Tape) 10,000,000,000s (10s sec) Ts $ per Mbyte increases

9 Lec 6.9 Cache Memory CacheCPU Memory wordblock TagBlock Block length (K words) 0 1 2 C-1 Block Word 0 1 2 2^n-1 CACHE MEMORY

10 Lec 6.10 Elements of Cache Design  Cache size  Mapping function Direct Set Associative Fully Associative  Replacement algorithm Least recently used (LRU) First in first out (FIFO) Random  Write policy Write through Write back  Line size  Number of caches Single or two level Unified or split

11 Lec 6.11 Terminology  Hit: data appears in some block in the upper level Hit Rate: the fraction of memory accesses found in the upper level Hit Time: time to access the upper level which consists of RAM access time + Time to determine hit/miss X2 X3 Xn-1 Xn-2 X1 X4 Cache Processor X1 (2) (1) Xn Memory Upper level Lower level Read hit

12 Lec 6.12 Terminology  Miss: data needs to be retrieved from a block in the lower level Miss Rate = 1 - (Hit Rate) Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor  Hit Time << Miss Penalty X2 X3 Xn-1 Xn-2 X1 X4 Cache Processor Xn (4) (1) Xn Memory Upper level Lower level Read miss Xn (3) (2)

13 Lec 6.13 Direct Mapped Cache Each memory location is mapped to exactly one location in the cache: Cache block # = (Block address) modulo (# of cache blocks) = Low order log2 ( # of cache blocks ) bits of the address

14 Lec 6.14 64 KByte Direct Mapped Cache Why do we need a Tag field? Why do we need a Valid bit field? What kind of locality are we taking care of? Total number of bits in a cache 2^n x (|valid| + |tag| + |block|) 2^n : # of cache blocks |valid| = 1 bit |tag| = 32 – (n + 2); 32-bit byte address 1 word blocks |block| = 32 bit

15 Lec 6.15  Address the cache by PC or ALU  If the cache signals hit, we have a read hit The requested word will be on the data lines  Otherwise, we have a read miss stall the CPU fetch the block from memory and write into cache restart the execution Reading from Cache

16 Lec 6.16 Writing to Cache  Address the cache by PC or ALU  If the cache signals hit, we have a write hit We have two options: -write-through: write the data into both cache and memory -write-back: write the data only into cache and write it into memory only when it is replaced  Otherwise, we have a write miss Handle write miss as if it were a write hit

17 Lec 6.17  Taking advantage of spatial locality Address (showing bit positions) 1612Byte offset V Tag Data HitData 16 32 4K entries 16 bits128 bits Mux 323232 2 32 Block offsetIndex Tag 31 1615 43210 64 KByte Direct Mapped Cache

18 Lec 6.18 Writing to Cache  Address the cache by PC or ALU  If the cache signals hit, we have a write hit Write-through cache: write the data into both cache and memory Otherwise, we have a write miss stall the CPU fetch the block from memory and write into cache restart the execution and rewrite the word

19 Lec 6.19 Associativity in Caches  Compute the set number: (Block number) modulo (Number of sets)  Choose one of the blocks in the computed set

20 Lec 6.20 Set Asscociative Cache Address 22 8 VTagIndex 0 1 2 253 254 255 DataVTagDataVTagDataVTagData 3222 4-to-1 multiplexor HitDat a 1238910111230310  N-way set associative N direct mapped caches operates in parallel N entries for each cache index N comparators and a N-to-1 mux Data comes AFTER Hit/Miss decision and set selection A four-way set associative cache

21 Lec 6.21 Fully Associative Cache  A block can be anywhere in the cache => No Cache Index  Compare the Cache Tags of all cache entries in parallel  Practical for small number of cache blocks : Cache Data Byte 0 0431 : Cache Tag (27 bits long) Valid Bit : Byte 1Byte 31 : Byte 32Byte 33Byte 63 : Cache Tag Byte Select Ex: 0x01 = = = = =

22 Lec 6.22  Q1: Block placement? Where can a block be placed in the upper level?  Q2: Block identification? How is a block found if it is in the upper level?  Q3: Block replacement? Which block should be replaced on a miss?  Q4: Write strategy? What happens on a write? Four Questions for Caches

23 Lec 6.23  Block 12 to be placed in an 8 block cache: Q1: Block Placement? Any block (12 mod 8) = 4 Only block 4 0 1 2 3 4 5 6 7 Block no. Fully associative 0 1 2 3 4 5 6 7 Block no. Direct mapped 0 1 2 3 4 5 6 7 Block no. Set 0 Set 1 Set 2 Set 3 Set associative (12 mod 4) = 0 Any block in set 0 Direct mapped: One place - (Block address) mod (# of cache blocks) Set associative: A few places - (Block address) mod (# of cache sets) # of cache sets = # of cache blocks/degree of associativity Fully associative: Any place

24 Lec 6.24 Q2: Block Identification? Block offset Block Address Tag Index Set Select Data Select Direct mapped: Indexing – index, 1 comparison N-way set associative: Limited search – index the set, N comparison Fully associative: Full search – search all cache entries

25 Lec 6.25  Easy for Direct Mapped  Set Associative or Fully Associative: Random: Randomly select one of the blocks in the set LRU (Least Recently Used): Select the block in the set which has been unused for the longest time Associativity:2-way4-way8-way Size LRU Random LRU Random LRU Random 16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0% 64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5% 256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12% Q3: Replacement Policy on a Miss?

26 Lec 6.26  Write through— The information is written to both the block in the cache and to the block in the lower-level memory  Write back— The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced is block clean or dirty?  Pros and Cons of each? WT: read misses cannot result in writes WB: no writes of repeated writes  WT always combined with write buffers to avoid waiting for lower level memory Q4: Write Policy?

27 Lec 6.27 Cache Performance CPU time = (CPU execution clock cycles + Memory stall clock cycles) x Cycle time Note: memory hit time is included in execution cycles Stalls due to cache misses: Memory stall clock cycles = Read-stall clock cycles + Write-stall clock cycles Read-stall clock cycles= Reads x Read miss rate x Read miss penalty Write-stall clock cycles= Writes x Write miss rate x Write miss penalty If read miss penalty = write miss penalty, Memory stall clock cycles = Memory accesses x Miss rate x Miss penalty

28 Lec 6.28 Cache Performance CPU time = Instruction count x CPI x Cycle time = Inst count x Cycle time x (ideal CPI + Memory stalls/Inst + Other stalls/Inst) Memory Stalls/Inst = Instruction Miss Rate x Instruction Miss Penalty + Loads/Inst x Load Miss Rate x Load Miss Penalty + Stores/Inst x Store Miss Rate x Store Miss Penalty Average Memory Access time (AMAT) = Hit Time + (Miss Rate x Miss Penalty) = (Hit Rate x Hit Time) + (Miss Rate x Miss Time)

29 Lec 6.29 Example  Suppose a processor executes at Clock Rate = 200 MHz (5 ns per cycle) Base CPI = 1.1 50% arith/logic, 30% ld/st, 20% control  Suppose that 10% of memory operations get 50 cycle miss penalty  Suppose that 1% of instructions get same miss penalty  CPI = Base CPI + average stalls per instruction = 1.1(cycles/ins) + [ 0.30 (Data Mops/ins) x 0.10 (miss/Data Mop) x 50 (cycle/miss)] + [ 1 (Inst Mop/ins) x 0.01 (miss/Inst Mop) x 50 (cycle/miss)] = (1.1 + 1.5 +.5) cycle/ins = 3.1  AMAT= (1/1.3)x[1+0.01x50]+ (0.3/1.3)x[1+0.1x50]= 2.54

30 Lec 6.30 Options to reduce AMAT: 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache CPU Time = IC x CT x (ideal CPI + memory stalls) Average Memory Access time = Hit Time + (Miss Rate x Miss Penalty) = (Hit Rate x Hit Time) + (Miss Rate x Miss Time) Improving Cache Performance

31 Lec 6.31 Reduce Misses: Larger Block Size Increasing block size also increases miss penalty !

32 Lec 6.32 Reduce Misses: Higher Associativity Increasing associativity also increases both time and hardware cost !

33 Lec 6.33  L2 Equations AMAT = Hit Time L1 + Miss Rate L1 x Miss Penalty L1 Miss Penalty L1 = Hit Time L2 + Miss Rate L2 x Miss Penalty L2 AMAT = Hit Time L1 + Miss Rate L1 x (Hit Time L2 + Miss Rate L2 x Miss Penalty L2 ) Reducing Penalty: Second-Level Cache Proc L1 Cache L2 Cache

34 Lec 6.34  Simple: CPU, Cache, Bus, Memory same width (32 bits)  Interleaved: CPU, Cache, Bus- 1 word N Memory Modules  Wide: CPU/Mux 1 word; Mux/Cache, Bus, Memory N words Designing the Memory System to Support Caches

35 Lec 6.35  DRAM (Read/Write) Cycle Time >> DRAM (Read/Write) Access Time  DRAM (Read/Write) Cycle Time : How frequent can you initiate an access?  DRAM (Read/Write) Access Time: How quickly will you get what you want once you initiate an access?  DRAM Bandwidth Limitation Time Access Time Cycle Time Main Memory Performance

36 Lec 6.36 Access Pattern without Interleaving: CPUMemory Start Access for D1Start Access for D2 D1 available Access Pattern with 4-way Interleaving: Access Bank 0 Access Bank 1 Access Bank 2 Access Bank 3 We can Access Bank 0 again CPU Memory Bank 1 Memory Bank 0 Memory Bank 3 Memory Bank 2 Increasing Bandwidth - Interleaving

37 Lec 6.37 Summary #1/2  The Principle of Locality: Program likely to access a relatively small portion of the address space at any instant of time. -Temporal Locality: Locality in Time -Spatial Locality: Locality in Space  Three (+1) Major Categories of Cache Misses: Compulsory Misses: sad facts of life. Example: cold start misses. Conflict Misses: increase cache size and/or associativity. Nightmare Scenario: ping pong effect! Capacity Misses: increase cache size  Cache Design Space total size, block size, associativity replacement policy write-hit policy (write-through, write-back) write-miss policy

38 Lec 6.38 Summary #2/2: The Cache Design Space  Several interacting dimensions cache size block size associativity replacement policy write-through vs write-back write allocation  The optimal choice is a compromise depends on access characteristics -workload -use (I-cache, D-cache, TLB) depends on technology / cost  Simplicity often wins Associativity Cache Size Block Size Bad Good LessMore Factor AFactor B


Download ppt "EEM 486 EEM 486: Computer Architecture Lecture 6 Memory Systems and Caches."

Similar presentations


Ads by Google