Presentation is loading. Please wait.

Presentation is loading. Please wait.

M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.

Similar presentations


Presentation on theme: "M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per."— Presentation transcript:

1 M E M O R Y

2 Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per instruction IPC = Instructions per cycle

3 Program locality Temporal locality (Data) –Recently referenced information Spatial locality –Same address space will be referenced Sequential locality (instructions) –(spatial locality) next address will be used

4 Caches Instruction cache Data cache Unified cache Split cache: I-cache & D-cache

5 Tradeoffs Usually cache (L1 and L2) is within CPU chip. Thus, area is a concern Design alternatives: –Large I-cache –Equal area: I- and D-cache –Large unified cache –Multi-level split

6 Some terms Read: Any read access (by the processor) to cache –instruction (inst. fetch) or data Write: Any write access (by the processor) into cache Fetch: read to main mem. to load a block Access: any reference to memory

7 Miss Ratio Number of references to cache not found in cache/total references to cache Number of I-cache references not found in cache/ instructions executed

8 Placement Problem Main Memory Cache Memory

9 Placement Policies WHERE to put a block in cache Mapping between main and cache memories. Main memory has a much larger capacity than cache memory.

10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Memory Block number 0123456701234567 Fully Associative Cache Block can be placed in any location in cache.

11 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Memory Block number 0123456701234567 Direct Mapped Cache (Block address) MOD (Number of blocks in cache) 12 MOD 8 = 4 Block can be placed ONLY in a single location in cache.

12 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Memory Block number 0123456701234567 Set Associative Cache (Block address) MOD (Number of sets in cache) 12 MOD 4 = 0 01230123 Set no. Block number Block can be placed in one of n locations in n-way set associative cache.

13 Mem. Addr. cache addr. CPU Name trans. Cache address

14 Cache organization TagIndexblk :. Valid Tag Data CPU address Data = MUX

15 TAG Contains the “sector name” of block Datatag = Block Main mem addr

16 Valid bit(s) Indicates if block contains valid data –initial program load into main memory. No valid data –Cache coherency in multiprocessors Data in main memory could have been chanced by other processor (or process).

17 Dirty bit(s) Indicates if the block has been written to. –No need in I-caches. –No need in write through D-cache. –Write back D-cache needs it.

18 Write back C P U Main memory cache D

19 Write through C P U Main memory cache

20 Cache Performance Access time (in cycles) T ea = Hit time + Miss rate * Miss penalty

21 Associativity (D-cache misses per 1000 instructions) 2-way4-way8-way Size (KB) LRURdmFIFOLRURdmFIFOLRURdmFIFO 16114.1117.3115.5111.7115.1113.3109.0111.8110.4 64103.4104.9103.9102.4102.3103.199.7100.5100.3 25692.292.192.592.1 92.592.1 92.5

22 CPU Cache memory Main memory

23 Replacement Policies FIFO (First-In First-Out) Random Least-recently used (LRU) –Replace the block that has been unused for the longest time.

24 Reducing Cache Miss Penalties Multilevel cache Critical word first Priority to Read over Write Victim cache

25 Multilevel caches CPU L1 L2 L3 Main

26 Multilevel caches Average access time = Hit time L1 + Miss rate L1 * Miss penalty L1 Hit time L2 + Miss rate L2 * Miss penalty L2

27 Critical word first The missed word is requested to be sent first from next level. Writes are not as critical as reads. CPU cannot continue if data or instruction is not read. Priority to Read over Write misses

28 Victim cache Small fully associative cache. Contains blocks that have been discarded (“victims”) Four entry cache removed 20-90% of conflict misses (4KB direct-mapped).

29

30 Pseudo Associative Cache Direct-mapped cache hit time If miss –Check other entry in cache –(change the most significant bit) If found no long wait

31 Reducing Cache Miss Rate Larger block size Larger caches Higher Associativity Pseudo-associative cache Prefetching

32 Larger Block Size Large block take advantage of spatial locality. On the other hand, less blocks in cache

33

34 Miss rate versus block size Block size Cache size 4K16K64K256K 168.57%3.94%2.04%1.09% 327.24%2.87%1.35%0.70% 647.00%2.64%1.06%0.51% 1287.78%2.77%1.02%0.49% 2569.51%3.29%1.15%0.49%

35 Example Memory takes 80 clock cycles of overhead Delivers 16 bytes every 2 clock cycles (cc) Mem. System supplies 16 bytes in 82 cc 32 bytes in 84 cc 1 cycle (miss) + 1 cycle (read)

36 Example (cont’d) Average access time = Hit time L1 + Miss rate L1 * Miss penalty L1 For a 16-byte block in a 4 KB cache Ave. access time = 1 + (8.57% * 82)=8.0274

37 Example (cont’d) Average access time = Hit time L1 + Miss rate L1 * Miss penalty L1 For a 64-byte block in a 256 KB cache (miss rate 0.51%) Ave. access time = 1 + (0.51% * 88)= 1.449

38 Ave. mem. access time versus block size Block size Miss penalty Cache size 4K16K64K256K 16828.0274.2312.6731.894 32847.0823.4112.1341.588 64887.1603.3231.9331.449 128968.4693.6591.9791.470 25611211.6514.6852.2881.549

39 Two-way cache (Alpha)

40

41 Higher Associativity Having higher associativity is NOT for free Slower clock may be required Thus, there is a trade-off between associativity (with higher hit rate) and faster clocks (direct-mapped).

42 Associativity example If clock cycle time (cct) increases as follows: –cct 2-way = 1.1cct 1-way –cct 4-way = 1.12cct 1-way –cct 8-way = 1.14cct 1-way Miss penalty is 50 cycles Hit time is 1 cycle

43

44 Pseudo Associative Cache Direct-mapped cache hit time If miss –Check other entry in cache –(change the most significant bit) If found no long wait

45 Pseudo Associative time Hit time Pseudo hit timeMiss penalty

46 Hardware prefetching What about fetching two blocks in a miss. –Alpha AXP 21064 Problem: real misses may be delayed due to prefetching

47 Fetch Policies Memory references: –90% reads –10% writes Thus, read has higher priority

48 Fetch Fetch on miss –Demand fetching Prefetching –Instructions (I-Cache)

49 Hardware prefetching What about fetching two blocks in a miss. –Alpha AXP 21064 Problem: real misses may be delayed due to prefetching

50 Compiler Controlled Prefetching Register prefetching –Load values in regs before they are needed Cache prefetching –Loads data in cache only This is only possible if we have nonblocking or lockup-free caches.

51 Order Main Memory to Cache Needed AU is in the middle of the block Block load –The complete block is loaded Load forward –The needed AU is forwarded –AUs behind the miss are not loaded Fetch bypass –Start with needed AU and AUs behind are loaded later AU=Addressable Unit

52 Compiler optimization: Merging Arrays The goal is to improve locality Rather than having two independent arrays –we could have only one.

53 Compiler optimization: Loop interchange Nested loops with nonsequential memory access. A 11 A 12 A 13 A 21 A 22 A 23 A 31 A 32 A 33 Array A 11 A 12 A 13 A 21 A 22 A 23 A 31 A 32 A 33

54 A 11 A 12 A 13 A 21 A 22 A 23 A 31 A 32 A 33 Array A 11 A 12 A 13 A 21 A 22 A 23 A 31 A 32 A 33 Memory

55 Priority to Read over Write If two (or more) misses occur give priority to read. M[512]  R3 R1  M[1024] R1  M[512] Write buffer

56 Non-blocking caches On a miss the cache should not block any access This is important for an out-of-order execution machine (Tomasulo approach).

57 Access time Mem_ access_time L1 = Hit_rate L1 + Miss_rate L1 X Miss_penalty L1 Mem_ access_time = Hit_rate + Miss_rate X Miss_penalty Miss_penalty L1 = Hit_rate L2 + Miss_rate L2 X Miss_penalty L2

58 Reducing Hit Time Critical in increasing the processor clock frequency. “smaller” hardware is faster –Less sequential operations Direct mapping is simpler Small & simple caches

59 Avoid address translation Virtual address  Physical address.

60 Example Cache SizeI-cacheD-cache Unified Cache 8 KB0.82%12.22%4.63% 16 KB0.38%11.36%3.75% 32 KB0.16%11.73%3.18% 64 KB0.065%10.25%2.89% 128 KB0.03%9.81%2.66% 256 KB0.002%9.06%2.42%


Download ppt "M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per."

Similar presentations


Ads by Google