Computer Organization & Design 计算机组成与设计 Weidong Wang ( 王维东 ) College of Information Science & Electronic Engineering 信息与通信工程研究所 Zhejiang.

Computer Organization & Design 计算机组成与设计 Weidong Wang ( 王维东 ) wdwang@zju.edu.cn College of Information Science & Electronic Engineering 信息与通信工程研究所 Zhejiang University

Course Information Instructor: Weidong WANG –Email: wdwang@zju.edu.cnwdwang@zju.edu.cn –Tel(O): 0571-87953170; –Office Hours: TBD, Yuquan Campus, Xindian (High-Tech) Building 306, /email whenever TA: –mobile ， Email: »Lu Huang 黄露, 13516719473/6719473; eliver8801@zju.edu.cn »Hanqi Shen 沈翰祺, 15067115046; 542886864@qq.com542886864@qq.com »Office Hours: Wednesday & Saturday 14:00-16:30 PM. »Xindian (High-Tech) Building 308. （也可以短信邮件联系） 2

Lecture 10 Memory Hierarchy

Review ： Memory Systems Outline 4

5 Review ： DRAM Architecture Row Address Decoder Col. 1 Col. 2 M Row 1 Row 2 N Column Decoder & Sense Amplifiers M N N+M bit lines word lines Memory cell (one bit) D Data Bits stored in 2-dimensional arrays on chip Modern chips have around 4-8 logical banks on each chip – each logical bank physically implemented as many smaller arrays

Technology: Speed & Cost 6

7 Review ： Management of Memory Hierarchy Small/fast storage, e.g., registers –Address usually specified in instruction –Generally implemented directly as a register file »but hardware might do things behind software’s back, e.g., stack management, register renaming Larger/slower storage, e.g., main memory –Address usually computed from values in register –Generally implemented as a hardware-managed cache hierarchy »hardware decides what is kept in fast memory »but software may provide “hints”, e.g., don’t cache or prefetch

Review ： Common Predictable Patterns Two predictable properties of memory references: –Temporal Locality: If a location is referenced it is likely to be referenced again in the near future. –Spatial Locality: If a location is referenced it is likely that locations near it will be referenced in the near future.

Review ： Taking Advantage of Locality 9

Memory Hierarchy: Everything is a Cache 10

11 Review ： Memory System Dynamic RAM (DRAM) is main form of main memory storage in use today –Holds values on small capacitors, need refreshing (hence dynamic) –Slow multi-step access: precharge, read row, read column Static RAM (SRAM) is faster but more expensive –Used to build on-chip memory for caches Cache holds small set of values in fast memory (SRAM) close to processor –Need to develop search scheme to find values in cache, and replacement policy to make space for newly accessed locations Caches exploit two forms of predictability in memory reference streams –Temporal locality 时间局部性, same location likely to be accessed again soon –Spatial locality 空间局部性, neighboring location likely to be accessed soon

Inside a Cache CACHE Processor Main Memory Address Data Address Tag Data Block Data Byte Data Byte Data Byte Line 100 304 6848 copy of main memory location 100 copy of main memory location 101 416

Cache Algorithm (Read) Look at Processor Address, search cache tags to find match. Then either Found in cache a.k.a. HIT Return copy of data from cache Not in cache a.k.a. MISS Read block of data from Main Memory Wait … Return data to processor and update cache Q: Which line do we replace?

Memory Hierarchy Level 14

ＨｏｗＰｒｏｃｅｓｓｏｒ Handles a Ｍｉｓｓ 15

ＢｕｉｌｄｉｎｇａＣａｃｈｅ 16

17 Placement Policy 0 1 2 3 4 5 6 7 0 1 2 3 Set Number Cache Fully (2-way) Set Direct AssociativeAssociative 组合的 Mapped anywhereanywhere in only into set 0 block 4 (12 mod 4) (12 mod 8) 0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 0 1 2 3 4 5 6 7 8 9 2 2 2 2 2 2 2 2 2 2 0 1 2 3 4 5 6 7 8 9 3 0 1 Memory Block Number block 12 can be placed

Review ： Direct-Mapped Cache 18

Review ：ＴａｇａｎｄＶａｌｉｄｂｉｔｓ 19

Cache Organization 20

Review ：ＬａｒｇｅｒＢｌｏｃｋＳｉｚｅ 21

Direct-Mapped Cache TagData Block V = Block Offset TagIndex t k b t HIT Data Word or Byte 2 k lines

Direct Map Address Selection higher-order vs. lower-order address bits TagData Block V = Block Offset Index t k b t HIT Data Word or Byte 2 k lines Tag

ＣａｃｈｅＢｌｏｃｋ 24

25 Word3Word0Word1Word2 Block Size and Spatial Locality Larger block size has distinct hardware advantages less tag overhead exploit fast burst transfers from DRAM exploit fast burst transfers over wide busses What are the disadvantages of increasing block size? block address offset b 2 b = block size a.k.a line size (in bytes) Split CPU address b bits 32-b bits Tag Block is unit of transfer between the cache and memory 4 word block, b=2 Fewer blocks => more conflicts. Can waste bandwidth.

Review ：ＢｌｏｃｋＳｉｚｅ 26 损失

Problem: ＣｏｎｆｌｉｃｔＭｉｓｓｅｓ 27 废弃

A ＲｅａｌＰｒｏｂｌｅｍ 28

ＦｕｌｌＡｓｓｏｃａｉｔｉｖｅＣａｃｈｅ 29

Fully Associative Cache

N-way Set Associative 31

2-Way Set-Associative Cache

Associative Cache 33

Associativity Example 34

Associativity Example 35

Set Associative Cache Organization 36

Cache Organization Options 37 256 bytes, 16 byte block, 16 blocks

Tag & Index with Set-Associative Cache 38

Bonus Trick: How to build a 3KB cache? 39

Associative Caches: Pros 赞成的意见 40 逐渐缩小的

Associative Caches: Cons 缺点 41

Replacement Methods 42

43 Review ： Replacement Policy In an associative cache, which block from a set should be evicted 逐出 when the set becomes full? Random Least-Recently Used (LRU) LRU cache state must be updated on every access true implementation only feasible for small sets (2-way) pseudo-LRU binary tree often used for 4-8 way First-In, First-Out (FIFO) a.k.a. Round-Robin 循环赛 used in highly associative caches Not-Most-Recently Used (NMRU) FIFO with exception for most-recently used block or blocks This is a second-order effect. Why? Replacement only happens on misses

44 CPU-Cache Interaction (5-stage pipeline) PC addr inst Primary Instruction Cache 0x4 Add IR D nop hit ? PCen Decode, Register Fetch wdata R addr wdata rdata Primary Data Cache we A B YY ALU MD1 MD2 Cache Refill Data from Lower Levels of Memory Hierarchy hit ? Stall entire CPU on data cache miss To Memory Control M E

45 Improving Cache Performance Average memory access time Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the hit time reduce the miss rate reduce the miss penalty 损失 What is the simplest design strategy? Biggest cache that doesn’t increase hit time past 1-2 cycles (approx 8-32KB in modern technology) [ design issues more complex with out-of-order superscalar processors ]

Serial-versus-Parallel Cache and Memory access  is HIT RATIO: Fraction of references in cache 1 -  is MISS RATIO: Remaining references CACHE Processor Main Memory Addr Data serial search Average access time for serial search: t cache + (1 - ) t mem CACHE Processor Main Memory Addr Data parallel search Average access time for parallel search:  t cache + (1 - ) t mem Savings are usually small, t mem >> t cache, hit ratio  high High bandwidth required for memory path Complexity of handling parallel paths can slow t cache

47 Causes for Cache Misses first Compulsory 强制的 : first-reference to a block a.k.a. cold start misses - misses that would occur even with infinite cache too small Capacity: cache is too small to hold all data needed by the program - misses that would occur even under perfect replacement policy collisions Conflict: misses that occur because of collisions due to block-placement strategy - misses that would not occur with full associativity

48 Effect of Cache Parameters on Performance Larger cache size + reduces capacity and conflict misses - hit time will increase Higher associativity 关联性 / 成组 + reduces conflict misses - may increase hit time Larger block size + reduces compulsory and capacity (reload) misses - increases conflict misses and miss penalty

What About Writes? 49

50 Write Policy Choices Cache hitCache hit: –write through: write both cache & memory »Generally higher traffic but simpler pipeline & cache design onlyonly –write back: write cache only, memory is written only when the entry is evicted 驱逐 »A dirty bit »A dirty bit per block further reduces write-back traffic »Must handle 0, 1, or 2 accesses to memory for each load/store Cache missCache miss: –no write allocate: only write to main memory –write allocate (aka fetch on write): fetch into cache Common combinationsCommon combinations: throughno （以 memory 为中心） –write through and no write allocate （以 memory 为中心） backwrite （以 cache 为中心） –write back with write allocate （以 cache 为中心）

Cache Write Policies: Major Options 51

Write Policy Trade-offs 52

53 Write Performance Tag Data V = Block Offset TagIndex t k b t HIT Data Word or Byte 2 k lines WE

54 Reducing Write Hit Time Problem: Writes take two cycles in memory stage, one cycle for tag check plus one cycle for data write if hit Solutions: Design data RAMDesign data RAM that can perform read and write in one cycle, restore old value after tag miss Fully-associative (CAM Tag) cachesFully-associative (CAM Tag) caches: Word line only enabled if hit Pipelined writesPipelined writes: Hold write data for store in single buffer ahead of cache, write cache data during next store’s tag check

55 Pipelining Cache Writes TagsData TagIndexStore Data Address and Store Data From CPU Delayed Write DataDelayed Write Addr. =? Load Data to CPU Load/Store L S 10 Hit? Data from a store hit written into data portion of cache during tag access of subsequent store

Avoiding the Stall for Write-Through 56

57 Write Buffer to Reduce Read Miss Penalty Processor is not stalled on writes, and read misses can go ahead of write to main memory Problem: Problem: Write buffer may hold updated value of location needed by a read miss Simple scheme: on a read miss, wait for the write buffer to go empty Faster scheme: Check write buffer addresses against read miss addresses, if no match, allow read miss to go ahead of writes, else, return value in write buffer Data Cache Unified L2 Cache RF CPU Write buffer Evicted dirty lines for writeback cache OR All writes in writethrough cache

Write Buffers 58 write-throughIf a write-through cache has a write buffer, what should happen on a read miss? write-backAre write-buffers of any use for write-back caches?

Cache Write Policy: Allocation Options 59

Write Miss Actions 60

Typical Choices 61

Be Careful, Even with Write Hits 62

Splitting Caches 63

64 Block-level Optimizations Tags are too large, i.e., too much overhead –Simple solution –Simple solution: Larger blocks, but miss penalty could be large. sectorSub-block placement (aka sector cache) smaller –A valid bit added to units smaller than full block, called sub-blocks –Only read a sub-block on a miss –If a tag matches, is the word in the cache? 100 300 204 1 1 1 1 0 0 0 1

Multilevel Caches 65

66 Multilevel Caches Problem Problem: A memory cannot be large and fast Solution: Increasing sizes of cache at each level CPU L1$ L2$ DRAM Local miss rate = misses in cache / accesses to cache Global miss rate Global miss rate = misses in cache / CPU memory accesses Misses per instruction = misses in cache / number of instructions

67 Presence of L2 influences L1 design Use smaller L1 if there is also L2 miss rate hit time miss penalty –Trade increased L1 miss rate for reduced L1 hit time and reduced L1 miss penalty average access –Reduces average access energy Use simpler write-through L1 with on-chip L2 –Write-back L2 cache absorbs write traffic, doesn’t go off-chip –At most one L1 miss request per L1 access (no dirty victim write back) simplifies pipeline control –Simplifies coherence issues –Simplifies error recovery in L1 (can use just parity bits in L1 and reload from L2 when parity error detected on L1 read)

Multilevel Cache Example 68

Example 69 400

70 Inclusion 可兼的 Policy Inclusive multilevel cache: copies –Inner cache holds copies of data in outer cache –External coherence snoop 窥探 access need only check outer cache Exclusive 专用的 multilevel caches: –Inner cache may hold data not in outer cache –Swap lines between inner/outer caches on miss –Used in AMD Athlon with 64KB primary and 256KB secondary cache Why choose one type or the other?

71 Itanium-2 On-Chip Caches (Intel/HP, 2002) Level 1: 16KB, 4-way s.a., 64B line, quad-port (2 load+2 store), single cycle latency Level 2: 256KB, 4-way s.a, 128B line, quad-port (4 load or 4 store), five cycle latency Level 3: 3MB, 12-way s.a., 128B line, single 32B port, twelve cycle latency

Power 7 On-Chip Caches [IBM 2009] 72 32KB L1 I$/core 32KB L1 D$/core 3-cycle latency 256KB Unified L2$/core 8-cycle latency 32MB Unified Shared L3$ Embedded DRAM 25-cycle latency to local slice

Cache Design: Datapath + Control To CPU To Lower Level Memory To CPU To Lower Level Memory TagsBlocks Addr Din Dout Addr Din Dout State Machine Control Most design errors come from incorrect specification of state machine behavior! Common bugs: Stalls, Block replacement, Write buffer

Cache Control 74

Interface Signals 75

Finite State Machines 76

Cache Controller FSM 77

78 Prefetching 预取 Speculate 推测 on future instruction and data accesses and fetch them into cache(s) –Instruction accesses –Instruction accesses easier to predict than data accesses Varieties of prefetching –Hardware prefetching –Software prefetching –Mixed schemes What types of misses does prefetching affect?

79 Issues in Prefetching Usefulness – should produce hits Timeliness – not late and not too early Cache and bandwidth pollution L1 Data L1 Instruction Unified L2 Cache RF CPU Prefetched data

80 Hardware Instruction Prefetching Instruction prefetch in Alpha AXP 21064 –Fetch two blocks on a miss; the requested block (i) and the next consecutive 连贯的 block (i+1) –Requested block placed in cache, and next block in instruction stream buffer –If miss in cache but hit in stream buffer, move stream buffer block into cache and prefetch next block (i+2) L1 Instruction Unified L2 Cache RF CPU Stream Buffer Prefetched instruction block Req block Req block

81 Hardware Data Prefetching Prefetch-on-miss: –Prefetch b+1 upon miss on b One Block Lookahead 预先准备 (OBL) scheme –Initiate prefetch for block b+1 when block b is accessed –Why is this different from doubling block size? –Can extend to N-block lookahead Strided 跨步 prefetch –If observe sequence of accesses to block b, b+N, b+2N, then prefetch b+3N etc. Example: IBM Power 5 [2003] supports eight independent streams of strided prefetch per processor, prefetching 12 lines ahead of current access

82 Software Prefetching for(i=0; i < N; i++) { prefetch( &a[i + 1] ); prefetch( &b[i + 1] ); SUM = SUM + a[i] * b[i]; }

83 Software Prefetching Issues Timing is the biggest issue, not predictability 可预测性 –If you prefetch very close to when the data is required, you might be too late –Prefetch too early, cause pollution –Estimate how long it will take for the data to come into L1, so we can set P appropriately – Why is this hard to do? for(i=0; i < N; i++) { prefetch( &a[i + P ] ); prefetch( &b[i + P ] ); SUM = SUM + a[i] * b[i]; } Must consider cost of prefetch instructions

84 Compiler Optimizations （ Software ） Restructuring 重组 code affects the data block access sequence –Group data –Group data accesses together to improve spatial locality –Re-order data –Re-order data accesses to improve temporal locality Prevent data from entering the cache variables –Useful for variables that will only be accessed once before being replaced –Needs mechanism for software to tell hardware not to cache data (“no-allocate” instruction hints or page table bits) Kill data that will never be used again –Streaming 流 data exploits spatial locality but not temporal locality –Replace into dead cache locations

85 Loop Interchange 循环交换 for(j=0; j < N; j++) { for(i=0; i < M; i++) { x[i][j] = 2 * x[i][j]; } } for(i=0; i < M; i++) { for(j=0; j < N; j++) { x[i][j] = 2 * x[i][j]; } } What type of locality does this improve?

86 Loop Fusion 融合 for(i=0; i < N; i++) a[i] = b[i] * c[i]; for(i=0; i < N; i++) d[i] = a[i] * c[i]; for(i=0; i < N; i++) { a[i] = b[i] * c[i]; d[i] = a[i] * c[i]; } What type of locality does this improve?

87 for(i=0; i < N; i++) for(j=0; j < N; j++) { r = 0; for(k=0; k < N; k++) r = r + y[i][k] * z[k][j]; x[i][j] = r; } Matrix Multiply, Naïve Code Not touchedOld accessNew access xj i yk i zj k

88 for(jj=0; jj < N; jj=jj+B) for(kk=0; kk < N; kk=kk+B) for(i=0; i < N; i++) for(j=jj; j < min(jj+B,N); j++) { r = 0; for(k=kk; k < min(kk+B,N); k++) r = r + y[i][k] * z[k][j]; x[i][j] = x[i][j] + r; } Matrix Multiply with Cache Tiling What type of locality does this improve? yk i zj k xj i Not touchedOld accessNew access

Main Memory: DRAM 89

Increasing Memory Bandwidth 90

Advanced DRAM Organization 91

DRAM Generations & Trends 92

Measuring Performance 93

Memory Performance 94

Worst-Case Simplicity 95

Simple Cache Performance Example 96

Performance Solution 97

Another Cache Problem 98

No Write Buffer 99

Perfect Write Buffer 100

Write Buffer Cache Write Buffer Perfect Write Buffer + Cache Write Buffer 101

Transparent Fallacy: Caches are Transparent to Programmers 102

Matrix Multiplication Example 103

Pentium Blocked Matrix Multiply Performance 104

Fallacy: All Cache Hierarchies are the Same Example 3-Level Cache Organization 105

HomeWork Readings: –Read Chapter 5.6-5.17; –familiar with SPIM HW8 (-5th) http://mypage.zju.edu.cn/wdwd/ –Project 1 –Project 2 (option) –do exercises, 5.4, 5.5, 5.6, 5.13. –Computer Organization and Design (COD) » (Fifth Edition)

Acknowledgements These slides contain material from courses: –UCB CS152 –Stanford EE108B –Also MIT course 6.823

Computer Organization & Design 计算机组成与设计 Weidong Wang ( 王维东 ) College of Information Science & Electronic Engineering 信息与通信工程研究所 Zhejiang.

Similar presentations

Presentation on theme: "Computer Organization & Design 计算机组成与设计 Weidong Wang ( 王维东 ) College of Information Science & Electronic Engineering 信息与通信工程研究所 Zhejiang."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computer Organization & Design 计算机组成与设计 Weidong Wang ( 王维东 ) College of Information Science & Electronic Engineering 信息与通信工程研究所 Zhejiang.

Similar presentations

Presentation on theme: "Computer Organization & Design 计算机组成与设计 Weidong Wang ( 王维东 ) College of Information Science & Electronic Engineering 信息与通信工程研究所 Zhejiang."— Presentation transcript:

Similar presentations

About project

Feedback