Download presentation
Presentation is loading. Please wait.
Published byKristian Robinson Modified over 7 years ago
1
COSC6385 Advanced Computer Architecture Lecture 6. Cache and Memory
Instructor: Olga Datskova Computer Science Department University of Houston
2
I/O, Memory, Cache CPU An Unbalanced System
Source: Bob Colwell keynote ISCA’
3
Memory Issues Latency Bandwidth Capacity Energy
Time to move through the longest circuit path (from the start of request to the response) Bandwidth Number of bits transported at one time Capacity Size of memory Energy Cost of accessing memory (to read and write)
4
Model of Memory Hierarchy
Reg File L1 Data cache Inst cache L2 Cache Main Memory DISK SRAM DRAM
5
Levels of the Memory Hierarchy
Capacity Access Time Cost Upper Level Staging Transfer Unit faster CPU Registers 100s Bytes <10 ns Registers Compiler 1-8 bytes Instr. Operands Cache K Bytes ns 1-0.1 cents/bit Cache Cache controller 8-128 bytes Our Focus Cache Lines Main Memory M Bytes 200ns- 500ns $ cents /bit Memory Operating system 512-4K bytes Pages Disk G Bytes, 10 ms (10,000,000 ns) cents/bit Disk -5 -6 User Mbytes Files Tape infinite sec-min 10 Larger Tape Lower Level -8
6
Topics covered Why do caches work Cache hierarchy Types of caches
Principle of program locality Cache hierarchy Average memory access time (AMAT) Types of caches Direct mapped Set-associative Fully associative Cache policies Write back vs. write through Write allocate vs. No write allocate
7
Principle of Locality Programs access a relatively small portion of address space at any instant of time. Two Types of Locality: Temporal Locality (Locality in Time): If an address is referenced, it tends to be referenced again e.g., loops, reuse Spatial Locality (Locality in Space): If an address is referenced, neighboring addresses tend to be referenced e.g., straightline code, array access Traditionally, HW has relied on locality for speed Locality is a program property that is exploited in machine design.
8
A Cache Line (One fetch)
Example of Locality int A[100], B[100], C[100], D; for (i=0; i<100; i++) { C[i] = A[i] * B[i] + D; } A[96] A[97] A[98] A[99] B[1] B[2] B[3] B[0] B[5] B[6] B[7] B[4] B[9] B[10] B[11] B[8] C[0] C[1] C[2] C[3] C[5] C[6] C[7] C[4] C[96] C[97] C[98] C[99] D A[0] A[1] A[2] A[3] A[5] A[6] A[7] A[4] A Cache Line (One fetch)
9
Modern Memory Hierarchy
By taking advantage of the principle of locality: Present the user with as much memory as is available in the cheapest technology. Provide access at the speed offered by the fastest technology. Processor Control Secondary Storage (Disk) Third Level Cache (SRAM) Tertiary Storage (Disk/Tape) Main Memory (DRAM) Cache L1 I Second Level Cache (SRAM) Datapath Registers Cache L1 D
10
Example: Intel Core2 Duo
L2 Cache Core0 Core1 DL1 L1 32 KB, 8-Way, 64 Byte/Line, LRU, WB 3 Cycle Latency L2 4.0 MB, 16-Way, 64 Byte/Line, LRU, WB 14 Cycle Latency IL1 Source:
11
Example : Intel Itanium 2
3MB Version 180nm 421 mm2 6MB Version 130nm 374 mm2
12
Intel Nehalem 3MB Core 0 Core 1 3MB 24MB L3
13
Cache Terminology Hit: data appears in some block
Hit Rate: the fraction of memory accesses found in the level Hit Time: Time to access the level (consists of RAM access time + Time to determine hit) Miss: data needs to be retrieved from a block in the lower level (e.g., Block Y) Miss Rate = 1 - (Hit Rate) Miss Penalty: Time to replace a block in the upper level + Time to deliver the block to the processor Hit Time << Miss Penalty Lower Level Memory Upper Level Memory From Processor Blk X Blk Y To Processor
14
Average Memory Access Time
Average memory-access time = Hit time + Miss rate x Miss penalty Miss penalty: time to fetch a block from lower memory level access time: function of latency transfer time: function of bandwidth b/w levels Transfer one “cache line/block” at a time Transfer at the size of the memory-bus width
15
Memory Hierarchy Performance
Main Memory (DRAM) 1 clk First-level Cache 300 clks Miss % * Miss penalty Hit Time Average Memory Access Time (AMAT) = Hit Time + Miss rate * Miss Penalty = Thit(L1) + Miss%(L1) * T(memory) Example: Cache Hit = 1 cycle Miss rate = 10% = 0.1 Miss penalty = 300 cycles AMAT = * 300 = 31 cycles Can we improve it?
16
Reducing Penalty: Multi-Level Cache
Main Memory (DRAM) 1 clk First-level Cache Second Level Cache Third Level Cache 10 clks 20 clks 300 clks L1 L2 L3 On-die Average Memory Access Time (AMAT) = Thit(L1) + Miss%(L1)* (Thit(L2) + Miss%(L2)* (Thit(L3) + Miss%(L3)*T(memory) ) )
17
AMAT of multi-level memory
= Thit(L1) + Miss%(L1)* Tmiss(L1) = Thit(L1) + Miss%(L1)* { Thit(L2) + Miss%(L2)* (Tmiss(L2) } = Thit(L1) + Miss%(L1)* { Thit(L2) + Miss%(L2) * [ Thit(L3) + Miss%(L3) * T(memory) ] }
18
AMAT Example = Thit(L1) + Miss%(L1)* (Thit(L2) + Miss%(L2)* (Thit(L3) + Miss%(L3)*T(memory) ) ) Example: Miss rate L1=10%, Thit(L1) = 1 cycle Miss rate L2=5%, Thit(L2) = 10 cycles Miss rate L3=1%, Thit(L3) = 20 cycles T(memory) = 300 cycles AMAT = ? 2.115 (compare to 31 with no multi-levels) 14.7x speed-up!
19
Types of Caches Type of cache Mapping of data from memory to cache
Complexity of searching the cache Direct mapped (DM) A memory value can be placed at a single corresponding location in the cache Fast indexing mechanism Set-associative (SA) A memory value can be placed in any of a set of locations in the cache Slightly more involved search mechanism Fully-associative (FA) A memory value can be placed in any location in the cache Extensive hardware resources required to search (CAM) DM and FA can be thought as special cases of SA DM 1-way SA FA All-way SA
20
Direct Mapped Cache Memory Address DM Cache
1 Cache Index 2 3 1 4 2 5 3 6 7 A Cache Line (or Block) 8 9 Cache location 0 is occupied by data from: Memory locations 0, 4, 8, and C Which one should we place in the cache? How can we tell which one is in the cache? A B C D E F
21
Three (or Four) Cs (Cache Miss Terms)
0x1234 Compulsory Misses: cold start misses (Caches do not have valid data at the start of the program) Capacity Misses: Increase cache size Conflict Misses: Increase cache size and/or associativity. Associative caches reduce conflict misses Coherence Misses: In multiprocessor systems (later lectures…) Processor Cache 0x1234 0x5678 0x91B1 0x1111 Processor Cache 0x1234 0x5678 0x91B1 0x1111 Processor Cache
22
Example: 1KB DM Cache, 32-byte Lines
The lowest M bits are the Offset (Line Size = 2M) Index = log2 (# of sets) Address 31 9 4 Tag Index Offset Ex: 0x01 Ex: 0x00 Valid Bit Cache Tag Cache Data : Byte 31 Byte 1 Byte 0 : Byte 63 Byte 33 Byte 32 1 2 # of set 3 : : : : Byte 1023 Byte 992 31
23
Example: 1KB DM Cache, 32-byte Lines
lw from 0x77FF1C68 Tag Index Offset 77FF1C68 = Tag array Data array 2 DM Cache 24 25 26 27
24
DM Cache Speed Advantage
Tag and data access happen in parallel Faster cache access! Tag Index Offset Tag array Data array Index
25
Associative Caches Reduce Conflict Misses
Set associative (SA) cache multiple possible locations in a set Fully associative (FA) cache any location in the cache Hardware and speed overhead Comparators Multiplexors Data selection only after Hit/Miss determination (i.e., after tag comparison)
26
Set Associative Cache (2-way)
Cache index selects a “set” from the cache The two tags in the set are compared in parallel Data is selected based on the tag result Additional circuitry as compared to DM caches Makes SA caches slower to access than DM of comparable size Cache Index Valid Cache Tag Cache Data Cache Data Cache Line 0 Cache Tag Valid : Cache Line 0 : : : Compare Adr Tag Compare 1 Sel1 Mux Sel0 OR Cache Line Hit
27
Set-Associative Cache (2-way)
32 bit address lw from 0x77FF1C78 Tag Index offset Tag array0 Data aray0 Data array1 Tag array1
28
Fully Associative Cache
tag offset Tag Data = Associative Search Multiplexor Rotate and Mask
29
Fully Associative Cache
Tag offset Write Data Address Tag Data Tag Data Tag Data Tag Data compare compare compare compare Additional circuitry as compared to DM caches More extensive than SA caches Makes FA caches slower to access than either DM or SA of comparable size Read Data
30
Cache Write Policy Write through -The value is written to both the cache line and to the lower-level memory. Write back - The value is written only to the cache line. The modified cache line is written to main memory only when it has to be replaced. Is the cache line clean (holds the same value as memory) or dirty (holds a different value than memory)?
31
Write-through Policy Processor Cache Memory 0x1234 0x1234 0x1234
32
Write Buffer Cache Processor DRAM Write Buffer Processor: writes data into the cache and the write buffer Memory controller: writes contents of the buffer to memory Write buffer is a FIFO structure: Typically 4 to 8 entries Desirable: Occurrence of Writes << DRAM write cycles Memory system designer’s nightmare: Write buffer saturation (i.e., Writes DRAM write cycles)
33
Writeback Policy Processor Cache Memory Write miss 0x1234 0x1234
????? 0x9ABC 0x5678 0x5678 0x5678 0x1234 Processor Cache Memory Write miss
34
On Write Miss Write allocate No write allocate
The line is allocated on a write miss, followed by the write hit actions above. Write misses first act like read misses No write allocate Write misses do not interfere cache Line is only modified in the lower level memory Mostly use with write-through cache
35
Quick recap Processor-memory performance gap
Memory hierarchy exploits program locality to reduce AMAT Types of Caches Direct mapped Set associative Fully associative Cache policies Write through vs. Write back Write allocate vs. No write allocate
36
Cache Replacement Policy
Random Replace a randomly chosen line FIFO Replace the oldest line LRU (Least Recently Used) Replace the least recently used line NRU (Not Recently Used) Replace one of the lines that is not recently used In Itanium2 L1 Dcache, L2 and L3 caches
37
LRU Policy Access C Access D Access E Access C Access G
MRU MRU-1 LRU+1 LRU A B C D Access C C A B D Access D D C A B Access E E D C A MISS, replacement needed Access C C E D A MISS, replacement needed Access G G C E D
38
LRU From Hardware Perspective
Way3 Way2 Way1 Way0 State machine Access update Access D A B C D LRU policy increases cache access times Additional hardware bits needed for LRU state machine
39
LRU Algorithms True LRU Pseudo LRU: O(N)
Expensive in terms of speed and hardware Need to remember the order in which all N lines were last accessed N! scenarios – O(log N!) O(N log N) LRU bits 2-ways AB BA = 2 = 2! 3-ways ABC ACB BAC BCA CAB CBA = 6 = 3! Pseudo LRU: O(N) Approximates LRU policy with a binary tree
40
Increasing cache pollution
Reducing Miss Rate Enlarge Cache If cache size is fixed Increase associativity Increase line size Does this always work? Increasing cache pollution 2 5 6 4 % 3 1 M i s r a t e B l o c k z ( b y ) K 8
41
Reduce Miss Rate: Code Optimization
Misses occur if sequentially accessed array elements come from different cache lines Code optimizations No hardware change Rely on programmers or compilers Examples: Loop interchange In nested loops: outer loop becomes inner loop and vice versa Loop blocking partition large array into smaller blocks, thus fitting the accessed array elements into cache size enhances cache reuse
42
What is the worst that could happen?
Loop Interchange j=0 Row-major ordering i=0 /* Before */ for (j=0; j<100; j++) for (i=0; i<5000; i++) x[i][j] = 2*x[i][j] What is the worst that could happen? Hint: DM cache /* After */ for (i=0; i<5000; i++) for (j=0; j<100; j++) x[i][j] = 2*x[i][j] j=0 i=0 Improved cache efficiency
43
Loop Blocking /* Before */ for (i=0; i<N; i++)
for (j=0; j<N; j++) { r=0; for (k=0; k<N; k++) r += y[i][k]*z[k][j]; x[i][j] = r; } X[i][j] y[i][k] z[k][j] k j i k i Does not exploit locality
44
Loop Blocking Partition the loop’s iteration space into many smaller chunks Ensure that the data stays in the cache until it is reused y[i][k] z[k][j] X[i][j] k j j i k i
45
Other Miss Penalty Reduction Techniques
Critical value first and Restart early Send requested data in the leading edge transfer Trailing edge transfer continues in the background Give priority to read misses over writes Use write buffer (WT) and writeback buffer (WB) Combining writes combining write buffer Intel’s WC (write-combining) memory type Victim caches Assist caches Non-blocking caches Data Prefetch mechanism
46
Write Combining Buffer
100 108 116 124 1 Mem[100] Mem[108] Mem[116] Mem[124] V Wr. addr Need to initiate 4 separate writes back to lower level memory For WC buffer, combine neighbor addresses 100 1 Mem[100] V Wr. addr Mem[108] Mem[116] Mem[124] One single write back to lower level memory
47
Cache Penalty Reduction Techniques
Victim cache Assist cache Non-blocking cache Data Prefetch mechanism
48
3Cs Absolute Miss Rate (SPEC92)
Compulsory misses are a tiny fraction of the overall misses Capacity misses reduce with increasing sizes Conflict misses reduce with increasing associativity Conflict
49
2:1 Cache Rule Miss rate DM cache size X ~= Miss rate 2-way SA cache size X/2 Conflict
50
Victim Caching [Jouppi’90]
Victim cache (VC) A small, fully associative structure Effective in direct-mapped caches Whenever a line is displaced from L1 cache, it is loaded into VC Processor checks both L1 and VC simultaneously Swap data between VC and L1 if L1 misses and VC hits When data has to be evicted from VC, it is written back to memory Processor L1 VC Memory Victim Cache Organization
51
% of Conflict Misses Removed
Dcache Icache
52
Assist Cache [Chan et al. ‘96]
Assist Cache (on-chip) avoids thrashing in main (off-chip) L1 cache (both run at full speed) 64 x 32-byte fully associative CAM Data enters Assist Cache when miss (FIFO replacement policy in Assist Cache) Data conditionally moved to L1 or back to memory during eviction Flush back to memory when brought in by “Spatial locality hint” instructions Reduce pollution Processor L1 AC Memory Assist Cache Organization
53
Multi-lateral Cache Architecture
Processor Core A B Memory A Fully Connected Multi-Lateral Cache Architecture Most of the cache architectures be generalized into this form
54
Cache Architecture Taxonomy
Processor Processor Processor A B A A B Memory Memory Memory General Description Single-level cache Two-level cache Processor A B Memory Victim cache Processor A B Memory Assist cache Processor A B Memory
55
Non-blocking (Lockup-Free) Cache [Kroft ‘81]
Prevent pipeline from stalling due to cache misses (continue to provide hits to other lines while servicing a miss on one/more lines) Uses Miss Status Handler Register (MSHR) Tracks cache misses, allocate one entry per cache miss (called fill buffer in Intel P6 proliferation) New cache miss checks against MSHR Pipeline stalls at a cache miss only when MSHR is full Carefully choose number of MSHR entries to match the sustainable bus bandwidth
56
Bus Utilization (MSHR = 2)
Time Lead-off latency 4 data chunk m1 m2 m3 Initiation interval m4 m5 Stall due to insufficient MSHR Data Transfer Bus Idle BUS Memory bus utilization
57
Bus Utilization (MSHR = 4)
Time Stall Data Transfer Bus Idle BUS Memory bus utilization
58
Prefetch (Data/Instruction)
Predict what data will be needed in future Pollution vs. Latency reduction If you correctly predict the data that will be required in the future, you reduce latency. If you mispredict, you bring in unwanted data and pollute the cache To determine the effectiveness When to initiate prefetch? (Timeliness) Which lines to prefetch? How big a line to prefetch? (note that cache mechanism already performs prefetching.) What to replace? Software (data) prefetching vs. hardware prefetching
59
Software-controlled Prefetching
Use instructions Existing instruction Alpha’s load to r31 (hardwired to 0) Specialized instructions and hints Intel’s SSE: prefetchnta, prefetcht0/t1/t2 MIPS32: PREF PowerPC: dcbt (data cache block touch), dcbtst (data cache block touch for store) Compiler or hand inserted prefetch instructions
60
Software-controlled Prefetching
/* unroll loop 4 times */ for (i=0; i < N-4; i+=4) { prefetch (&a[i+4]); prefetch (&b[i+4]); sop = sop + a[i]*b[i]; sop = sop + a[i+1]*b[i+1]; sop = sop + a[i+2]*b[i+2]; sop = sop + a[i+3]*b[i+3]; } sop = sop + a[N-4]*b[N-4]; sop = sop + a[N-3]*b[N-3]; sop = sop + a[N-2]*b[N-2]; sop = sop + a[N-1]*b[N-1]; for (i=0; i < N; i++) { prefetch (&a[i+1]); prefetch (&b[i+1]); sop = sop + a[i]*b[i]; } Prefetch latency <= computational time
61
Hardware-based Prefetching
Sequential prefetching Prefetch on miss Tagged prefetch Both techniques are based on “One Block Lookahead (OBL)” prefetch: Prefetch line (L+1) when line L is accessed based on some criteria
62
Sequential Prefetching
Prefetch on miss Initiate prefetch (L+1) whenever an access to L results in a miss Alpha does this for instructions (prefetched instructions are stored in a separate structure called stream buffer) Tagged prefetch Idea: Whenever there is a “first use” of a line (demand fetched or previously prefetched line), prefetch the next one One additional “Tag bit” for each cache line Tag the prefetched, not-yet-used line (Tag = 1) Tag bit = 0 : the line is demand fetched, or a prefetched line is referenced for the first time Prefetch (L+1) only if Tag bit = 1 on L
63
Sequential Prefetching
Prefetch-on-miss when accessing contiguous lines Demand fetched Prefetched miss Demand fetched Prefetched hit Demand fetched Prefetched miss Tagged Prefetch when accessing contiguous lines Demand fetched Prefetched 1 miss Demand fetched Prefetched 1 hit Demand fetched Prefetched 1 hit
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.