COSC6385 Advanced Computer Architecture Lecture 6. Cache and Memory

COSC6385 Advanced Computer Architecture Lecture 6. Cache and Memory
Instructor: Olga Datskova Computer Science Department University of Houston

I/O, Memory, Cache CPU An Unbalanced System
Source: Bob Colwell keynote ISCA’

Memory Issues Latency Bandwidth Capacity Energy
Time to move through the longest circuit path (from the start of request to the response) Bandwidth Number of bits transported at one time Capacity Size of memory Energy Cost of accessing memory (to read and write)

Model of Memory Hierarchy
Reg File L1 Data cache Inst cache L2 Cache Main Memory DISK SRAM DRAM

Levels of the Memory Hierarchy
Capacity Access Time Cost Upper Level Staging Transfer Unit faster CPU Registers 100s Bytes <10 ns Registers Compiler 1-8 bytes Instr. Operands Cache K Bytes ns 1-0.1 cents/bit Cache Cache controller 8-128 bytes Our Focus Cache Lines Main Memory M Bytes 200ns- 500ns $ cents /bit Memory Operating system 512-4K bytes Pages Disk G Bytes, 10 ms (10,000,000 ns) cents/bit Disk -5 -6 User Mbytes Files Tape infinite sec-min 10 Larger Tape Lower Level -8

Topics covered Why do caches work Cache hierarchy Types of caches
Principle of program locality Cache hierarchy Average memory access time (AMAT) Types of caches Direct mapped Set-associative Fully associative Cache policies Write back vs. write through Write allocate vs. No write allocate

Principle of Locality Programs access a relatively small portion of address space at any instant of time. Two Types of Locality: Temporal Locality (Locality in Time): If an address is referenced, it tends to be referenced again e.g., loops, reuse Spatial Locality (Locality in Space): If an address is referenced, neighboring addresses tend to be referenced e.g., straightline code, array access Traditionally, HW has relied on locality for speed Locality is a program property that is exploited in machine design.

A Cache Line (One fetch)
Example of Locality int A[100], B[100], C[100], D; for (i=0; i<100; i++) { C[i] = A[i] * B[i] + D; } A[96] A[97] A[98] A[99] B[1] B[2] B[3] B[0] B[5] B[6] B[7] B[4] B[9] B[10] B[11] B[8] C[0] C[1] C[2] C[3] C[5] C[6] C[7] C[4] C[96] C[97] C[98] C[99] D A[0] A[1] A[2] A[3] A[5] A[6] A[7] A[4] A Cache Line (One fetch)

Modern Memory Hierarchy
By taking advantage of the principle of locality: Present the user with as much memory as is available in the cheapest technology. Provide access at the speed offered by the fastest technology. Processor Control Secondary Storage (Disk) Third Level Cache (SRAM) Tertiary Storage (Disk/Tape) Main Memory (DRAM) Cache L1 I Second Level Cache (SRAM) Datapath Registers Cache L1 D

Example: Intel Core2 Duo
L2 Cache Core0 Core1 DL1 L1 32 KB, 8-Way, 64 Byte/Line, LRU, WB 3 Cycle Latency L2 4.0 MB, 16-Way, 64 Byte/Line, LRU, WB 14 Cycle Latency IL1 Source:

Example : Intel Itanium 2
3MB Version 180nm 421 mm2 6MB Version 130nm 374 mm2

Intel Nehalem 3MB Core 0 Core 1 3MB 24MB L3

Cache Terminology Hit: data appears in some block
Hit Rate: the fraction of memory accesses found in the level Hit Time: Time to access the level (consists of RAM access time + Time to determine hit) Miss: data needs to be retrieved from a block in the lower level (e.g., Block Y) Miss Rate = 1 - (Hit Rate) Miss Penalty: Time to replace a block in the upper level + Time to deliver the block to the processor Hit Time << Miss Penalty Lower Level Memory Upper Level Memory From Processor Blk X Blk Y To Processor

Average Memory Access Time
Average memory-access time = Hit time + Miss rate x Miss penalty Miss penalty: time to fetch a block from lower memory level access time: function of latency transfer time: function of bandwidth b/w levels Transfer one “cache line/block” at a time Transfer at the size of the memory-bus width

Memory Hierarchy Performance
Main Memory (DRAM) 1 clk First-level Cache 300 clks Miss % * Miss penalty Hit Time Average Memory Access Time (AMAT) = Hit Time + Miss rate * Miss Penalty = Thit(L1) + Miss%(L1) * T(memory) Example: Cache Hit = 1 cycle Miss rate = 10% = 0.1 Miss penalty = 300 cycles AMAT = * 300 = 31 cycles Can we improve it?

Reducing Penalty: Multi-Level Cache
Main Memory (DRAM) 1 clk First-level Cache Second Level Cache Third Level Cache 10 clks 20 clks 300 clks L1 L2 L3 On-die Average Memory Access Time (AMAT) = Thit(L1) + Miss%(L1)* (Thit(L2) + Miss%(L2)* (Thit(L3) + Miss%(L3)*T(memory) ) )

AMAT of multi-level memory
= Thit(L1) + Miss%(L1)* Tmiss(L1) = Thit(L1) + Miss%(L1)* { Thit(L2) + Miss%(L2)* (Tmiss(L2) } = Thit(L1) + Miss%(L1)* { Thit(L2) + Miss%(L2) * [ Thit(L3) + Miss%(L3) * T(memory) ] }

AMAT Example = Thit(L1) + Miss%(L1)* (Thit(L2) + Miss%(L2)* (Thit(L3) + Miss%(L3)*T(memory) ) ) Example: Miss rate L1=10%, Thit(L1) = 1 cycle Miss rate L2=5%, Thit(L2) = 10 cycles Miss rate L3=1%, Thit(L3) = 20 cycles T(memory) = 300 cycles AMAT = ? 2.115 (compare to 31 with no multi-levels) 14.7x speed-up!

Types of Caches Type of cache Mapping of data from memory to cache
Complexity of searching the cache Direct mapped (DM) A memory value can be placed at a single corresponding location in the cache Fast indexing mechanism Set-associative (SA) A memory value can be placed in any of a set of locations in the cache Slightly more involved search mechanism Fully-associative (FA) A memory value can be placed in any location in the cache Extensive hardware resources required to search (CAM) DM and FA can be thought as special cases of SA DM  1-way SA FA  All-way SA

Direct Mapped Cache Memory Address DM Cache
1 Cache Index 2 3 1 4 2 5 3 6 7 A Cache Line (or Block) 8 9 Cache location 0 is occupied by data from: Memory locations 0, 4, 8, and C Which one should we place in the cache? How can we tell which one is in the cache? A B C D E F

Three (or Four) Cs (Cache Miss Terms)
0x1234 Compulsory Misses: cold start misses (Caches do not have valid data at the start of the program) Capacity Misses: Increase cache size Conflict Misses: Increase cache size and/or associativity. Associative caches reduce conflict misses Coherence Misses: In multiprocessor systems (later lectures…) Processor Cache 0x1234 0x5678 0x91B1 0x1111 Processor Cache 0x1234 0x5678 0x91B1 0x1111 Processor Cache

Example: 1KB DM Cache, 32-byte Lines
The lowest M bits are the Offset (Line Size = 2M) Index = log2 (# of sets) Address 31 9 4 Tag Index Offset Ex: 0x01 Ex: 0x00 Valid Bit Cache Tag Cache Data : Byte 31 Byte 1 Byte 0 : Byte 63 Byte 33 Byte 32 1 2 # of set 3 : : : : Byte 1023 Byte 992 31

Example: 1KB DM Cache, 32-byte Lines
lw from 0x77FF1C68 Tag Index Offset 77FF1C68 = Tag array Data array 2 DM Cache 24 25 26 27

DM Cache Speed Advantage
Tag and data access happen in parallel Faster cache access! Tag Index Offset Tag array Data array Index

Associative Caches Reduce Conflict Misses
Set associative (SA) cache multiple possible locations in a set Fully associative (FA) cache any location in the cache Hardware and speed overhead Comparators Multiplexors Data selection only after Hit/Miss determination (i.e., after tag comparison)

Set Associative Cache (2-way)
Cache index selects a “set” from the cache The two tags in the set are compared in parallel Data is selected based on the tag result Additional circuitry as compared to DM caches Makes SA caches slower to access than DM of comparable size Cache Index Valid Cache Tag Cache Data Cache Data Cache Line 0 Cache Tag Valid : Cache Line 0 : : : Compare Adr Tag Compare 1 Sel1 Mux Sel0 OR Cache Line Hit

Set-Associative Cache (2-way)
32 bit address lw from 0x77FF1C78 Tag Index offset Tag array0 Data aray0 Data array1 Tag array1

Fully Associative Cache
tag offset Tag Data = Associative Search Multiplexor Rotate and Mask

Fully Associative Cache
Tag offset Write Data Address Tag Data Tag Data Tag Data Tag Data compare compare compare compare Additional circuitry as compared to DM caches More extensive than SA caches Makes FA caches slower to access than either DM or SA of comparable size Read Data

Cache Write Policy Write through -The value is written to both the cache line and to the lower-level memory. Write back - The value is written only to the cache line. The modified cache line is written to main memory only when it has to be replaced. Is the cache line clean (holds the same value as memory) or dirty (holds a different value than memory)?

Write-through Policy Processor Cache Memory 0x1234 0x1234 0x1234

Write Buffer Cache Processor DRAM Write Buffer Processor: writes data into the cache and the write buffer Memory controller: writes contents of the buffer to memory Write buffer is a FIFO structure: Typically 4 to 8 entries Desirable: Occurrence of Writes << DRAM write cycles Memory system designer’s nightmare: Write buffer saturation (i.e., Writes  DRAM write cycles)

Writeback Policy Processor Cache Memory Write miss 0x1234 0x1234
????? 0x9ABC 0x5678 0x5678 0x5678 0x1234 Processor Cache Memory Write miss

On Write Miss Write allocate No write allocate
The line is allocated on a write miss, followed by the write hit actions above. Write misses first act like read misses No write allocate Write misses do not interfere cache Line is only modified in the lower level memory Mostly use with write-through cache

Quick recap Processor-memory performance gap
Memory hierarchy exploits program locality to reduce AMAT Types of Caches Direct mapped Set associative Fully associative Cache policies Write through vs. Write back Write allocate vs. No write allocate

Cache Replacement Policy
Random Replace a randomly chosen line FIFO Replace the oldest line LRU (Least Recently Used) Replace the least recently used line NRU (Not Recently Used) Replace one of the lines that is not recently used In Itanium2 L1 Dcache, L2 and L3 caches

LRU Policy Access C Access D Access E Access C Access G
MRU MRU-1 LRU+1 LRU A B C D Access C C A B D Access D D C A B Access E E D C A MISS, replacement needed Access C C E D A MISS, replacement needed Access G G C E D

LRU From Hardware Perspective
Way3 Way2 Way1 Way0 State machine Access update Access D A B C D LRU policy increases cache access times Additional hardware bits needed for LRU state machine

LRU Algorithms True LRU Pseudo LRU: O(N)
Expensive in terms of speed and hardware Need to remember the order in which all N lines were last accessed N! scenarios – O(log N!)  O(N log N) LRU bits 2-ways  AB BA = 2 = 2! 3-ways  ABC ACB BAC BCA CAB CBA = 6 = 3! Pseudo LRU: O(N) Approximates LRU policy with a binary tree

Increasing cache pollution
Reducing Miss Rate Enlarge Cache If cache size is fixed Increase associativity Increase line size Does this always work? Increasing cache pollution 2 5 6 4 % 3 1 M i s r a t e B l o c k z ( b y ) K 8

Reduce Miss Rate: Code Optimization
Misses occur if sequentially accessed array elements come from different cache lines Code optimizations  No hardware change Rely on programmers or compilers Examples: Loop interchange In nested loops: outer loop becomes inner loop and vice versa Loop blocking partition large array into smaller blocks, thus fitting the accessed array elements into cache size enhances cache reuse

What is the worst that could happen?
Loop Interchange j=0 Row-major ordering i=0 /* Before */ for (j=0; j<100; j++) for (i=0; i<5000; i++) x[i][j] = 2*x[i][j] What is the worst that could happen? Hint: DM cache /* After */ for (i=0; i<5000; i++) for (j=0; j<100; j++) x[i][j] = 2*x[i][j] j=0 i=0 Improved cache efficiency

Loop Blocking /* Before */ for (i=0; i<N; i++)
for (j=0; j<N; j++) { r=0; for (k=0; k<N; k++) r += y[i][k]*z[k][j]; x[i][j] = r; } X[i][j] y[i][k] z[k][j] k j i k i Does not exploit locality

Loop Blocking Partition the loop’s iteration space into many smaller chunks Ensure that the data stays in the cache until it is reused y[i][k] z[k][j] X[i][j] k j j i k i

Other Miss Penalty Reduction Techniques
Critical value first and Restart early Send requested data in the leading edge transfer Trailing edge transfer continues in the background Give priority to read misses over writes Use write buffer (WT) and writeback buffer (WB) Combining writes combining write buffer Intel’s WC (write-combining) memory type Victim caches Assist caches Non-blocking caches Data Prefetch mechanism

Write Combining Buffer
100 108 116 124 1 Mem[100] Mem[108] Mem[116] Mem[124] V Wr. addr Need to initiate 4 separate writes back to lower level memory For WC buffer, combine neighbor addresses 100 1 Mem[100] V Wr. addr Mem[108] Mem[116] Mem[124] One single write back to lower level memory

Cache Penalty Reduction Techniques
Victim cache Assist cache Non-blocking cache Data Prefetch mechanism

3Cs Absolute Miss Rate (SPEC92)
Compulsory misses are a tiny fraction of the overall misses Capacity misses reduce with increasing sizes Conflict misses reduce with increasing associativity Conflict

2:1 Cache Rule Miss rate DM cache size X ~= Miss rate 2-way SA cache size X/2 Conflict

Victim Caching [Jouppi’90]
Victim cache (VC) A small, fully associative structure Effective in direct-mapped caches Whenever a line is displaced from L1 cache, it is loaded into VC Processor checks both L1 and VC simultaneously Swap data between VC and L1 if  L1 misses and VC hits When data has to be evicted from VC, it is written back to memory Processor L1 VC Memory Victim Cache Organization

% of Conflict Misses Removed
Dcache Icache

Assist Cache [Chan et al. ‘96]
Assist Cache (on-chip) avoids thrashing in main (off-chip) L1 cache (both run at full speed) 64 x 32-byte fully associative CAM Data enters Assist Cache when miss (FIFO replacement policy in Assist Cache) Data conditionally moved to L1 or back to memory during eviction Flush back to memory when brought in by “Spatial locality hint” instructions Reduce pollution Processor L1 AC Memory Assist Cache Organization

Multi-lateral Cache Architecture
Processor Core A B Memory A Fully Connected Multi-Lateral Cache Architecture Most of the cache architectures be generalized into this form

Cache Architecture Taxonomy
Processor Processor Processor A B A A B Memory Memory Memory General Description Single-level cache Two-level cache Processor A B Memory Victim cache Processor A B Memory Assist cache Processor A B Memory

Non-blocking (Lockup-Free) Cache [Kroft ‘81]
Prevent pipeline from stalling due to cache misses (continue to provide hits to other lines while servicing a miss on one/more lines) Uses Miss Status Handler Register (MSHR) Tracks cache misses, allocate one entry per cache miss (called fill buffer in Intel P6 proliferation) New cache miss checks against MSHR Pipeline stalls at a cache miss only when MSHR is full Carefully choose number of MSHR entries to match the sustainable bus bandwidth

Bus Utilization (MSHR = 2)
Time Lead-off latency 4 data chunk m1 m2 m3 Initiation interval m4 m5 Stall due to insufficient MSHR Data Transfer Bus Idle BUS Memory bus utilization

Bus Utilization (MSHR = 4)
Time Stall Data Transfer Bus Idle BUS Memory bus utilization

Prefetch (Data/Instruction)
Predict what data will be needed in future Pollution vs. Latency reduction If you correctly predict the data that will be required in the future, you reduce latency. If you mispredict, you bring in unwanted data and pollute the cache To determine the effectiveness When to initiate prefetch? (Timeliness) Which lines to prefetch? How big a line to prefetch? (note that cache mechanism already performs prefetching.) What to replace? Software (data) prefetching vs. hardware prefetching

Software-controlled Prefetching
Use instructions Existing instruction Alpha’s load to r31 (hardwired to 0) Specialized instructions and hints Intel’s SSE: prefetchnta, prefetcht0/t1/t2 MIPS32: PREF PowerPC: dcbt (data cache block touch), dcbtst (data cache block touch for store) Compiler or hand inserted prefetch instructions

Software-controlled Prefetching
/* unroll loop 4 times */ for (i=0; i < N-4; i+=4) { prefetch (&a[i+4]); prefetch (&b[i+4]); sop = sop + a[i]*b[i]; sop = sop + a[i+1]*b[i+1]; sop = sop + a[i+2]*b[i+2]; sop = sop + a[i+3]*b[i+3]; } sop = sop + a[N-4]*b[N-4]; sop = sop + a[N-3]*b[N-3]; sop = sop + a[N-2]*b[N-2]; sop = sop + a[N-1]*b[N-1]; for (i=0; i < N; i++) { prefetch (&a[i+1]); prefetch (&b[i+1]); sop = sop + a[i]*b[i]; } Prefetch latency <= computational time

Hardware-based Prefetching
Sequential prefetching Prefetch on miss Tagged prefetch Both techniques are based on “One Block Lookahead (OBL)” prefetch: Prefetch line (L+1) when line L is accessed based on some criteria

Sequential Prefetching
Prefetch on miss Initiate prefetch (L+1) whenever an access to L results in a miss Alpha does this for instructions (prefetched instructions are stored in a separate structure called stream buffer) Tagged prefetch Idea: Whenever there is a “first use” of a line (demand fetched or previously prefetched line), prefetch the next one One additional “Tag bit” for each cache line Tag the prefetched, not-yet-used line (Tag = 1) Tag bit = 0 : the line is demand fetched, or a prefetched line is referenced for the first time Prefetch (L+1) only if Tag bit = 1 on L

Sequential Prefetching
Prefetch-on-miss when accessing contiguous lines Demand fetched Prefetched miss Demand fetched Prefetched hit Demand fetched Prefetched miss Tagged Prefetch when accessing contiguous lines Demand fetched Prefetched 1 miss Demand fetched Prefetched 1 hit Demand fetched Prefetched 1 hit

COSC6385 Advanced Computer Architecture Lecture 6. Cache and Memory

Similar presentations

Presentation on theme: "COSC6385 Advanced Computer Architecture Lecture 6. Cache and Memory"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

COSC6385 Advanced Computer Architecture Lecture 6. Cache and Memory

Similar presentations

Presentation on theme: "COSC6385 Advanced Computer Architecture Lecture 6. Cache and Memory"— Presentation transcript:

Similar presentations

About project

Feedback