Presentation is loading. Please wait.

Presentation is loading. Please wait.

Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Similar presentations


Presentation on theme: "Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism."— Presentation transcript:

1

2 Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism –Reduce hit time AMAT = Hit time + Miss rate × Miss penalty

3 5.5. Reducing Miss Rate Three sources of misses: –Compulsory “cold start misses” –Capacity Cache is full –Conflict Set is full/block is occupied Increase block size Increase size of cache Increase degree of associativity

4 Larger Block Size Bigger blocks reduce compulsory misses –Spatial locality BUT: –Increased miss penalty More data to transfer –Possibly increased overall miss rate More conflict and capacity misses as there are fewer blocks

5 Effect of Block Size AMAT Block size Miss rate Block size Transfer Access Miss penalty Block size

6 Larger Caches Reduces capacity misses Increases hit time and cost

7 Higher Associativity Miss rates improve with higher associativity Two rules of thumb: –8-way set associative caches are almost as effective as fully associative But much simpler! –2:1 cache rule A direct mapped cache of size N has about the same miss rate as a 2-way set associative cache of size N/2

8 Way Prediction Set-associative cache predicts which block will be needed on next access to the set Only one tag check is done –If mispredicted the whole set must be checked E.g. Alpha 21264 instruction cache –Prediction rate > 85% –Correct prediction: 1 cycle hit –Misprediction: 3 cycles

9 Pseudo-Associative Caches Check a direct mapped cache for a hit as usual If it misses, check a second block –Invert MSB of index One fast and one slow hit time

10 Compiler Optimisations Compilers can optimise code to minimise miss rates: –Reordering procedures –Aligning basic blocks with cache blocks –Reorganising array element accesses

11 5.6. Reduce Miss Rate or Miss Penalty via Parallelism Three techniques that overlap instruction execution with memory access

12 Nonblocking caches Dynamic scheduling allows CPU to continue with other instructions while waiting for data Nonblocking cache allows other cache accesses to continue while waiting for data

13 Hardware Prefetching Fetch data/instructions before they are requested by the processor –Either into cache or another buffer Particularly useful for instructions –High degree of spatial locality UltraSPARC III –Special prefetch cache for data –Increases effectiveness by about four times

14 Compiler Prefetching Compiler inserts “prefetch” instructions Two types: –Prefetch register value –Prefetch data cache block Can be faulting or non-faulting Cache continues as normal while data is prefetched

15 SPARC V9 Prefetch: prefetch [%rs1 + %rs2], fcn prefetch [%rs1 + imm13], fcn fcn = Prefetch function 0 = Prefetch for several reads 1 = Prefetch for one read 2 = Prefetch for several writes 3 = Prefetch for one write 4 = Prefetch page

16 5.7. Reducing Hit Time Critical –Often affects CPU clock cycle time

17 Small, simple caches Small usually equals fast in hardware A small cache may reside on the processor chip –Decreases communication –Compromise: tags on chip, data separate Direct mapped –Data can be read in parallel with tag checking

18 Avoiding address translation Physical caches –Use physical addresses Address translation must happen before cache lookup Virtual caches –Use virtual addresses –Protection issues –High context switching overhead

19 Virtual caches Minimising context switch overhead: –Add process-identifier tag to cache Multiple virtual addresses may refer to a single physical address –Hardware enforces anti-aliasing –Software requires less significant bits to be the same

20 Avoiding address translation (cont.) Choice of page size: –Bigger than cache index + offset –Address translation and tag lookup can happen in parallel OffsetTagIndex Address CPU Page offsetPage no. Cache VM Translation

21 Pipelining cache access Split cache access into several stages Impacts on branch and load delays

22 Trace caches Blocks follow program flow rather than spatial locality! Branch prediction is taken into account by cache Intel NetBurst microarchitecture Complicates address mapping Minimises wasted space within blocks

23 Cache Optimisation Summary Cache optimisation is very complex –Improving one factor may have a negative impact on another

24 5.6. Main Memory Latency and bandwidth are both important Latency is composed of two factors: –Access time –Cycle time Two main technologies: –DRAM –SRAM

25 5.7. Virtual Memory Physical memory is divided into blocks –Allocated to processes –Provides protection –Allows swapping to disk –Simplifies loading Historically: –Overlays Programmer controlled swapping

26 Terminology Block: –Page –Segment Miss: –Page fault –Address fault Memory mapping (address translation) –Virtual address  physical address

27 Characteristics Block size –4kB – 64kB Hit time –50 – 150 cycles Miss penalty –1 000 000 – 10 000 000 cycles Miss Rate –0.000 01 – 0.001% 

28 Categorising VM Systems Fixed block size –Pages Variable block size –Segments –Difficult replacement Hybrid approaches –Paged segments –Multiple page sizes (2 n × smallest)

29 Q1: Block placement? Anywhere in memory –“Fully associative” –Minimises miss rate

30 Q2: Block identification? Page/segment number gives physical page address –Paging: offset concatenated –Segments: offset added Uses a page table –Number of pages in virtual address space –Save space: inverted page table Number of pages in physical memory

31 Q3: Block replacement? Least-recently used (LRU) –Minimises miss rate –Hardware provides a use bit or reference bit

32 Q4: Write strategy? Write back –With a dirty bit You won’t become famous by being the first to try write through!

33 Fast Address Translation Page tables are big –Stored in memory themselves –Two memory accesses for every datum! Principle of locality –Cache recent translations –Translation look-aside buffer (TLB), or translation buffer (TB)

34 Alpha 21264 TLB

35 Selecting a Page Size Big –Smaller page table –Allows parallel cache access –Efficient disk transfers –Reduces TLB misses Small –Less memory wastage (internal fragmentation) –Quicker process startup

36 Putting it ALL Together! SPARC Revisited

37 Two SPARCs SuperSPARC –1992 –32-bit superscalar design UltraSPARC –Late 1990’s –64-bit design –Graphics support (VIS)

38 UltraSPARC Four-way superscalar execution Two integer ALU’s FP unit –Five functional units Graphics unit

39 Pipeline 9 stages: –Fetch –Decode –Grouping –Execution –Cache access –Load miss –Integer pipe wait (for FP/graphics pipelines) –Trap resolution –Writeback

40 Branch Handling Dynamic branch prediction –Two bit scheme –Every second instruction in cache has prediction bits (predicts up to 2048 branches) –88% success rate (integer) Target prediction –Fetches from predicted path

41 FPU Five functional units: –Add –Multiply –Divide/square root –Two graphics units (add and multiply) Mostly fully pipelined (latency 3 cycles) –Except divide and square root (not pipelined, latency is 22 cycles for 64-bit)

42 Memory Hierarchy On-chip instruction and data caches –Data: 16kB direct-mapped, write-through –Instructions: 16kB 2-way set associative –Both virtually addressed External cache –Up to 4MB

43 Virtual Memory 64-bit virtual addresses  44-bit physical addresses TLB –64 entry, fully-associative cache

44 Multimedia Support (VIS) Integrated with FPU Partitioned operations –Multiple smaller values in 64-bits Video compression instructions –E.g. motion estimation instruction replaces 48 simple instructions for MPEG compression

45 The End!

46


Download ppt "Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism."

Similar presentations


Ads by Google