Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

Similar presentations


Presentation on theme: "An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions."— Presentation transcript:

1 An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions on Computers Volume: 52 Issue: 5 Pages: 607 - 616 May 2003

2 An Intelligent Cache System with Hardware Prefetching for High Performance 2/20 2015/7/12 Abstract  In this paper, we present a high performance cache structure with a hardware prefetching mechanism that enhances exploitation of spatial and temporal locality. The proposed cache, which we call a Selective- Mode Intelligent (SMI) cache, consists of three parts: a direct-mapped cache with a small block size, a fully associative spatial buffer with a large block size, and a hardware prefetching unit. Temporal locality is exploited by selective moving small blocks into the direct-mapped cache after monitoring their activity in the spatial buffer for a time period. Spatial locality is enhanced by intelligently prefetching a neighboring block when a spatial buffer hit occurs.  The overhead of this prefetching operation is shown to be negligible. We also show that the prefetch operation is highly accurate: over 90 percent of all prefetches generated are for blocks that are subsequently accessed. Our results show that the system enables the cache size to be reduced by a factor of four to eight relative to a conventional direct- mapped cache while maintaining similar performance. Also, the SMI cache can reduce the miss ratio by around 20 percent and the average memory access time by 10 percent, compared with a victim-buffer cache configuration.

3 An Intelligent Cache System with Hardware Prefetching for High Performance 3/20 2015/7/12 What’s the Problem  Most cache system haven shown a tendency to emphasize only one of the spatial locality or temporal locality  Contradictory requirements on the structure of hardware  Existing hardware prefetching mechanisms often with high overhead  Prefetch generation rate is frequently Higher power consumption Increase the Memory Cycles Per Instruction

4 An Intelligent Cache System with Hardware Prefetching for High Performance 4/20 2015/7/12 Related Work  The stream cache (direct-mapped cache + additional small buffer)  Besides miss data, several consecutive words are prefetched to the stream buffer  The victim cache (direct-mapped cache + additional small buffer)  The victim buffer holds the blocks that are discarded from the main cache  reduce conflict miss  Selective victim cache (direct-mapped cache + additional small buffer)  Places incoming blocks in main cache or victim buffer base its history of use  The assist cache (direct-mapped cache + additional small buffer)  Blocks are first loaded into the assist buffer and only promoted into the main cache if they exhibit temporal locality Temporal locality detection is provided statically by the compiler Use different associativities with the same block size

5 An Intelligent Cache System with Hardware Prefetching for High Performance 5/20 2015/7/12 Related Work (cont.)  The selective cache (temporal cache + spatial cache)  Uses locality prediction table to decide whether the requested data has either temporal locality or spatial locality Data may be in just one of the two subcaches depends on the predicted type of locality for a given memory access  The split temporal / spatial cache (STS)  At compile time, data access are classified as either temporal locality or spatial locality, and tagged for one of these caches  Focus on how to detect the type of locality and how to deal with the reference based on the predicted locality Use the same associativity with different block sizes

6 An Intelligent Cache System with Hardware Prefetching for High Performance 6/20 2015/7/12 The Proposed SMI Cache System  The SMI cache is constructed in three parts:  Direct-mapped cache with small block size Exploit temporal locality (increase number of blocks in the cache)  Fully associative spatial buffer with large block size Exploit spatial locality Large block size is a multiple of the small block size  Hardware prefetching unit  The SMI cache exploit both type of locality by  Determine whether a block in the spatial buffer also has temporal locality 1) Load a large block including the missed block into spatial buffer 2) Monitor and determine whether the missed block shows strong temporal locality while it was resident in the spatial buffer 3) Move the block which shows temporal locality into the direct-mapped cache while the large block is replaced from the spatial buffer Extend lifetime of blocks by storing them in direct-mapped cache Temporal locality is enhanced

7 An Intelligent Cache System with Hardware Prefetching for High Performance 7/20 2015/7/12 SMI Cache System Structure Distinguish referenced block from unreferenced Determine whether generate prefetch operation Generate the tag of (i+1)th large block (i+1)th large block is not in spatial buffer Activate just one of the banks

8 An Intelligent Cache System with Hardware Prefetching for High Performance 8/20 2015/7/12 Basic Operation of the SMI Cache  Main cache and spatial buffer are searched in parallel  Hit in main cache (direct-mapped cache) As a hit in a conventional L1 cache  Miss in main cache, but hit in spatial buffer Fetch corresponding small block from spatial buffer and set hit bit  Miss in both main cache and spatial buffer Large block is fetched into the spatial buffer  Temporal locality is enhanced by  Increasing the number of blocks in the cache  Extending lifetime of small blocks that exhibit temporal locality By storing those blocks in the direct-mapped cache  Spatial locality is enhanced by  Fetching a large block  Intelligently prefetching of data that exhibit spatial locality

9 An Intelligent Cache System with Hardware Prefetching for High Performance 9/20 2015/7/12 Operational Model in the Case of Cache hits  Hit in main cache  Read hit -> transmit requested data to the CPU without delay  Write hit -> perform write operation and set the dirty bit for the block  Hit in spatial buffer  Corresponding small block is sent to the CPU and set the hit bit (H)  Prefetch controller generates a prefetch signal when  A large block is accessed  Prefetch bit (P) is not set  Multiple hit bits (H) in a large block is set Two operations are performed by prefetch controller 1) Search the tags of the spatial buffer (one cycle penalty)  (i+1)th large block is already in spatial buffer  stop prefetch and set P bit  (i+1)th large block is not in spatial buffer  second operation is performed 2) prefetch into prefetch buffer and P bit of the ith large block is set If the P bit is set, the consecutive large block must be present in either spatial buffer or prefetch buffer

10 An Intelligent Cache System with Hardware Prefetching for High Performance 10/20 2015/7/12 Operational Model in the Case of Cache Misses  Miss in both main cache and spatial buffer  Bring a large block including the missed small block into spatial buffer, two possible cases: Spatial buffer is not full Spatial buffer is full  The oldest entry is replaced according to a FIFO policy  Move the blocks in that entry whose hit bits are set into main cache  Example of the move operation between two caches - 8KB main cache with 8 bytes block, 1KB spatial buffer with 32 bytes block  In main cache, tag is 19bits, index is 10bits, offset is 3 bits  In spatial buffer, tag is 27 bits and offset is 5 bits If only the hit bit of the first small block is set  The two bit offset ’00’ is combined with the spatial buffer’s tag value  Two bit offset for the four small blocks are ’00’,’01’,’10’,’11’  Therefore, a new memory address without an offset is formed 27 bits tag2 bits offset 19 bits tag10 bits index tag and index value for main cache

11 An Intelligent Cache System with Hardware Prefetching for High Performance 11/20 2015/7/12 Avoid Data Incoherence in Different Subcache  Mechanism that avoids incoherence  When a global miss occurs, search the tags of main cache Detect whether being fetched small blocks are in main cache If a match is detected 1) Invalidate corresponding small blocks in main cache 2) Use these dirty small blocks to update its entry in spatial buffer Modified OriginalMissed Small block in main cache Large block being fetched which include the missed small block This small block is brought due to the miss of other blocks Now, valid copy is referenced from the spatial buffer The small block is copied into the main cache while its corresponding large block is replaced

12 An Intelligent Cache System with Hardware Prefetching for High Performance 12/20 2015/7/12 Transfer From Prefetch Buffer to Spatial Buffer  Large block in prefetch buffer is transferred to spatial buffer  When a global miss occurs, cache controller handles a miss Therefore, the transfer time can be hidden  But, the missed block may already be in the prefetch buffer Compare tag of prefetch block with generated address while transfer  If a match in comparison 1) The requested data is transferred to the CPU and spatial buffer 2) Cache controller cancels the ongoing miss signal If a global miss occurs while a prefetch operation is being performed, miss handling is deferred until the ongoing prefetch operation completes

13 An Intelligent Cache System with Hardware Prefetching for High Performance 13/20 2015/7/12 Observation From the SMI Cache  Cache write-back doesn’t occur from spatial buffer  Because any referenced small block is moved to main cache  Write-back occurs only from the main cache  Therefore, reduce the write traffic into memory effectively Write back operation only for the 8 bytes small blocks  The following three operations don’t incur additional delay  Move small blocks that exhibit temporal locality into main cache  Search the tags of main cache (for cache coherence)  Transfer a large block from prefetch buffer to spatial buffer Because these operations are accomplished while the cache controller is handling a miss

14 An Intelligent Cache System with Hardware Prefetching for High Performance 14/20 2015/7/12 Threshold for Prefetch Generation  # of hit bits that should be set before prefetch generation  Prefetch-2 achieves more performance gain but greater overhead Due to increased prefetch frequency  Prefetch-4 achieves the most accuracy of the prefetch Prefetch-2 Prefetch-4 Prefetch generation != Prefetch actually occurred

15 An Intelligent Cache System with Hardware Prefetching for High Performance 15/20 2015/7/12 Prefetch Overhead  When a prefetch operation is performed  Search the tags of spatial buffer -> additional one cycle Total two cycles : normal access cycle + search cycle Prefetch-2 tend to have greater Search overhead Overhead increase : Due to Increase the rate that a block to prefetch already in spatial buffer Case A: P not set, not in spatial buffer Case B: P not set, in spatial buffer Prefetch actually occur Case C: P set

16 An Intelligent Cache System with Hardware Prefetching for High Performance 16/20 2015/7/12 Rate of Prefetch Operation Actually Occurred Prefetch-2 have higher actually occurred rate Decrease in actually occurred Due to decrease the rate that a block to prefetch doesn’t exist in spatial buffer Due to higher rate of prefetch generation Prefetch-4 have higher actually referenced rate Increase in actually referenced Due to decrease the rate that prefetch actually occurred Due to lower rate of prefetch generation The prefetching accuracy of Prefetch-4 is over 90%

17 An Intelligent Cache System with Hardware Prefetching for High Performance 17/20 2015/7/12 Comparison of a Conventional Cache With the SMI Cache  Assume SMI cache operating in “prefetch-4” configuration  The average miss ratio of 8KB SMI cache in nonprefetching mode is equal to 32KB direct-mapped cache Cache size is reduced by a factor of four  The average miss ratio of 8KB SMI cache in prefetching mode is equal to 64KB direct-mapped cache Cache size is reduced by a factor of eight

18 An Intelligent Cache System with Hardware Prefetching for High Performance 18/20 2015/7/12 Comparison of a Victim Cache With the SMI Cache  When a victim buffer hit occurs  Content swap between main cache and victim buffer (one cycle penalty)  Miss in both main cache and victim buffer  Content swap penalty can be hidden  SMI cache (8KB main cache with 8 bytes block, 1KB spatial buffer with 32 bytes block) has better performance than 8KB victim cache with 1KB victim buffer SMI cache can further reduce write traffic into memory, due to the use of smaller block size Victim cache with 32 bytes block SMI cache

19 An Intelligent Cache System with Hardware Prefetching for High Performance 19/20 2015/7/12 Relation Between Cost and Performance  SMI cache shows about 60% and 80% area reduction compared with 32KB DM and 64KB DM respectively  SMI cache size can be reduced by a factor of 4~8 relative to DM cache while maintaining similar performance  SMI cache can reduced around 20% miss ratio, and 10% average memory access time V.S. the victim cache Best configuration in performance Triumph

20 An Intelligent Cache System with Hardware Prefetching for High Performance 20/20 2015/7/12 Conclusions  Proposed a simple, high performance and low cost cache system  Exploiting two type of locality effectively Direct-mapped cache with small block size  For exploiting temporal locality Fully associative spatial buffer with large block size  For exploiting spatial locality An intelligent hardware prefetching mechanism  Enhancing spatial locality  The SMI cache overcomes the structural drawbacks of direct-mapped caches (e.g. conflict missed and thrashing)


Download ppt "An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions."

Similar presentations


Ads by Google