Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper appears in: Embedded and Multimedia Computing, 2009. EM-Com 2009. 4th International Publication Date : 10-12 Dec. 2009
The large working sets of commercial and scientific workloads favor a shared L2 cache design that maximizes the aggregate cache capacity and minimizes off-chip memory requests in Chip Multiprocessors (CMP) Two Important hurdles that restrict the scalability of these chip multiprocessors: the on-chip memory cost of directory the long L1 miss latencies
Network on Chip (NoC) In a NoC system, modules such as processor cores, memories and specialized IP blocks exchange data using a network as a "public transportation" sub-system for the information traffic. An NoC is constructed from multiple point-to-point data links interconnected by routers, such that messages can be relayed from any source module to any destination module over several links, by making routing decisions at the routers.
Victim Cache A victim cache is a cache used to hold blocks evicted from a CPU cache upon replacement. The victim cache lies between the main cache and its refill path. The victim cache is usually fully associative, and is intended to reduce the number of conflict misses. Only a small fraction of the memory accesses of the program require high associativity. The victim cache exploits this property by providing high associativity to only these accesses.
Baseline Architecture The tile CMP is organized as 2D array of replicated tiles each with a core, a private L1 cache, an L2 cache slice, and a router that connects the tile to the network on chip. The L2 cache slices form a logically shared L2. L1 cache misses are sent to the corresponding home tile, which looks up the directory information and performs the actions needed to ensure coherence. L1 caches are kept coherent by using directory-based cache coherence protocol. Directory
Baseline Router Architecture In tiled CMP, L1 cache and L2 cache are attached to router through Network Interface Component (NIC). Routers are connected together by four direction interfaces to form a 2D network on chip.
The Network Victim Cache (NVC) The difference from the baseline router architecture is the modification of network interface component. VC and DC are added into the network interface component. Remove directory information from L2 caches and stored it in Directory Caches (DC) in the network interface components to save memory space The saved directory space is used as Victim Caches (VC) to capture and store evictions from local L1 caches to reduce subsequent L1 miss latencies.
At the home tile, the DC captures L1 miss request in the network interface component and looks up directory information of the requesting block. It fetches data block from local L2 cache and sends reply back to the requestor. If a L1 cache line is evicted because of a conflict or capacity miss, we attempt to keep a copy of the victim line in the VC to reduce subsequent access latency to the same line. L1 Cache DC Miss Request Miss Request L2 Cache Fetched Data Block Evicted by a conflict or capacity miss VC
All L1 misses will first check VC when they flow through the network interface component in case there’s a valid block. On a VC miss, the request continues to travel to the home tile. On a VC hit, the block is invalidated in the VC and moved into the L1 cache. L1 Cache Miss Request Miss Request Move Back VC Miss DC … Hit -> Invalidate
Simulation Environment Use GEMS simulator to evaluate the performance of NVC against over the baseline CMP. The number of entries of VC is equal to that of L1 cache and the number of entries of DC is twice that of L1 cache. Detailed system parameters 8 workloads from SPLASH-2 and PARSEC benchmarks on Solaris 10 operating system
Impact on L1 cache miss latency NVC decreases the L1 cache miss latencies by 21-49%, and by 31% on average. For water benchmark, small working set makes most of L1 misses can be satisfied in local victim cache, and then reduces the L1 miss latencies by 49%. Normalized L1 cache average miss latency
Impact on execution time NVC reduces the execution time of each benchmark by 10- 34%.The execution time of lu and water are reduced by 34%. For water benchmark, small working set makes most of L1 misses can be satisfied in local victim cache and leads to better performance. NVC improves performance of CMP by 23% on average. Execution time
On-Chip Network Traffic Reduction An additional benefit of NVC is the reduction of on-chip coherence traffic. NVC reduces the number of coherence messages of each benchmark by 16-48%, and by 28% on average. NVC eliminates some inter-tile messages when accesses can be resolved in local victim caches.
Scalability Compared to conventional shared L2 cache design, NVC increases on-chip storage by only 0.18%. As the number of cores increases, the saved directory storage from L2 cache will increase significantly, while the storage overhead of the proposed scheme will increase far slower. NVC can provide much better scalability than the conventional shared L2 cache design when the number of cores increases.