Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan ACSAC 13 August 4, 2008
Back Ground CMP (Chip MultiProcessor): – Several processor cores integrated in a chip – High performance by parallel processing – New feature: Cache-to-cache data transfer Limitation factor of CMP performance – Memory-wall problem is more critical High frequency of off-chip accesses Not scaling bandwidth with the number of cores 2 CMP Core L1 $ L2 $ Core L1 $ chip Data prefetching is more important in CMPs
Motivation & Goal Motivation – Conventional prefetch techniques have been developed for uniprocessors – Not clear that these prefetch techniques achieve high performance in even in CMPs – Is it necessary for the prefetch techniques to consider CMP features ? – Need to know the effect of prefetch on CMPs Goal – Analysis of the prefetch effect on CMPs 3
Outline Introduction Prefetch Taxonomy for multiprocessors Extension for CMPs Quantitative Analysis Conclusions 4
Classification of Prefetches According to Impact on Memory Performance Focusing on each prefetch Definition of the prefetch states – Initial state: the state just after a block is prefetched into cache memory – Final State: the state when the block is evicted from cache memory – The state transits based on Events during the life time of the prefetched block in cache memory 5
Definition of Events Event1. The prefetched block is accessed by the local core Event2. The local core accesses the block which has evicted from the cache by the prefetch Event3. The prefetch causes a downgrade followed by a subsequent upgrade in a remote core 6 Local coreRemote core Core L1 $ L2 $ Core L1 $ Main Memory A A A prefetch A Load A hiding off-chip access latency
Core L1 $ L2 $ Core L1 $ Main Memory Definition of Events Event1. The prefetched block is accessed by the local core Event2. The local core accesses the block which has evicted from the cache by the prefetch Event3. The prefetch causes a downgrade followed by a subsequent upgrade in a remote core B 7 Local coreRemote core A A A prefetch A Load B Cache miss!! blockB Is evicted
Core L1 $ L2 $ Core L1 $ Main Memory Definition of Events Event1. The prefetched block is accessed by the local core Event2. The local core accesses the block which has evicted from the cache by the prefetch Event3. The prefetch causes a downgrade followed by a subsequent upgrade in a remote core 8 Local coreRemote core AA prefetch A Store A Invalidate Request
The State Transition of Prefetch in Local Core Useless Useless/Conflict Useful Event2 Event1. Useful/Conflict Event1 9 Local coreRemote core Core L1 $ L2 $ Core L1 $ Main Memory A A A prefetch A Load A # of local L1 cache misses is decreased # of memory accesses is increased in local core(initial state) # of local L1 cache misses and # of accesses are increased in local core # of memory accesses is Increased in local core B Load B blockB Is evicted cache miss!!
The State Transition of Prefetch in Local and Remote Cores* Useless Useless/Conflict Useful Useful/Conflict * Jerger, N., Hill, E., and Lipasti, M., ``Friendly Fire: Understanding the Effects of Multiprocessor Prefetching‘’ In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Event2 Event1 10
Useful/Conflict Event1 Useful The State Transition of Prefetch in Local and Remote Cores* Useless Useless/ConflictHarmful Harmful/Conflict Event3 Event2 * Jerger, N., Hill, E., and Lipasti, M., ``Friendly Fire: Understanding the Effects of Multiprocessor Prefetching‘’ In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Event3 Event2 Event1 11 Core L1 $ L2 $ Core L1 $ Main Memory Local coreRemote core AA prefetch A Store A Invalidate Request B Load B # of invalidation requests and # of memory accesses are increased # of invalidation requests, # of memory accesses and #of cache misses are increased cache miss!! invalidated
Considering Cache-to-Cache Data Transfer Event4. The prefetched block loaded from L2 or main memory is accessed by a remote core Local coreRemote core Core L1 $ L2 $ Core L1 $ Main Memory A A A prefetch A Load A A 12 hiding off chip access latency
The State Transition in CMPs Useful Useful/Conflict Harmful Harmful/Conflict Event3. Event2. Event3. Event2. Event1. 13 Event1. Local coreRemote core Core L1 $ L2 $ Core L1 $ Main Memory A B Useless Useless/Conflict
Event2. Harmful Harmful/Conflict The State Transition in CMPs Useless Useless/Conflict Useful Useful/Conflict Useless/Conflict /Remote Useless/Remote Event4 Event2 Event1 Event4 14 Local coreRemote core Core L1 $ L2 $ Core L1 $ Main Memory A A A prefetch A Load A A B Load B cache miss Load A # of L2 access is decreased in remote core # of L2 accesses is decreased in remote core, # of cache misses is increased in local core
Classification of Prefetches in CMPs Useless Useless/Conflict Harmful Harmful/Conflict Useful Useful/Conflict Useless/Remote Useless/Conflict /Remote one L2 access decreased in remote core one memory access is increased in local core one cache miss is decreased in local core one memory access in local core, and invalidate request in remote core are increased one cache miss is increased in local core 15 Best case Worst case Better case Worse case
Outline Introduction Prefetch Taxonomy – for Multiprocessors – for CMP Quantitative Analysis Conclusions 16
Simulator – M5: CMP simulator Prefetch mechanism attached on L1 cache Stride prefetch and tagged prefetch MOESI coherence protocol Benchmark programs – SPLASH-2: Scientific computation programs Simulation Environment L2$ Main memory Core D I DI DI DI 64KB 2-way 4MB 8way 17
Can Conventional Prefetch Techniques Exploit Cache-to-Cache data transfer ? The percentage of Useless/Remote and Useless/Conflict/Remote prefetches is only 5% Conventional prefetch techniques do not exploit cache-to-cache data transfer effectively 1 2 FMMLURadixWater 1. stride prefetch 2.tagged prefetch 18 Useless/Remote Useless/Conflict/ Remote
Are the Prefetched-Block Invalidations Serious Problem for CMPs? Prefetches of Harmful and Harmful/Conflict are extremely few (average 0.2%) Invalidations of prefetched blocks are negligible 1 2 FMMLURadixWater 1. stride prefetch 2.tagged prefetch 19
Multiprocessor vs. Chip Multiprocessor Harmful and Harmful/Conflict prefetches – 0.01~0.70% in CMP (tagged prefetch) Small negative impact – 2~18% in MP* (sequential prefetch) Large negative impact Why does this difference occur ? *Jerger, N., Hill, E., and Lipasti, M., Friendly Fire: Understanding the Effects of Multiprocessor Prefetching. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS),
The Reason of Difference of Invalidation Rate Difference of the life time of prefetched blocks in cache – Long life time (large cache size) High possibility of invalidation – Short life time (small cache size) Low possibility of invalidation If the cache size is large, the negative impact is large( like MPs) If the cache size is small, the negative impact is small (like CMPs) CMP Core L1 $ L2 $ core L1 $ Multiprocessor load prefetched block and keep coherence Core L1$ L2$ Core L1$ L2$ 21
The Invalidation Rate of Prefetched Blocks with Varying L1 Cache Size (tagged prefetch) Larger cache large negative impact ( like MPs) Smaller cache small negative impact (like CMPs) invalidated rate L1 cache size 22
Summary Contributions – New method to analyze prefetch effects on CMPs – Quantitative analysis for two types of prefetches Observations – Conventional prefetch techniques DO NOT exploit cache-to-cache data transfer effectively – Harmful prefetches are NOT harmful in CMPs Future work – Propose novel prefetch technique exploiting the features of CMPs 23
Any Questions ? ~Please speak slowly~ Thank you
Average Memory Access Time (AMAT) 25 L1 $ Remote L1 $ L2 $ Shared bus Memory bus Main memory
Harmful and Harmful/Conflict Prefetches varying # of cores 26
MultiProcessor Traffic and Miss Taxonomy (MPTMT [Jerger’06]) MultiProcessor Traffic and Miss Taxonomy (MPTMT) – This is an extended version of Uniproccessor taxonomy (Srinivasan et al.) – Prefetches are classified according to effects on memory performance – To count the classified prefetches, we can measure the prefetch effects precisely 27