Presentation is loading. Please wait.

Presentation is loading. Please wait.

Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan.

Similar presentations


Presentation on theme: "Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan."— Presentation transcript:

1 Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan ACSAC 13 August 4, 2008

2 Back Ground CMP (Chip MultiProcessor): – Several processor cores integrated in a chip – High performance by parallel processing – New feature: Cache-to-cache data transfer Limitation factor of CMP performance – Memory-wall problem is more critical High frequency of off-chip accesses Not scaling bandwidth with the number of cores 2 CMP Core L1 $ L2 $ Core L1 $ chip Data prefetching is more important in CMPs

3 Motivation & Goal Motivation – Conventional prefetch techniques have been developed for uniprocessors – Not clear that these prefetch techniques achieve high performance in even in CMPs – Is it necessary for the prefetch techniques to consider CMP features ? – Need to know the effect of prefetch on CMPs Goal – Analysis of the prefetch effect on CMPs 3

4 Outline Introduction Prefetch Taxonomy for multiprocessors Extension for CMPs Quantitative Analysis Conclusions 4

5 Classification of Prefetches According to Impact on Memory Performance Focusing on each prefetch Definition of the prefetch states – Initial state: the state just after a block is prefetched into cache memory – Final State: the state when the block is evicted from cache memory – The state transits based on Events during the life time of the prefetched block in cache memory 5

6 Definition of Events Event1. The prefetched block is accessed by the local core Event2. The local core accesses the block which has evicted from the cache by the prefetch Event3. The prefetch causes a downgrade followed by a subsequent upgrade in a remote core 6 Local coreRemote core Core L1 $ L2 $ Core L1 $ Main Memory A A A prefetch A Load A hiding off-chip access latency

7 Core L1 $ L2 $ Core L1 $ Main Memory Definition of Events Event1. The prefetched block is accessed by the local core Event2. The local core accesses the block which has evicted from the cache by the prefetch Event3. The prefetch causes a downgrade followed by a subsequent upgrade in a remote core B 7 Local coreRemote core A A A prefetch A Load B Cache miss!! blockB Is evicted

8 Core L1 $ L2 $ Core L1 $ Main Memory Definition of Events Event1. The prefetched block is accessed by the local core Event2. The local core accesses the block which has evicted from the cache by the prefetch Event3. The prefetch causes a downgrade followed by a subsequent upgrade in a remote core 8 Local coreRemote core AA prefetch A Store A Invalidate Request

9 The State Transition of Prefetch in Local Core Useless Useless/Conflict Useful Event2 Event1. Useful/Conflict Event1 9 Local coreRemote core Core L1 $ L2 $ Core L1 $ Main Memory A A A prefetch A Load A # of local L1 cache misses is decreased # of memory accesses is increased in local core(initial state) # of local L1 cache misses and # of accesses are increased in local core # of memory accesses is Increased in local core B Load B blockB Is evicted cache miss!!

10 The State Transition of Prefetch in Local and Remote Cores* Useless Useless/Conflict Useful Useful/Conflict * Jerger, N., Hill, E., and Lipasti, M., ``Friendly Fire: Understanding the Effects of Multiprocessor Prefetching‘’ In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2006. Event2 Event1 10

11 Useful/Conflict Event1 Useful The State Transition of Prefetch in Local and Remote Cores* Useless Useless/ConflictHarmful Harmful/Conflict Event3 Event2 * Jerger, N., Hill, E., and Lipasti, M., ``Friendly Fire: Understanding the Effects of Multiprocessor Prefetching‘’ In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2006. Event3 Event2 Event1 11 Core L1 $ L2 $ Core L1 $ Main Memory Local coreRemote core AA prefetch A Store A Invalidate Request B Load B # of invalidation requests and # of memory accesses are increased # of invalidation requests, # of memory accesses and #of cache misses are increased cache miss!! invalidated

12 Considering Cache-to-Cache Data Transfer Event4. The prefetched block loaded from L2 or main memory is accessed by a remote core Local coreRemote core Core L1 $ L2 $ Core L1 $ Main Memory A A A prefetch A Load A A 12 hiding off chip access latency

13 The State Transition in CMPs Useful Useful/Conflict Harmful Harmful/Conflict Event3. Event2. Event3. Event2. Event1. 13 Event1. Local coreRemote core Core L1 $ L2 $ Core L1 $ Main Memory A B Useless Useless/Conflict

14 Event2. Harmful Harmful/Conflict The State Transition in CMPs Useless Useless/Conflict Useful Useful/Conflict Useless/Conflict /Remote Useless/Remote Event4 Event2 Event1 Event4 14 Local coreRemote core Core L1 $ L2 $ Core L1 $ Main Memory A A A prefetch A Load A A B Load B cache miss Load A # of L2 access is decreased in remote core # of L2 accesses is decreased in remote core, # of cache misses is increased in local core

15 Classification of Prefetches in CMPs Useless Useless/Conflict Harmful Harmful/Conflict Useful Useful/Conflict Useless/Remote Useless/Conflict /Remote one L2 access decreased in remote core one memory access is increased in local core one cache miss is decreased in local core one memory access in local core, and invalidate request in remote core are increased one cache miss is increased in local core 15 Best case Worst case Better case Worse case

16 Outline Introduction Prefetch Taxonomy – for Multiprocessors – for CMP Quantitative Analysis Conclusions 16

17 Simulator – M5: CMP simulator Prefetch mechanism attached on L1 cache Stride prefetch and tagged prefetch MOESI coherence protocol Benchmark programs – SPLASH-2: Scientific computation programs Simulation Environment L2$ Main memory Core D I DI DI DI 64KB 2-way 4MB 8way 17

18 Can Conventional Prefetch Techniques Exploit Cache-to-Cache data transfer ? The percentage of Useless/Remote and Useless/Conflict/Remote prefetches is only 5%  Conventional prefetch techniques do not exploit cache-to-cache data transfer effectively 1 2 FMMLURadixWater 1. stride prefetch 2.tagged prefetch 18 Useless/Remote Useless/Conflict/ Remote

19 Are the Prefetched-Block Invalidations Serious Problem for CMPs? Prefetches of Harmful and Harmful/Conflict are extremely few (average 0.2%)  Invalidations of prefetched blocks are negligible 1 2 FMMLURadixWater 1. stride prefetch 2.tagged prefetch 19

20 Multiprocessor vs. Chip Multiprocessor Harmful and Harmful/Conflict prefetches – 0.01~0.70% in CMP (tagged prefetch)  Small negative impact – 2~18% in MP* (sequential prefetch)  Large negative impact Why does this difference occur ? *Jerger, N., Hill, E., and Lipasti, M., Friendly Fire: Understanding the Effects of Multiprocessor Prefetching. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2006. 20

21 The Reason of Difference of Invalidation Rate Difference of the life time of prefetched blocks in cache – Long life time (large cache size)  High possibility of invalidation – Short life time (small cache size)  Low possibility of invalidation If the cache size is large, the negative impact is large( like MPs) If the cache size is small, the negative impact is small (like CMPs) CMP Core L1 $ L2 $ core L1 $ Multiprocessor load prefetched block and keep coherence Core L1$ L2$ Core L1$ L2$ 21

22 The Invalidation Rate of Prefetched Blocks with Varying L1 Cache Size (tagged prefetch) Larger cache  large negative impact ( like MPs) Smaller cache  small negative impact (like CMPs) invalidated rate L1 cache size 22

23 Summary Contributions – New method to analyze prefetch effects on CMPs – Quantitative analysis for two types of prefetches Observations – Conventional prefetch techniques DO NOT exploit cache-to-cache data transfer effectively – Harmful prefetches are NOT harmful in CMPs Future work – Propose novel prefetch technique exploiting the features of CMPs 23

24 Any Questions ? ~Please speak slowly~ Thank you

25 Average Memory Access Time (AMAT) 25 L1 $ Remote L1 $ L2 $ Shared bus Memory bus Main memory

26 Harmful and Harmful/Conflict Prefetches varying # of cores 26

27 MultiProcessor Traffic and Miss Taxonomy (MPTMT [Jerger’06]) MultiProcessor Traffic and Miss Taxonomy (MPTMT) – This is an extended version of Uniproccessor taxonomy (Srinivasan et al.) – Prefetches are classified according to effects on memory performance – To count the classified prefetches, we can measure the prefetch effects precisely 27


Download ppt "Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan."

Similar presentations


Ads by Google