Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan.

Slides:

Advertisements

Similar presentations

Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.

Advertisements

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy J. Zebchuk, E. Safi, and A. Moshovos.

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

Cache Coherence Mechanisms (Research project) CSCI-5593

Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors Chinnakrishnan S. Ballapuram Ahmad Sharif Hsien-Hsin S.

To Include or Not to Include? Natalie Enright Dana Vantrease.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.

Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.

1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

High Performing Cache Hierarchies for Server Workloads

Our approach! 6.9% Perfect L2 cache (hit rate 100% ) 1MB L2 cache Cholesky 47% speedup BASE: All cores are used to execute the application-threads. PB-GS(PB-LS)

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

University of Utah1 Interconnect-Aware Coherence Protocols for Chip Multiprocessors Liqun Cheng Naveen Muralimanohar Karthik Ramani Rajeev Balasubramonian.

CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.

The Stanford Directory Architecture for Shared Memory (DASH)* Presented by: Michael Bauer ECE 259/CPS 221 Spring Semester 2008 Dr. Lebeck * Based on “The.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.

- Sam Ganzfried - Ryan Sukauye - Aniket Ponkshe. Outline Effects of asymmetry and how to handle them Design Space Exploration for Core Architecture Accelerating.

L2 to Off-Chip Memory Interconnects for CMPs Presented by Allen Lee CS258 Spring 2008 May 14, 2008.

Scaling the Bandwidth Wall: Challenges in and Avenues for CMP Scalability 36th International Symposium on Computer Architecture Brian Rogers †‡, Anil Krishna.

CPE 731 Advanced Computer Architecture Multiprocessor Introduction

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science.

Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.

Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,

A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,

Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.

Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

Quantifying and Comparing the Impact of Wrong-Path Memory References in Multiple-CMP Systems Ayse Yilmazer, University of Rhode Island Resit Sendag, University.

Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.

Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.

On the Value Locality of Store Instructions Kevin M. Lepak Mikko H. Lipasti University of Wisconsin—Madison

Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

Lecture 13: Multiprocessors Kai Bu

Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

Sequential Hardware Prefetching in Shared-Memory Multiprocessors Fredrik Dahlgren, Member, IEEE Computer Society, Michel Dubois, Senior Member, IEEE, and.

Coherence Decoupling: Making Use of Incoherence J. Huh, J. Chang, D. Burger, G. Sohi ASPLOS 2004.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

Improving Energy Efficiency of Configurable Caches via Temperature-Aware Configuration Selection Hamid Noori †, Maziar Goudarzi ‡, Koji Inoue ‡, and Kazuaki.

Influence Of The Cache Size On The Bus Traffic Mohd Azlan bin Hj. Abd Rahman M

PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches Yuejian Xie, Gabriel H. Loh Georgia Institute of Technology Presented by: Yingying.

An Accurate and Detailed Prefetching Simulation Framework for gem5 Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture.

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

Optimizing Replication, Communication, and Capacity Allocation in CMPs Z. Chishti, M. D. Powell, and T. N. Vijaykumar Presented by: Siddhesh Mhambrey Published.

ADAPTIVE CACHE-LINE SIZE MANAGEMENT ON 3D INTEGRATED MICROPROCESSORS Takatsugu Ono, Koji Inoue and Kazuaki Murakami Kyushu University, Japan ISOCC 2009.

Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P.

컴퓨터교육과 이상욱 Published in: COMPUTER ARCHITECTURE LETTERS (VOL. 10, NO. 1) Issue Date: JANUARY-JUNE 2011 Publisher: IEEE Authors: Omer Khan (Massachusetts.

IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

Cache Miss-Aware Dynamic Stack Allocation Authors: S. Jang. et al. Conference: International Symposium on Circuits and Systems (ISCAS), 2007 Presenter:

Cache Perf. CSE 471 Autumn 021 Cache Performance CPI contributed by cache = CPI c = miss rate * number of cycles to handle the miss Another important metric.

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.

Predictable Cache Coherence for Multi-Core Real-Time Systems

Lecture: Large Caches, Virtual Memory

ASR: Adaptive Selective Replication for CMP Caches

Jason F. Cantin, Mikko H. Lipasti, and James E. Smith

Kai Bu 13 Multiprocessors So today, we’ll finish the last part of our lecture sessions, multiprocessors.

Interconnect with Cache Coherency Manager

Introduction to Multiprocessors

Improving Multiple-CMP Systems with Token Coherence

Chapter 4 Multiprocessors

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Presentation transcript:

Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan ACSAC 13 August 4, 2008

Back Ground CMP (Chip MultiProcessor): – Several processor cores integrated in a chip – High performance by parallel processing – New feature: Cache-to-cache data transfer Limitation factor of CMP performance – Memory-wall problem is more critical High frequency of off-chip accesses Not scaling bandwidth with the number of cores 2 CMP Core L1 $ L2 $ Core L1 $ chip Data prefetching is more important in CMPs

Motivation & Goal Motivation – Conventional prefetch techniques have been developed for uniprocessors – Not clear that these prefetch techniques achieve high performance in even in CMPs – Is it necessary for the prefetch techniques to consider CMP features ? – Need to know the effect of prefetch on CMPs Goal – Analysis of the prefetch effect on CMPs 3

Outline Introduction Prefetch Taxonomy for multiprocessors Extension for CMPs Quantitative Analysis Conclusions 4

Classification of Prefetches According to Impact on Memory Performance Focusing on each prefetch Definition of the prefetch states – Initial state: the state just after a block is prefetched into cache memory – Final State: the state when the block is evicted from cache memory – The state transits based on Events during the life time of the prefetched block in cache memory 5

Definition of Events Event1. The prefetched block is accessed by the local core Event2. The local core accesses the block which has evicted from the cache by the prefetch Event3. The prefetch causes a downgrade followed by a subsequent upgrade in a remote core 6 Local coreRemote core Core L1 $ L2 $ Core L1 $ Main Memory A A A prefetch A Load A hiding off-chip access latency

Core L1 $ L2 $ Core L1 $ Main Memory Definition of Events Event1. The prefetched block is accessed by the local core Event2. The local core accesses the block which has evicted from the cache by the prefetch Event3. The prefetch causes a downgrade followed by a subsequent upgrade in a remote core B 7 Local coreRemote core A A A prefetch A Load B Cache miss!! blockB Is evicted

Core L1 $ L2 $ Core L1 $ Main Memory Definition of Events Event1. The prefetched block is accessed by the local core Event2. The local core accesses the block which has evicted from the cache by the prefetch Event3. The prefetch causes a downgrade followed by a subsequent upgrade in a remote core 8 Local coreRemote core AA prefetch A Store A Invalidate Request

The State Transition of Prefetch in Local Core Useless Useless/Conflict Useful Event2 Event1. Useful/Conflict Event1 9 Local coreRemote core Core L1 $ L2 $ Core L1 $ Main Memory A A A prefetch A Load A # of local L1 cache misses is decreased # of memory accesses is increased in local core(initial state) # of local L1 cache misses and # of accesses are increased in local core # of memory accesses is Increased in local core B Load B blockB Is evicted cache miss!!

The State Transition of Prefetch in Local and Remote Cores* Useless Useless/Conflict Useful Useful/Conflict * Jerger, N., Hill, E., and Lipasti, M., ``Friendly Fire: Understanding the Effects of Multiprocessor Prefetching‘’ In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Event2 Event1 10

Useful/Conflict Event1 Useful The State Transition of Prefetch in Local and Remote Cores* Useless Useless/ConflictHarmful Harmful/Conflict Event3 Event2 * Jerger, N., Hill, E., and Lipasti, M., ``Friendly Fire: Understanding the Effects of Multiprocessor Prefetching‘’ In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Event3 Event2 Event1 11 Core L1 $ L2 $ Core L1 $ Main Memory Local coreRemote core AA prefetch A Store A Invalidate Request B Load B # of invalidation requests and # of memory accesses are increased # of invalidation requests, # of memory accesses and #of cache misses are increased cache miss!! invalidated

Considering Cache-to-Cache Data Transfer Event4. The prefetched block loaded from L2 or main memory is accessed by a remote core Local coreRemote core Core L1 $ L2 $ Core L1 $ Main Memory A A A prefetch A Load A A 12 hiding off chip access latency

The State Transition in CMPs Useful Useful/Conflict Harmful Harmful/Conflict Event3. Event2. Event3. Event2. Event1. 13 Event1. Local coreRemote core Core L1 $ L2 $ Core L1 $ Main Memory A B Useless Useless/Conflict

Event2. Harmful Harmful/Conflict The State Transition in CMPs Useless Useless/Conflict Useful Useful/Conflict Useless/Conflict /Remote Useless/Remote Event4 Event2 Event1 Event4 14 Local coreRemote core Core L1 $ L2 $ Core L1 $ Main Memory A A A prefetch A Load A A B Load B cache miss Load A # of L2 access is decreased in remote core # of L2 accesses is decreased in remote core, # of cache misses is increased in local core

Classification of Prefetches in CMPs Useless Useless/Conflict Harmful Harmful/Conflict Useful Useful/Conflict Useless/Remote Useless/Conflict /Remote one L2 access decreased in remote core one memory access is increased in local core one cache miss is decreased in local core one memory access in local core, and invalidate request in remote core are increased one cache miss is increased in local core 15 Best case Worst case Better case Worse case

Outline Introduction Prefetch Taxonomy – for Multiprocessors – for CMP Quantitative Analysis Conclusions 16

Simulator – M5: CMP simulator Prefetch mechanism attached on L1 cache Stride prefetch and tagged prefetch MOESI coherence protocol Benchmark programs – SPLASH-2: Scientific computation programs Simulation Environment L2$ Main memory Core D I DI DI DI 64KB 2-way 4MB 8way 17

Can Conventional Prefetch Techniques Exploit Cache-to-Cache data transfer ? The percentage of Useless/Remote and Useless/Conflict/Remote prefetches is only 5%  Conventional prefetch techniques do not exploit cache-to-cache data transfer effectively 1 2 FMMLURadixWater 1. stride prefetch 2.tagged prefetch 18 Useless/Remote Useless/Conflict/ Remote

Are the Prefetched-Block Invalidations Serious Problem for CMPs? Prefetches of Harmful and Harmful/Conflict are extremely few (average 0.2%)  Invalidations of prefetched blocks are negligible 1 2 FMMLURadixWater 1. stride prefetch 2.tagged prefetch 19

Multiprocessor vs. Chip Multiprocessor Harmful and Harmful/Conflict prefetches – 0.01~0.70% in CMP (tagged prefetch)  Small negative impact – 2~18% in MP* (sequential prefetch)  Large negative impact Why does this difference occur ? *Jerger, N., Hill, E., and Lipasti, M., Friendly Fire: Understanding the Effects of Multiprocessor Prefetching. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS),

The Reason of Difference of Invalidation Rate Difference of the life time of prefetched blocks in cache – Long life time (large cache size)  High possibility of invalidation – Short life time (small cache size)  Low possibility of invalidation If the cache size is large, the negative impact is large( like MPs) If the cache size is small, the negative impact is small (like CMPs) CMP Core L1 $ L2 $ core L1 $ Multiprocessor load prefetched block and keep coherence Core L1$ L2$ Core L1$ L2$ 21

The Invalidation Rate of Prefetched Blocks with Varying L1 Cache Size (tagged prefetch) Larger cache  large negative impact ( like MPs) Smaller cache  small negative impact (like CMPs) invalidated rate L1 cache size 22

Summary Contributions – New method to analyze prefetch effects on CMPs – Quantitative analysis for two types of prefetches Observations – Conventional prefetch techniques DO NOT exploit cache-to-cache data transfer effectively – Harmful prefetches are NOT harmful in CMPs Future work – Propose novel prefetch technique exploiting the features of CMPs 23

Any Questions ? ~Please speak slowly~ Thank you

Average Memory Access Time (AMAT) 25 L1 $ Remote L1 $ L2 $ Shared bus Memory bus Main memory

Harmful and Harmful/Conflict Prefetches varying # of cores 26

MultiProcessor Traffic and Miss Taxonomy (MPTMT [Jerger’06]) MultiProcessor Traffic and Miss Taxonomy (MPTMT) – This is an extended version of Uniproccessor taxonomy (Srinivasan et al.) – Prefetches are classified according to effects on memory performance – To count the classified prefetches, we can measure the prefetch effects precisely 27