Sequential Hardware Prefetching in Shared-Memory Multiprocessors Fredrik Dahlgren, Member, IEEE Computer Society, Michel Dubois, Senior Member, IEEE, and.

Slides:

Advertisements

Similar presentations

Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir.

Advertisements

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.

Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.

Analysis of and Dynamic Page Remapping Technique to Reduce L2 Misses in an SMT Processor CSE 240B Class Project Spring 2005, UCSD Subhradyuti Sarkar Siddhartha.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Our approach! 6.9% Perfect L2 cache (hit rate 100% ) 1MB L2 cache Cholesky 47% speedup BASE: All cores are used to execute the application-threads. PB-GS(PB-LS)

Performance of Cache Memory

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

Erhan Erdinç Pehlivan Computer Architecture Support for Database Applications.

Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.

CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.

Memory System Characterization of Big Data Workloads

Reducing Occurrences of Priority Inversion in MSoC's using Dynamic Processor Priority Assignment Mikael Collin Mladen Nikitovic Christer Norström Research.

1 CSIT431 Introduction to Operating Systems Welcome to CSIT431 Introduction to Operating Systems In this course we learn about the design and structure.

Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.

Software-Based Cache Coherence with Hardware-Assisted Selective Self Invalidations Using Bloom Filters Authors ： Thomas J. Ashby, Pedro D´ıaz, Marcelo.

Computer System Overview Chapter 1. Basic computer structure CPU Memory memory bus I/O bus diskNet interface.

Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

@ Carnegie Mellon Databases Inspector Joins Shimin Chen Phillip B. Gibbons Todd C. Mowry Anastassia Ailamaki 2 Carnegie Mellon University Intel Research.

Analytic Evaluation of Shared-Memory Systems with ILP Processors Daniel J. Sorin, Vijay S. Pai, Sarita V. Adve, Mary K. Vernon, and David A. Wood Presented.

Computer Architecture Lecture 26 Fasih ur Rehman.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Mini-Project Presentation: Prefetching TDT4260 Computer Architecture

Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.

Efficient Cache Structures of IP Routers to Provide Policy-Based Services Graduate School of Engineering Osaka City University

Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.

Page Replacement Algorithms and Simulation Neville Allen.

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.

Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache And Pefetch Buffers Norman P. Jouppi Presenter:Shrinivas Narayani.

Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan.

ExLRU : A Unified Write Buffer Cache Management for Flash Memory EMSOFT '11 Liang Shi 1,2, Jianhua Li 1,2, Chun Jason Xue 1, Chengmo Yang 3 and Xuehai.

Project 11: Influence of the Number of Processors on the Miss Rate Prepared By: Suhaimi bin Mohd Sukor M

1 March 2010Summary, EE800 EE800 Circuit Elements in Digital Computations (Review) Professor S. Ko Electrical and Computer Engineering University of Saskatchewan.

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

Utilization Based Duty Cycle Tuning MAC Protocol for Wireless Sensor Networks Shih-Hsien Yang, Hung-Wei Tseng, Eric Hsiao-Kuang Wu, and Gen-Huey Chen Computer.

Execution Architecture MTT CPU08 Core M CPU08 INTRODUCTION.

An Evaluation of Memory Consistency Models for Shared- Memory Systems with ILP processors Vijay S. Pai, Parthsarthy Ranganathan, Sarita Adve and Tracy.

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

1 Dynamic Decentralized Cache Schemes for MIMD Parallel Processors Larry Rudolph Zary Segall Presenter: Tu Phan.

Energy-Efficient Hardware Data Prefetching Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz.

CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.

Chapter 1 Computer System Overview Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,

An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.

Cache memory. Cache memory Overview CPU Cache Main memory Transfer of words Transfer of blocks of words.

Chapter 1 Computer System Overview

Lecture: Cache Hierarchies

The University of Adelaide, School of Computer Science

Reactive Synchronization Algorithms for Multiprocessors

5.2 Eleven Advanced Optimizations of Cache Performance

Building Expressive, Area-Efficient Coherence Directories

Lecture: Cache Hierarchies

Improving cache performance of MPEG video codec

Supporting Fault-Tolerance in Streaming Grid Applications

Adaptive Cache Replacement Policy

CSCI1600: Embedded and Real Time Software

Lecture 14: Reducing Cache Misses

Chapter 5 Memory CSE 820.

Lecture: Cache Innovations, Virtual Memory

Lecture 4: Update Protocol

Lecture 4: Synchronization

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

Caches: AAT, 3C’s model of misses Prof. Eric Rotenberg

Lecture 19: Coherence and Synchronization

Principle of Locality: Memory Hierarchies

CSCI1600: Embedded and Real Time Software

Presentation transcript:

Sequential Hardware Prefetching in Shared-Memory Multiprocessors Fredrik Dahlgren, Member, IEEE Computer Society, Michel Dubois, Senior Member, IEEE, and Per Stenstrom, Member, IEEE

INTRODUCTION Why prefetching? Motivations for Prefetching of data. Types of prefetching. SOFTWARE HARDWARE

SIMPLEST & MOST OBVIOUS PREFETCHING TECHNIQUE: INCREASE BLOCK SIZE!!!

IMPACT OF BLOCK SIZE ON ACCESS PENALTIES AND TRAFFIC TYPES OF MISSES TRUE SHARINGFALSE SHARING REPLACEMENT COHERENCE COLD

EFFECT OF INCREASING THE BLOCK SIZE ON THE DIFFERENT TYPES OF MISSES EFFECT OF BLOCK SIZE ON DIFFERENT TYPES OF MISSES EFFECT OF BLOCK SIZE ON MEMORY TRAFFIC EFFECT OF BLOCK SIZE ON WRITE PENALTY

SIMULATED NODE ARCHITECTURE

TWO SIMPLE HARDWARE CONTROLLED SEQUENTIAL PREFETCHING TECHNIQUE 1. FIXED SEQUENTIAL PREFETCHING

2. ADAPTIVE SEQUENTIAL PREFETCHING K is controlled by the LookAhead Counter Mechanism needed : Prefetch bit Zero bit Look Ahead Counter Prefetch Counter Useful Counter

ADAPTIVE SEQUENTIAL PREFETCHING ALGORITHM MEASURES PREFETCH EFFICIENCY BY COUNTING THE FRACTION OF USEFUL PREFETCHS. COUNTING PREFETCH BLOCKS : INCREMENT THE PREFETCH COUNTER WHENEVER WE DO A PREFETCH. COUNTING THE USEFUL PREFETCHES : INCREMENT THE USEFUL COUNTER WHENEVER A BLOCK WITH ITS PREFETCH BIT = 1 IS ACCESSED. IF PREFETCH COUNTER = MAX, THEN CHECK USEFUL COUNTER. USEFUL COUNTER > UPPER THRESHOLD ; LOOKAHEAD COUNTER INCREASED. USEFUL COUNTER < LOWER THRESHOLD ; LOOKAHEAD COUNTER DECREASED. LOWER THRESHOLD < USEFUL COUNTER < UPPER THRESHOLD ; LOOKAHEAD COUNTER UNAFFECTED.

EXPERIMENTAL RESULTS READ, WRITE AND SYNCHRONIZATION TRAFFIC & THE THREE SCHEMES

RELATIVE READ STALL TIMES FOR FIXED AND ADAPTIVE PREFETCHING NORMALIZED TO NO PREFETCHING.

CONCLUSIONS Prefetching improves efficiency Fixed Sequential Prefetching analyzed for K=1. Read Misses Decreases by 25 – 45 %, Read stall time decreases by 20 – 35 % Under Adaptive Sequential prefetching Read stall time reduced by 58%, execution time decreased by 25%.

QUESTIONS ?? THANK YOU