4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Slides:

Advertisements

Similar presentations

Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.

Advertisements

Improving DRAM Performance by Parallelizing Refreshes with Accesses

Fairness via Source Throttling: A configurable and high-performance fairness substrate for multi-core memory systems Eiman Ebrahimi * Chang Joo Lee * Onur.

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

Prefetching Techniques for STT-RAM based Last-level Cache in CMP Systems Mengjie Mao, Guangyu Sun, Yong Li, Kai Bu, Alex K. Jones, Yiran Chen Department.

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.

August 8 th, 2011 Kevan Thompson Creating a Scalable Coherent L2 Cache.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu.

Aérgia: Exploiting Packet Latency Slack in On-Chip Networks

1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.

OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.

1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian.

1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.

1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.

1 Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures M. Aater Suleman* Onur Mutlu† Moinuddin K. Qureshi‡ Yale N. Patt* *The.

Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.

Address-Value Delta (AVD) Prediction Onur Mutlu Hyesoon Kim Yale N. Patt.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

Veynu Narasiman The University of Texas at Austin Michael Shebanow

Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs José A. Joao * M. Aater Suleman * Onur Mutlu ‡ Yale N. Patt * * HPS Research.

1 Lecture 4: Memory: HMC, Scheduling Topics: BOOM, memory blades, HMC, scheduling policies.

Flexible Reference-Counting-Based Hardware Acceleration for Garbage Collection José A. Joao * Onur Mutlu ‡ Yale N. Patt * * HPS Research Group University.

Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

1 Application Aware Prioritization Mechanisms for On-Chip Networks Reetuparna Das Onur Mutlu † Thomas Moscibroda ‡ Chita Das § Reetuparna Das § Onur Mutlu.

Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research.

Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

Mascar: Speeding up GPU Warps by Reducing Memory Pitstops

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research.

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.

1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)

Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.

15-740/ Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University.

Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.

Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.

1 Lecture 4: Memory Scheduling, Refresh Topics: scheduling policies, refresh basics.

1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,

Reducing Memory Interference in Multicore Systems

18-447: Computer Architecture Lecture 23: Caches

Zhichun Zhu Zhao Zhang ECE Department ECE Department

ISPASS th April Santa Rosa, California

Milad Hashemi, Onur Mutlu, and Yale N. Patt

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Application Slowdown Model

Staged-Reads : Mitigating the Impact of DRAM Writes on DRAM Reads

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Milad Hashemi, Onur Mutlu, Yale N. Patt

Lecture: DRAM Main Memory

Achieving High Performance and Fairness at Low Cost

Address-Value Delta (AVD) Prediction

Lecture: DRAM Main Memory

Lecture: DRAM Main Memory

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

José A. Joao* Onur Mutlu‡ Yale N. Patt*

Presentation transcript:

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and Computer Engineering The University of Texas at Austin * Electrical and Computer Engineering Carnegie Mellon University

4/17/20152 Main Memory System Crucial to high performance computing Made of DRAM chips Multiple banks → Each bank can be accessed independently

4/17/20153 Memory Bank-Level Parallelism (BLP) Req B0 Req B1 bank 0 bank 1 DRAM system DRAM controller DRAM DRAM request buffer Req B1 Bank 0 Bank 1 Time Overlapped time Req B0 Data bus Data for Req B0 Data for Req B1 Older DRAM throughput increased

4/17/20154 Memory Latency-Tolerance Mechanisms Out-of-order execution, prefetching, runahead etc. Increase outstanding memory requests on the chip –Memory-Level Parallelism (MLP) [Glew’98] Hope many requests will be serviced in parallel in the memory system Higher performance can be achieved when BLP is exposed to the DRAM controller

4/17/20155 Problems On-chip buffers e.g., Miss Status Holding Registers (MSHRs) are limited in size –Limit the BLP exposed to the DRAM controller –E.g., requests to the same bank fill up MSHRs In CMPs, memory requests from different cores are mixed together in DRAM request buffers –Destroy the BLP of each application running on CMPs Request Issue policies are critical to BLP exploited by DRAM controller

4/17/20156 Goals 1. Maximize the BLP exposed from each core to the DRAM controller → Increase DRAM throughput for useful requests 2. Preserve the BLP of each application in CMPs → Increase system performance BLP-Aware Prefetch Issue (BAPI): Decides the order in which prefetches are sent from prefetcher to MSHRs BLP-Preserving Multi-core Request Issue (BPMRI): Decides the order in which memory requests are sent from each core to DRAM request buffers Goals and Proposal

4/17/20157 DRAM BLP-Aware Request Issue Policies BLP-Aware Prefetch Issue (BAPI)BLP-Aware Prefetch Issue (BAPI) BLP-Preserving Multi-core Request Issue (BPMRI)

4/17/20158 What Can Limit DRAM BLP? Miss Status Holding Registers (MSHRs) are NOT large enough to handle many memory requests [Tuck, MICRO’06] –MSHRs keep track of all outstanding misses for a core → Total number of demand/prefetch requests ≤ total number of MSHR entries –Complex, latency-critical, and power-hungry → Not scalable Request issue policy to MSHRs affects the level of BLP exploited by DRAM controller

4/17/20159 What Can Limit DRAM BLP? Prefetch request buffer MSHRs Dem B0 β:β: α To DRAM Core Pref B0 Pref B1 DRAM request buffers Bank 0 Bank 1 DRAM service time Overlapped time Dem B0Pref B0 Pref B1 DRAM service time Overlapped time Dem B0Pref B0 Pref B1 Bank 0 Bank 1  FIFO (Intel Core)  BLP-aware Bank 0Bank 1 Saved time Older α:α: Increasing the number of requests ≠ high DRAM BLP β 2 requests0 request Pref B0 Pref B1 β Simple issue policy improves DRAM BLP 1 request Full

4/17/ BLP-Aware Prefetch Issue (BAPI) Sends prefetches to MSHRs based on current BLP exposed in the memory system –Sends a prefetch mapped to the least busy DRAM bank Adaptively limits the issue of prefetches based on prefetch accuracy estimation –Low prefetch accuracy → Fewer prefetches issued to MSHRs –High prefetch accuracy → Maximize BLP

4/17/ Implementation of BAPI FIFO prefetch request buffer per DRAM bank –Stores prefetches mapped to the corresponding DRAM bank MSHR occupancy counter per DRAM bank –Keeps track of the number of outstanding requests to the corresponding DRAM bank Prefetch accuracy register –Stores the estimated prefetch accuracy periodically

4/17/ BAPI Policy Every prefetch issue cycle 1.Make the oldest prefetch to each bank valid only if the bank’s MSHR occupancy counter ≤ prefetch send threshold 2.Among valid prefetches, select the request to the bank with minimum MSHR occupancy counter value

4/17/ Adaptivity of BAPI Prefetch Send Threshold –Reserves MSHR entries for prefetches to different banks –Adjusted based on prefetch accuracy Low prefetch accuracy → low prefetch send threshold High prefetch accuracy → high prefetch send threshold

4/17/ DRAM BLP-Aware Request Issue Policies BLP-Aware Prefetch Issue (BAPI) BLP-Preserving Multi-core Request Issue (BPMRI)BLP-Preserving Multi-core Request Issue (BPMRI)

4/17/ BLP Destruction in CMP Systems DRAM request buffers are shared by multiple cores –To exploit the BLP of a core, the BLP should be exposed to DRAM request buffers –BLP potential of a core can be destroyed by the interference from other cores’ requests Request issue policy from each core to DRAM request buffers affects BLP of each application

4/17/ Why is DRAM BLP Destroyed? To DRAM Core A DRAM request buffers Bank 0 Bank 1 Time Req A0  Round-robin  BLP-Preserving Bank 0Bank 1 Core B Request issuer Req A0 Req A1 Req B1 Req B0 Req B1Req A1 Core A Core B Stall Bank 0 Bank 1 Time Req A0 Req B0 Req B1 Req A1 Core A Core B Stall Saved cycles for Core A Increased cycles for Core B Req A0 Req A1 Req B1 Req B0 DRAM controller Older Serializes requests from each core Issue policy should preserve DRAM BLP

4/17/ BLP-Preserving Multi-Core Request Issue (BPMRI) Consecutively sends requests from one core to DRAM request buffers Limits the maximum number of consecutive requests sent from one core –Prevent starvation of memory non-intensive applications Prioritizes memory non-intensive applications –Impact of delaying requests from memory non-intensive application > Impact of delaying requests from memory intensive application

4/17/ Implementation of BPMRI Last-level (L2) cache miss counter per core –Stores the number of L2 cache misses from the core Rank register per core –Fewer L2 cache misses → higher rank –More L2 cache misses → lower rank

4/17/ BPMRI Policy Every request issue cycle If consecutive requests from selected core ≥ request send threshold then selected core ← highest ranked core issue oldest request from selected core

4/17/ Simulation Methodology x86 cycle accurate simulator Baseline processor configuration –Per core 4-wide issue, out-of-order, 128-entry ROB Stream prefetcher (prefetch degree: 4, prefetch distance: 64) 32-entry MSHRs 512KB 8-way L2 cache –Shared On-chip, demand-first FR-FCFS memory controller(s) 1, 2, 4 DRAM channels for 1, 4, 8-core systems 64, 128, 512-entry DRAM request buffers for 1, 4 and 8-core systems DDR DRAM, ns, 8KB row buffer

4/17/ Simulation Methodology Workloads –14 most memory-intensive SPEC CPU 2000/2006 benchmarks for single-core system –30 and 15 SPEC 2000/2006 workloads for 4 and 8-core CMPs Pseudo-randomly chosen multiprogrammed BAPI’s prefetch send threshold: BPMRI’s request send threshold: 10 Prefetch accuracy estimation and rank decision are made every 100K cycles Prefetch accuracy (%)0~4040~8585~100 Threshold1727

4/17/ Performance of BLP-Aware Issue Policies 4-core8-core 1-core 13.8% 13.6%8.5%

4/17/ Hardware Storage Cost for 4-core CMP Cost (bits) BAPI94,368 BPMRI72 Total94,440 Total storage: 94,440 bits (11.5KB) –0.6% of L2 cache data storage Logic is not on the critical path –Issue decision can be made slower than processor cycle

4/17/ Conclusion Uncontrolled memory request issue policies limit the level of BLP exploited by DRAM controller BLP-Aware Prefetch Issue –Increases the BLP of useful requests from each core exposed to DRAM controller BLP-Preserving Multi-core Request Issue –Ensures requests from the same core can be serviced in parallel by DRAM controller Simple, low-storage cost Significantly improve DRAM throughput and performance for both single and multi-core systems Applicable to other memory technologies

4/17/ Questions?