A Staged Memory Resource Management Method for CMP Systems

A Staged Memory Resource Management Method for CMP Systems
Yangguo Liu, Junlin Lu, Dong Tong and Xu Cheng Microprocessor Research and Development Center Peking University

Outline Background and Prior Work Motivation
Bank Level Parallelism Dynamic bandwidth Throttling Staged Memory Resource Management Method (SRM) Memory Partitioning Integrating Bandwidth Throttling and Memory Partitioning Evaluation Summary

DRAM Memory System … Columns BANK 0 BANK 1 BANK 7 Rows Row Buffer
First, I’ll present a brief introduction about DRAM memory system. In modern DRAM memory systems, a memory system consists of one or more channels. Each channel has independent address, data, and command buses. Memory accesses to different channels can proceed in parallel. A channel typically consists of one or several ranks. Each rank consists of multiple banks (e.g. DDR3 8 banks/rank). A bank is organized as a two-dimensional structure, consisting of multiple rows and columns. Each bank has a row buffer that provides temporary data storage of a DRAM row. DRAM Bus Memory Controller

DRAM Memory System Row Buffer Multi-Bank
Provides temporary data storage of a DRAM row Row-buffer hit/conflict Hit: Column access Conflict: Precharge  Activate  Column access Row buffer locality Multi-Bank Multiple banks allow multiple memory requests to proceed in parallel Bank level parallelism: overlap/hide memory access latency If a subsequent memory access goes to the same row currently in the row buffer, only column access is necessary; this is called a row-buffer hit. Otherwise, there is a row-buffer conflict, memory controller has to first precharge the opened row, activate another row, and then perform column access. The multi-bank structure allows memory requests to process in parallel.

Shared DRAM MPSoC Systems
Multi-core Processor . . . Core 0 Core 1 Core 7 Cache . . . Cache Cache inter-application interference Data Bus Shared DRAM Memory System Memory Controller In a multi-core system, the on-chip cores share DRAM system. As a result, when applications executing on these cores generate DRAM requests, these requests interfere with each other in the DRAM system, thus causing inter-application interference. Bank 0 Bank 1 Bank 2 . . . Bank 7

Inter-application Interference in DRAM System
Memory streams of different applications are interleaved and interfere with each other at DRAM memory Destroy original row-buffer locality of individual applications High row-buffer conflict rate High memory access latency

Prior Work Out-of-order Memory scheduling Bank partitioning
Reorder memory requests to recover original row-buffer locality Recovering ability is restricted Arrival interval between memory requests of individual applications Scheduling buffer size Cannot eliminate inter-application interference Bank partitioning Divide memory banks among cores Isolate memory access streams of different applications Eliminate inter-application interference

Prior Work Bank partitioning OS-based page coloring
Assign different page colors to different cores Eliminate inter-application interference Drawback Previous bank partitioning only addresses the interference among memory-intensive applications but ignores the interference caused by memory non-intensive applications.

Bank Level Parallelism Dynamic bandwidth throttling Staged Memory Resource Management Method (SRM) Partitioning Integrating Bandwidth and Bank Partitioning Evaluation Summary

Bank Level Parallelism
Memory non-intensive applications are not sensitive to bank amounts Memory intensive applications are sensitive to bank amounts High row-buffer locality Low row-buffer locality Most applications give peak performance at 8 or 16 banks BLP-sensitive

Trade Off between Parallelism and Locality
Bank Partitioning Reserve row-buffer locality shared Improve bank level parallelism at the cost of interference Low row-buffer locality applications: Improving BLP brings more benefits than eliminating interference We study the trade-offs between bank level parallelism and row-buffer locality of memory intensive applications with low spatial locality. In our experiments, we run two applications concurrently. The column BP means two applications have their own memory banks, and shared means two applications share all banks evenly. BP reserves row-buffer locality at the cost of reduced BLP. However, the original spatial locality of these applications is low; recovering row-buffer locality brings limited benefits. Shared improves BLP at the cost of inter-application interference. When the number of banks rises to 16 or 32, increasing bank level parallelism also outperforms reducing interference. As the amount of banks rises to 64, increasing bank level parallelism by evenly sharing all banks negatively impacts DRAM performance, since most applications give peak performance at 8 or 16 banks. The 2 instances of bwaves running together show similar results. When two different applications (mcf and bwaves) running together, evenly share their memory banks also improves system throughput.

Dynamic bandwidth throttling
BTT:sets the maximum bandwidth share that the core is allowed to use. Once this level is reached, the Arbiter will no longer accept requests from this core until the bandwidth usage drops below the threshold. MAFR: Memory access frequency ratio of Memory intensive applications to non-intensive applications we study how bandwidth throttling influences the system performance by performing experiments. We define bandwidth throttling as the blocking of new memory requests being sent from a particular application to the memory controller. BTT sets the maximum bandwidth share that the core is allowed to use. Once this level is reached, the Arbiter will no longer accept requests from this core until the bandwidth usage drops below the threshold. We measure the difference between applications using a memory-access frequency ratio (MAFR), which is defined as the ratio of memory intensities of memory-intensive applications to memory non-intensive applications. Firstly, we observe that as the BTT increases, the system performance also increases to a point (e.g., approximately 20%). Secondly, we observe that BTT is too aggressive past a certain point. In contrast, the bandwidth throttling is too conservative until then. Thirdly, as MAFR increases, the impact of bandwidth throttling on the system performance becomes greater. Cases with low MAFR (e.g., < 1) show almost no performance variation while varying bandwidth throttling from 0% to 20%. CPU：8core，4GHz，out-of-order； L1 ：32KB private ； L2 ：8MB shared DRAM Memory：FR-FCFS，open-page,，2channels，2rank/channel, 8bank/rank, DDR3-1333

Dynamic bandwidth throttling
Most workloads achieve their best performance with a BTT ranging from 5% to 20% As the BTT increases, the system performance also increases to a point (e.g., approximately 20%), because the associate reduction in memory-access congestion alleviates the interference. BTT is too aggressive past a certain point. If the throttling threshold is set incorrectly (over- or under-throttling), it can degrade the performance. As MAFR increases, the impact of bandwidth throttling on the system performance becomes greater.

Bank Level Parallelism Dynamic Bandwidth Throttling Staged Memory Resource Management Method (SRM) Partitioning Integrating Bandwidth Throttling and Memory Partitioning Evaluation Summary

Staged Memory Resource Management Method (SRM)
High-level architecture of the CMP system SRM introduces a three-stage design. The first stage is the bandwidth throttling stage that dynamically assigns the memory bandwidth to different applications according to their memory-access intensities, and throttles the memory intensive applications from consuming excessive bandwidth in order to alleviate interference. The second stage is the DRAM command scheduler that schedules memory accesses while satisfying all DRAM constraints. The last stage is the memory partitioning stage that dynamically divides bank/channel among applications to address the interference at the DRAM device.

The process Step1: Profiling applications’ memory characteristics
Memory intensity MAPI — memory access per interval Row-buffer locality RBH — row-buffer hit rate Step2: Grouping applications Memory non-intensive cluster (NI cluster) Memory intensive cluster (MI cluster) Low row-buffer locality (MI-LRL cluster) High row-buffer locality (MI-HRL cluster) Step3: Memory partitioning Step4: Bandwidth throttling

Locality-aware Bank Partitioning
Step1: Profiling applications’ memory characteristics Step2: Grouping applications Step3: Memory partitioning Objectives Isolate the memory access streams of MI-HRL applications from other MI applications Improve BLP for BLP-sensitive applications Step4: Bandwidth throttling Dynamic bandwidth throttling

Memory Partitioning Memory Channel Partitioning (MCP) + Dynamic Bank Partitioning (DBP) We assign dedicate memory channel for MI-HRL applications. We further isolate their memory accesses from each other by equally allocating dedicated memory banks to each application. The MI-LRL applications share the other channel with NI applications. For MI-LRL applications, we provide as many banks as possible to increase bank level parallelism, these applications share all the banks with NI applications in the channel.

Memory Partitioning MI-HRL cluster MI-LRL cluster NI cluster
Isolate MI-HRL applications from other memory intensive applications Allocate dedicate channel MI-LRL cluster Share the channel with NI applications Share all the banks of the allocated channel NI cluster share banks with MI-LRL group Save memory banks for BLP-sensitive applications

Integrating Bandwidth throttling and memory Partitioning
Divide memory banks among applications Resolve inter-application interference Don’t consider system throughput and fairness Bandwidth Throttling Consider applications’ memory access behavior and system fairness Prioritizing light applications can improve system throughput [Kim et al., HPCA 2010, Muralidhara et al., MICRO 2011, Zheng et al., ICPP 2008] Improve system throughput and fairness Can’t eliminate inter-application interference memory partitioning solves inter-application memory interference purely using OS-based page coloring. However, it doesn’t take into account system throughput and fairness. In contrast, Bandwidth Throttling reorders memory requests by considering applications’ memory access behavior and system fairness. Previous work shows servicing memory requests from “light” applications allows them to quickly resume computation, and hence improving system throughput. Memory scheduling can effectively improve system throughput and fairness. However, it cannot eliminate inter-application interference. These two kinds of method are complementary to each other. memory partitioning can benefit from better scheduling, and memory scheduling can benefit from improved spatial locality.

Integrating Bandwidth throttling and Bank Partitioning
Memory Partitioning Dynamically divide memory banks among applications Mitigate inter-application interference Improve bank level parallelism Bandwidth Throttling Proportionally throttle MI applications Applications in NI cluster are relatively prioritized

Bank Level Parallelism Staged Memory Resource Management Method Partitioning Integrating Bandwidth and Bank Partitioning Evaluation Summary

Evaluation Methodology
Simulation Model 8 cores, 2 channels, 2 ranks/channel, 8 banks/rank Core Model Out-of-order 32KB Inst/32KB Data, 4-way, 64B line, LRU 8MB shared L2 cache, 16-way, 64B line DRAMSim2 – DDR3 [P. Rosenfeld et al., CAL 2011] Workloads SPEC CPU 2006 multi-programmed workloads (categorized based on memory intensity) Metrics

Comparison with Other Methods
Baseline FR-FCFS [Rixner et al., ISCA 2000] Prioritizes row-hit requests, older requests Others MCP :memory channel partitioning DBP :dynamic bank partitioning TCM : thread cluster memory scheduler (previously, the best scheduling algorithm for system performance) DPBT :dynamically proportional bandwidth throttling MCP+DBP SRM :staged memory resource management method DBP+TCM:Integrating Dynamic Bank Partitioning with Thread Cluster Memory scheduling [mingli xie et al., HPCA 2014]

System Throughput Integrating bandwidth throttling and memory partitioning provides better system throughput than either of them alone

System Fairness

Summary Inter-application interference in DRAM memory of CMP systems degrades system performance Memory Partitioning Resolve interference Improve intra-applications’ bank level parallelism Integrating memory partitioning with bandwidth throttling Dynamic memory partitioning Resolves interference Bandwidth throttling Proportionally throttle MI applications Applications in NI cluster are relatively prioritized

Thank you. Questions?

A Staged Memory Resource Management Method for CMP Systems

Similar presentations

Presentation on theme: "A Staged Memory Resource Management Method for CMP Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Staged Memory Resource Management Method for CMP Systems

Similar presentations

Presentation on theme: "A Staged Memory Resource Management Method for CMP Systems"— Presentation transcript:

Similar presentations

About project

Feedback