1 Lecture 4: Memory Scheduling, Refresh Topics: scheduling policies, refresh basics.

Slides:

Advertisements

Similar presentations

Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.

Advertisements

Improving DRAM Performance by Parallelizing Refreshes with Accesses

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

A Case for Refresh Pausing in DRAM Memory Systems

The Compact Memory Scheduling Maximizing Row Buffer Locality Young-Suk Moon, Yongkee Kwon, Hong-Sik Kim, Dong-gun Kim, Hyungdong Hayden Lee, and Kunwoo.

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Understanding a Problem in Multicore and How to Solve It

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.

OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.

1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.

MICRO-47, December 15, 2014 FIRM: Fair and HIgh-PerfoRmance Memory Control for Persistent Memory Systems Jishen Zhao Onur Mutlu Yuan Xie.

1 Lecture: Memory, Coherence Protocols Topics: wrap-up of memory systems, intro to multi-thread programming models.

1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.

Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.

1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )

1 Lecture 15: DRAM Design Today: DRAM basics, DRAM innovations (Section 5.3)

1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.

Wk 2 – Scheduling 1 CS502 Spring 2006 Scheduling The art and science of allocating the CPU and other resources to processes.

Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.

1 Lecture 14: DRAM, PCM Today: DRAM scheduling, reliability, PCM Class projects.

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.

1 Lecture 1: Introduction and Memory Systems CS 7810 Course organization:  5 lectures on memory systems  5 lectures on cache coherence and consistency.

Scalable Many-Core Memory Systems Lecture 4, Topic 3: Memory Interference and QoS-Aware Memory Systems Prof. Onur Mutlu

1 Lecture 4: Memory: HMC, Scheduling Topics: BOOM, memory blades, HMC, scheduling policies.

Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research.

Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era Dimitris Kaseridis +, Jeffrey Stuecheli *+, and.

1 Lecture 14: DRAM Main Memory Systems Today: cache/TLB wrap-up, DRAM basics (Section 2.3)

1 Presented By: Michael Bieniek. Embedded systems are increasingly using chip multiprocessors (CMPs) due to their low power and high performance capabilities.

Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research.

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.

1 Lecture 2: Memory Energy Topics: energy breakdowns, handling overfetch, LPDRAM, row buffer management, channel energy, refresh energy.

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

1 Lecture 5: Refresh, Chipkill Topics: refresh basics and innovations, error correction.

1 Lecture 5: Scheduling and Reliability Topics: scheduling policies, handling DRAM errors.

1 Lecture 3: Memory Buffers and Scheduling Topics: buffers (FB-DIMM, RDIMM, LRDIMM, BoB, BOOM), memory blades, scheduling policies.

Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.

1 Lecture: Memory Technology Innovations Topics: memory schedulers, refresh, state-of-the-art and upcoming changes: buffer chips, 3D stacking, non-volatile.

1 Lecture 3: Memory Energy and Buffers Topics: Refresh, floorplan, buffers (SMB, FB-DIMM, BOOM), memory blades, HMC.

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations.

1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)

15-740/ Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University.

Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.

15-740/ Computer Architecture Lecture 25: Main Memory

Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.

1 Lecture 16: Main Memory Innovations Today: DRAM basics, innovations, trends HW5 due on Thursday; simulations can take a few hours Midterm: 32 scores.

1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,

Massed Refresh: An Energy-Efficient Technique to Reduce Refresh Overhead in Hybrid Memory Cube Architectures. A DRAM Refresh Method By Ishan Thakkar, Sudeep Pasricha

Reducing Memory Interference in Multicore Systems

CSE 502: Computer Architecture

Prof. Onur Mutlu Carnegie Mellon University 11/12/2012

Zhichun Zhu Zhao Zhang ECE Department ECE Department

Computer Architecture Lecture 23: Memory Management

Lecture 15: DRAM Main Memory Systems

Application Slowdown Model

Lecture: Memory, Multiprocessors

Staged-Reads : Mitigating the Impact of DRAM Writes on DRAM Reads

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Lecture: DRAM Main Memory

Achieving High Performance and Fairness at Low Cost

Lecture: DRAM Main Memory

Lecture: DRAM Main Memory

Lecture: Memory Technology Innovations

Manjunath Shevgoor, Rajeev Balasubramonian, University of Utah

15-740/ Computer Architecture Lecture 19: Main Memory

Presented by Florian Ettinger

18-447: Computer Architecture Lecture 19: Main Memory

Presentation transcript:

1 Lecture 4: Memory Scheduling, Refresh Topics: scheduling policies, refresh basics

2 Scheduling Policies Basics Must honor several timing constraints for each bank/rank Commands: PRE, ACT, COL-RD, COL-WR, REF, Power-Up/Dn Must handle reads and writes on the same DDR3 bus Must issue refreshes on time Must maximize row buffer hit rates and parallelism Must maximize throughput and fairness

3 Address Mapping Policies Consecutive cache lines can be placed in the same row to boost row buffer hit rates Consecutive cache lines can be placed in different ranks to boost parallelism Example address mapping policies: row:rank:bank:channel:column:blkoffset row:column:rank:bank:channel:blkoffset

4 Reads and Writes A single bus is used for reads and writes The bus direction must be reversed when switching between reads and writes; this takes time and leads to bus idling Hence, writes are performed in bursts; a write buffer stores pending writes until a high water mark is reached Writes are drained until a low water mark is reached

5 Refresh Example, tREFI (gap between refresh commands) = 7.8us, tRFC (time to complete refresh command) = 350 ns JEDEC: must issue 8 refresh commands in a 8*tREFI time window Allows for some flexibility in when each refresh command is issued Elastic refresh: issue a refresh command when there is lull in activity; any unissued refreshes are handled at the end of the 8*tREFI window

6 Maximizing Row Buffer Hit Rates FCFS: Issue the first read or write in the queue that is ready for issue (not necessarily the oldest in program order) First Ready - FCFS: First issue row buffer hits if you can

7 STFM Mutlu and Moscibroda, MICRO’07 When multiple threads run together, threads with row buffer hits are prioritized by FR-FCFS Each thread has a slowdown: S = T alone / T shared, where T is the number of cycles the ROB is stalled waiting for memory Unfairness is estimated as S max / S min If unfairness is higher than a threshold, thread priorities override other priorities (Stall Time Fair Memory scheduling) Estimation of T alone requires some book-keeping: does an access delay critical requests from other threads?

8 PAR-BS Mutlu and Moscibroda, ISCA’08 A batch of requests (per bank) is formed: each thread can only contribute R requests to this batch; batch requests have priority over non-batch requests Within a batch, priority is first given to row buffer hits, then to threads with a higher “rank”, then to older requests Rank is computed based on the thread’s memory intensity; low-intensity threads are given higher priority; this policy improves batch completion time and overall throughput By using rank, requests from a thread are serviced in parallel; hence, parallelism-aware batch scheduling

9 TCM Kim et al., MICRO 2010 Organize threads into latency-sensitive and bw-sensitive clusters based on memory intensity; former gets higher priority Within bw-sensitive cluster, priority is based on rank Rank is determined based on “niceness” of a thread and the rank is periodically shuffled with insertion shuffling or random shuffling (the former is used if there is a big gap in niceness) Threads with low row buffer hit rates and high bank level parallelism are considered “nice” to others

10 Minimalist Open-Page Kaseridis et al., MICRO 2011 Place 4 consecutive cache lines in one bank, then the next 4 in a different bank and so on – provides the best balance between row buffer locality and bank-level parallelism Don’t have to worry as much about fairness Scheduling first takes priority into account, where priority is determined by wait-time, prefetch distance, and MLP in thread A row is precharged after 50 ns, or immediately following a prefetch-dictated large burst

11 Other Scheduling Ideas Using reinforcement learning Ipek et al., ISCA 2008 Co-ordinating across multiple MCs Kim et al., HPCA 2010 Co-ordinating requests from GPU and CPU Ausavarungnirun et al., ISCA 2012 Several schedulers in the Memory Scheduling Championship at ISCA 2012 Predicting the number of row buffer hits Awasthi et al., PACT 2011

12 Refresh Basics A cell is expected to have a retention time of 64ms; every cell must be refreshed within a 64ms window The refresh task is broken into 8K refresh operations; a refresh operation is issued every tREFI = 7.8 us If you assume that a row of cells on a chip is 8Kb and there are 8 banks, then every refresh operation in a 4Gb chip must handle 8 rows in each bank Each refresh operation takes time tRFC = 300ns Larger chips have more cells and tRFC will grow

13 More Refresh Details To refresh a row, it needs to be activated and precharged Refresh pipeline: the first bank draws the max available current to refresh a row in many subarrays in parallel; each bank is handled sequentially; the process ends with a recovery period to restore charge pumps “Row” on the previous slide refers to the size of the available row buffer when you do an Activate; an Activate only deals with some of the subarrays in a bank; refresh performs an activate in all subarrays in a bank, so it can do multiple rows in a bank in parallel

14 Fine Granularity Refresh Will be used in DDR4 Breaks refresh into small tasks; helps reduce read queuing delays (see example) In a future 32Gb chip, tRFC = 640ns, tRFC_2x = 480ns, tRFC_4x = 350ns – note the high overhead from the recovery period

15 What Makes Refresh Worse Refresh operations are issued per rank; LPDDR does allow per bank refresh Can refresh all ranks simultaneously – this reduces memory unavailable time, but increases memory peak power Can refresh ranks in staggered manner – increases memory unavailable time, but reduces memory peak power High temperatures will increase the leakage rate and require faster refresh rates (> 85 degrees C  3.9us tREFI)

16 Next Class Refresh optimizations: elastic refresh, refresh pausing, smart refresh, Flikker, preemptive command drain, refresh and commands together, etc.

17 Title Bullet