Page-based Commands for DRAM Systems Aamer Jaleel Brinda Ganesh Lei Zong.

Slides:

Advertisements

Similar presentations

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.

Advertisements

Lecture 12 Reduce Miss Penalty and Hit Time

Performance of Cache Memory

1 Overview Assignment 5: hints  Garbage collection Assignment 4: solution.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Amoeba-Cache Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar Arrvindh Shriraman Eric Matthews Lesley Shannon Hongzhou Zhao.

Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.

Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.

A Cache-Like Memory Organization for 3D memory systems CAMEO 12/15/2014 MICRO Cambridge, UK Chiachen Chou, Georgia Tech Aamer Jaleel, Intel Moinuddin K.

Nov COMP60621 Concurrent Programming for Numerical Applications Lecture 6 Chronos – a Dell Multicore Computer Len Freeman, Graham Riley Centre for.

Using one level of Cache:

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 17, 2003 Topic: Virtual Memory.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

Memory Management 2010.

Page-based Commands for DRAM Systems Aamer Jaleel Brinda Ganesh Lei Zong.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

Embedded DRAM for a Reconfigurable Array S.Perissakis, Y.Joo 1, J.Ahn 1, A.DeHon, J.Wawrzynek University of California, Berkeley 1 LG Semicon Co., Ltd.

Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Memory Hierarchy 2.

For Fast Block Copy in Dram 1 Fast Block Copy in DRAM  Motivation Exploit the wide bandwidth within DRAM chip  Idea Use DRAM refresh period to.

2/27/2002CSE Cache II Caches, part II CPU On-chip cache Off-chip cache DRAM memory Disk memory.

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University See P&H Chapter: , 5.8, 5.10, 5.15; Also, 5.13 & 5.17.

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

Defining Anomalous Behavior for Phase Change Memory

Lecture 19: Virtual Memory

The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.

DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan.

Chapter 9: Virtual Memory Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating Kernel.

L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.

Pentium III Memory.

Alpha Supplement CS 740 Oct. 14, 1998

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Yi Feng & Emery Berger University of Massachusetts Amherst A Locality-Improving.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

The Evicted-Address Filter

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

Using Uncacheable Memory to Improve Unity Linux Performance

Virtual Memory Review Goal: give illusion of a large memory Allow many processes to share single memory Strategy Break physical memory up into blocks (pages)

Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a “cache” for secondary (disk) storage – Managed jointly.

DRAM Tutorial Lecture Vivek Seshadri. Vivek Seshadri – Thesis Proposal DRAM Module and Chip 2.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.

CSCI206 - Computer Organization & Programming

CMSC 611: Advanced Computer Architecture

Memory COMPUTER ARCHITECTURE

CS2100 Computer Organization

Supporting x86-64 Address Translation for 100s of GPU Lanes

Section 9: Virtual Memory (VM)

Section 9: Virtual Memory (VM)

Today How was the midterm review? Lab4 due today.

Stash: Have Your Scratchpad and Cache it Too

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.

Cache Memory Presentation I

CSE 153 Design of Operating Systems Winter 2018

ECE 445 – Computer Organization

CMSC 611: Advanced Computer Architecture

Xen Network I/O Performance Analysis and Opportunities for Improvement

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs

Lecture 22: Cache Hierarchies, Memory

CS 3410, Spring 2014 Computer Science Cornell University

CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs

Lecture 21: Memory Hierarchy

Virtual Memory: Working Sets

CSE 153 Design of Operating Systems Winter 2019

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.

Overview Problem Solution CPU vs Memory performance imbalance

Computer Architecture Lecture 30: In-memory Processing

Presentation transcript:

Page-based Commands for DRAM Systems Aamer Jaleel Brinda Ganesh Lei Zong

Outline Memory System Overview Experiment setup Page level access measurements Solution Expected Speedup

Memory Access Time CPU L1 L2 MC DRAM CPU Access Time (cycles) L13 L28 DRAM181 Data for 1.8GHz Opteronwww.aceshardware.com/

Memory Access Applications Initialization Data Movement Stream operations Operating System Task Creation System Calls Page Allocation, Management Library Routines Used: Memset, Clear User (MEMZERO) Memcpy, Copy from User, Copy To User

Experiment Setup Workstation based 2.4 GHz P4 (wonko.sca.umd.edu) 750 MHz PIII (majikthise.eng.umd.edu) 900 MHz PIII (jaleel.eng.umd.edu) Bochs x86 emulator Operating System Mandrake 9.0 Linux Kernel v Applications SPEC2000 Integer benchmarks using glibc-2.2.5

Using In The Resources CPU CORE IL1DL1 UNIFIED L2 Mandrake Linux 9.0 KERNEL MEM CNTRLR UPROC SYSTEM LIBRARIES BOCHS DRAM User Level Routines Kernel Level Routines Running Same OS SW HW

MEMSET – SPECINT 2000

MEMSET Overhead – SPECINT 2000

MEMCPY – SPECINT 2000

MEMCPY Overhead – SPECINT 2000

OS Behavior: MEMZERO/MEMCPY SHOW LIVE DATA

Page based Commands SET_PAGE #(CONS), #ADDR, #(SIZE) ADDR  CONS COPY_PAGE #DST, #SRC, #(SIZE) DST  SRC Page level stream operations A  B + C A  B - C

Issues w/Page Based Commands Data partially present in cache? Cache-Memory Consistency Issues SET_PAGE Add logic in cache to latch in data If cache block dirty, write to memory COPY_PAGE If destination in cache, evict Address is not page aligned Will require accessing 2 rows SET_PAGE #(CONS), #ADDR, #(SIZE) ~~~ ~~~~ COPY_PAGE #DST, #SRC, #(SIZE) ~~~~ SET_PAGE #(CONS), #ADDR, #(SIZE) ~~~~ Instruction Stream

How much data is actually in the cache ? Function% Hit Rate Boot + Halt % Hit Rate SPEC workload Memset7.23%0.23 Memcpy ( Source) % Memcpy (Destination)< 0.01 %

Page Based MEMSET Proposed Implementation end  s + n while ( n >= PageSize) SET_PAGE (c), s, n n  n – PageSize s  s + n while ( s < end) MEM[ s++ ]  c void *memset( void *s, int c, size_t n) Current Implementation end  s + n while ( s < end) MEM[ s++ ]  c

Expected Speedup Avg Memset Time For 4KB Page with 128 byte cache line size: Row Read Time * #Rows + Misc = 100 ns * 32 + X = X  s Measured Average: 4  s Expected Time Using Page Based CMDs Max # Rows/page * Row Read Time + Cache Coherence Logic + Misc = 2 * 100 ns + X = 200 ns + X Expected Speedup: >= 50% (Approximation)

Conclusions Memory accessed frequently on a page granularity Use page based commands to replace existing routines that perform work on a cache line basis If implemented, we are looking to expect significant speedups

Related Work IRAM – On-chip DRAM Advantage: energy efficient, eliminates much of the off-chip memory access Disadvantage: not much performance increase, doesn’t work with conventional microprocessors Active page – bring computation to DRAM break the memory into fixed page-size and add reconfigurable logic to DRAM Elimination of compulsory cache misses due to dynamic initialization