Page-based Commands for DRAM Systems Aamer Jaleel Brinda Ganesh Lei Zong
Outline Memory System Overview Experiment setup Page level access measurements Solution Expected Speedup
Memory Access Time CPU L1 L2 MC DRAM CPU Access Time (cycles) L13 L28 DRAM181 Data for 1.8GHz Opteronwww.aceshardware.com/
Memory Access Applications Initialization Data Movement Stream operations Operating System Task Creation System Calls Page Allocation, Management Library Routines Used: Memset, Clear User (MEMZERO) Memcpy, Copy from User, Copy To User
Experiment Setup Workstation based 2.4 GHz P4 (wonko.sca.umd.edu) 750 MHz PIII (majikthise.eng.umd.edu) 900 MHz PIII (jaleel.eng.umd.edu) Bochs x86 emulator Operating System Mandrake 9.0 Linux Kernel v Applications SPEC2000 Integer benchmarks using glibc-2.2.5
Using In The Resources CPU CORE IL1DL1 UNIFIED L2 Mandrake Linux 9.0 KERNEL MEM CNTRLR UPROC SYSTEM LIBRARIES BOCHS DRAM User Level Routines Kernel Level Routines Running Same OS SW HW
MEMSET – SPECINT 2000
MEMSET Overhead – SPECINT 2000
MEMCPY – SPECINT 2000
MEMCPY Overhead – SPECINT 2000
OS Behavior: MEMZERO/MEMCPY SHOW LIVE DATA
Page based Commands SET_PAGE #(CONS), #ADDR, #(SIZE) ADDR CONS COPY_PAGE #DST, #SRC, #(SIZE) DST SRC Page level stream operations A B + C A B - C
Issues w/Page Based Commands Data partially present in cache? Cache-Memory Consistency Issues SET_PAGE Add logic in cache to latch in data If cache block dirty, write to memory COPY_PAGE If destination in cache, evict Address is not page aligned Will require accessing 2 rows SET_PAGE #(CONS), #ADDR, #(SIZE) ~~~ ~~~~ COPY_PAGE #DST, #SRC, #(SIZE) ~~~~ SET_PAGE #(CONS), #ADDR, #(SIZE) ~~~~ Instruction Stream
How much data is actually in the cache ? Function% Hit Rate Boot + Halt % Hit Rate SPEC workload Memset7.23%0.23 Memcpy ( Source) % Memcpy (Destination)< 0.01 %
Page Based MEMSET Proposed Implementation end s + n while ( n >= PageSize) SET_PAGE (c), s, n n n – PageSize s s + n while ( s < end) MEM[ s++ ] c void *memset( void *s, int c, size_t n) Current Implementation end s + n while ( s < end) MEM[ s++ ] c
Expected Speedup Avg Memset Time For 4KB Page with 128 byte cache line size: Row Read Time * #Rows + Misc = 100 ns * 32 + X = X s Measured Average: 4 s Expected Time Using Page Based CMDs Max # Rows/page * Row Read Time + Cache Coherence Logic + Misc = 2 * 100 ns + X = 200 ns + X Expected Speedup: >= 50% (Approximation)
Conclusions Memory accessed frequently on a page granularity Use page based commands to replace existing routines that perform work on a cache line basis If implemented, we are looking to expect significant speedups
Related Work IRAM – On-chip DRAM Advantage: energy efficient, eliminates much of the off-chip memory access Disadvantage: not much performance increase, doesn’t work with conventional microprocessors Active page – bring computation to DRAM break the memory into fixed page-size and add reconfigurable logic to DRAM Elimination of compulsory cache misses due to dynamic initialization