Presentation is loading. Please wait.

Presentation is loading. Please wait.

Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian,

Similar presentations


Presentation on theme: "Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian,"— Presentation transcript:

1 Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian, Al Davis School of Computing, University of Utah ASPLOS-2010

2 DRAM Memory Constraints Modern machines spend nearly 25% - 40% of total system power for memory. Some commercial servers already have larger power budgets for memory than CPU. Main memory access is one of the largest performance bottlenecks. We address both performance and power concerns for DRAM memory accesses. 2

3 DRAM Access Mechanism … Memory Controller Memory bus or channel Rank DRAM chip or device Bank Array DIMM 3 CPU makes a memory request and the Memory Controller converts it to appropriate DRAM commands. Accesses within a device begin with selecting a bank, then a row. 1/8 th of the row buffer One word of data output Row A few column bits are then selected from the row- buffer. These bits are then the output from the device. Many bits read from the DRAM cells to service a single CPU request!

4 DRAM Access Inefficiencies - I Over fetch due to large row-buffers. 8 KB read into row buffer for a 64 byte cache line. Row-buffer utilization for a single request < 1%. Why are row buffers so large? Large arrays minimize cost-per-bit. Striping a cache line across multiple chips (arrays) improves data transfer bandwidth. 4

5 DRAM Access Inefficiencies - II Open page policy Row buffers kept open with the hope that subsequent requests will be row buffer hits. FR-FCFS request scheduling (First-Ready FCFS) Memory controller schedules requests to open row-buffers first. Diminishing locality in multi-cores. 5 Access LatencyAccess Energy Row-buffer Hit~ 75 cycles ~ 18 nJ Row-buffer Miss~ 225 cycles~ 38 nJ

6 DRAM Row-buffer Hit-rates 6 With increasing core counts, DRAM row-buffer hit-rates reduce.

7 Key Observation Cache Block Access Pattern Within OS Pages 7 For heavily accessed pages in a given time interval, accesses are usually to a few cache blocks.

8 Outline DRAM Basics. Motivation. Basic Idea. Software Only Implementation (ROPS). Hardware Implementation (HAM). Results. 8

9 Basic Idea Gather all heavily accessed chunks of independent OS pages and map them to the same DRAM row. 9 Hottest micro-pages 1 KB micro-pages Coldest micro-pages 4 KB OS Pages DRAM Memory Reserved DRAM Region

10 Basic Idea Identifying “hot” micro-pages. Memory controller counters and OS daemon. Reserved rows in DRAM for hot micro-pages. Simplifies book-keeping overheads. 4MB capacity loss from a 4GB system (< 0.1%). EPOCH based schemes. Expose EPOCH length to the OS for flexibility. 10

11 Software Only Implementation (ROPS) 11 Virtual Address X CPU Memory Request 4 GB Main Memory Baseline Translation Lookaside Buffer (TLB) Y Physical Address Y Reduced OS Page size (ROPS) 4 MB Reserved DRAM region Shrink the OS page size to 1KB Every Epoch: 1.Migrate hot micro-pages. TLB shoot-down and page table update. 2.Promote cold micro-pages to a superpage. Page table/TLB updated. Hot micro-pages Cold micro-pages Physical Address Z Translation Lookaside Buffer (TLB)

12 Software Only Implementation (ROPS) Reduced OS Page Size (ROPS). Throughout the system, reduce page size to 1KB size. Migrate hot micro-pages via DRAM-copy Hot micro-pages live in the same row-buffer in the reserved DRAM region. Mitigate reduction in TLB reach by promoting cold micro-pages to 4KB superpages. Superpage creation facilitated by “reservation-based” page allocation. Allocate four 1KB micro-pages to contiguous DRAM frames. Allows contiguous virtual addresses to be placed in contiguous physical addresses → makes superpage creation easy. 12

13 Hardware Implementation (HAM) 13 Physical Address X New addr. Y 4 GB Main Memory CPU Memory Request 4 MB Reserved DRAM region Y X Page A Mapping Table X Y Old AddressNew Address BaselineHardware Assisted Migration (HAM)

14 Hardware Implementation (HAM) Hardware Assisted Migration (HAM). New level of address indirection − Place data wherever you want in the DRAM. Maintain a Mapping Table (MT) − Preserve old physical addresses of migrated micro-pages. DRAM-copy of hot micro-pages to the reserved rows. Populate/update MT every EPOCH. 14

15 Results Schemes Evaluated Baseline Oracle/Profiled:  Best-effort estimate of expected benefit in the next epoch based on a prior profile run. Epoch Based ROPS and HAM  Evaluated 5M, 10M, 50M, and 100M.  Trends are similar, best perf. with 5M and 10M. Simics simulation platform. DRAMSim based DRAM timing. DRAM timing and energy figures from Micron datasheets. 15 Simulation Parameters

16 Results Accesses to Micro-Pages in Reserved Rows in an Epoch 16 % of Total accesses to micro- pages in reserved rows Total # 4KB pages touched in an Epoch. % Accesses to micro-pages4KB pages touched

17 Results 17 5M cycle EPOCH, ROPS, HAM and ORACLE Hardware assisted migration offers better returns due to lower TLB management overheads. Apart from 9% perf. gains, our schemes also save energy at the same time! Applications with room for improvement show average performance Improvement of 9% Percent change in performance

18 Results ROPS, HAM and ORACLE Energy consumption of the DRAM sub-system % Reduction in DRAM energy 18

19 Conclusions On average, for applications with room for improvement and with our best performing scheme Average performance ↑ 9% (max. 18%) Average memory energy consumption ↓ 18% (max. 62%). Average row-buffer utilization ↑ 38% Hardware assisted migration offers better returns due to fewer overheads of TLB shoot-down and misses. Future work Can co-locate hot micro-pages that are accessed around the same time. 19

20 That's all for today … Questions? http://www.cs.utah.edu/arch-research 20


Download ppt "Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian,"

Similar presentations


Ads by Google