Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Memory Controller Innovations for High-Performance Systems Rajeev Balasubramonian School of Computing University of Utah Sep 25 th 2013.

Similar presentations


Presentation on theme: "1 Memory Controller Innovations for High-Performance Systems Rajeev Balasubramonian School of Computing University of Utah Sep 25 th 2013."— Presentation transcript:

1 1 Memory Controller Innovations for High-Performance Systems Rajeev Balasubramonian School of Computing University of Utah Sep 25 th 2013

2 2 Micron Road Trip MICRON BOISE SALT LAKE CITY

3 3 DRAM Chip Innovations

4 4 Feedback - I Don’t bother modifying the DRAM chip.

5 5 Feedback - II We love what you’re doing with the memory controller and OS.

6 6 Academic Research Agendas Not giving up on memory device innovations  Several examples of academic papers resonating commercial innovations Greater focus on memory controller improvements

7 7 This Talk’s Focus: The Memory Controller More relevant to Intel, Micron Cores are being commoditized, but memory controller features are still evolving – new devices (buffer chips, HMC), chipkill, compression Lots of room for improvement – MCs haven’t seen the same innovation frenzy as the cores

8 8 Example IBM Server Source: P. Bose, WETI Workshop, 2012

9 9 Power Contributions PERCENTAGE OF TOTAL SERVER POWER PROCESSOR MEMORY

10 10 Power Contributions PERCENTAGE OF TOTAL SERVER POWER PROCESSOR MEMORY

11 11 Memory Basics HOST MULTI-CORE PROCESSOR MC … x8

12 12 Outline Background Focusing on the memory controller Memory basics Implementing memory compression (MemZip) Implementing chipkill (LESS-ECC) Voltage and current aware scheduling (MICRO 2013)

13 13 Making a Case for Compression Prior work: IBM MXT, Ekman and Stenstrom, LCP, Alameldeen and Wood, etc. Can improve several metrics: primarily memory capacity secondary benefit in apps with locality: bandwidth, energy Typically worsens access complexity and introduces data copies The MemZip approach: focus on other metrics no change in memory capacity improvements in energy, bandwidth, reliability, complexity

14 14 MemZip HOST MULTI-CORE PROCESSOR MC … x8 Rank subsetting Data fetch in 8-byte increments Need metadata Modified data layout MDC

15 15 Cache Line Format BASE-DELTA-IMMEDIATE FREQUENT PATTERN COMPRESSION

16 16 Using Spare Space for Energy and Reliability COMPRESSED CACHE LINE 26 BYTES 8 B16 B24 B32 B ROOM FOR ECC AND DBI CODES

17 17 Making the ECC Access More Efficient Baseline ECC: ECC code is fetched in parallel from 9 th chip Subranking with embedded-ECC: no extra chip; ECC is located in the same row as data; need extra COL-RDs to fetch ECC codes MemZip with embedded-ECC: in many cases, the ECC is fetched with no additional COL-RD

18 18 DBI for Energy Efficiency Data Bus Inversion: to save energy, either send data or the inverse of data Break the cache line into small words; each word needs an inversion bit; the inversion bits make up the DBI code We use either 0, 1, 2, or 3 bytes of DBI codes ORIGINAL DATA Transfer 1: Transfer 2: WITH DBI ENCODING Transfer 1: Transfer 2: bits for DBI code size DBI code

19 19 Methodology Simics (8 out-of-order cores) and USIMM memory system timing Micron power calculator for DRAM power estimates Collection of workloads from SPEC2k6, NASPB, Parsec, CloudSuite; multi-programmed and multi-threaded

20 20 Effect of Sub-Ranking 2-way sub-ranking has best performance 8-way sub-ranking is worse than baseline

21 21 Effect of Compression on Performance 20% performance improvement With compression, 4-way is the best, but only slightly better than 8x2-way

22 22 Effect on Memory Energy 8x2-way has lowest traffic and energy Additional 17% reduction in activity with DBI

23 23 Outline Background Focusing on the memory controller Memory basics Implementing memory compression (MemZip) Implementing chipkill (LESS-ECC) Voltage and current aware scheduling (MICRO 2013)

24 24 Chipkill Overview Chipkill: the ability to recover from an entire memory chip failure Commercial symbol-based chipkill: 4 check symbols are required to recover from 1 data symbol corruption; hence needs 32+4 x4 DRAM chips per access (two channels)

25 25 LOT-ECC 1 st level: checksums for error detection and location 2 nd and 3 rd levels: parity for error recovery A0 PA A1 A7 A8 … CA0 PPA CA1 CA7CA8 PA PPA PA PPA PA PPA

26 26 LESS-ECC A0 CA0 A1 CA1 A7 CA7 XA CXA … 1 st level: parity for error detection and recovery 2 nd level: checksums for error location detection

27 27 Reducing Storage Checksums can be made large/small Small checksums can also be effectively cached on chip In LESS-ECC, the checksum can be designed so that  basic: 8-bit checksum for 64 bits of data  ES1: 8-bit checksum for 512 bits of data  ES2: 64-bit checksum for 8Kb of data  ES3: 64-bit checksum for 4Gb of data LOT-ECCLESS-ECC X826%13% X1652%26% Storage Overhead

28 28 Error Rates Checksums have a small probability of failing to detect an error LOT-ECC uses checksums in the 1 st level and hence causes SDC (silent data corruption) LESS-ECC uses checksums in the 2 nd level and hence causes DUE (detected but unrecoverable error); DUEs are more favorable

29 29 LESS-ECC Summary Benefits: energy, parallelism, storage, SDC Disadvantage: checksum cache and more logic at the memory controller

30 30 LESS-ECC Performance

31 31 LESS-ECC Memory Energy

32 32 LESS-ECC Energy Efficiency LESS-ECC-x8 has 0.5% lower energy than LOT-ECC-x8 but 15% less energy per usable byte LESS-ECC-x16 has 26% lower energy than LOT-ECC-x8 (both have similar storage overhead of 26%)

33 33 Outline Background Focusing on the memory controller Memory basics Implementing memory compression (MemZip) Implementing chipkill (LESS-ECC) Voltage and current aware scheduling (MICRO 2013)

34 34 Current Constraints and IR-Drop MC ensures that requests are scheduled appropriately; many timing constraints, such as tFAW A new constraint emerges in future 3D-stacked devices: IR-drop

35 35 Many Possible Solutions Note that charge pumps and decaps scale with dies, but IR-drop gets worse Provide higher voltage (power increase!) Provide more pins and TSVs (cost increase!) Alternative: Use an architectural solution; the MC schedules requests in a manner that does not violate IR-drop limits Similar to the power tokens used in some PCM papers, but we have to be aware of where activity is happening Place data such that IR-drop-prone regions are avoided

36 36 Example Voltage Map Y Coordinate

37 37 IR-Drop Regions

38 38 Scheduling Constraints For a given part of the device, identify the worst-case set of requests that will cause an IR-drop violation For example, for region A-Top, if you issue one COL-RD to the furthest bank, that region cannot handle any other request Continue to widen the list of constraints by considering larger regions on the device

39 39 Scheduling Constraints 1 COL-RD = 1 COL-WR = 2 ACTs = 6 PREs

40 40 Overcoming Scheduling Limitations Starvation: If B-Top always has 2 accesses, A-Top requests will be starved; prioritize requests that have much longer than average wait times Page placement: dynamically identify frequently accessed pages and move them to favorable regions

41 41 Performance Impact  With All Constraints, (Real PDN) performance falls by 4.6X  With Starvation management, gap is reduced to 1.47X  Profiled Page Placement with Starvation Control is within 15% of unrealistic Ideal PDN

42 42 Summary Many features expected of future memory controllers: handling compression, errors, new devices Lots of low-hanging fruit Significant energy/performance benefits from compression Energy-efficient and storage-efficient chipkill possible, but requires some effort in the MC More scheduling constraints being imposed as technology evolves; we show in an IR-drop case study for 3D-stacked devices that the performance impacts can be large

43 43 Acks Students in the Utah Arch Lab (Amirali Boroumand, Nil Chatterjee, Seth Pugsley, Ali Shafiee, Manju Shevgoor, Meysam Taassori) Other collaborators from Samsung (Jung-Sik Kim), HP Labs (Naveen Muralimanohar), ARM (Ani Udipi), U. Nebrija (Pedro Reviriego), Utah (Al Davis) Funding sources: NSF, Samsung, HP, IBM


Download ppt "1 Memory Controller Innovations for High-Performance Systems Rajeev Balasubramonian School of Computing University of Utah Sep 25 th 2013."

Similar presentations


Ads by Google