Presentation is loading. Please wait.

Presentation is loading. Please wait. Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah Ph.D. Dissertation.

Similar presentations

Presentation on theme: " Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah Ph.D. Dissertation."— Presentation transcript:

1 Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah Ph.D. Dissertation Defense, March 7, 2012 Advisor: Rajeev Balasubramonian

2 My other computer is.. 2

3 Scaling server farms Facebook: 30,000 servers, 80 Billion images stored, serves 600,000 photos a second, logs 25 TB of data per day… the statistics can go on.. The primary challenge to scaling: efficient supply of data to thousands of cores It’s all about the memory! 3

4 Performance Trends Demand-side –Multi-socket, multi-core, multi-thread –Large datasets - big data analytics, scientific computation models –RAMCloud-like designs –1 TB/s per node by 2017 Supply-side –Pin count, per pin BW, capacity –Severely power limited 4 Source: ZDNet Source: Tom’s Hardware

5 Datacenters consume ~2% of all power generated in the US –Operation + cooling 100 Billion kWh, $7.4 Billion % of total power in large systems consumed in memory As processors get simpler, this fraction likely to increase Energy Trends 5

6 Cost-per-bit Traditionally the holy grail of DRAM design Operational expenditure over 3 years == Capital expenditure in datacenter servers –Cost-per-bit less important than before 6 $ W $ W

7 The job of the memory controller is hard –18+ timing parameters for DRAM! –Maintenance operations  Refresh, scrub, power down, etc. Several DIMM and controller variants –Hard to provide interoperability –Need processor-side support for new memory features Now throw in heterogeneity –Memristors, PCM, STT-RAM, etc. Complexity Trends 7

8 Reliability Trends Shrinking feature sizes not helping Nor is the scale –64 x DRAM cells in a typical datacenter DRAM errors the #1 reason for servers at Google to enter repair Datacenters are the backbone of web-connected infrastructure –Reliability is essential Server downtime has huge economic impact –Breached SLAs, for example 8

9 Thesis statement Main memory systems are at an inflection point –Convergence of several trends Major overhaul required to achieve a system that is –Energy-efficient, high-performance, low-complexity, reliable, and cost effective Combination of two things –Prudent application of novel technologies –Fundamental rethinking of conventional design decisions 9

10 Designing Future Memory Systems 10 CPU MC DIMM … 1 2 Memory Interconnect – Prudent use of Silicon Photonics, without modifying DRAM dies [ISCA ’11] Memory Reliability – Efficient RAID-based high-availability Chipkill memory [ISCA ’12] 1 1 Memory Chip Architecture – reducing overfetch & increasing parallelism [ISCA ’10] 3 Memory protocol – Streamlined Slot-based Interface with semi- autonomous memory [ISCA ’11]

11 PART 1 – Memory Chip Organization

12 Key bottleneck 12 RAS CAS Cache Line DRAM Chip Row Buffer One bank shown in each chip

13 Why this is a problem 13

14 … 14

15 SSA Architecture 15 MEMORY CONTROLLER 8 8 ADDR/CMD BUS 64 Bytes Bank Subarray Bitlines Row buffer Global Interconnect to I/O ONE DRAM CHIP DIMM DATA BUS

16 SSA Operation 16 Address Cache Line DRAM Chip Subarray DRAM Chip Subarray DRAM Chip Subarray DRAM Chip Subarray Sleep Mode (or other parallel accesses) Subarray

17 SSA Impact Energy reduction –Dynamic – fewer bitlines activated –Static – smaller activation footprint – more and longer spells of inactivity – better power down Latency impact –Limited pins per cache line – serialization latency –Higher bank-level parallelism – shorter queuing delays Area increase –More peripheral circuitry and I/O at finer granularities – area overhead (< 5%) 17

18 Key Contributions Up to 6X reduction in DRAM chip dynamic energy Up to 5X reduction in DRAM chip static energy Up to 50% improvements in performance in applications limited by bank contention All for ~5% increase in area 18

19 PART 2 – Memory Interconnect

20 Key Bottleneck Fundamental nature of electrical pins –Limited pin count, per pin bandwidth, memory capacity, etc. Diverging growth rates of core count and pin count Limited by physics, not engineering! 20

21 21 Silicon Photonic Interconnects We need something that can break the edge-bandwidth bottleneck Ring modulator based photonics –Off chip light source –Indirect modulation using resonant rings –Relatively cheap coupling on- and off-chip DWDM for high bandwidth density –As many as 67 wavelengths possible –Limited by Free Spectral Range, and coupling losses between rings Source: Xu et al. Optical Express 16(6), 2008 DWDM 64 λ × 10 Gbps/ λ = 80 GB/s per waveguide

22 The Questions We’re Trying to Answer 22 Should we replace all interconnects with photonics? On-chip too? Should we be designing photonic DRAM dies? Stacks? Channels? How do we make photonics less invasive to memory die design? What should the role of 3D be in an optically connected memory? What should the role of electrical signaling be?

23 Design Considerations – I Photonic interconnects –Large static power dissipation: ring tuning  Rings are designed to resonate at a specific frequency  Processing defects and temperature change this  Need to heat the rings to correct for this –Much lower dynamic energy consumption – relatively independent of distance Electrical interconnects –Relatively small static power dissipation –Large dynamic energy consumption 23

24 Design Considerations – II Should not over-provision photonic bandwidth, use only where necessary Use photonics where they’re really useful –To break the off-chip pin barrier Exploit 3D-Stacking and TSVs –High bandwidth, low static power, decouples memory dies Exploit low-swing wires –Cheap on-chip communication 24

25 Proposed Design 25 Processor DIMM Waveguide DRAM chips Photonic Interface die Memory controller ADVANTAGE 1: Increased activity factor, more efficient use of photonics ADVANTAGE 3: Not disruptive to the design of commodity memory dies ADVANTAGE 2: Rings are co-located; easier to isolate or tune thermally

26 Key Contributions 23% reduced energy consumption 4X capacity per channel Potential for performance improvements due to increased bank count Less disruptive to memory die design 26 Processor DIMM Waveguide DRAM chips Photonic Interface die Memory controller Makes the job of the memory controller difficult!

27 PART 3 – Memory Access Protocol

28 Key Bottleneck Large capacity, high bandwidth, and evolving technology trends will increase pressure on the memory interface Memory controller micro-manages every operation of the memory system –Processor-side support required for every memory innovation –Several signals between processor and memory  Heavy pressure on address/command bus  Worse with several independent banks, large amounts of state 28

29 Proposed Solution Release MC’s tight control, make memory stack more autonomous Move mundane tasks to the interface die –Maintenance operation (refresh, scrub, etc.) –Routine operations (DRAM precharge, NVM wear leveling) –Timing control (18+ constraints for DRAM alone) –Coding and any other special requirements Processor-side controller only schedules requests and controls data bus 29

30 Memory Access Operation 30 S1 Arrival First free slot Issue Start looking Backup slot ML > ML Time Slot – Cache line data bus occupancy X – Reserved Slot ML – Memory Latency = Addr. latency + Bank access + Data bus latency xxx S2

31 Performance Impact – Synthetic Traffic 31 < 9% latency impact, even at maximum load Virtually no impact on achieved bandwidth

32 Performance Impact – PARSEC/STREAM 32 Apps have very low BW requirements Scaled down system, similar trends

33 Key Contributions Plug and play –Everything is interchangeable and interoperable –Only interface-die support required (communicate ML) Better support for heterogeneous systems –Easier DRAM-NVM data movement on the same channel More innovation in the memory system – Without processor-side support constraints Fewer commands between processor and memory –Energy, performance advantages 33

34 PART 4 – Memory Reliability

35 Key Bottleneck Increased access granularity –Every data access is spread across 36 DRAM chips –DRAM industry standards define minimum access granularity from each chip –Massive overfetch of data at multiple levels  Wastes energy  Wastes bandwidth  Occupies ranks/banks for longer, hurting performance x4 device width restriction –fewer ranks for given DIMM real estate –x8/x16/x32 more power efficient per capacity Reliability level: 1 failed chip out of 36 35

36 A new approach: LOT-ECC Operate on a single rank of memory: 9 chips –and support failure of 1 chip per rank (9 chips) Multiple tiers of localized protection –Tier-1: Local Error Detection (checksums) –Tier 2: Global Error Correction (parity) –T3 & T4 to handle specific failure cases Error correction data stored in data memory Data mapping handled by memory controller with firmware support –Transparent to OS, caches, etc. 36

37 LOT-ECC Design 37

38 The Devil is in the Details We’re borrowing one bit from [data + LED] to use in the GEC –Put them all in the same DRAM row When a cache line is written, –Write data, LED, GEC – all “self-contained” –no read-before-write –Guaranteed row-buffer hit 38 7b 1b 1b PA 0-6 PA 7-13 PA PP A. T4 PA 56 T4 Surplus bit borrowed from data + LED Chip 0 Chip 1 Chip 7Chip 8

39 Key Benefits Energy Efficiency: Fewer chips activated per access, reduced access granularity, reduced static energy through better use of low-power modes Performance Gains: More rank-level parallelism, reduced access granularity Improved Protection: Can handle 1 failed chip out of 9, compared to 1 in 36 currently Flexibility: Works with a single rank of x4 DRAMs or more efficient wide-I/O x8/x16 DRAMs Implementation Ease: Changes to memory controller and system firmware only; commodity processor/memory/OS 39

40 Power Results %

41 Performance Results 41 Latency Reduction: LOT-ECC x8 – 43% +GEC Coalescing – 47% Oracular – 57%

42 Exploiting features in SSA 42

43 Putting it all together

44 Summary Tremendous pressure on the memory system –Bandwidth, energy, complexity, reliability Prudently apply novel technologies –Silicon photonics –Low-swing wires –3D-stacking Rethink some fundamental design choices –Micromanagement by the memory controller –Overfetch in the face of diminishing locality –Conventional ECC codes 44

45 Impact Significant static/dynamic energy reduction –Memory core, channel, controller, reliability Significant performance improvement –Bank parallelism, channel bandwidth, reliability Significant complexity reduction –Memory controller Improved reliability 45

46 Synergies SSA Photonics Photonics Autonomous memory SSA Reliability SSA, Photonics, and LOT-ECC provide additive energy benefits –Each targets one of three major sources of energy consumption – DRAM array, off-chip channel, reliability SSA, Photonics, and LOT-ECC also provide additive performance benefits –Each targets one of three major performance bottleneck – Bank-contention, off-chip BW, reliability 46

47 Research Contributions Memory reliability Memory access protocol Memory channel architecture Memory chip microarchitecture On-chip networks Non-uniform power caches 3D stacked cache design 47 [ISCA 2012] [ISCA 2011] [ISCA 2010] [HPCA 2010] [HiPC 2009] [HPCA 2009]

48 Future Work Future project ideas include –Memory architectures for graphics/throughput- oriented applications –Memory optimizations for handheld devices  Tightly integrated software support  Managing heterogeneity, reconfigurability  Novel memory hierarchies –Memory autonomy and virtualization –Refresh management in DRAM 48

49 Acknowledgements Rajeev Naveen Committee: Al, Norm, Erik, Ken Awesome lab-mates Karen, Ann, Emily… front office Parents & family Friends 49

Download ppt " Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah Ph.D. Dissertation."

Similar presentations

Ads by Google