Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah www.cs.utah.edu/~udipi Ph.D. Dissertation.

www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah www.cs.utah.edu/~udipi Ph.D. Dissertation Defense, March 7, 2012 Advisor: Rajeev Balasubramonian

www.cs.utah.edu/~udipi My other computer is.. 2

www.cs.utah.edu/~udipi Scaling server farms Facebook: 30,000 servers, 80 Billion images stored, serves 600,000 photos a second, logs 25 TB of data per day… the statistics can go on.. The primary challenge to scaling: efficient supply of data to thousands of cores It’s all about the memory! 3

www.cs.utah.edu/~udipi Performance Trends Demand-side –Multi-socket, multi-core, multi-thread –Large datasets - big data analytics, scientific computation models –RAMCloud-like designs –1 TB/s per node by 2017 Supply-side –Pin count, per pin BW, capacity –Severely power limited 4 Source: ZDNet Source: Tom’s Hardware

www.cs.utah.edu/~udipi Datacenters consume ~2% of all power generated in the US –Operation + cooling 100 Billion kWh, $7.4 Billion 25-40 % of total power in large systems consumed in memory As processors get simpler, this fraction likely to increase Energy Trends 5

www.cs.utah.edu/~udipi Cost-per-bit Traditionally the holy grail of DRAM design Operational expenditure over 3 years == Capital expenditure in datacenter servers –Cost-per-bit less important than before 6 $3.00 13W $0.30 60W

www.cs.utah.edu/~udipi The job of the memory controller is hard –18+ timing parameters for DRAM! –Maintenance operations  Refresh, scrub, power down, etc. Several DIMM and controller variants –Hard to provide interoperability –Need processor-side support for new memory features Now throw in heterogeneity –Memristors, PCM, STT-RAM, etc. Complexity Trends 7

www.cs.utah.edu/~udipi Reliability Trends Shrinking feature sizes not helping Nor is the scale –64 x 10 15 DRAM cells in a typical datacenter DRAM errors the #1 reason for servers at Google to enter repair Datacenters are the backbone of web-connected infrastructure –Reliability is essential Server downtime has huge economic impact –Breached SLAs, for example 8

www.cs.utah.edu/~udipi Thesis statement Main memory systems are at an inflection point –Convergence of several trends Major overhaul required to achieve a system that is –Energy-efficient, high-performance, low-complexity, reliable, and cost effective Combination of two things –Prudent application of novel technologies –Fundamental rethinking of conventional design decisions 9

www.cs.utah.edu/~udipi Designing Future Memory Systems 10 CPU MC DIMM … 1 2 Memory Interconnect – Prudent use of Silicon Photonics, without modifying DRAM dies [ISCA ’11] Memory Reliability – Efficient RAID-based high-availability Chipkill memory [ISCA ’12] 1 1 Memory Chip Architecture – reducing overfetch & increasing parallelism [ISCA ’10] 3 Memory protocol – Streamlined Slot-based Interface with semi- autonomous memory [ISCA ’11] 4 2 3 4 4 23

www.cs.utah.edu/~udipi PART 1 – Memory Chip Organization

www.cs.utah.edu/~udipi Key bottleneck 12 RAS CAS Cache Line DRAM Chip Row Buffer One bank shown in each chip

www.cs.utah.edu/~udipi Why this is a problem 13

www.cs.utah.edu/~udipi … 14

www.cs.utah.edu/~udipi SSA Architecture 15 MEMORY CONTROLLER 8 8 ADDR/CMD BUS 64 Bytes Bank Subarray Bitlines Row buffer Global Interconnect to I/O ONE DRAM CHIP DIMM 8888888 DATA BUS

www.cs.utah.edu/~udipi SSA Operation 16 Address Cache Line DRAM Chip Subarray DRAM Chip Subarray DRAM Chip Subarray DRAM Chip Subarray Sleep Mode (or other parallel accesses) Subarray

www.cs.utah.edu/~udipi SSA Impact Energy reduction –Dynamic – fewer bitlines activated –Static – smaller activation footprint – more and longer spells of inactivity – better power down Latency impact –Limited pins per cache line – serialization latency –Higher bank-level parallelism – shorter queuing delays Area increase –More peripheral circuitry and I/O at finer granularities – area overhead (< 5%) 17

www.cs.utah.edu/~udipi Key Contributions Up to 6X reduction in DRAM chip dynamic energy Up to 5X reduction in DRAM chip static energy Up to 50% improvements in performance in applications limited by bank contention All for ~5% increase in area 18

www.cs.utah.edu/~udipi PART 2 – Memory Interconnect

www.cs.utah.edu/~udipi Key Bottleneck Fundamental nature of electrical pins –Limited pin count, per pin bandwidth, memory capacity, etc. Diverging growth rates of core count and pin count Limited by physics, not engineering! 20

www.cs.utah.edu/~udipi 21 Silicon Photonic Interconnects We need something that can break the edge-bandwidth bottleneck Ring modulator based photonics –Off chip light source –Indirect modulation using resonant rings –Relatively cheap coupling on- and off-chip DWDM for high bandwidth density –As many as 67 wavelengths possible –Limited by Free Spectral Range, and coupling losses between rings Source: Xu et al. Optical Express 16(6), 2008 DWDM 64 λ × 10 Gbps/ λ = 80 GB/s per waveguide

www.cs.utah.edu/~udipi The Questions We’re Trying to Answer 22 Should we replace all interconnects with photonics? On-chip too? Should we be designing photonic DRAM dies? Stacks? Channels? How do we make photonics less invasive to memory die design? What should the role of 3D be in an optically connected memory? What should the role of electrical signaling be?

www.cs.utah.edu/~udipi Design Considerations – I Photonic interconnects –Large static power dissipation: ring tuning  Rings are designed to resonate at a specific frequency  Processing defects and temperature change this  Need to heat the rings to correct for this –Much lower dynamic energy consumption – relatively independent of distance Electrical interconnects –Relatively small static power dissipation –Large dynamic energy consumption 23

www.cs.utah.edu/~udipi Design Considerations – II Should not over-provision photonic bandwidth, use only where necessary Use photonics where they’re really useful –To break the off-chip pin barrier Exploit 3D-Stacking and TSVs –High bandwidth, low static power, decouples memory dies Exploit low-swing wires –Cheap on-chip communication 24

www.cs.utah.edu/~udipi Proposed Design 25 Processor DIMM Waveguide DRAM chips Photonic Interface die Memory controller ADVANTAGE 1: Increased activity factor, more efficient use of photonics ADVANTAGE 3: Not disruptive to the design of commodity memory dies ADVANTAGE 2: Rings are co-located; easier to isolate or tune thermally

www.cs.utah.edu/~udipi Key Contributions 23% reduced energy consumption 4X capacity per channel Potential for performance improvements due to increased bank count Less disruptive to memory die design 26 Processor DIMM Waveguide DRAM chips Photonic Interface die Memory controller Makes the job of the memory controller difficult!

www.cs.utah.edu/~udipi PART 3 – Memory Access Protocol

www.cs.utah.edu/~udipi Key Bottleneck Large capacity, high bandwidth, and evolving technology trends will increase pressure on the memory interface Memory controller micro-manages every operation of the memory system –Processor-side support required for every memory innovation –Several signals between processor and memory  Heavy pressure on address/command bus  Worse with several independent banks, large amounts of state 28

www.cs.utah.edu/~udipi Proposed Solution Release MC’s tight control, make memory stack more autonomous Move mundane tasks to the interface die –Maintenance operation (refresh, scrub, etc.) –Routine operations (DRAM precharge, NVM wear leveling) –Timing control (18+ constraints for DRAM alone) –Coding and any other special requirements Processor-side controller only schedules requests and controls data bus 29

www.cs.utah.edu/~udipi Memory Access Operation 30 S1 Arrival First free slot Issue Start looking Backup slot ML > ML Time Slot – Cache line data bus occupancy X – Reserved Slot ML – Memory Latency = Addr. latency + Bank access + Data bus latency xxx S2

www.cs.utah.edu/~udipi Performance Impact – Synthetic Traffic 31 < 9% latency impact, even at maximum load Virtually no impact on achieved bandwidth

www.cs.utah.edu/~udipi Performance Impact – PARSEC/STREAM 32 Apps have very low BW requirements Scaled down system, similar trends

www.cs.utah.edu/~udipi Key Contributions Plug and play –Everything is interchangeable and interoperable –Only interface-die support required (communicate ML) Better support for heterogeneous systems –Easier DRAM-NVM data movement on the same channel More innovation in the memory system – Without processor-side support constraints Fewer commands between processor and memory –Energy, performance advantages 33

www.cs.utah.edu/~udipi PART 4 – Memory Reliability

www.cs.utah.edu/~udipi Key Bottleneck Increased access granularity –Every data access is spread across 36 DRAM chips –DRAM industry standards define minimum access granularity from each chip –Massive overfetch of data at multiple levels  Wastes energy  Wastes bandwidth  Occupies ranks/banks for longer, hurting performance x4 device width restriction –fewer ranks for given DIMM real estate –x8/x16/x32 more power efficient per capacity Reliability level: 1 failed chip out of 36 35

www.cs.utah.edu/~udipi A new approach: LOT-ECC Operate on a single rank of memory: 9 chips –and support failure of 1 chip per rank (9 chips) Multiple tiers of localized protection –Tier-1: Local Error Detection (checksums) –Tier 2: Global Error Correction (parity) –T3 & T4 to handle specific failure cases Error correction data stored in data memory Data mapping handled by memory controller with firmware support –Transparent to OS, caches, etc. 36

www.cs.utah.edu/~udipi LOT-ECC Design 37

www.cs.utah.edu/~udipi The Devil is in the Details We’re borrowing one bit from [data + LED] to use in the GEC –Put them all in the same DRAM row When a cache line is written, –Write data, LED, GEC – all “self-contained” –no read-before-write –Guaranteed row-buffer hit 38 7b 1b 1b PA 0-6 PA 7-13 PA 49-55 PP A. T4 PA 56 T4 Surplus bit borrowed from data + LED Chip 0 Chip 1 Chip 7Chip 8

www.cs.utah.edu/~udipi Key Benefits Energy Efficiency: Fewer chips activated per access, reduced access granularity, reduced static energy through better use of low-power modes Performance Gains: More rank-level parallelism, reduced access granularity Improved Protection: Can handle 1 failed chip out of 9, compared to 1 in 36 currently Flexibility: Works with a single rank of x4 DRAMs or more efficient wide-I/O x8/x16 DRAMs Implementation Ease: Changes to memory controller and system firmware only; commodity processor/memory/OS 39

www.cs.utah.edu/~udipi Power Results 40 -55%

www.cs.utah.edu/~udipi Performance Results 41 Latency Reduction: LOT-ECC x8 – 43% +GEC Coalescing – 47% Oracular – 57%

www.cs.utah.edu/~udipi Exploiting features in SSA 42

www.cs.utah.edu/~udipi Putting it all together

www.cs.utah.edu/~udipi Summary Tremendous pressure on the memory system –Bandwidth, energy, complexity, reliability Prudently apply novel technologies –Silicon photonics –Low-swing wires –3D-stacking Rethink some fundamental design choices –Micromanagement by the memory controller –Overfetch in the face of diminishing locality –Conventional ECC codes 44

www.cs.utah.edu/~udipi Impact Significant static/dynamic energy reduction –Memory core, channel, controller, reliability Significant performance improvement –Bank parallelism, channel bandwidth, reliability Significant complexity reduction –Memory controller Improved reliability 45

www.cs.utah.edu/~udipi Synergies SSA Photonics Photonics Autonomous memory SSA Reliability SSA, Photonics, and LOT-ECC provide additive energy benefits –Each targets one of three major sources of energy consumption – DRAM array, off-chip channel, reliability SSA, Photonics, and LOT-ECC also provide additive performance benefits –Each targets one of three major performance bottleneck – Bank-contention, off-chip BW, reliability 46

www.cs.utah.edu/~udipi Research Contributions Memory reliability Memory access protocol Memory channel architecture Memory chip microarchitecture On-chip networks Non-uniform power caches 3D stacked cache design 47 [ISCA 2012] [ISCA 2011] [ISCA 2010] [HPCA 2010] [HiPC 2009] [HPCA 2009]

www.cs.utah.edu/~udipi Future Work Future project ideas include –Memory architectures for graphics/throughput- oriented applications –Memory optimizations for handheld devices  Tightly integrated software support  Managing heterogeneity, reconfigurability  Novel memory hierarchies –Memory autonomy and virtualization –Refresh management in DRAM 48

www.cs.utah.edu/~udipi Acknowledgements Rajeev Naveen Committee: Al, Norm, Erik, Ken Awesome lab-mates Karen, Ann, Emily… front office Parents & family Friends 49

Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah www.cs.utah.edu/~udipi Ph.D. Dissertation.

Similar presentations

Presentation on theme: "Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah www.cs.utah.edu/~udipi Ph.D. Dissertation."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah www.cs.utah.edu/~udipi Ph.D. Dissertation.

Similar presentations

Presentation on theme: "Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah www.cs.utah.edu/~udipi Ph.D. Dissertation."— Presentation transcript:

Similar presentations

About project

Feedback