Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,

Similar presentations


Presentation on theme: "1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,"— Presentation transcript:

1 1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee, Seth Pugsley, Manju Shevgoor School of Computing University of Utah

2 2 Towards Scalable and Energy-Efficient Memory System Architectures

3 3 Convergence of Technology Trends Energy Reliability New Memory Technologies BW, Capacity, and Locality for Multi-Cores Overhaul of main memory architecture!

4 4 High Level Approach Explore changes to memory chip microarchitecture Must cause minimal disruption to density Explore changes to interfaces and standards Major change appears inevitable! Explore system and memory controller innovations Most attractive, but order-of-magnitude improvement unlikely Design solutions that are technology-agnostic

5 5 Projects Memory Chip Reduce overfetch Support reliability Handle PCM drift Promote read/write parallelism Memory Interface Interface with photonics Organize channel for high capacity Memory Controller Maximize use of row buffer Schedule for low latency and energy Exploit mini-ranks CPU MC DIMM …

6 6 Talk Outline Mature work: SSA architecture – Single Subarray Access (ISCA’10) Support for reliability (ISCA’10) Interface with photonics (ISCA’11) Micro-pages – data placement for row buffer efficiency (ASPLOS’10) Handling multiple memory controllers (PACT’10) Managing resistance drift in PCM cells (NVMW’11) Preliminary work: Handling read/write parallelism Enabling high capacity Handling DMA scheduling Exploiting rank subsetting for performance and thermals

7 7 Minimizing Overfetch with Single Subarray Access Ani Udipi CPU MC DIMM … Primary Impact

8 Problem 1 - DRAM Chip Energy On every DRAM access, multiple arrays in multiple chips are activated Was useful when there was good locality in access streams –Open page policy Helped keep density high and reduce cost-per-bit With multi-thread, multi-core and multi-socket systems, there is much more randomness –“Mixing” of access streams when finally seen by the memory controller 8

9 Rethinking DRAM Organization Limited use for designs based on locality As much as 8kbytes read in order to service a 64byte cache line request Termed “overfetch” –Substantially increases energy consumption Need a new architecture that –Eliminates overfetch –Increases parallelism –Increases opportunity for power-down –Allows efficient reliability 9

10 Proposed Solution – SSA Architecture 10

11 SSA Basics Entire DRAM chip divided into small “subarrays” Width of each subarray is exactly one cache line Fetch entire cache line from a single subarray in a single DRAM chip – SSA Groups of subarrays combined into “banks” to keep peripheral circuit overheads low Close page policy and “posted-RAS” Data bus to processor essentially split into 8 narrow buses 11

12 SSA Architecture Impact Energy reduction –Dynamic – fewer bitlines activated –Static – smaller activation footprint – more and longer spells of inactivity – better power down Latency impact –Limited pins per cache line – serialization latency –Higher bank-level parallelism – shorter queuing delays Area increase –More peripheral circuitry and I/O at finer granularities – area overhead (< 5%) 12

13 Area Impact Smaller arrays – more peripheral overhead More wiring overhead in the on-chip interconnect between arrays and pin pads We did a best-effort area impact calculation using a modified version of CACTI 6.5 –Analytical model, has its limitations More feedback in this specific regard would be awesome! More info on exactly where in the hierarchy overfetch stops would be great too 13

14 14 Support for Chipkill Reliability Ani Udipi CPU MC DIMM … Primary Impact

15 Problem 2 – DRAM Reliability Many server applications require chipkill-level reliability – failure of an entire DRAM chip One example of existing systems –Consider baseline 64-bit word plus 8-bit ECC –Each of these 72 bits must be read out of a different chip, else a chip failure will lead to a multi-bit error in the 72-bit field – unrecoverable! –Reading 72 chips - significant overfetch! Chipkill even more of a concern for SSA since entire cache line comes from a single chip 15

16 Proposed Solution Approach similar to RAID-5 16 DIMM L0 C L1 C L2 C L3 C L4 C L5 C L6 C L7 C P0 C L9 C L10 C L11 C L12 C L13 C L14 C L15 C P1 C L8 C. C L56 C L57 C L58 C L59 C L60 C L61 C L62 C L63 C P7 DRAM DEVICE L – Cache LineC – Local ChecksumP – Global Parity

17 Chipkill design Two-tier error protection Tier - 1 protection – self-contained error detection –8-bit checksum/cache line – 1.625% storage overhead –Every cache line read is now slightly longer Tier -2 protection – global error correction –RAID-like striped parity across 8+1 chips –12.5% storage overhead Error-free access (common case) –1 chip reads –2 chip writes – leads to some bank contention –12% IPC degradation Erroneous access –9 chip operation 17

18 Questions What are the common failure modes in DRAM? PCM? Do entire chips fail? Do parts of chips fail? –Which parts? Bitlines? Wordlines? Capacitors? –Entire arrays? –Entire banks? –I/O? Should all these failures be handled the same way? 18

19 19 Designing Photonic Interfaces Ani Udipi CPU MC DIMM … Primary Impact

20 Problem 3 – Memory interconnect Electrical interconnects are not scaling well –Where can photonics make an impact, both on energy and performance? Various levels in the DRAM interconnect –Memory cell to sense-amp - addressed by SSA –Row buffer to I/O – currently electrical (on-chip) –I/O pins to processor – currently electrical (off-chip) Photonic interconnects –Large static power component – laser/ring tuning –Much lower dynamic component – relatively unaffected by distance Electrical interconnects –Relatively small static component –Large dynamic component Cannot overprovision photonic bandwidth, use only where necessary 20

21 Consideration 1 – How much photonics on a die? 21 Electrical Energy Photonic Energy

22 Consideration 2 - Increasing Capacity 3D stacking is imminent There will definitely be several dies on the channel –Each die has photonic components that are constantly burning static power –Need to minimize this! TSVs available within a stack; best of both worlds –Large bandwidth –Low static energy –Need to exploit this! 22

23 Proposed Design 23 Processor DIMM Waveguide DRAM chips Photonic Interface die + Stack controller Memory controller

24 Proposed Design – Interface Die Exploit 3D die stacking to move all photonic components to a separate interface die, shared by several memory dies –Use photonics where there is heavy utilization – shared bus between processor and interface die i.e. the off-chip interconnect – Helps break pin barrier for efficient I/O, substantially improves socket-edge BW –On-stack, where there is low utilization, use efficient low- swing interconnects and TSVs 24

25 Advantages of the proposed system Reduction in energy consumption –Fewer photonic resources, without loss in performance –Rings, couplers, trimming Industry considerations –Does not affect design of commodity memory dies –Same memory die can be used with both photonic and electrical systems –Same interface die can be used with different kinds of memory dies – DRAM, PCM, STT-RAM, Memristors 25

26 Problem 4 – Communication Protocol Large capacity, high bandwidth, and evolving technology trends will increase pressure on the memory interface –Need to handle heterogeneous memory modules, each with its own maintenance requirements, further complicates scheduling –Very little interoperability – affects both consumers (too many choices!) and vendors (stock-keeping and manufacturing) –Heavy pressure on address/command bus – several commands to micro-manage every operation of the DRAM –Several independent banks – need to maintain large amounts of state to schedule requests efficiently –Simultaneous arbitration for multiple resources (address bus, data bank, data bus) to complete a single transaction 26

27 Proposed Solution – Packet-based interface Release most of the tight control memory controller holds today Move mundane tasks to the memory modules themselves (on the interface die) - make them more autonomous –maintenance operation (refresh, scrub, etc.) –routine operations (DRAM precharge, NVM wear handling) –timing control (DRAM alone has almost 20 different timing constraints to be respected) – coding and any other special requirements Only information the memory module needs is the address and read/write identification, time slots reserved apriori for data return 27

28 Advantages Better interoperability, plug and play –As long as the interface die has the necessary information, everything in interchangeable Better support for heterogeneous systems –Allows easier data movement between DRAM and NVM for example, on the same channel Reduces memory controller complexity Allows innovation and value addition in the memory, without being constrained by processor-side support Reduces bit transport energy on the address/command bus 28

29 29 Data Placement with Micro-Pages To Boost Row Buffer Utility Kshitij Sudan CPU MC DIMM … Primary Impact

30 DRAM Access Inefficiencies Over fetch due to large row-buffers 8 KB read into row buffer for a 64 byte cache line Row-buffer utilization for a single request < 1% Diminishing locality in multi-cores Increasingly randomized memory access stream Row-buffer hit rates bound to go down Open page policy and FR-FCFS request scheduling Memory controller schedules requests to open row-buffers first Goal Improve row-buffer hit-rates for Chip Multi-Processors 30

31 Key Observation Gather all heavily accessed chunks of independent OS pages and map them to the same DRAM row Cache Block Access Pattern Within OS Pages For heavily accessed pages in a given time interval, accesses are usually to a few cache blocks 31

32 Basic Idea Hottest micro-pages 1 KB micro-pages Coldest micro-pages 4 KB OS Pages DRAM Memory Reserved DRAM Region 32

33 Hardware Implementation (HAM) Physical Address X New addr. Y 4 GB Main Memory CPU Memory Request 4 MB Reserved DRAM region Y X Page A Mapping Table X Y Old AddressNew Address BaselineHardware Assisted Migration (HAM) 33

34 Results 5M cycle EPOCH, ROPS, HAM and ORACLE Apart from average 9% performance gains, our schemes also save DRAM energy at the same time! Percent change in performance 34

35 Conclusions On average, for applications with room for improvement and with our best performing scheme Average performance ↑ 9% (max. 18%) Average memory energy consumption ↓ 18% (max. 62%). Average row-buffer utilization ↑ 38% Hardware assisted migration offers better returns due to fewer overheads of TLB shoot-down and misses 35

36 36 Data Placement Across Multiple Memory Controllers Kshitij Sudan CPU MC DIMM … Primary Impact

37 DRAM NUMA Latency MC Core 1Core 2 Core 3Core 4 DIMM MC Core 1Core 2 Core 3Core 4 DIMM MC Core 1Core 2 Core 3Core 4 DIMM MC Core 1Core 2 Core 3Core 4 DIMM QPI MC On-Chip Memory Controller QPI Interconnect Memory Channel DIMM DRAM (DIMMs) Socket Boundary 37

38 Problem Summary Pin limitations → increasing queuing delay Almost 8x increase in queuing delays from single core/one thread to 16 cores/16 threads Multi-cores → increasing row-buffer interference Increasingly randomized memory access stream Longer on- and off-chip wire delays → increasing NUMA factor NUMA factor already at 1.5x today Goal Improve application performance by reducing queuing delays and NUMA latency 38

39 Policies to Manage Data Placement Among MCs Adaptive First Touch Assign new virtual pages to a DRAM (physical) page belonging to MC(j) minimizing the a cost function Dynamic Page Migration Programs change phases → Imbalance in MC load Migrate pages between MCs at runtime Integrating Heterogenous Memory Technologies cost j = α x load j + β x rowhits j + λ x distance j cost k = Λ * distance k + Γ * rowhits k cost j = α x load j + β x rowhits j + λ x distance + Ƭ x LatencyDimmCluster j + µ x Usage j 39

40 Summary Multiple on-chip MCs will be common in future CMPs Multiple cores sharing one MC, MCs controlling different types of memories Intelligent data mapping needed Adaptive First Touch policy (AFT) Increases performance by 6.5% in homogeneous and by 1.6% in DRAM – PCM hierarchy. Dynamic page migration, improvement on AFT Further improvement over AFT - 8.9% over baseline in homogeneous, and by 4.4% in best performing DRAM-PCM hierarchy. 40

41 41 Managing Resistance Drift in PCM Cells Manu Awasthi CPU MC DIMM … Primary Impact

42 Quick Summary Multi level cells in PCM appear imminent A number of proposals exist to handle hard errors and lifetime issues of PCM devices Resistance Drift is a less explored phenomenon – Will become increasingly significant as number of levels/cell increases – primary cause of “soft errors” – Naïve techniques based on DRAM-like refresh will be extremely costly for both latency and energy – Need to explore holistic solutions to counter drift 42

43 What is Resistance Drift? 43 Resistance Time A B ERROR!! T0T0 TnTn CrystallineAmorphous

44 Resistance Drift Data 44 Cell TypeDrift Time at Room temperature (secs) Median 11 cell Worst 11 Case cell10 15 Median 10 cell10 24 Worst Case 10 cell5.94 Median 01 cell10 8 Worst Case 01 cell1.81 (11) (00)(10) (01)

45 Resistance Drift - Issues Programmed resistance drifts according to power law equation - R 0, α usually follow a Gaussian distribution Time to drift (error) depends on – Programmed resistance (R 0 ), and – Drift Coefficient (α) – Is highly unpredictable!! 45 R drift (t) = R 0 х (t) α

46 Resistance Drift - How it happens Median case cell Typical R 0 Typical α Median case cell Typical R 0 Typical α Worst case cell High R 0 High α Worst case cell High R 0 High α Scrub rate will be dictated by the Worst Case R 0 and Worst Case α Naive refresh/scrub will be extremely costly! Scrub rate will be dictated by the Worst Case R 0 and Worst Case α Naive refresh/scrub will be extremely costly! Drift R0R0 R0R0 RtRt RtRt ERROR!! Number of Cells

47 Architectural Solutions - Headroom Assumes support for Light Array Reads for Drift Detection (LARDDs) & ECC-N Headroom-h scheme – scrub is triggered if N-h errors are detected †Decreases probability of errors slipping through –Increases frequency of full scrub and hence decreases life time –Gradual Headroom scheme : Start with large LARDD frequency, increase frequency as errors increase 47 Read Line Check for Errors Errors < N-h Scrub Line True False After N cycles

48 Reducing Overheads with Circuit Level Solution Invoking ECC on every LARDD increases energy consumption Parity – like error detection circuit is used to signal the need for a full fledged ECC error detect – Number of Drift Prone States in each line are counted when the line is written into memory (single bit represents odd/even) – At every LARDD, parity is verified Reduces need for ECC read-compare at every LARDD cycle 48 (11) (00)(10) (01)

49 More Solutions Precise Writes – More write iterations to program state closer to mean, reduce chance of drift – Increases energy consumption, write time and decreases lifetime! Non Uniform Guardbanding – Resistance is equally distributed between all n states – Expand resistance range for drift prone states at expense of non-drift prone ones 49

50 Results 50 LARDD Interval (seconds) Errors

51 Conclusions Resistance drift will exacerbate with MLC scaling Naïve solutions based on ECC support are costly for PCM – Increased write energy, decreased lifetimes Holistic solutions need to be explored to counter drift at device, architectural and system levels – 39% reduction in energy, 4x less errors, 102x increase in lifetime 51

52 52 Handling Read/Write Parallelism Nil Chatterjee CPU MC DIMM … Primary Impact

53 The Problem 53 Writes are not on the critical path for program execution, but they can slow down reads through resource contention In future chipkill correct systems, each data write will necessitate an update of the ECC codes and the impact of writes will be more evident. In PCM, the problem is exacerbated by the significantly longer write times. Abstracting the writes away improves read latency by 48% in non-ECC DRAM systems.

54 Impact of Writes on Reads Write draining affects read latencies by – Increasing the queuing delay – Reducing the read stream’s row-buffer locality 54

55 Bank Contention from Writes Reads are not scheduled in the middle of the WQ drain because it would require multiple bus turnarounds incurring tWRT and tOST delays. Underutilization of the data bus bandwidth during WQ draining leading to performance loss. However, opportunities to schedule read accesses to idle banks might exist in this interval. 55

56 Example 56

57 Solution : Increasing R/W overlap During a WQ drain cycle, schedule partial reads to idle banks. – Following a column read command, the data is fetched from the sense amplifiers into a small buffer (64byte) near the I/O pads. – Data will be streamed out only after the WQ reaches the low watermark - no turnaround delays. Immediately following the WQ drain, a flurry of prefetched reads can occupy the data bus. 57

58 Solution : Increasing R/W overlap. 58

59 Impact A small pool of partial read registers can help increase the data bus utilization post writes. In PCM system, where writes are very expensive, partial reads can have higher impact. The JEDEC standard must be augmented to support a partial read command. 59

60 60 Organizing Channels for High Capacity Kshitij Sudan CPU MC DIMM … Primary Impact

61 Increasing DRAM Capacity by Re-Architecting Memory Channel Increase DRAM capacity, while minimizing power Re-architect CPU-to-DRAM channel Study effects of bus width and protocol (serial vs. parallel) CMPs might have changed the playfield! 61

62 Increasing DRAM Capacity by Re-Architecting Memory Channel Organize modules as binary tree, and move some MC functionality to “Buffer Chip” Reduces module depth from O(n) to O(log n) Reduces worst case latency, and improves signal integrity Buffer chip manages low- level DRAM operations and channel arbitration Not limited by worst-case access latency like FB- DIMM NUMA like DRAM access – leverage data mapping 62

63 63 Handling DMA Scheduling Kshitij Sudan CPU MC DIMM … Primary Impact

64 Handling DMA Scheduling Reduce conflicts between CPU generated RAM requests, and DMA generated DRAM requests 64

65 Handling DMA Scheduling Study interference from DMA requests on CPU generated DRAM requests With on-chip MCs, unclear how DMA requests compete with DRAM requests. Devise scheduling polices to minimize DMA and CPU access conflicts Infer how DMA and DRAM requests are arbitrated at the MC No CPU manufacturer documentation available publicly! 65

66 66 Variable Rank Subsetting Seth Pugsley CPU MC DIMM … Primary Impact

67 Motivation for Rank Subsetting Rank Subsetting – Split up a rank+data channel into multiple, smaller ranks+data channels Prior motivations: reduce dynamic energy and overfetch 67

68 Rank Size Options Standard 8 chip-wide rank 1x64-bit data bus 2 banks 1x8KB row buffer 64 byte cache line in 8 clock edges All transfers sequential 4 chip-wide narrow rank 2x32-bit data buses 4 banks 2x4KB row buffers 64 byte cache line in 16 clock edges Can transfer 2 cache lines in parallel 1 chip-wide narrow rank 8x8-bit data buses 16 banks 8x1KB row buffers 64 byte cache line in 64 clock edges Can transfer 8 cache lines in parallel 2 chip-wide narrow rank 4x16-bit data buses 8 banks 4x2KB row buffers 64 byte cache line in 32 clock edges Can transfer 4 cache lines in parallel 68

69 Impact on Queuing Delay Core Access DB 16 cyc 4 cyc Core Access DB 16 cyc 4 cyc Core Access DB 16 cyc 4 cyc Behavior with a single bank: data bus utilization of 25% 69

70 Impact on Queuing Delay Core Access DB 16 cyc 4 cyc Core Access DB 16 cyc 4 cyc Core Access DB 16 cyc 4 cyc Behavior with a single bank: data bus utilization of 25% Core 0 Access DB 16 cyc Core 1 Access DB 16 cyc Core 0 Access DB 16 cyc Core 1 Access DB 16 cyc Behavior with two banks: data bus utilization of 50% Core 0 Access DB 16 cyc Core 1 Access DB 16 cyc 70

71 Advantages of Rank Subsetting More open rows – Each open row is narrower (still OK hit rates) Reduced Queuing Delay – More banks available and better data bus utilization 71

72 Performance for Static Rank Subsetting 72

73 Variable Rank Subsetting Use a different size rank for each memory op – e.g., 1-wide transaction on data bus at same time as 2- wide and 4-wide transactions – Scheduling can get pretty hairy – Many wasted data bus slots D0 D1 D2 D3 D4 D5 D6 D7 Time = 1-wide = 2-wide = 4-wide= 8-wide= wasted … 73

74 More Sensible Variable Rank Subsetting Still can use a different size rank for each memory op Limit rank size to only 2 options – Software chooses mode for newly allocated pages – Scheduling is much easier than the previous example D0 D1 D2 D3 D4 D5 D6 D7 = 4-wide= 8-wide= wasted … Time 74

75 75 Exploiting Rank Subsetting to Alleviate Thermal Constraints Manju Shevgoor CPU MC DIMM … Primary Impact

76 The problem- DRAM is getting hot DRAM Temperatures can rise up to 95° C Refresh rate needs to double once DRAM crosses 85° C Thermal emergencies due to elevated temperatures adversely affect performance Cooling Systems are expensive 76 Full DIMM heat spreader, Zhu et al., ITHERM’08 Typical Cooling System, Liu et al., HPCA’11

77 77 Current Thermal Throttling Techniques CPU Throttling Reduces overall activity Thermal Shutdown Stop all requests to over-heated chips Memory Bandwidth Throttling Lower channel bandwidth to reduce DRAM activity All DRAM chips are affected by these techniques irrespective of their temperature Even cool chips which could otherwise be operating at optimal throughput are also throttled

78 Refresh Overhead 78 Elastic Refresh, Stucheli et al., MICRO’10 As memory chips get denser, this problem only worsens Integer workloads can have up to 13% IPC degradation because of Refresh Chips working at Extended Temperature Range will cause larger IPC degradation

79 Temperature Profile along a DIMM Proximity to the hot processor results in unequal temperature Position with respect to airflow also impacts the temperature Temperature difference between the hottest and coolest chips can be 10°C 79 Typical Temperature Profile Along the RDIMM Source: Zhu et al., ITHERM’08 Typical Cooling System, Liu et al., HPCA’11

80 Baseline All chips are grouped into 1 Rank Not all chips are ‘HOT’ Not all chips need to be throttled! 80 Buffer Rank 1 DIMM Baseline Rank Organization

81 Proposed Solution 81 BufferRank 1 (Coolest Rank) Rank 4Rank 3 (Warmest Rank) Rank 2 DIMM Proposed Rank Organization- Statically Split DIMM into multiple Ranks based on temperature Not all Ranks are equally hot, so Penalize only Hottest Ranks Control Refresh Rate at Rank granularity Only the hottest chips are refreshed every 32ms the rest can be refreshed every 64ms

82 Fine-Grained DRAM Throttling Need a throttling mechanism which can be applied at a finer granularity Temperature Aware Cache Replacement – Modify LRU to preferentially evict lines belonging to Cool Ranks – Will reduce activity only in Hot Ranks 82 Decrease activity ONLY in Hot-Ranks R3 R1 R2 R1 R3 R4 R1 R3 R4 R2 R1 R4 R1 R2 R3 R2 R3 R2 R4 R2 R3 MRULRU

83 Rank-wise Refresh 83 BufferRank 1 (Coolest) Rank 4Rank 3 (Warmest) Rank 2 DIMM Refresh only as fast as needed Only Ranks operating at Extended Temperature Range are refreshed every 32ms Ranks operating at Normal Temperature Range are refreshed every 64ms Extended Temperature Range Normal Temperature Range

84 Summary 84 Split DIMM into Mini Ranks Model Temperature of Chips Throttle Activity of Hot Ranks Increase Refresh Rate of Hot Rank Only Penalize Hot Ranks ONLY!! Keeps the Chips from Reaching High Temp. Maintains Data Integrity of Chips once they get Hot

85 85 Summary Converging technology trends require an overhaul of main memory architectures Multi-pronged approach required for significant improvements: memory chip, controller, interface, OS Future memory chips must also optimize for energy and reliability, and not just latency and density Publications:

86 86 Acknowledgments Collaborators at HP Labs, IBM, Intel Funding from NSF, Intel, HP, University of Utah Thanks for hosting!


Download ppt "1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian, Al Davis, Ani Udipi, Kshitij Sudan, Manu Awasthi, Nil Chatterjee,"

Similar presentations


Ads by Google