Rajeev Balasubramonian

Slides:

Advertisements

Similar presentations

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

Advertisements

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

1 Exploiting 3D-Stacked Memory Devices Rajeev Balasubramonian School of Computing University of Utah Oct 2012.

Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores Aniruddha N. Udipi, Naveen Muralimanohar*, Niladrish Chatterjee, Rajeev Balasubramonian,

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture

1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.

Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian,

Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.

1 Lecture 15: DRAM Design Today: DRAM basics, DRAM innovations (Section 5.3)

 2003 Micron Technology, Inc. All rights reserved. Information is subject to change without notice. High Performance Next Generation Memory Technology.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.

University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

1 Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, Onur Mutlu.

TLC: Transmission Line Caches Brad Beckmann David Wood Multifacet Project University of Wisconsin-Madison 12/3/03.

CACTI-IO: CACTI With Off-Chip Power-Area-Timing Models

1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.

1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.

Timing Channel Protection for a Shared Memory Controller Yao Wang, Andrew Ferraiuolo, G. Edward Suh Feb 17 th 2014.

Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.

Memory RAM Joe Liuzzo Nicholas Ward. History Williams Tube DRAM ROM.

BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches

Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative

Overview High Performance Packet Processing Challenges

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

Efficiently Prefetching Complex Address Patterns Manjunath Shevgoor, Sahil Koladiya, Rajeev Balasubramonian University of Utah Chris Wilkerson, Zeshan.

Sunpyo Hong, Hyesoon Kim

33 rd IEEE International Conference on Computer Design ICCD rd IEEE International Conference on Computer Design ICCD 2015 Improving Memristor Memory.

Managing DRAM Latency Divergence in Irregular GPGPU Applications Niladrish Chatterjee Mike O’Connor Gabriel H. Loh Nuwan Jayasena Rajeev Balasubramonian.

LIOProf: Exposing Lustre File System Behavior for I/O Middleware

University of Utah 1 Interconnect Design Considerations for Large NUCA Caches Naveen Muralimanohar Rajeev Balasubramonian.

A Case for Toggle-Aware Compression for GPU Systems

TI Information – Selective Disclosure

Backprojection Project Update January 2002

Institute of Parallel and Distributed Systems (IPADS)

Seth Pugsley, Jeffrey Jestes,

Reducing Memory Interference in Multicore Systems

Zhichun Zhu Zhao Zhang ECE Department ECE Department

ASR: Adaptive Selective Replication for CMP Caches

Good morning everyone, my name is Arjun Deb

Understanding Latency Variation in Modern DRAM Chips Experimental Characterization, Analysis, and Optimization Kevin Chang Abhijith Kashyap, Hasan Hassan,

ISPASS th April Santa Rosa, California

A Requests Bundling DRAM Controller for Mixed-Criticality System

Scaling the Memory Power Wall with DRAM-Aware Data Management

Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio

BitWarp Energy Efficient Analytic Data Processing on Next Generation General Purpose GPUs Jason Power || Yinan Li || Mark D. Hill || Jignesh M. Patel.

CS : Technology Trends August 31, 2015 Ion Stoica and Ali Ghodsi (

RFVP: Rollback-Free Value Prediction with Safe to Approximate Loads

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Lecture 15: DRAM Main Memory Systems

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Untrodden Paths for Near Data Processing

Lecture: Memory, Multiprocessors

Staged-Reads : Mitigating the Impact of DRAM Writes on DRAM Reads

Row Buffer Locality Aware Caching Policies for Hybrid Memories

Ali Shafiee Rajeev Balasubramonian Feifei Li

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform

Aniruddha N. Udipi, Naveen Muralimanohar*, Niladrish Chatterjee,

Die Stacking (3D) Microarchitecture -- from Intel Corporation

CANDY: Enabling Coherent DRAM Caches for Multi-node Systems

Manjunath Shevgoor, Rajeev Balasubramonian, University of Utah

Direct Rambus DRAM (aka SyncLink DRAM)

A Case for Interconnect-Aware Architectures

Demystifying Complex Workload–DRAM Interactions: An Experimental Study

DSPatch: Dual Spatial pattern prefetcher

Presentation transcript:

Rajeev Balasubramonian CACTI 7: New Tools for Interconnect Exploration in Innovative Off-Chip Memories Rajeev Balasubramonian Andrew B. Kahng Naveen Muralimanohar Ali Shafiee Vaishnav Srinivas

Main Memory Matters Software Architecture Technology In-Memory DBs, Key-Value Stores Graph Algorithms, Deep Learning Software Commodity CPUs, Accelerators Shift in bottlenecks Example innovations: NDP, DDR to GDDR5  3x TOPS in TPU Architecture Technology DDR4, HMC, HBM, NVM The Innovation Hub is Moving to Memory

Two Silos CACTI 7 can be used out-of-the-box when defining memory parameters for traditional memory systems CACTI 7 primitives can be leveraged to model and evaluate new memory architectures

Talk Outline CACTI for the main memory Inputs/outputs The nuts and bolts Modeling I/O power Design space exploration Case studies: two novel architectures Cascaded Channels Narrow Channels

CACTI for Memory Cost Table Exhaustive Search Bandwidth Table Capacity Cost Table #channels, ECC vs. Not Bandwidth Table DRAM Type: DDR3,DDR4 Power Parameters Access Pattern: bw, row buffer hits, Rd/Wr ratio Channel Configs Energy per access Inputs and outputs

Cost and capacity relationship is not linear DIMM Cost Cost factors: technology, capacity, support for ECC, max bandwidth, vendor Aggregated costs from online sources Cost is volatile and should be updated periodically Cost in dollars 4GB 8GB 16GB 32GB 64GB DDR3 UDIMM 40 76 RDIMM 42 64 122 304 LRDIMM 211 287 1079 DDR4 26 46 33 60 126 310 279 331 1474 Cost and capacity relationship is not linear

Bandwidth Bandwidth depends on load, voltage, and DIMM type 1DPC (MHz) DDR3 UDIMM-DR 533 667 RDIMM-DR 800 RDIMM-QR LRDIMM-QR 1.2V DDR4 1066 933

Power Modeling Extending CACTI-I/O DDR4 and SerDes support added SerDes parameters from literature for different lengths/speeds For parallel buses, support for more accurate termination power with HSPICE simulations Different termination models for each bus type Different frequency, DIMMs per channel On-DIMM and on-board Different range (short or long)

Interconnect Model API

Power Analysis (DDR3)

Power Analysis (DDR4)

Cost and Bandwidth Analysis Highest possible BW for the demanded capacity Lowest possible cost for the demanded capacity

Two Case Studies Key Observations New Idea I: Cascaded Segments High DPC  less BW More channels  high bw and low cost New Idea I: Cascaded Segments Each segment has few DIMMs  higher BW New Idea II: Narrow Channels Partition the channel into many parallel channels Fewer DIMMs per data wire, new ECC  higher BW Lower power on DIMM

Cascaded Channels Same DPC, higher BW Same BW, lower cost CPU RoB DIMM CPU RoB 533 MHz 667MHz 667MHz Relay on Board chip Same BW, lower cost 64 GB CPU 32 GB RoB 667 MHz 667MHz 667MHz one memory cycle increase in latency

Unbalanced channel Load Hybrid Memory NVM is slow  Software optimized to access DRAM more Unbalanced channel Load balanced channel Load D CPU N One Channel DRAM One Channel NVM Frontend DRAM Backend NVM

Narrow Channels Higher Bandwidth but Higher Latency Command/Address Bus is shared between channels Higher Bandwidth but Higher Latency Lower frequency/power for DRAM Chips! ECC on DIMM and CRC for link to reduce bw

Methodology Trace-based simulation Trace fed to USIMM Memory-intensive Benchmarks (NPB and SPEC2006) Trace generated by Simics 8-core at 3.2 GHz L1D = 32KB, L1I = 32KB, L2 = 8MB Power CACTI 7

Cascaded Channels DDR3 DDR4 25% higher BW 13% higher BW 22% higher IPC

Cascaded Latency

Cascaded Power: DRAM Cartridge 533 MHz 70% utilization 667MHz 70% utilization 667MHz 35% utilization CPU CPU DIMM BoB I/O Total Power/BW Baseline 23.2W 5.5W 9.4W 38.1W 7.9 (nJ/B) Cascaded 22.6W 6.4W 12.2W 41.2W 6.7 (nJ/B)

Cascaded Cost

Cascaded Hybrid Percentage of Load on DRAM

Narrow Channel: Performance Performance Improvement: 2-channel-x36  18% 3-channel-x24  17%

Narrow Channel: Power 23% overall memory power reduction

Conclusion CACTI 7: models off-chip memories and I/O Detailed I/O power model Design space exploration Analyzes trade-offs: capacity, power, bandwidth, and cost Two novel architectures Cascaded channels Narrow channels