Dong Hyuk Woo Nak Hee Seong Hsien-Hsin S. Lee

Slides:

Advertisements

Similar presentations

Zehan Cui, Yan Zhu, Yungang Bao, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences July 28, 2011.

Advertisements

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.

Virtual Exclusion: An Architectural Approach to Reducing Leakage Energy in Multiprocessor Systems Mrinmoy Ghosh Hsien-Hsin S. Lee School of Electrical.

CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.

A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.

Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.

Synonymous Address Compaction for Energy Reduction in Data TLB Chinnakrishnan Ballapuram Hsien-Hsin S. Lee Milos Prvulovic School of Electrical and Computer.

Evaluating an Adaptive Framework For Energy Management in Processor- In-Memory Chips Michael Huang, Jose Renau, Seung-Moon Yoo, Josep Torrellas.

System Design Tricks for Low-Power Video Processing Jonah Probell, Director of Multimedia Solutions, ARC International.

Understanding a Problem in Multicore and How to Solve It

SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†

Citadel: Efficiently Protecting Stacked Memory From Large Granularity Failures Dec 15 th 2014 MICRO-47 Cambridge UK Prashant Nair - Georgia Tech David.

A Cache-Like Memory Organization for 3D memory systems CAMEO 12/15/2014 MICRO Cambridge, UK Chiachen Chou, Georgia Tech Aamer Jaleel, Intel Moinuddin K.

3D-MAPS: 3D Massively Parallel Processor with Stacked Memory Dae Hyun Kim, Krit Athikulwongse, Michael Healy, Mohammad Hossain, Moongon Jung, et al. Georgia.

Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.

Power Efficient IP Lookup with Supernode Caching Lu Peng, Wencheng Lu*, and Lide Duan Dept. of Electrical & Computer Engineering Louisiana State University.

1 Lecture 15: DRAM Design Today: DRAM basics, DRAM innovations (Section 5.3)

Restrictive Compression Techniques to Increase Level 1 Cache Capacity Prateek Pujara Aneesh Aggarwal Dept of Electrical and Computer Engineering Binghamton.

1 Lecture 7: Caching in Row-Buffer of DRAM Adapted from “A Permutation-based Page Interleaving Scheme: To Reduce Row-buffer Conflicts and Exploit Data.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

1 Lecture 1: Introduction and Memory Systems CS 7810 Course organization:  5 lectures on memory systems  5 lectures on cache coherence and consistency.

1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian School of Computing University of Utah.

Xiaodong Wang  Dilip Vasudevan Hsien-Hsin Sean Lee University of College Cork  Georgia Tech Global Built-In Self-Repair for 3D Memories with Redundancy.

ERD and Memory Architectures Paul Franzon Department of Electrical and Computer Engineering

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

Hao-Hsuan, Liu IEE5011 –Autumn 2013 Memory Systems 3D DRAM using TSV technology Hao-Hsuan, Liu Department of Electronics Engineering National Chiao Tung.

L i a b l eh kC o m p u t i n gL a b o r a t o r y Yield Enhancement for 3D-Stacked Memory by Redundancy Sharing across Dies Li Jiang, Rong Ye and Qiang.

McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures Runjie Zhang Dec.3 S. Li et al. in MICRO’09.

Storage Class Memory Architecture for Energy Efficient Data Centers Bruce Childers, Sangyeun Cho, Rami Melhem, Daniel Mossé, Jun Yang, Youtao Zhang Computer.

A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,

Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.

Timing Channel Protection for a Shared Memory Controller Yao Wang, Andrew Ferraiuolo, G. Edward Suh Feb 17 th 2014.

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos

ISLPED’99 International Symposium on Low Power Electronics and Design

ACSAC’04 Choice Predictor for Free Mongkol Ekpanyapong Pinar Korkmaz Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Institute.

Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

1 Tuning Garbage Collection in an Embedded Java Environment G. Chen, R. Shetty, M. Kandemir, N. Vijaykrishnan, M. J. Irwin Microsystems Design Lab The.

Srihari Makineni & Ravi Iyer Communications Technology Lab

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

NC STATE UNIVERSITY Tapping ZettaRAM™ for Low-Power Memory Systems Center for Embedded Systems Research (CESR) Department of Electrical & Computer Engineering.

© 2000 Morgan Kaufman Overheads for Computers as Components Energy/power optimization  Energy: ability to do work.  Most important in battery-powered.

Yi-Lin, Tu 2013 IEE5011 –Fall 2013 Memory Systems Wide I/O High Bandwidth DRAM Yi-Lin, Tu Department of Electronics Engineering National Chiao Tung University.

Energy Reduction for STT-RAM Using Early Write Termination Ping Zhou, Bo Zhao, Jun Yang, *Youtao Zhang Electrical and Computer Engineering Department *Department.

Energy Efficient D-TLB and Data Cache Using Semantic-Aware Multilateral Partitioning School of Electrical and Computer Engineering Georgia Institute of.

BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches

Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.

Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State.

Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

1 Lecture 2: Memory Energy Topics: energy breakdowns, handling overfetch, LPDRAM, row buffer management, channel energy, refresh energy.

Eager Writeback — A Technique for Improving Bandwidth Utilization

Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.

1 Lecture 3: Memory Energy and Buffers Topics: Refresh, floorplan, buffers (SMB, FB-DIMM, BOOM), memory blades, HMC.

Contemporary DRAM memories and optimization of their usage Nebojša Milenković and Vladimir Stanković, Faculty of Electronic Engineering, Niš.

L i a b l eh kC o m p u t i n gL a b o r a t o r y Modeling TSV Open Defects in 3D-Stacked DRAM Li Jiang †, Liu Yuxi †, Lian Duan ‡, Yuan Xie ‡, and Qiang.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Die Stacking (3D) Microarchitecture Bryan Black, Murali Annavaram, Ned Brekelbaum, John DeVale, Lei Jiang, Gabriel H. Loh1, Don McCauley, Pat Morrow, Donald.

THREE-DIMENSIONAL INTEGRATED CIRCUITS As Integrated circuits become more and more miniaturized, the only direction to go may be … up! Raymond Do Rory NordeenRyan.

Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1

Seth Pugsley, Jeffrey Jestes,

Parallel-DFTL: A Flash Translation Layer that Exploits Internal Parallelism in Solid State Drives Wei Xie1 , Yong Chen1 and Philip C. Roth2 1. Texas Tech.

‘99 ACM/IEEE International Symposium on Computer Architecture

Energy-Efficient Address Translation

Experiment Evaluation

Row Buffer Locality Aware Caching Policies for Hybrid Memories

Computer System Design Lecture 9

MICRO-50 Swamit Tannu Zachary Myers Douglas Carmean Prashant Nair

CANDY: Enabling Coherent DRAM Caches for Multi-node Systems

Increasing Effective Cache Capacity Through the Use of Critical Words

Presentation transcript:

Dong Hyuk Woo Nak Hee Seong Hsien-Hsin S. Lee Heterogeneous Die Stacking of SRAM Row Cache and 3-D DRAM: An Empirical Design Evaluation Dong Hyuk Woo Nak Hee Seong Hsien-Hsin S. Lee Electrical and Computer Engineering Georgia Tech Intel Labs The 54th IEEE International Midwest Symposium on Circuits and Systems

Modern DRAM Design Challenges Scaling challenge Less capacity Higher leakage Increasing manufacturing cost Energy efficiency pressure Smart phone / tablet Cloud / Exa-scale computing

Future Solutions Heterogeneous stacking [Kawano et al., IEDM 2006] Dedicating a logic layer for I/O circuit Better performance, lower energy consumption Homogeneous stacking [Kang et al., JSSC 2010] Increasing density without scaling the device

New Opportunity for Processor Architects SPACE AVAILABLE (404) 894-9483

SRAM Row Cache?

Motivation An Optimized 3D-Stacked Memory Architecture by Exploiting Excessive, High-Density TSV Bandwidth by D. H. Woo, N. H. Seong, D. L. Lewis and H.-H. S. Lee in IEEE International Symposium on High-Performance Computer Architecture (HPCA-16), 2010.

Row Buffer Conflicts in a Multi-core Address Stream 0x0000 0000 0x0000 0040 Conventional 3D DRAM 0x1000 0000 0x1000 0040 0x0000 0080 0x0000 00c0 Row Buffer Hit rate ~ 50% One entry / bank 3 row misses 3 row hits One cache line

Eliminating Redundant Array Lookup Address Stream 0x0000 0000 0x0000 0040 Heterogeneous SRAM row cache + 3D DRAM 0x1000 0000 0x1000 0040 0x0000 0080 0x0000 00c0 Row cache Hits 2 row misses 4 cache hits One row cache line Row Cache

SRAM Row Cache Stacking Large set-associative SRAM row cache Eliminating redundant DRAM look-ups caused by conflict misses in row buffers High bandwidth, low energy communication through TSVs

Conventional DRAM Bank Structure 2-D transfer is still energy hungry! Large area overhead of TSV TSVs One bank per die Not drawn to scale

Folded, Scalable DRAM Bank Structure Short transfer of large data (a row) Long transfer of small data (a cache line) 64x64TSVs 64x16TSVs One half-row = 4Kb (64x64)

Final Design: SRAM Row Cache + 3-D DRAM One SRAM cache bank per DRAM bank

Performance: Overall Speedup Performance: Row Hit Rate Performance Results Performance: Overall Speedup Performance: Row Hit Rate

Energy: Relative DRAM Lookup Energy Energy Results Energy: Relative DRAM Lookup Energy

Energy Breakdown

Conclusion 3-D stacking  new light for architects SRAM row cache for 3-D DRAM Folded DRAM bank design Optimize 2-D Traffic Significant energy savings

That’s All, Folks! Georgia Tech ECE MARS Lab http://arch.ece.gatech.edu

BACKUP FOILS

Simulation Results