3D Systems with On-Chip DRAM for Enabling

Slides:



Advertisements
Similar presentations
A Novel 3D Layer-Multiplexed On-Chip Network
Advertisements

Performance, Energy and Thermal Considerations of SMT and CMP architectures Yingmin Li, David Brooks, Zhigang Hu, Kevin Skadron Dept. of Computer Science,
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.
System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
System Design Tricks for Low-Power Video Processing Jonah Probell, Director of Multimedia Solutions, ARC International.
Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Aanjhan Ranganathan (ETH Zurich), Ali Galip Bayrak (EPFL), Theo Kluter.
Techniques for Multicore Thermal Management Field Cady, Bin Fu and Kai Ren.
CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.
Lecture 2: Modern Trends 1. 2 Microprocessor Performance Only 7% improvement in memory performance every year! 50% improvement in microprocessor performance.
© 2012 IBM Corporation Barcelona Supercomputing Center MICRO 2012 Tuesday, December 4, 2012 Systematic Energy Characterization of CMP/SMT Processor Systems.
1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian.
1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
Software Optimization for Performance, Energy, and Thermal Distribution: Initial Case Studies Md. Ashfaquzzaman Khan, Can Hankendi, Ayse Kivilcim Coskun,
1 Exploring Design Space for 3D Clustered Architectures Manu Awasthi, Rajeev Balasubramonian University of Utah.
Embedded Computing From Theory to Practice November 2008 USTC Suzhou.
Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
1 The Problem of Power Consumption in Servers L. Minas and B. Ellison Intel-Lab In Dr. Dobb’s Journal, May 2009 Prepared and presented by Yan Cai Fall.
University of Karlsruhe, System Architecture Group Balancing Power Consumption in Multiprocessor Systems Andreas Merkel Frank Bellosa System Architecture.
Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.
“Early Estimation of Cache Properties for Multicore Embedded Processors” ISERD ICETM 2015 Bangkok, Thailand May 16, 2015.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD What can Manifold Enable? Manifold.
Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research.
Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.
1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.
Timing Channel Protection for a Shared Memory Controller Yao Wang, Andrew Ferraiuolo, G. Edward Suh Feb 17 th 2014.
Dong Hyuk Woo Nak Hee Seong Hsien-Hsin S. Lee
Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
SAN FRANCISCO, CA, USA Adaptive Energy-efficient Resource Sharing for Multi-threaded Workloads in Virtualized Systems Can HankendiAyse K. Coskun Boston.
University of Michigan Electrical Engineering and Computer Science Composite Cores: Pushing Heterogeneity into a Core Andrew Lukefahr, Shruti Padmanabha,
1 Some Limits of Power Delivery in the Multicore Era Runjie Zhang, Brett H. Meyer, Wei Huang, Kevin Skadron and Mircea R. Stan University of Virginia,
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.
Optimal Power Allocation for Multiprogrammed Workloads on Single-chip Heterogeneous Processors Euijin Kwon 1,2 Jae Young Jang 2 Jae W. Lee 2 Nam Sung Kim.
02/21/2003 CART 1 On-chip MRAM as a High-Bandwidth, Low-Latency Replacement for DRAM Physical Memories Rajagopalan Desikan, Charles R. Lefurgy, Stephen.
Dynamic Phase-based Tuning for Embedded Systems Using Phase Distance Mapping + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.
DTM and Reliability High temperature greatly degrades reliability
Thermal-aware Phase-based Tuning of Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing This work was supported.
Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
Migration Cost Aware Task Scheduling Milestone Shraddha Joshi, Brian Osbun 10/24/2013.
Sunpyo Hong, Hyesoon Kim
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin
Lynn Choi School of Electrical Engineering
Multiprocessing.
Seth Pugsley, Jeffrey Jestes,
Adaptive Cache Partitioning on a Composite Core
Zhichun Zhu Zhao Zhang ECE Department ECE Department
Lynn Choi School of Electrical Engineering
An Automated Design Flow for 3D Microarchitecture Evaluation
Chapter 1 Introduction.
Leveraging Optical Technology in Future Bus-based Chip Multiprocessors
Die Stacking (3D) Microarchitecture -- from Intel Corporation
Rajeev Balasubramonian
Presentation transcript:

3D Systems with On-Chip DRAM for Enabling Low-Power High-Performance Computing Jie Meng, Daniel Rossell, and Ayse K. Coskun Performance and Energy Aware Computing Lab (PEAC-Lab) Electrical and Computer Engineering Department Boston University HPEC’11 – September 22, 2011

Performance and Energy Aware Computing Laboratory Figure: IBM Zurich & EPFL Energy and thermal management of manycore systems: Scheduling Memory architecture Message passing / shared memory … 3D stacked architectures: Performance modeling, thermal verification, heterogeneous integration (e.g., DRAM stacking), …

Performance and Energy Aware Computing Laboratory Green software: Software optimization, parallel workloads, scientific & modeling applications, … Figure: Argonne's Blue Gene/P supercomputer Energy Efficiency and Real-Time Design in Cyber-Physical Systems

Multi-core to Many-core Architectures Challenges in many-core systems Memory access latency Interconnect delay & power Yield Chip power & temperature Intel’s 48-core Single-Chip Cloud Computer In addition to interconnect delay reduction -> speed gap between chip and memory. Tilera TILEPro 64-core Processor

3D Stacking Shorter interconnects  Low power and high speed Figure: Ray Yarema, Fermilab Ability to integrate different technologies in a single chip Figure: IMEC Figure: LSM, EPFL

Energy Efficiency and Temperature Temperature-induced challenges Energy problem High cost: a 10MW data center spends millions of dollars per year for operational and cooling costs Adverse effects on the environment Cooling Cost Leakage Performance Reliability Thermal challenges accelerate in high-performance systems! Mention dram stacking can boost the performance.

Contributions Model for estimating memory access latency in 3D systems with on-chip DRAM Novel methodology to jointly evaluate performance, power, and temperature of 3D systems Analysis of 3D multicore systems and comparisons with equivalent 2D systems demonstrating: Up to 3X improvement in throughput, resulting in up to 76% higher power consumption per core Temperature ranges area within safe margins for high-end systems. Embedded 3D systems are subject to severe thermal problems. Prior 3d work looks into perf or temp only. Performance Power Thermal Thermal management policies 3D systems with on-chip DRAM running parallel applications

Outline System description: 2D baseline vs. 3D target systems configuration Methodology: Performance, power, and thermal modeling Thread allocation policy Evaluation: Exploring performance, power, and thermal behavior of 2D baseline vs. 3D system with DRAM stacking

Target System 16-core processor, cores based on the cores in Intel SCC [Howard, ISSCC’10] Manufactured at 45nm, has a die area of 128.7mm2 Core architecture CPU clock 1.0GHz Branch Predictor Tournament predictor Issue Width 2-way out-of-order Functional Units 2 IntAlu, 1 IntMult, 1 FPALU, 1 FPMultDiv Physical Registers 128 Int, 128 FP Instruction Queue 64 entries L1 ICache / DCache 16 KB @2 ns (2 cyc) L2 Cache(s) 16 private L2 Caches Each L2: 4-way set-associative, 64B blocks 512 KB @5 ns (5 cyc)

3D System with On-chip DRAM 11.7mm 11mm Core L2 System Interface + I/O pad 2-layer 8Gb DRAM (4Gb each layer) core + L2s HeatSink Memory Controller pad 11mm 9 mm 11.7 mm 11.5 mm DRAM Layer 2.4mm Core L2 1.625mm 1.3mm

3D system with on-chip DRAM Memory Access Latency: 2D vs. 3D Memory access latency 2D-baseline design 3D system with on-chip DRAM memory controller (MC) 4 cycles controller-to-core delay, 116 cycles queuing delay, 5 cycles MC processing time 50 cycles queuing delay, main memory off-chip 1GB SDRAM, tRAS = 40ns, tRP = 15ns, 10ns chipset request/return on-chip 1GB SDRAM, tRAS = 30ns, tRP = 15ns, no chipset request/return memory bus off-chip memory bus, 200MHz, 8Byte bus width on-chip memory bus, 2GHz, 128Byte bus width

Outline System description: 2D baseline vs. 3D target systems configuration Methodology: Performance, power, and thermal modeling Thread allocation policy Evaluation: Exploring performance, power, and thermal behavior of 2D baseline vs. 3D system with DRAM stacking

Performance Model Performance metric: Application IPC Full-system simulator: M5 (gem5) simulator [Binkert, IEEE Micro’06] Thread-binding in an unmodified Linux 2.6 operating system Parallel benchmarks: PARSEC parallel benchmark suite [Bienia, Princeton 2011] Sim-large input sets in region of interest (ROI)

Power Model M5 McPAT Processor power: McPAT simulator [Li, MICRO’ 06] Calibration step to match the average power values of the Intel SCC cores M5 McPAT IPC Cache misses ...... Dynamic power Leakage power L2 cache power: CACTI 5.3 [HPLabs 2008] Dynamic power computed using L2 cache access rate 3D DRAM power: MICRON’s DRAM power calculator [www.micron.com] Takes the memory read and write access rates as inputs

Thermal simulation parameters Thermal Model Hotspot 5.0 [Skadron, ISCA’ 03] Includes basic 3D features Thermal simulation parameters Chip thickness 0.1mm Silicon thermal conductivity 100 W/mK Silicon specific heat 1750 kJ/m3K Sampling interval 0.01s Spreader thickness 1mm Spreader thermal conductivity 400 W/mK DRAM thickness 0.02mm DRAM thermal conductivity Interface material thickness Interface material conductivity 4 W/mK

Heat sink parameters for three different packages Thermal Model (Cont’d) We consider two additional packages representing smaller size and lower cost embedded packages. M5 McPAT IPC Cache misses …….. Dynamic power Leakage power Hotspot Temperature Heat sink parameters for three different packages Package Thickness Resistance High Performance 6.9 mm 0.1 K/W No Heatsink (Embedded A) 10 μm Medium Cost (Embedded B) 1.0 K/W

Outline System description: 2D baseline vs. 3D target systems configuration Methodology: Performance, power, and thermal modeling Thread allocation policy Evaluation: Exploring performance, power, and thermal behavior of 2D baseline vs. 3D system with DRAM stacking

Thread Allocation Policy Based on the balance_location policy [Coskun, SIGMETRICS ’09] Assigns threads with the highest IPCs to the cores at the coolest locations on the die

Performance Evaluation 3D DRAM stacking achieves an average IPC improvement of 72.55% compared to 2D.

Temporal Performance Behavior streamcluster and fluidanimate improve their average IPC by 211.8% and 49.8%, respectively.

Power of the 3D system and 2D-baseline running PARSEC benchmarks Power Evaluation Per-core power increases by 16.6% on average for the 3D system. Power of the 3D system and 2D-baseline running PARSEC benchmarks

DRAM Power and Temperature DRAM power changes following the variations in memory access rate. DRAM layer power and temperature traces for dedup benchmark Discuss impact of low mem access versus high mem access (e.g., parallel access case).

Peak core temperature for the default high-performance package Temperature Analysis Temperature decreases because the lower power DRAM layer shares the heat of the hotter cores. Peak core temperature for the default high-performance package

Temperature Analysis (Cont’d) Temperatures increase more noticeably in 3D systems with small-size and low-cost embedded packages small-size embedded package low-cost embedded package

Conclusion We provide a comprehensive simulation framework for 3D systems with on-chip DRAM. We explore the performance, power, and temperature characteristics of a 3D multi-core system running parallel applications. Average IPC increase is 72.6% and average core power increase is 16.6% compared to the equivalent 2D system. We demonstrate limited temperature changes in the 3D systems with DRAM stacking with respect to the 2D baseline. Future work: Detailed DRAM power models, higher bandwidth memory access, new 3D system architectures, new thermal/energy management policies.

Performance and Energy Aware Computing Laboratory Collaborators EPFL, Switzerland IBM Oracle Intel Brown University University of Bologna, Italy Funding: DAC, Richard Newton Award Dean’s Catalyst Award, BU Oracle VMware Contact: http://www.bu.edu/peaclab http://people.bu.edu/acoskun acoskun@bu.edu