Die Stacking (3D) Microarchitecture -- from Intel Corporation

Slides:



Advertisements
Similar presentations
Chapter 3 Embedded Computing in the Emerging Smart Grid Arindam Mukherjee, ValentinaCecchi, Rohith Tenneti, and Aravind Kailas Electrical and Computer.
Advertisements

Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors Chinnakrishnan S. Ballapuram Ahmad Sharif Hsien-Hsin S.
Lecture 6: Multicore Systems
High Performing Cache Hierarchies for Server Workloads
Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.
Multi Dimensional Steady State Heat Conduction P M V Subbarao Associate Professor Mechanical Engineering Department IIT Delhi It is just not a modeling.
†The Pennsylvania State University
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
Room: E-3-31 Phone: Dr Masri Ayob TK 2123 COMPUTER ORGANISATION & ARCHITECTURE Lecture 4: Computer Performance.
The Effect of Data-Reuse Transformations on Multimedia Applications for Different Processing Platforms N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.
Thermal-Aware SoC Test Scheduling with Test Set Partitioning and Interleaving Zhiyuan He 1, Zebo Peng 1, Petru Eles 1 Paul Rosinger 2, Bashir M. Al-Hashimi.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.
CS 7810 Lecture 15 A Case for Thermal-Aware Floorplanning at the Microarchitectural Level K. Sankaranarayanan, S. Velusamy, M. Stan, K. Skadron Journal.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.
Multi-core processors. History In the early 1970’s the first Microprocessor was developed by Intel. It was a 4 bit machine that was named the 4004 The.
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative.
Computer performance.
TLC: Transmission Line Caches Brad Beckmann David Wood Multifacet Project University of Wisconsin-Madison 12/3/03.
Multi Core Processor Submitted by: Lizolen Pradhan
Lecture 03: Fundamentals of Computer Design - Trends and Performance Kai Bu
CAD for Physical Design of VLSI Circuits
An evaluation of HotSpot-3.0 block-based temperature model
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
1 Recap (from Previous Lecture). 2 Computer Architecture Computer Architecture involves 3 inter- related components – Instruction set architecture (ISA):
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
Bypass Aware Instruction Scheduling for Register File Power Reduction Sanghyun Park, Aviral Shrivastava Nikil Dutt, Alex Nicolau Yunheung Paek Eugene Earlie.
Alpha Supplement CS 740 Oct. 14, 1998
Presentation 31 – Multicore, Multiprocessing, Multithreading, and Multitasking. When discussing modern PCs, the term “Multi” is thrown around a lot as.
Chapter 5: Computer Systems Design and Organization Dr Mohamed Menacer Taibah University
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
CS203 – Advanced Computer Architecture
Die Stacking (3D) Microarchitecture Bryan Black, Murali Annavaram, Ned Brekelbaum, John DeVale, Lei Jiang, Gabriel H. Loh1, Don McCauley, Pat Morrow, Donald.
CPU (Central Processing Unit). The CPU is the brain of the computer. Sometimes referred to simply as the processor or central processor, the CPU is where.
ALPHA 21164PC. Alpha 21164PC High-performance alternative to a Windows NT Personal Computer.
CPU Central Processing Unit
William Stallings Computer Organization and Architecture 6th Edition
Microprocessor Microarchitecture Introduction
CS203 – Advanced Computer Architecture
Lecture 2: Performance Today’s topics:
Lynn Choi School of Electrical Engineering
Seth Pugsley, Jeffrey Jestes,
Microarchitecture.
Multi-core processors
Lynn Choi School of Electrical Engineering
Multi-core processors
Architecture & Organization 1
Intel’s Core i7 Processor
Basic Computer Organization
Hyperthreading Technology
CSCI206 - Computer Organization & Programming
The University of Texas at Austin
Introduction, Focus, Overview
Memory Hierarchies.
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Lecture 2: Performance Today’s topics: Technology wrap-up
Architecture & Organization 1
Overview of VLSI 魏凱城 彰化師範大學資工系.
Alpha Microarchitecture
An Automated Design Flow for 3D Microarchitecture Evaluation
Chapter 1 Introduction.
The University of Adelaide, School of Computer Science
Computer Evolution and Performance
Introduction, Focus, Overview
A Case for Interconnect-Aware Architectures
The University of Adelaide, School of Computer Science
A microprocessor into a memory chip Dave Patterson, Berkeley, 1997
Utsunomiya University
Rajeev Balasubramonian
Presentation transcript:

Die Stacking (3D) Microarchitecture -- from Intel Corporation EEL6935 Paper Presentation Wang, Dexiang

3D Stacking Structure:

3D Stacking Recent Application: Embedded processor systems: Goals: low power, small die area 3D Architecture Forum: “ 3D Architectures for Semiconductor Integration and Packaging”: dedicated to transforming 3D design research ideas into products.

3D stacking advantages and challenge: 1. Wire length ↓ → latency ↓ 2. Higher inter-die bandwidth 3. Power consumption ↓ (30% consumed by wires in former design) Challenge: Thermal density maybe ↑ Trade-off: Performance, Power, Die Area and Hotspot Temperature

Configuration 1: Memory + Logic Original Baseline

Configuration 1: Memory + Logic (cont.) 3D Variations

Configuration 1: Memory + Logic (cont.) Simulation Infrastructure (heavy data access): Multi-threaded application → A full system multi-processor simulator / trace generator → trace record (cpu id, memory access address, instruction pointer address, unique identification number of an earlier trace) → trace driven multi-processor memory hierarchy simulator → CPMA (cycles per memory access) Benchmarks: RMS (Recognition, Mining and Synthesis) – Two Threaded Two categories: Application and Kernels Each benchmark: 1 billion total memory references (2.5 billion executed instructions)

Configuration 1: Memory + Logic (cont.) Applications Kernels

Configuration 2: Logic + Logic 2D to 3D transform: By 3D design method, the number of pipeline stage can shrink and the power consumption can be improved by reducing number of latches, repeaters and so on

Configuration 2: Logic + Logic (Cont.) Simulation Model: Tool: Single threaded microarchitecture performance simulator (developed by the Pentium 4 design team), model the wire delays due to block interconnections Benchmarks: over 650 single thread benchmark traces (SPECINT, SPECFP, hand written kernels, multimedia, internet, productivity and workstation applications)

Thermal Model: Chip-generated heat → IHS (integrated heat spreader) / Heat Sink → Forced Convection / Nature Convection

Thermal Formula: Boundary condition: Energy equation (Based one Fourier’s law): T is a function of space (x,y,z) and time (t) Boundary condition:

Thermal simulation method: FEM FEM (Finite Element Method) is a mathematic way to obtain approximate solution to partial differentiation equations, which partitions the spatial variables into small elements and transforms the differentiation equations into a great deal of linear algebra equation groups attached to each element to find the approximate solutions. The boundary and initial conditions is used to give sufficient spatial and time information to get the unique solution. The smaller the elements are partitioned, the accurate the results can be obtained.

Heat Dissipation Sensitivity: Because Cu Metal layer has more thermal sensitivity than actual bonding layer, 3D stacking is not fundamental thermal limitation.

Simulation Parameters for Memory + Logic: Microarchitecture Parameters for Intel Core 2 Duo

Simulation results for Memory + Logic Stacking: On average, increasing the L2 cache from 4MB to 32MB would reduce off-die BW requirement by 3x and CMPA by 13% with peak CMPA reduction of 50%

Simulation results for Memory + Logic Stacking: (cont.) Thermal Analysis Total power: 92W, FP units, RS and load/store units will consume more power

Simulation results for Memory + Logic Stacking: (cont.) 3D DRAM is low power compared to DDR3 because the 3D to die interconnect is much lower power than traditional off-die I/O

Simulation results for Memory + Logic Stacking: (cont.)

Simulation results for Logic + Logic Stacking: In average, 25% of all pipe stages are eliminated, resulting in 15% performance improvement

Simulation results for Logic + Logic Stacking: (Cont.) The risk of 3D stacking Carefully placing blocks and iterative optimization would lead to 1.3x power density (14 c temp increase); Worst case would lead to 2.0 power density (26 c temp increase)

Trade-off between performance and temperature: We can trade performance for a better thermal condition: 15% performance gain / 15% power reduction → 14 c temperature increase 34% power reduction / 8% performance gain → no peak temperature change 54% power reduction / 0% performance gain → 22 c temperature reduction

Question?