SoC Design Methodology for Exascale Computing

Slides:

Advertisements

Similar presentations

1 Networks for Multi-core Chip A Controversial View Shekhar Borkar Intel Corp.

Advertisements

Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,

Semiconductor Memory Design. Organization of Memory Systems Driven only from outside Data flow in and out A cell is accessed for reading by selecting.

Computing Systems Roadmap and its Impact on Software Development Michael Ernst, BNL HSF Workshop at SLAC January, 2015.

Keeping Hot Chips Cool Ruchir Puri, Leon Stok, Subhrajit Bhattacharya IBM T.J. Watson Research Center Yorktown Heights, NY Circuits R-US.

1 Runnemede: Disruptive Technologies for UHPC John Gustafson Intel Labs HPC User Forum – Houston 2011.

“This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not.

Some Thoughts on Technology and Strategies for Petaflops.

OCIN Workshop Wrapup Bill Dally. Thanks To Funding –NSF - Timothy Pinkston, Federica Darema, Mike Foster –UC Discovery Program Organization –Jane Klickman,

Spring 07, Jan 16 ELEC 7770: Advanced VLSI Design (Agrawal) 1 ELEC 7770 Advanced VLSI Design Spring 2007 Introduction Vishwani D. Agrawal James J. Danaher.

IP I/O Memory Hard Disk Single Core IP I/O Memory Hard Disk IP Bus Multi-Core IP R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R Networks.

CAD and Design Tools for On- Chip Networks Luca Benini, Mark Hummel, Olav Lysne, Li-Shiuan Peh, Li Shang, Mithuna Thottethodi,

On-Line Adjustable Buffering for Runtime Power Reduction Andrew B. Kahng Ψ Sherief Reda † Puneet Sharma Ψ Ψ University of California, San Diego † Brown.

Architectural Power Management for High Leakage Technologies Department of Electrical and Computer Engineering Auburn University, Auburn, AL /15/2011.

Tanenbaum, Structured Computer Organization, Fifth Edition, (c) 2006 Pearson Education, Inc. All rights reserved The Digital Logic Level.

Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.

CS 423 – Operating Systems Design Lecture 22 – Power Management Klara Nahrstedt and Raoul Rivas Spring 2013 CS Spring 2013.

Slide 1 U.Va. Department of Computer Science LAVA Architecture-Level Power Modeling N. Kim, T. Austin, T. Mudge, and D. Grunwald. “Challenges for Architectural.

Case Study - SRAM & Caches

TLC: Transmission Line Caches Brad Beckmann David Wood Multifacet Project University of Wisconsin-Madison 12/3/03.

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

1. 2 Electronics Beyond Nano-scale CMOS Shekhar Borkar Intel Corp. July 27, 2006.

1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.

Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.

“ Near-Threshold Computing: Reclaiming Moore’s Law Through Energy Efficient Integrated Circuits ” By Ronald G. Dreslinski, Michael Wieckowski, David Blaauw,

Introspective 3D Chips S. Mysore, B. Agrawal, N. Srivastava, S. Lin, K. Banerjee, T. Sherwood (UCSB), ASPLOS 2006 Shimin Chen (LBA Reading Group Presentation)

Evolution in Complexity Evolution in Transistor Count.

Power Reduction for FPGA using Multiple Vdd/Vth

Dong Hyuk Woo Nak Hee Seong Hsien-Hsin S. Lee

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

® 1 VLSI Design Challenges for Gigascale Integration Shekhar Borkar Intel Corp. October 25, 2005.

1. DAC 2006 CAD Challenges for Leading-Edge Multimedia Designs.

Steve Pawlowski Intel Senior Fellow GM, Architecture and Planning CTO, Digital Enterprise Group Intel Corporation HPC: Energy Efficient Computing April.

Exascale Programming Models Lecture Series 06/12/2014 What is OCR? TG Team (presenter: Romain Cledat) June 12,

EE4800 CMOS Digital IC Design & Analysis

Development of Programmable Architecture for Base-Band Processing S. Leung, A. Postula, Univ. of Queensland, Australia A. Hemani, Royal Institute of Tech.,

Distributed Computation: Circuit Simulation CK Cheng UC San Diego

Basics of Energy & Power Dissipation

Rabi Mahapatra Department of Computer Science & Engineering Texas A&M University.

© Digital Integrated Circuits 2nd Inverter Digital Integrated Circuits A Design Perspective The Inverter Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

By Nasir Mahmood.  The NoC solution brings a networking method to on-chip communication.

0 1 Thousand Core Chips A Technology Perspective Shekhar Borkar Intel Corp. June 7, 2007.

Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.

Greg Alkire/Brian Smith 197 MAPLD An Ultra Low Power Reconfigurable Task Processor for Space Brian Smith, Greg Alkire – PicoDyne Inc. Wes Powell.

JouleTrack - A Web Based Tool for Software Energy Profiling Amit Sinha and Anantha Chandrakasan Massachusetts Institute of Technology June 19, 2001.

Seok-jae, Lee VLSI Signal Processing Lab. Korea University

A 1.2V 26mW Configurable Multiuser Mobile MIMO-OFDM/-OFDMA Baseband Processor Motivations –Most are single user, SISO, downlink OFDM solutions –Training.

VADA Lab.SungKyunKwan Univ. 1 L5:Lower Power Architecture Design 성균관대학교 조 준 동 교수

1 Dual-V cc SRAM Class presentation for Advanced VLSIPresenter:A.Sammak Adopted from: M. Khellah,A 4.2GHz 0.3mm 2 256kb Dual-V CC SRAM Building Block in.

Sp09 CMPEN 411 L21 S.1 CMPEN 411 VLSI Digital Circuits Spring 2009 Lecture 21: Shifters, Decoders, Muxes [Adapted from Rabaey’s Digital Integrated Circuits,

Implementing Tile-based Chip Multiprocessors with GALS Clocking Styles Zhiyi Yu, Bevan Baas VLSI Computation Lab, ECE Department University of California,

A 280mV-to-1.1V 256b Reconfigurable SIMD Vector Permutation Engine with 2-D Shuffle in 22nm CMOS [ISSCC ’12] Literature Review Fang-Li Yuan Advisor: Prof.

CS203 – Advanced Computer Architecture

Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.

-1- Soft Core Viterbi Decoder EECS 290A Project Dave Chinnery, Rhett Davis, Chris Taylor, Ning Zhang.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

A Case for Standard-Cell Based RAMs in Highly-Ported Superscalar Processor Structures Sungkwan Ku, Elliott Forbes, Rangeen Basu Roy Chowdhury, Eric Rotenberg.

Lynn Choi School of Electrical Engineering

NVIDIA’s Extreme-Scale Computing Project

Energy Efficient Computing in Nanoscale CMOS

Ultra-Low-Voltage UWB Baseband Processor

Jim Held Intel Fellow & Director, Tera-scale Computing Research

Toward a Unified HPC and Big Data Runtime

HPC User Forum 2012 Panel on Potential Disruptive Technologies Emerging Parallel Programming Approaches Guang R. Gao Founder ET International.

Power is Leading Design Constraint

Characteristics of Reconfigurable Hardware

A High Performance SoC: PkunityTM

Energy Efficient Power Distribution on Many-Core SoC

Department of Electrical Engineering Joint work with Jiong Luo

Exascale Programming Models in an Era of Big Computation and Big Data

Presentation transcript:

SoC Design Methodology for Exascale Computing Shekhar Borkar Intel Corp. June 7, 2015 This research was, in part, funded by the U.S. Government, DOE and DARPA. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

Outline Exascale motivation & challenges Future Exascale straw-man system design Compute Memory Interconnect Design challenges and solutions Choice of process technology SoC Challenges

From Giga to Exa, via Tera & Peta 1.5x from transistor 670x from parallelism Exa Peta 8x from transistor 128x from parallelism Tera 32x from transistor 32x from parallelism Giga System performance from parallelism

Where is the Energy Consumed? Teraflop system today Decode and control Address translations… Power supply losses Bloated with inefficient architectural features 600W ~1KW 10TB disk @ 1TB/disk @10W 100W Goal Disk 100pJ com per FLOP 100W Com 5W ~20W 0.1B/FLOP @ 1.5nJ per Byte 150W ~3W Memory ~5W 50W 2W 50pJ per FLOP Compute 5W

The UHPC* Challenge 20 pJ/Operation 20W, Tera 20MW, Exa 20KW, Peta *DARPA, Ubiquitous HPC Program 20W, Tera 20MW, Exa 20 pJ/Operation 2W, 100 Giga 20 mW, Mega 20 mW, Giga

Top Exascale Challenges System Power & Energy New, efficient, memory subsystem Extreme parallelism, data locality, gentle-slope programmability New execution model comprehending self awareness with introspection Resiliency to provide system reliability System efficiency & cost

Exascale HW Design Challenges 1. NTV Logic & Memory for low energy Break the Vmin barrier FPU, logic & latches Register Files, SRAM 4. Instrumentation for introspection Energy measurement Performance feedback Ambient conditions 2. Coping with variations due to NTV Freq degradation Variation in performance Functionality loss 5. Hierarchical, heterogeneous IC Busses X Bars Circuit & Packet Switched 3. Fine grain power management Voltage Regulator Buck or Switched Cap Power gating Frequency control 6. Overhaul memory subsystem

Exascale SW Challenges 1. Extreme parallelism O(billion) Programming model Data locality Legacy compatibility 4. Self awareness Observation based Monitor, continuously adapt Objective function based runtime optimization 2. Programming model & system Data flow inspired Gentle slope programming (Productivity to high performance) 5. Challenge applications New algorithms Locality friendly Immunity to tapered BW 3. New execution model Event driven Asynchronous Dynamic scheduling Runtime system 6. System level resiliency research

Exascale SoC Processor Straw-man Core Large shared Cache/Mem Block Large shared Cache/Mem Cluster I$ D$ (64KB) RF FPMAC Dec Simple Core Mem Controller IO (PCIe) Interconnect Fabric SoC SoC Targets (5 nm) Die Size mm ~25 Frequency GHz >2 Cores ~4096 FP Perf Tera-Flops >16 Power Watts < 100 Energy Efficiency pJ/FLOP 4-6 GF/Watt > 200 Large shared Cache/Mem

Node ...... SoC ...... Processor Node ...... ~1 TB LPDDR 4 TB 3D 1 TB/s Total BW High performance, low power DDRx ...... 4 TB High capacity, low cost SoC ...... IC Fabric

Interconnect Switch Straw-man Electrical PHY Local Electrical Switch Network Optical PHY Data Buffer Arbitration Logic Switching Fabric

On-die Data Movement vs Compute Compute energy On die IC energy/mm 60% 6X Having looked at energy efficiency of the compute, let’s now focus on the communication aspects. The graph to the left compares energy per operation of a compute operation over successive generation of technologies, and compares against energy of moving a bit over on-die interconnect by say one mm. Notice that the interconnect energy reduces much slower than the compute energy. Therefore, interconnect related energy will start dominating in the future designs, necessitating restricting data movement. Off-die interconnect energy and bandwidth is shown to the right. The energy is reducing, and the bandwidth is increasing, but probably not at the rate desired to support compute. Also notice that there is a big gap between what is reported in the research papers in terms of energy efficiency and bandwidth, vs what is actually realized in production. We need to pay attention to both on-die as well as off-die communication much more than in the past. Interconnect energy (per mm) reduces slower than compute On-die data movement energy will start to dominate

Interconnect vs Switch Energy Repeated wire delay Wire energy Switch delay Switch energy Switch energy With scaling

Interconnect Structures Buses over short distance Shared bus 1 to 10 fJ/bit 0 to 5mm Limited scalability Multi-ported Memory Shared memory 10 to 100 fJ/bit 1 to 5mm Limited scalability X-Bar Cross Bar Switch 0.1 to 1pJ/bit 2 to 10mm Moderate scalability 1 to 3pJ/bit >5 mm, scalable Packet Switched Network Board Cabinet System Hierarchy & Heterogeneity

Exascale HW System

Bandwidth Tapering Intelligent BW tapering is necessary Naïve, 4X Severe Intelligent BW tapering is necessary

NTV Operation for Energy Efficiency 10X increase in Energy Efficiency Leakage power dominates Variations become worse Die to die frequency Across temperature H. Kaul et al, 16.6: A 320mV 56μW 411GOPS/Watt Ultra-Low-Voltage Motion-Estimation Accelerator in 65nm CMOS ISSCC 2008 H. Kaul et al, “Near-Threshold Voltage (NTV) Design—Opportunities and Challenges”, DAC, 2012

NTV Design Considerations (1) Conventional dual-ended write RF cell NTV friendly Flip-Flop NTV friendly Vector Flip-Flop NTV friendly, dual-ended transmission gate RF cell H. Kaul et al, “Near-Threshold Voltage (NTV) Design—Opportunities and Challenges”, DAC, 2012

NTV Design Considerations (2) NTV friendly multiplexer Ultra-low voltage, split-output level shifter Two-stage, cascaded split-output level shifter

Integration of IP Blocks Integrated Voltage Regulator & Control Vdd L Vout Integrated voltage regulators Enable DVFS and NTV operation Power Enable# Clock Enable Power Enable Synthesizing clock gating Synthesizing power gating Other traditional IP blocks: Accelerators, Memory controller, PCIe, …etc.

Choice of Technology 22 nm SoC Technology 30% drop 100X drop C.-H. Jan et al, “A 22nm SoC Platform Technology Featuring 3-D Tri-Gate and High-k/Metal Gate, Optimized for Ultra-Low-Power, High-Performance and High-Density SoC Applications”, IEDM 2012

SoC Design Challenges Challenge Potential Solutions 1,000’s or cores Large die WD & DD Variations Variation tolerance Dynamic, system level adaptation Voltage scaling to NTV New standard cell library? Data movement dominates Interconnect optimization Multi-ported, various sizes of RF and memories Memory and RF compilers Integration of custom and off-the-shelf IP blocks Soft IP, not hard IP Complexity Formal verification Silicon cost Architectural simplicity Design for system efficiency System level optimization tools Process technology HP vs LP—judicious choice