HPEC 2007, Lexington, MA18-20 September, 2007 On-Chip Photonic Communications for High Performance Multi-Core Processors Keren Bergman, Luca Carloni, Columbia.

Slides:

Advertisements

Similar presentations

QuT: A Low-Power Optical Network-on-chip

Advertisements

A Novel 3D Layer-Multiplexed On-Chip Network

4. Shared Memory Parallel Architectures 4.4. Multicore Architectures

1 IK1500 Communication Systems IK1330 Lecture 3: Networking Anders Västberg

Benjamin C. Johnstone, Dr. Sonia Lopez Alarcon 1.

Router Architecture : Building high-performance routers Ian Pratt

Module R R RRR R RRRRR RR R R R R Efficient Link Capacity and QoS Design for Wormhole Network-on-Chip Zvika Guz, Isask ’ har Walter, Evgeny Bolotin, Israel.

1 Chapter 9 Computer Networks. 2 Chapter Topics OSI network layers Network Topology Media access control Addressing and routing Network hardware Network.

1 Lecture 15: DRAM Design Today: DRAM basics, DRAM innovations (Section 5.3)

IP I/O Memory Hard Disk Single Core IP I/O Memory Hard Disk IP Bus Multi-Core IP R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R Networks.

1 Architectural Results in the Optical Router Project Da Chuang, Isaac Keslassy, Nick McKeown High Performance Networking Group

MINIMISING DYNAMIC POWER CONSUMPTION IN ON-CHIP NETWORKS Robert Mullins Computer Architecture Group Computer Laboratory University of Cambridge, UK.

1 Multi - Core fast Communication for SoPC Multi - Core fast Communication for SoPC Technion – Israel Institute of Technology Department of Electrical.

1 Evgeny Bolotin – ClubNet Nov 2003 Network on Chip (NoC) Evgeny Bolotin Supervisors: Israel Cidon, Ran Ginosar and Avinoam Kolodny ClubNet - November.

1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)

Issues in System-Level Direct Networks Jason D. Bakos.

Interconnection and Packaging in IBM Blue Gene/L Yi Zhu Feb 12, 2007.

COLUMBIA UNIVERSITY Interconnects Jim Tomkins: “Exascale System Interconnect Requirements” Jeff Vetter: “IAA Interconnect Workshop Recap and HPC Application.

RFAD LAB, YONSEI University IEEE TRANSACTIONS ON COMPUTERS, VOL. 57, NO. 9, SEPTEMBER 2008 Photonic Networks-on-Chip for Future Generations of Chip Multiprocessors.

Photonic Networks on Chip Yiğit Kültür CMPE 511 – Computer Architecture Term Paper Presentation 27/11/2008.

ROBERT HENDRY, GILBERT HENDRY, KEREN BERGMAN LIGHTWAVE RESEARCH LAB COLUMBIA UNIVERSITY HPEC 2011 TDM Photonic Network using Deposited Materials.

High Performance Embedded Computing © 2007 Elsevier Lecture 16: Interconnection Networks Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

8/28/2015  A. Orda, R. Rom, A. Segall, Design of Computer Networks Prof. Ariel Orda Room 914, ext 4646.

Interconnect Networks

On-Chip Networks and Testing

Introduction to Interconnection Networks. Introduction to Interconnection network Digital systems(DS) are pervasive in modern society. Digital computers.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

R OUTE P ACKETS, N OT W IRES : O N -C HIP I NTERCONNECTION N ETWORKS Veronica Eyo Sharvari Joshi.

Brierley 1 Module 4 Module 4 Introduction to LAN Switching.

ATM SWITCHING. SWITCHING A Switch is a network element that transfer packet from Input port to output port. A Switch is a network element that transfer.

Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.

QoS Support in High-Speed, Wormhole Routing Networks Mario Gerla, B. Kannan, Bruce Kwan, Prasasth Palanti,Simon Walton.

High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.

Dynamic Interconnect Lecture 5. COEN Multistage Network--Omega Network Motivation: simulate crossbar network but with fewer links Components: –N.

Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.

COMPARISON B/W ELECTRICAL AND OPTICAL COMMUNICATION INSIDE CHIP Irfan Ullah Department of Information and Communication Engineering Myongji university,

CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.

Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.

Interconnect Technologies and Drivers primary technologies: integration + optics driven primarily by servers/cloud computing thin wires → slow wires; limits.

1 Optical Packet Switching Techniques Walter Picco MS Thesis Defense December 2001 Fabio Neri, Marco Ajmone Marsan Telecommunication Networks Group

Silicon Nanophotonic Network-On-Chip Using TDM Arbitration

Anshul Kumar, CSE IITD ECE729 : Advanced Computer Architecture Lecture 27, 28: Interconnection Mechanisms In Multiprocessors 29 th, 31 st March, 2010.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

Traffic Steering Between a Low-Latency Unsiwtched TL Ring and a High-Throughput Switched On-chip Interconnect Jungju Oh, Alenka Zajic, Milos Prvulovic.

University of Michigan, Ann Arbor

Performance, Cost, and Energy Evaluation of Fat H-Tree: A Cost-Efficient Tree-Based On-Chip Network Hiroki Matsutani (Keio Univ, JAPAN) Michihiro Koibuchi.

Soc 5.1 Chapter 5 Interconnect Computer System Design System-on-Chip by M. Flynn & W. Luk Pub. Wiley 2011 (copyright 2011)

Unit III Bandwidth Utilization: Multiplexing and Spectrum Spreading In practical life the bandwidth available of links is limited. The proper utilization.

Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.

Assaf Shacham, Keren Bergman, Luca P. Carloni Presented for HPCAN Session by: Millad Ghane NOCS’07.

Lecture Note on Switch Architectures. Function of Switch.

Challenges in the Next Generation Internet Xin Yuan Department of Computer Science Florida State University

Hybrid Optoelectric On-chip Interconnect Networks Yong-jin Kwon 1.

CCNA3 Module 4 Brierley Module 4. CCNA3 Module 4 Brierley Topics LAN congestion and its effect on network performance Advantages of LAN segmentation in.

Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.

Computer Communication and Networking Lecture # 4 by Zainab Malik 1.

Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.

Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.

Building manycore processor-to-DRAM networks using monolithic silicon photonics Ajay Joshi †, Christopher Batten †, Vladimir Stojanović †, Krste Asanović.

Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.

Network-on-Chip Paradigm Erman Doğan. OUTLINE SoC Communication Basics  Bus Architecture  Pros, Cons and Alternatives NoC  Why NoC?  Components 

ESE532: System-on-a-Chip Architecture

Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio

Exploring Concentration and Channel Slicing in On-chip Network Router

Gilbert Hendry Johnnie Chan, Daniel Brunina,

Analysis of a Chip Multiprocessor Using Scientific Applications

Israel Cidon, Ran Ginosar and Avinoam Kolodny

EE 122: Lecture 7 Ion Stoica September 18, 2001.

Leveraging Optical Technology in Future Bus-based Chip Multiprocessors

Multiprocessors and Multi-computers

Presentation transcript:

HPEC 2007, Lexington, MA18-20 September, 2007 On-Chip Photonic Communications for High Performance Multi-Core Processors Keren Bergman, Luca Carloni, Columbia University Jeffrey Kash, Yurii Vlasov, IBM Research

2 HPEC 2007, Lexington, MA September, 2007 Chip MultiProcessors (CMP) Niagara Sun 2004 CELL BE IBM 2005 Montecito Intel 2004 Terascale Intel 2007 Barcelona AMD 2007

3 HPEC 2007, Lexington, MA September, 2007 Networks on Chip (NoC)  Shared, packet-switched, optimized for communications –Resource efficiency –Design simplicity –IP reusability –High performance  But… no true relief in power dissipation Kolodny, 2005

4 HPEC 2007, Lexington, MA September, 2007 Chip MultiProcessors (CMPs) IBM Cell, Sun Niagara, Intel Montecito, … IBM Cell:

5 HPEC 2007, Lexington, MA September, 2007 Why Photonics for CMP NoC? OPTICS:  Modulate/receive ultra-high bandwidth data stream once per communication event  Transparency: broadband switch routes entire multi-wavelength high BW stream  Low power switch fabric, scalable  Off-chip and on-chip can use essentially the same technology  Off-chip BW = On-chip BW for same power TX RX TX RX TX RX TX RX TX RX TX RX ELECTRONICS:  Buffer, receive and re-transmit at every switch  Off chip is pin-limited and really power hungry Photonics changes the rules for Bandwidth-per-Watt

6 HPEC 2007, Lexington, MA September, 2007 Recent advances in photonic integration Infinera, 2005 IBM, 2007Lipson, Cornell, 2005 Luxtera, 2005Bowers, UCSB, 2006

7 HPEC 2007, Lexington, MA September, DI CMP System Concept  Future CMP system in 22nm  Chip size ~625mm 2  3D layer stacking used to combine: –Multi-core processing plane –Several memory planes –Photonic NoC Processor System Stack  For 22nm scaling will enable 36 multithreaded cores similar to today’s Cell  Estimated on-chip local memory per complex core ~0.5GB

8 HPEC 2007, Lexington, MA September, 2007 Optical NoC: Design Considerations  Design to exploit optical advantages: –Bit rate transparency: transmission/switching power independent of bandwidth –Low loss: power independent of distance –Bandwidth: exploit WDM for maximum effective bandwidths across network (Over) provision maximized bandwidth per port Maximize effective communications bandwidth –Seamless optical I/O to external memory with same BW  Design must address optical challenges: –No optical buffering –No optical signal processing –Network routing and flow control managed in electronics Distributed vs. Central Electronic control path provisioning latency  Packaging constraints: CMP chip layout, avoid long electronic interfaces, network gateways must be in close proximity on photonic plane  Design for photonic building blocks: low switch radix

9 HPEC 2007, Lexington, MA September, 2007  Goal: Design a NoC for a chip multiprocessor (CMP)  Electronics Integration density  abundant buffering and processing Integration density  abundant buffering and processing  Power dissipation grows with data rate  Photonics Low loss, large bandwidth, bit-rate transparency Low loss, large bandwidth, bit-rate transparency  Limited processing, no buffers  Our solution – a hybrid approach: A dual-network design –Data transmission in a photonic network –Control in an electronic network –Paths reserved before transmission  No optical buffering Photonic On-Chip Network PPP PPP PPP GGG GGG GGG

10 HPEC 2007, Lexington, MA September, 2007 P G P G P G P G P G P G P G P G P G On-Chip Optical Network Architecture Bufferless, Deflection-switch based Cell Core (on processor plane) Gateway to Photonic NoC (on processor and photonic planes) Thin Electrical Control Network (~1% BW, small messages) Photonic NoC Deflection Switch

11 HPEC 2007, Lexington, MA September, 2007 Building Blocks (1): High-speed Photonic Modulator Ring Resonator Recent 12.5GHz  Ring-resonator structure  Achieve optical data modulation  Compact ~ 10  m diameter for high density integration  Ultra-low power ~ 1pJ/bit today, scalable to 0.1pJ/bit  12.5Gb/s demo, extendable to 40Gb/s

12 HPEC 2007, Lexington, MA September, 2007 OFF ON  Broadband ring-resonator switch  OFF state –passive waveguide crossover –negligible power  ON state: –carrier injection  coupling into ring  signal switched ~0.5mW Building Blocks (2): Broadband deflection switch

13 HPEC 2007, Lexington, MA September, 2007  Lateral PIN design, direct Ge growth on thin SOI (IBM)  Low capacitance and dark current  20GHz Bandwidth  Ultra-low power, 0.1pJ/bit today scalable to 0.01pJ/bit Building Blocks (3): Detector

14 HPEC 2007, Lexington, MA September, x4 Photonic Switch Element  4 deflection switches grouped with electronic control  4 waveguide pairs I/O links  Electronic router –High speed simple logic –Links optimized for high speed  Small area (~0.005mm 2 )  Nearly no power consumption in OFF state

15 HPEC 2007, Lexington, MA September, 2007 Non-Blocking 4x4 Switch Design  Original switch is internally blocking  Addressed by routing algorithm in original design  Limited topology choices  New design –Strictly non-blocking* –Same number of rings –Negligible additional loss –Larger area * U-turns not allowed W E N S WE N S

16 HPEC 2007, Lexington, MA September, 2007May 18, 2007 Design of strictly non-blocking photonic mesh Non-blocking 4x4  enables non-blocking mesh topology Network is strictly nonblocking (derived from crossbar) Link bidirectionality is exploited Allow 2 gateways to inject on each row Allow 2 gateways eject on each column Processor Layout

17 HPEC 2007, Lexington, MA September, 2007 gw EC -demultiplexer 1 × 2 injection switch E/O modulators drivers electronic control logic receivers -mux N S EW PSE injection/ejection switch N S W gateway electronic pathway electronic control EC network slice Detailed layout

18 HPEC 2007, Lexington, MA September, 2007 Comparative Power Analysis [DAC ’07]  6x6 tiled CMP  Very large bandwidths per core –Peak: 800 Gb/s –Average: 512 Gb/s  Compared designs –Electronic on-chip network –Hybrid photonic on-chip network  Performance per Watt PPP PPP PPP PPP PPP PPP PPP PPP PPP PPP PPP PPP GGG GGG GGG GGG GGG GGG GGG GGG GGG GGG GGG GGG PPP PPP PPP PPP PPP PPP PPP PPP PPP PPP PPP PPP GGG GGG GGG GGG GGG GGG GGG GGG GGG GGG GGG GGG

19 HPEC 2007, Lexington, MA September, 2007  Electronic NoC –Copper lines are bandwidth-limited –Parallelism used to attain large bandwidth –Wide busses and large buffers are power hungry –Multiple hops require regeneration –NoC power exceeding 100 W (prediction for 22 nm)  Photonic NoC –Message generation: 2.3 W (assuming 0.11 pJ/bit) –Photonic switching: 0.04 W – practically negligible –Network control: 0.8 W (and scaling down with technology) –Total – 3.2 W –optical I/O off-chip with same bandwidth to external memory at very little additional power. Power Analysis Results [DAC ’07] TX RX TX RX TX RX TX RX TX RX TX RX

20 HPEC 2007, Lexington, MA September, 2007 Performance Analysis  Goal to evaluate performance-per-Watt advantage of CMP system with photonic NoC  Developed network simulator using OMNeT++: modular, open- source, event-driven simulation environment –Modules for photonic building blocks, assembled in network –Multithreaded model for complex cores  Evaluate NoC performance under uniform random distribution  Performance-per-Watt gains of photonic NoC on FFT application

21 HPEC 2007, Lexington, MA September, 2007 Multithreaded complex core model  Model complex core as multithreaded processor with many computational threads executed in parallel  Each thread independently make a communications request to any core  Three main blocks: –Traffic generator – simulates core threads data transfer requests, requests stored in back-pressure FIFO queue –Scheduler – extracts requests from FIFO, generates path setup, electronic interface, blocked requests re-queued, avoids HoL blocking –Gateway – photonic interface, send/receive, read/write data to local memory

22 HPEC 2007, Lexington, MA September, 2007 Throughput per core  Throughput-per-core = ratio of time core transmits photonic message over total simulation time –Metric of average path setup time –Function of message length and network topology  Offered load  considered when core is ready to transmit  For uncongested network: throughput-per-core = offered load  Simulation system parameters: –36 multithreaded cores –DMA transfers of fixed size messages, 16kB –Line rate = 960Gbps; Photonic message = 134ns

23 HPEC 2007, Lexington, MA September, 2007 Throughput per core for 36-node photonic NoC Multithreading enables better exploitation of photonic NoC high BW Gain of 26% over single-thread Non-blocking mesh, shorter average path, improved by 13% over crossbar

24 HPEC 2007, Lexington, MA September, 2007 FFT Computation Performance  We consider the execution of Cooley-Turkey FFT algorithm using 32 of 36 available cores  First phase: each core processes: k=m/M sample elements –m = array size of input samples –M = number of cores  After first phase, log M iterations of computation-step followed by communication-step when cores exchange data in butterfly  Time to perform FFT computation depends on core architecture, time for data movement is function of NoC line rate and topology  Reported results for FFT on Cell processor, 2 24 samples FFT executes in ~43ms based on Bailey’s algorithm.  We assume Cell core with (2X) 256MB local-store memory, DP  Use Bailey’s algorithm to complete first phase of Cooley-Turkey in 43ms  Cooley-Turkey requires 5kLogk floating point operations, each iteration after first phase is ~1.8ms for k= 2 24  Assuming 960Gbps, CMP non-blocking mesh NoC can execute 2 29 in 66ms

25 HPEC 2007, Lexington, MA September, 2007 FFT Computation Power Analysis  For photonic NoC: –Hop between two switches is 2.78mm, with average path of 11 hops and 4 switch element turns –32 blocks of 256MB and line rate of 960Gbps, each connection is 105.6mW at interfaces and 2mW in switch turns –total power dissipation is 3.44W  Electronic NoC: –Assume equivalent electronic circuit switched network –Power dissipated only for length of optimally repeated wire at 22nm, 0.26pJ/bit/mm  Summary: Computation time is a function of the line rate, independent of medium

26 HPEC 2007, Lexington, MA September, 2007 FFT Computation Performance Comparison FFT computation: time ratio and power ratio as function of line rate

27 HPEC 2007, Lexington, MA September, 2007 Performance-per-Watt  To achieve same execution time (time ratio = 1), electronic NoC must operate at the same line rate of 960Gbps, dissipating 7.6W/connection or ~70X over photonic  Total dissipated power is ~244W  To achieve same power (power ratio = 1), electronic NoC must operate at line rate of 13.5Gbps, a reduction of 98.6%.  Execution time will take ~1sec or 15X longer than photonic

28 HPEC 2007, Lexington, MA September, 2007 Summary Optical I/O  Electronic NoCs dissipate prohibitively high power  a technology shift is required  Remarkable advances in Silicon Nanophotonics  Photonic NoCs provide enormous capacity at dramatically low power consumption required for future CMPs, both on- and off-chip  Performance-per-Watt gains on communications intensive applications  CMPs are clearly emerging for power efficient high performance computing capability  Future on-chip interconnects must provide large bandwidth to many cores

Power Analysis: Electronic On-chip Network Assumptions: Assumptions: 6x6 Mesh, uniform traffic 6x6 Mesh, uniform traffic Link length ( l ): 1.67 mm Link length ( l ): 1.67 mm Bus width ( w ): 168 bits Bus width ( w ): 168 bits Signaling rate ( f ): 5 GHz Signaling rate ( f ): 5 GHz Injection rate ( IR ): Injection rate ( IR ): Results: Results: Peak bandwidth ( BW PEAK =w  f ) : 840 Gb/s Peak bandwidth ( BW PEAK =w  f ) : 840 Gb/s Average bandwidth ( BW AVG =w  f  IR ) : 525 Gb/s Average bandwidth ( BW AVG =w  f  IR ) : 525 Gb/s Link traversal energy: Link traversal energy: E link = pJ/bit/mm (estimated for 32 nm) E link = pJ/bit/mm (estimated for 32 nm) E router = 0.83 pJ/bit (estimated for 32 nm) E router = 0.83 pJ/bit (estimated for 32 nm) E flit-hop = ( E link  l+E link )  w = 235 pJ E flit-hop = ( E link  l+E link )  w = 235 pJ 6x6 Mesh  120 links 6x6 Mesh  120 links Average link utilization (uniform traffic) = 0.75 Average link utilization (uniform traffic) = 0.75 Total network power = U AVG  N LINKS  E flit-hop  f = 106 W Total Network Power E per hop Link utilization PPP PPP PPP PPP PPP PPP PPP PPP PPP PPP PPP PPP GGG GGG GGG GGG GGG GGG GGG GGG GGG GGG GGG GGG

6x6 CMP (36 Gateways) 6x6 CMP (36 Gateways) 12x12 Photonic mesh 12x12 Photonic mesh 960 Gb/s peak bandwidth 960 Gb/s peak bandwidth Injection rate: 0.6 Injection rate: 0.6 Average BW: 576 Gb/s Average BW: 576 Gb/s 4 turns per message 4 turns per message 86 switches ON (  0.5 mW each) 86 switches ON (  0.5 mW each) Network power: 43 mW Network power: 43 mW Power Analysis: (1) Photonic Network PPP PPP PPP PPP PPP PPP PPP PPP PPP PPP PPP PPP GGG GGG GGG GGG GGG GGG GGG GGG GGG GGG GGG GGG

Power Analysis: (2) Photonic Gateways Generating/receiving very high bandwidths is costly. Generating/receiving very high bandwidths is costly. Comparable to a single electronic link Comparable to a single electronic link But – need to modulate/detect only once, while routing is nearly free. But – need to modulate/detect only once, while routing is nearly free. Current Exp. scaling Modulation  1 pJ/bit 0.1 pJ/bit Detection  0.1 pJ/bit 0.01 pJ/bit Total (36 x 576 Gb/s) 23 W 2.3 W TX RX TX RX TX RX TX RX TX RX TX RX

Power Analysis: (3) Electronic Control Network Low bandwidth electronic NoC: Carries only control packets. Low bandwidth electronic NoC: Carries only control packets. Bulk of data transmitted on photonic network Bulk of data transmitted on photonic network Assumptions Assumptions x2 path length (overprovisioning) x2 path length (overprovisioning) 64 control bits per 2-KByte photonic message 64 control bits per 2-KByte photonic message Carries only 0.8% of the traffic Carries only 0.8% of the traffic P = P Electronic-NoC  0.8% = 106 W  0.8% P = P Electronic-NoC  0.8% = 106 W  0.8% Electronic control layer  0.82 W Electronic control layer  0.82 W Will scale down with electronic technology Will scale down with electronic technology PPP PPP PPP PPP PPP PPP PPP PPP PPP PPP PPP PPP GGG GGG GGG GGG GGG GGG GGG GGG GGG GGG GGG GGG