Energy-Proportional Photonic Interconnects Nikos Hardavellas – Parallel Architecture Group Northwestern University Team: Y. Demir, P. Yan, S. Song,

Slides:



Advertisements
Similar presentations
Galaxy: High-Performance Energy-Efficient Multi-Chip Architectures Using Photonic Interconnects Nikos Hardavellas – Parallel Architecture Group.
Advertisements

Electrical and Computer Engineering UAH System Level Optical Interconnect Optical Fiber Computer Interconnect: The Simultaneous Multiprocessor Exchange.
Asaf SOMEKH, Oct 15 th, 2013 Evolving Peering with a New Router Architecture Jean-David LEHMANN-CHARLEY Compass-EOS RIPE 67, Athens
QuT: A Low-Power Optical Network-on-chip
Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.
A Novel 3D Layer-Multiplexed On-Chip Network
Nikos Hardavellas, Northwestern University
THE RAW MICROPROCESSOR: A COMPUTATIONAL FABRIC FOR SOFTWARE CIRCUITS AND GENERAL- PURPOSE PROGRAMS Taylor, M.B.; Kim, J.; Miller, J.; Wentzlaff, D.; Ghodrat,
Benjamin C. Johnstone, Dr. Sonia Lopez Alarcon 1.
Galaxy: High-Performance Energy-Efficient Multi-Chip Architectures Using Photonic Interconnects Nikos Hardavellas – Parallel Architecture Group.
CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.
Router Architecture : Building high-performance routers Ian Pratt
Firefly: Illuminating Future Network-on-Chip with Nanophotonics Yan Pan, Prabhat Kumar, John Kim †, Gokhan Memik, Yu Zhang, Alok Choudhary EECS Department.
1 Lecture 15: DRAM Design Today: DRAM basics, DRAM innovations (Section 5.3)
IP I/O Memory Hard Disk Single Core IP I/O Memory Hard Disk IP Bus Multi-Core IP R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R Networks.
MINIMISING DYNAMIC POWER CONSUMPTION IN ON-CHIP NETWORKS Robert Mullins Computer Architecture Group Computer Laboratory University of Cambridge, UK.
Scaling the Bandwidth Wall: Challenges in and Avenues for CMP Scalability 36th International Symposium on Computer Architecture Brian Rogers †‡, Anil Krishna.
Memory Hierarchy.1 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output.
Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim
Dragonfly Topology and Routing
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
Diamonds are a Memory Controller’s Best Friend* *Also known as: Achieving Predictable Performance through Better Memory Controller Placement in Many-Core.
Photonic Networks on Chip Yiğit Kültür CMPE 511 – Computer Architecture Term Paper Presentation 27/11/2008.
ROBERT HENDRY, GILBERT HENDRY, KEREN BERGMAN LIGHTWAVE RESEARCH LAB COLUMBIA UNIVERSITY HPEC 2011 TDM Photonic Network using Deposited Materials.
Interconnect Networks
On-Chip Networks and Testing
Introduction to Interconnection Networks. Introduction to Interconnection network Digital systems(DS) are pervasive in modern society. Digital computers.
Computing Hardware Starter.
R OUTE P ACKETS, N OT W IRES : O N -C HIP I NTERCONNECTION N ETWORKS Veronica Eyo Sharvari Joshi.
Report Advisor: Dr. Vishwani D. Agrawal Report Committee: Dr. Shiwen Mao and Dr. Jitendra Tugnait Survey of Wireless Network-on-Chip Systems Master’s Project.
Mark Franklin, S06 CS, CoE, EE 362 Digital Computers II: Architecture Prof. Mark Franklin: Course Assistants: –Drew Frank:
International Symposium on Low Power Electronics and Design NoC Frequency Scaling with Flexible- Pipeline Routers Pingqiang Zhou, Jieming Yin, Antonia.
Optics in Internet Routers Mark Horowitz, Nick McKeown, Olav Solgaard, David Miller Stanford University
Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.
SMART: A Single- Cycle Reconfigurable NoC for SoC Applications -Jyoti Wadhwani Chia-Hsin Owen Chen, Sunghyun Park, Tushar Krishna, Suvinay Subramaniam,
TTM1 – 2013: Core networks and Optical Circuit Switching (OCS)
QoS Support in High-Speed, Wormhole Routing Networks Mario Gerla, B. Kannan, Bruce Kwan, Prasasth Palanti,Simon Walton.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
Nikos Hardavellas – Parallel Architecture Group
Design and Evaluation of Hierarchical Rings with Deflection Routing Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, ​ Kevin Chang, Greg Nazario, Reetuparna.
COMPARISON B/W ELECTRICAL AND OPTICAL COMMUNICATION INSIDE CHIP Irfan Ullah Department of Information and Communication Engineering Myongji university,
Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.
Rev PA1 1 Performance energy trade-offs with Silicon Photonics Sébastien Rumley, Robert Hendry, Dessislava Nikolova, Keren Bergman.
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
Traffic Steering Between a Low-Latency Unsiwtched TL Ring and a High-Throughput Switched On-chip Interconnect Jungju Oh, Alenka Zajic, Milos Prvulovic.
University of Michigan, Ann Arbor
© GCSE Computing Computing Hardware Starter. Creating a spreadsheet to demonstrate the size of memory. 1 byte = 1 character or about 1 pixel of information.
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
Assaf Shacham, Keren Bergman, Luca P. Carloni Presented for HPCAN Session by: Millad Ghane NOCS’07.
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
Hybrid Optoelectric On-chip Interconnect Networks Yong-jin Kwon 1.
Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.
PERFORMANCE EVALUATION OF LARGE RECONFIGURABLE INTERCONNECTS FOR MULTIPROCESSOR SYSTEMS Wim Heirman, Iñigo Artundo, Joni Dambre, Christof Debaes, Pham.
Virtual-Channel Flow Control William J. Dally
Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.
Mohamed ABDELFATTAH Andrew BITAR Vaughn BETZ. 2 Module 1 Module 2 Module 3 Module 4 FPGAs are big! Design big systems High on-chip communication.
Building manycore processor-to-DRAM networks using monolithic silicon photonics Ajay Joshi †, Christopher Batten †, Vladimir Stojanović †, Krste Asanović.
Design Space Exploration for NoC Topologies ECE757 6 th May 2009 By Amit Kumar, Kanchan Damle, Muhammad Shoaib Bin Altaf, Janaki K.M Jillella Course Instructor:
CS203 – Advanced Computer Architecture
Seth Pugsley, Jeffrey Jestes,
Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio
Exploring Concentration and Channel Slicing in On-chip Network Router
Cache Memory Presentation I
Analysis of a Chip Multiprocessor Using Scientific Applications
Leveraging Optical Technology in Future Bus-based Chip Multiprocessors
2.C Memory GCSE Computing Langley Park School for Boys.
Computer Evolution and Performance
CS 6290 Many-core & Interconnect
Presentation transcript:

Energy-Proportional Photonic Interconnects Nikos Hardavellas – Parallel Architecture Group Northwestern University Team: Y. Demir, P. Yan, S. Song, J. Kim, G. Memik

Technology Scaling Runs Out of Steam Transistor counts increase exponentially, but… © Hardavellas 2 Can no longer power the entire chip (voltage, cooling do not scale)

Technology Scaling Runs Out of Steam Transistor counts increase exponentially, but… © Hardavellas 3 Can no longer feed all cores with data fast enough (package pins do not scale) Can no longer power the entire chip (voltage, cooling do not scale) Power Wall

Technology Scaling Runs Out of Steam Transistor counts increase exponentially, but… © Hardavellas 4 Can no longer feed all cores with data fast enough (package pins do not scale) Bandwidth Wall Can no longer keep costs at bay (process variation, defects) Low Yield Can no longer power the entire chip (voltage, cooling do not scale) Power Wall Monolithic (single-chip) processor designs running out of steam too

Galaxy: Optically-Connected Disintegrated Processors Physical constraints limit the performance of single-chip designs  Area, Yield, Power, Bandwidth Multi-chip designs break free of these limitations  Processor disintegration  Macro-chip integration © Hardavellas 5 [WINDS-2010, ICS-2014]

Outline Introduction ➔ Background Scalable Multi-Chip System Design with Silicon Photonics  Galaxy Architecture  Experimental Results Energy-Proportional Photonic Interconnects  EcoLaser  ProLaser Conclude © Hardavellas 6

Nanophotonic Components © Hardavellas 7 off-chip laser source coupler resonant modulators resonant detectors Ge-doped waveguide

Modulation and Detection © Hardavellas wavelengths DWDM 10Gbps per link μm waveguide pitch TB/s/mm bandwidth density

Outline Introduction Background Scalable Multi-Chip System Design with Silicon Photonics ➔ Galaxy Architecture  Experimental Results Energy-Proportional Photonic Interconnects  EcoLaser  ProLaser Conclude © Hardavellas 9

Optical Crossbar © Hardavellas 10 Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 0 Cluster 1 Cluster 2 Cluster 3

Routing Example © Hardavellas 11 Optical Fiber bundle Waveguide bundle AB

Galaxy Architecture (5-chiplet example) © Hardavellas 12

Why Fibers? Traditional alternatives are: Electrical strips (SerDes) on FR4 board  Fibers are 10x more efficient: 180 fJ/bit vs. 2.5pJ/bit for 4’’  Fibers offer 8 TB/s/mm vs. pin interface (<200GB/s) Electrical wires on a silicon interposer  Fibers are 3x more efficient: 180 fJ/bit vs. 0.5pJ/bit  Fibers have a reach of several feet, vs. ~4 mm  Fibers transmit one bit per 4-16 um pitch, vs ~70 um pitch SOI waveguides on a silicon wafer  Fibers are twice as fast: 0.286c vs 0.676c  Fibers have negligible optical loss: 0.3db/cm vs. 0.2db/Km Do not confine the design on a single board, package, or wafer © Hardavellas 13

Dense Off-Chip Coupling Dense optical fiber array [Lee, OSA/OFC/NFOEC 2010] ~3.8dB loss, 8 Tbps/mm demonstrated Misalignment within  loss <1 dB © Hardavellas 14 Connects a fiber array to an on-chip waveguide array at a chip’s edge

Outline Introduction Background Scalable Multi-Chip System Design with Silicon Photonics  Galaxy Architecture ➔ Experimental Results Energy-Proportional Photonic Interconnects  EcoLaser  ProLaser Conclude © Hardavellas 15

Modeling Infrastructure © Hardavellas 16 3D-stack model SimFlex sampling 95% confidence photonic-layer ring heating

Impact of Disintegration: Speedup Over Single-Chip © Hardavellas 17 Processor Disintegration with Galaxy: 2–3x speedup M=Concentrated Mesh w/Exp.Links, C=Corona, F=Firefly, G=Galaxy

Impact of Disintegration: Speedup Over “Unlimited” © Hardavellas 18 Galaxy matches the performance of “unlimited” designs M=Concentrated Mesh w/Exp.Links, C=Corona, F=Firefly, G=Galaxy

Macrochip Integration with Galaxy © Hardavellas 19 Fiber Galaxy: 2.5x speedup over Oracle Macrochip (6.8x max) Galaxy’s lasers each consumes 6x less power

80-core 5-chiplet Galaxy Thermal CFD Modeling © Hardavellas 20 8cm spacing allows cooling with cheap passive heatsinks C

9-chiplet Dense Array (Oracle Macrochip) © Hardavellas 21 Tight arrangement points to liquid cooling requirement C

9-chiplet Galaxy 3D © Hardavellas 22 Flexible fibers allow “virtual chip” to break free of 2D planar designs C

Galaxy Summary “Virtual chips” with the performance of unlimited designs Breaks free of typical physical constraints  Large aggregate area  Improved yield (break-even point : 60% yield for photonics)  Tb/s/mm bandwidth density  Pushes back power wall Processor disintegration  2.4x – 3.2x avg. speedup (3.4 max)  2.4x – 2.8x avg. smaller EDP (7.1x max) Macrochip integration  2.5x speedup over Oracle Macrochip (6.8x max)  6x more power efficient links © Hardavellas 23

© Hardavellas 24 Problem 1: High Laser Power Silicon photonics are emerging as a promising technology for high-bandwidth, low-latency, and energy-efficient communication in many-cores However, lasers are really power-hungry  Optical devices induce optical loss (13+ dB is typical)  WDM-compatible lasers are 5-10% efficient 10-20x higher laser power than required optical output

Problem 2: Laser Power is Wasted Interconnect may stay idle for long times  Compute-intensive execution phases of workloads  30% server utilization in Google data centers But laser stays always on!  …even during periods of interconnect inactivity © Hardavellas 25 Up to 88% energy waste in real-world workloads

Solution: Laser Power Gating Turn the lasers off when interconnect is idle Turn the lasers on before sender transmits Overlooked until now  Traditional comb lasers are slow to turn on New enabling technology: Germanium Lasers  Turn on/off in 1ns  On-chip  simplify design and lower cost © Hardavellas 26

Outline Introduction Background Scalable Multi-Chip System Design with Silicon Photonics  Galaxy Architecture  Experimental Results Energy-Proportional Photonic Interconnects ➔ EcoLaser  ProLaser Conclude © Hardavellas 27

EcoLaser: Adapt Laser to Interconnect Traffic First paper on laser power gating  Power down lasers when not needed  Relaxed turn-off to facilitate opportunistic senders Adaptive mechanism to determine stay-on time  Monitors interconnect activity Result Highlights  24 – 77% energy savings on real workloads  1.1 – 2x speedup  Within 2-6% of a perfect (ideal) scheme © Hardavellas 28 [ISLPED-2014]

SWMR Optical Bus © Hardavellas … … Router 0 (Home) Router 1Router N-2Router N-1 … Data Bus Reservation Channel R0R0 R1R1 D0D0 D1D1 ……

MWSR Optical Bus © Hardavellas … … Router 0Router 1 Router N-2 Router N-1 (Home) … Token Stream Data Bus T0T0 T1T1 D0D0 D1D1 ……

EcoLaser Design - MWSR Laser turn-on request via token stream  Laser Turn On 31 © Hardavellas

32 Adaptive Laser Control The laser stays on for K cycles each time it turns on Static-K laser control  K is statically set, stays fixed across time  We model a range of static schemes Adaptive laser control  Approximate ideal value of K at each time interval  Monitor the laser turn-on signals  Too many  increase K  higher performance  Too few  lower K  higher energy savings Balance energy savings with interconnect performance

Interconnect Performance - MWSR © Hardavellas 33 Static: saturate early (56% throughput for Static-1) Adaptive: provides max interconnect throughput

© Hardavellas 34 Static: fail to capture all energy savings Adaptive: within 3% of the Perfect scheme Interconnect Energy - MWSR

© Hardavellas 35 Higher laser power -> higher performance impact Adaptive: 2x speedup at 29% laser energy (within 6% Perfect) EcoLaser Speedup – radix-64 MWSR Measured Injection Rate

© Hardavellas 36 Impact on Energy × Delay – radix-64 MWSR... Radix-64 impractical to implement without laser control Adaptive: 3.8x lower EDP, within 7% of Perfect

EcoLaser Summary Power down lasers when not needed  Relaxed turn-off to facilitate opportunistic senders  Monitor & adapt to interconnect activity Result Highlights  24 – 77% energy savings on real workloads  1.1 – 2x speedup  Within 2-6% of a perfect (ideal) scheme But  Complicated token scheme  Can do much better © Hardavellas 37 Yes, we can improve 2x over this “Perfect” scheme

Outline Introduction Background Scalable Multi-Chip System Design with Silicon Photonics  Galaxy Architecture  Experimental Results Energy-Proportional Photonic Interconnects  EcoLaser ➔ ProLaser Conclude © Hardavellas 38

DFB Laser … λ1λ1 λ2λ2 λNλN Data-Only Bits λ 1 λ 2 … λ N … DFB Laser … λ1λ1 λ2λ2 λNλN Common Bits λ 1 λ 2 … λ N … Data Bus ProLaser: Segregate Data from Control 39 © Hardavellas Switch on only the necessary interconnect portion [IEEE Photonics ]

ProLaser: Proactively Switch On Laser © Hardavellas 40 L2 Cache Requests & Replies Switch Allocator & VC Allocator … … Reservation Channels Data Channels … L L L L L L R1R1 R2R2 RNRN CH 1 CH N CH 2 RCH N RCH 2 RCH 1 Lasers Inject 1 Inject C … … Eject 1 Eject C Data Channel i Data Channel 1 Data Channel N O/E Laser Controller VC0 VC1 VC2 VC0 VC1 VC2 E/O Reservation Channel i … E/O Common Channel i L2 Cache Slice Bloom Filter Bloom filters + coherence protocol  predict accesses

ProLaser: Interconnect Performance © Hardavellas 41 ProLaser almost perfect saturation; EcoLaser saturates early

ProLaser: Interconnect Energy © Hardavellas 42 ProLaser saves 49-88% of laser power ProLaser is ~2x better than EcoLaser; 2-6% of Perfect

ProLaser: Performance Impact © Hardavellas 43 60% speedup over No-Ctrl; 40% over flattened buttefly

Sensitivity to Laser Turn-on Delay © Hardavellas 44 Tolerates high laser delays (7x increase  15% penalty)

Conclusions Galaxy breaks free of typical physical constraints  “Virtual chips” with the performance of unlimited designs  Processor disintegration: 3.2x speedup, 2.8x EDP (7x max)  Macrochip integration: 6.8x speedup, 6x lower power  Provides system design flexibility Adaptive Laser Control  Makes power-hungry photonic interconnects practical  Saves 49-88% of the laser energy  Provides 50-70% speedup © Hardavellas 45

Thank You! Questions? © Hardavellas 46

TECHNOLOGY BACKUP SLIDES © Hardavellas 47

Chip Power Scaling © Hardavellas 48 Chip power does not scale [Azizi 2010]

Demand for High-Performance Computing Grows Large Hadron Collider in March’11: 1.6PB data (Tier-1) Large Synoptic Array Survey Telescope: 30 TB/night  i.e., 2x Sloan Digital Sky Surveys/night  Sloan: more data than entire history of astronomy before it Data grows faster than Moore’s Law © Hardavellas 49 More data  more computing power to process them

Voltage Scaling Has Slowed © Hardavellas 50 In last decade: 13x transistors but 30% lower voltage Cannot run all transistors fast enough

Pin Bandwidth Scaling © Hardavellas 51 [TU Berlin] Cannot feed cores with data fast enough to keep them busy

Electrical vs. Photonic Links © Hardavellas 52 [Nitta et al., 2013]

Electrical (SerDes) vs. SOI Waveguides vs. Fibers © Hardavellas

SWMR vs. MWSR Crossbar © Hardavellas 54 Single-Writer Multiple-Reader Broadcast bus All receivers always read On-rings  optical loss High laser power Multiple-Writer Single-Reader Only one receiver reads Only one ring is on  low loss Low laser power Needs arbitration

GALAXY BACKUP SLIDES © Hardavellas 55

Single Chiplet Connectivity © Hardavellas 56

Galaxy MWSR Optical Crossbar © Hardavellas 57 MWSR avoids broadcast data bus, but requires arbitration

Token-Based Arbitration © Hardavellas 58 8 cycles on average for token arbitration (5 chiplets)

Modeling Infrastructure © Hardavellas 59 3D-stack model SimFlex sampling 95% confidence photonic-layer ring heating

Architectural Parameters © Hardavellas 60

Nanophotonic Parameters © Hardavellas 61

Load Latency (uniform random traffic) © Hardavellas 62

Load-Latency Curves © Hardavellas tokens provide optimal buffer depth

Impact of Disintegration: Speedup Over “Unlimited” © Hardavellas 64 Galaxy matches the performance of “unlimited” designs M=Concentrated Mesh w/Exp.Links, C=Corona, F=Firefly, G=Galaxy

Performance Against “Realistic” Designs © Hardavellas 65 Realistic: within power and bandwidth envelopes Galaxy chiplets within o C  chiplets run at max speed Galaxy: 2.4x - 3.2x speedup on average (3.4 max) Galaxy: 2.4x-2.8x smaller EDP on average (up to 7.1x smaller)

Comparison Against Multi-Chip Alternatives © Hardavellas 66

Tapered vs. Optical Proximity Couplers © Hardavellas 67 6x less laser power than Oracle Macrochip with demonstrated couplers

Laser Power Sensitivity to Optical Parameters © Hardavellas 68 Coupler Loss Off-Ring Loss Waveguide & Filter Drop Loss Modulator Insertion Loss Highly sensitive to coupler loss, insensitive to other losses

Sensitivity to Fiber Density 116mm 2 chiplets  43mm along the chip edge Enough room for μm pitch © Hardavellas fibers: within 3% of max performance

Energy-Delay Product © Hardavellas 70 Galaxy: 2.4x-2.8x smaller EDP on average (7.1x max)

Energy per Instruction © Hardavellas 71 Galaxy: 12-20% lower energy/instruction on average (up to 2.3x less)

9-chiplet Galaxy 2D © Hardavellas 72 Cooling 9 chiplets with passive heatsinks C

ECOLASER BACKUP SLIDES © Hardavellas 73

Laser Power Consumption Modulator Insertion Loss Off-Ring Loss Waveguide Loss Filter Drop Loss 10x Wall- plug Laser Power 74 © Hardavellas

EcoLaser Design - SWMR Message in injection buffers  Laser Turn On 75 © Hardavellas

EcoLaser Token Design Traditional token provides arbitration only  1 bit is sufficient EcoLaser token needs to  T: Facilitate arbitration  L: Indicate light presence on data bus  S: Provide laser turn-on signal  Check if the laser is on first, before sending the turn on signal  Laser turn-on signal should trail T/L by one cycle  Denote dedicated slot (to avoid starvation) © Hardavellas 76 T L S T

EcoLaser 3-bit Token and Laser Controller FSM © Hardavellas 77

EcoLaser Writer Node FSM © Hardavellas 78

MWSR Laser Control Example Token stream Data stream R3R3 R2R2 R1R1 R0R0 Router Laser Source R0R0 R1R1 R2R2 R3R3 T3T T2T T1T T0T T7T T6T T5T T4T D6D6 D5D5 D4D4 D3D3 D2D2 D1D1 D0D0 D7D7 Off 79 © Hardavellas

R0R0 R1R1 R2R2 R3R3 T4T T3T T2T T1T T0T T7T T6T T5T D7D7 D6D6 D5D5 D4D4 D3D3 D2D2 D1D1 D0D0 Off t = 1 MWSR Laser Control Example 80 © Hardavellas

0 R0R0 R1R1 R2R2 R3R3 T5T T4T T3T T2T T1T T0T T7T7 0 0 T6T D0D0 D7D7 D6D6 D5D5 D4D4 D3D3 D2D2 D1D1 Off t = 2 MWSR Laser Control Example 81 © Hardavellas

0 R0R0 R1R1 R2R2 R3R3 T6T T5T T4T T3T T2T T1T T0T0 0 0 T7T D1D1 D0D0 D7D7 D6D6 D5D5 D4D4 D3D3 D2D2 Off t = 3 MWSR Laser Control Example 82 © Hardavellas

01 R0R0 R1R1 R2R2 R3R3 T7T T6T T5T T4T T3T T2T T1T1 0 T0T D2D2 D1D1 D0D0 D7D7 D6D6 D5D5 D4D4 D3D3 On t = 4 MWSR Laser Control Example 83 © Hardavellas

R0R0 R1R1 R2R2 R3R3 T0T T7T T6T T5T T4T T3T T2T2 0 T1T1 0 D3D3 D2D2 D1D1 D0D0 D7D7 D6D6 D5D5 D4D4 On t = 5 MWSR Laser Control Example © Hardavellas

R0R0 R1R1 R2R2 R3R3 T1T T0T T7T T6T T5T T4T T3T3 0 T2T2 0 D4D4 D3D3 D2D2 D1D1 D0D0 D7D7 D6D6 D5D5 On t = 6 MWSR Laser Control Example © Hardavellas

0110 R0R0 R1R1 R2R2 R3R3 T2T T1T T0T T7T T6T T5T T4T4 0 T3T3 0 D5D5 D4D4 D3D3 D2D2 D1D1 D0D0 D7D7 D6D6 On t = 7 MWSR Laser Control Example 86 © Hardavellas

R0R0 R1R1 R2R2 R3R3 T3T T2T T1T T0T T7T T6T T5T5 T4T4 D6D6 D5D5 D4D4 D3D3 D2D2 D1D1 D0D0 D7D7 On t = 8 MWSR Laser Control Example © Hardavellas

R0R0 R1R1 R2R2 R3R3 T4T T3T T2T T1T T0T T7T T6T6 T5T5 D7D7 D6D6 D5D5 D4D4 D3D3 D2D2 D1D1 D0D0 On t = 9 MWSR Laser Control Example © Hardavellas

R0R0 R1R1 R2R2 R3R3 T5T T4T T3T T2T T1T T0T T7T7 T6T6 D0D0 D7D7 D6D6 D5D5 D4D4 D3D3 D2D2 D1D1 On t = 10 MWSR Laser Control Example © Hardavellas

R0R0 R1R1 R2R2 R3R3 T6T T5T T4T T3T T2T T1T T0T0 T7T7 D1D1 D0D0 D7D7 D6D6 D5D5 D4D4 D3D3 D2D2 OnOn t = 11 MWSR Laser Control Example © Hardavellas

R0R0 R1R1 R2R2 R3R3 T7T T6T T5T T4T T3T T2T T1T1 T0T0 D2D2 D1D1 D0D0 D7D7 D6D6 D5D5 D4D4 D3D3 On t = 12 MWSR Laser Control Example © Hardavellas

R0R0 R1R1 R2R2 R3R3 T0T T7T T6T T5T T4T T3T T2T2 T1T1 D3D3 D2D2 D1D1 D0D0 D7D7 D6D6 D5D5 D4D4 On t = 13 MWSR Laser Control Example © Hardavellas

R0R0 R1R1 R2R2 R3R3 T1T T0T T7T T6T T5T T4T T3T3 T2T2 D4D4 D3D3 D2D2 D1D1 D0D0 D7D7 D6D6 D5D5 On t = 14 MWSR Laser Control Example © Hardavellas

01 R0R0 R1R1 R2R2 R3R3 T2T T1T T0T T7T T6T T5T T4T4 1 T3T D5D5 D4D4 D3D3 D2D2 D1D1 D0D0 D7D7 D6D6 Of f t = 15 MWSR Laser Control Example 94 © Hardavellas

R0R0 R1R1 R2R2 R3R3 T3T T2T T1T T0T T7T T6T T5T T4T D6D6 D5D5 D4D4 D3D3 D2D2 D1D1 D0D0 D7D7 Off t = 16 MWSR Laser Control Example 95 © Hardavellas

R0R0 R1R1 R2R2 R3R3 T4T T3T T2T T1T T0T T7T T6T T5T D7D7 D6D6 D5D5 D4D4 D3D3 D2D2 D1D1 D0D0 Off t = 17 MWSR Laser Control Example 96 © Hardavellas

1 R0R0 R1R1 R2R2 R3R3 T5T T4T T3T T2T T1T T0T T7T7 1 1 T6T D0D0 D7D7 D6D6 D5D5 D4D4 D3D3 D2D2 D1D1 Off t = 18 MWSR Laser Control Example 97 © Hardavellas

EcoLaser Nanophotonic Parameters © Hardavellas 98

Interconnect Performance – radix-16 MWSR © Hardavellas 99 Static: saturate early (56% throughput for Static-1) Adaptive: provides max interconnect throughput

© Hardavellas 100 Static: fail to capture all energy savings Adaptive: within 3% of the Perfect scheme Interconnect Energy – radix-16 MWSR

Interconnect Performance – radix-16 SWMR 101 © Hardavellas

Interconnect Energy – radix-16 SWMR 102 © Hardavellas

103 Laser power savings leave more power for cores  faster Adaptive: 1.1x speedup at 50% laser energy (within 2% Perfect) EcoLaser Speedup – radix-16 MWSR Measured Injection Rate

© Hardavellas 104 Laser power savings leave more power for cores  faster Adaptive: 5% speedup at 50% laser energy (within 2% Perfect) EcoLaser Speedup – radix-16 MWSR Measured Injection Rate

EcoLaser Speedup – Radix-16 SWMR 105 © Hardavellas

106 Higher laser power -> higher performance impact Adaptive: 2x speedup at 29% laser energy (within 6% Perfect) EcoLaser Speedup – radix-64 MWSR Measured Injection Rate

EcoLaser Speedup for Radix-64 MWSR © Hardavellas 107 EcoLaser Power Savings  ~2x Speedup

EcoLaser Speedup – Radix-64 SWMR 108 © Hardavellas

EcoLaser Speedup for Radix-64 SWMR © Hardavellas 109 EcoLaser Power Savings  ~2x Speedup

Static-1 is 19% slower than No-Ctrl on average (30% maximum). Adaptive saves 45% laser energy and it is 4.8% slower than Perfect. Impact of Latency Overhead 110 © Hardavellas

Impact of Latency Overhead 111 © Hardavellas

Impact of Latency Overhead 112 © Hardavellas

Impact of Latency Overhead 113 © Hardavellas

Energy × Delay – radix-16 MWSR No-Ctrl: more energy efficient than Static-1, Power_Eq Adaptive: 13% lower EDP, within 2% of Perfect © Hardavellas

Energy × Delay – radix-16 SWMR 115 © Hardavellas

116 Impact on Energy × Delay – radix-64 MWSR... Radix-64 impractical to implement without laser control Adaptive: 3.8x lower EDP, within 7% of Perfect

Energy × Delay – radix-64 SWMR 117 © Hardavellas

Backup Slides Why not use Off-Chip Laser?  Pro: Higher eff. & off the chip power budget  Con: Coupler Loss and intrinsic loss* Conclusion: Off-chip laser source might increase the total system power consumption. On-Chip laser source with control is more efficient than off- chip lasers. Ge-based lasers manufactured footprint 1.6um x 4mm, could be smaller. 118 © Hardavellas

Experimental Methodology CMP Size64 cores, 480 mm2 Processing CoreULTRASPARC III ISA, up to 5Ghz, OoO, 4-wide dispatch/retirement, 96-entry ROB L1 CacheSplit I/D, 64KB 2-way, 2-cycle load-to-use, 2 ports, 64-byte blocks, 32 MSHRs, 16-entry victim cache L2 CacheShared, 512 KB per core, 16 way, 64-byte blocks, 14 cycle-hit, 32 MSHRs, 16-entry victim cache Memory ControllerOne per 4 cores, 1 channel per Memory Controller Round-robin page interleaving Main MemoryOptically connected memory [2], 10ns access NetworkSWMR and MWSR crossbars, radix-16 and bit wide 10GHz, 20 flit deep buffers, 3 cycle router delay 119 © Hardavellas

Radix-16Radix-64 DWDM6416 WG Loss3 dB Non-Linearity1 dB Modulator Ins.0.5 dB Ring Through10.24 dB Filter Drop1.2 dB Photodetector0.1 dB Total Loss16.04 dB Laser Power0.401 mW Total Laser Power 20.1W78.1W Laser Power Consumption 120 © Hardavellas

Radix-16AreaRadix-64 DWDM6416 WG80160 mm mm2 Ring Resonators 77K7.7 mm21.2 M100 mm2 Lasers mm219K125 mm2 Optical Component Count 121 © Hardavellas

Workloads Fmm: Input 128K Moldyn: 15, 20, 3.2 M Barnes: Input 64K Tomcatv: 4096, 10 Appbt: in.24x24x24x8bit Ocean: 1026, 9600 Em3d: 400K, 2, 15, © Hardavellas

PROLASER BACKAUP SLIDES © Hardavellas 123

Data-Only Bits DFB … … Laser Switch λ1λ1 λ2λ2 λNλN λ1λ1 λ2λ2 λNλN λ 1 … λ N λ 1 λ 2 … λ N λ & λ … … Common Bits Data Bus Network-on-chip Off-chip laser die Optical Fiber SOI Waveguides Off-Chip Ge-based Laser Source

ProLaser – Architectural Parameters © Hardavellas 125

ProLaser – Nanophotonic Parameters © Hardavellas 126