Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Avinash K. Kodi and Randy W. Morris, Jr. Department of Electrical Engineering and Computer Science Ohio University, Athens, OH 45701

Similar presentations


Presentation on theme: "1 Avinash K. Kodi and Randy W. Morris, Jr. Department of Electrical Engineering and Computer Science Ohio University, Athens, OH 45701"— Presentation transcript:

1 1 Avinash K. Kodi and Randy W. Morris, Jr. Department of Electrical Engineering and Computer Science Ohio University, Athens, OH 45701 E-mail: kodi@ohio.edu, rmorris@cs.ohiou.edu ACM/IEEE Symposium on Architectures for Networking and Communications Systems, Princeton, New Jersey October 19-20, 2009 Design of a Scalable Nanophotonic Interconnect for Future Multicores

2 Talk Outline Section I: Motivation & Background Section II: PROPEL Architecture Section III: E-PROPEL Architecture Section IV: Performance Analysis Section V: Conclusion 2

3 Chip Multi-Processor 3 Multicores have arrived -Future processors will be comprised of 100’s to 1000’s of cores Multicores have arrived -Future processors will be comprised of 100’s to 1000’s of cores Intel Tera-FLOPS, 80-cores, 65 nm, 2007 1 IBM cell processor, 8-cores, 90 nm, 2004 3 SPARC processor-16cores, 65 nm, 2008 2 1.Y. Hoskote, S. Vangal, A. Singh, N. Borkar, and S. Borkar, ”A 5-ghz mesh inter-connect for a teraFLOPS processor," IEEE Micro, pp. 51-61, September/October 2007 2.G. Konstadinidis et. al., “Architecture and physical implementation of a third generation 65 nm, 16 cores, 32 thread chip-multithreading sparc processor,“ IEEE Journal of Solid-State Circuits, no. 1, p. 717, January 2009. 3.The Cell project at IBM Research, http://www.research.ibm.com/cell/home.html

4 Network-on-Chip (NoC) 4 Core (3,0) Core Router Link Crossbar Switch Processing Core +X -X -Y +Y +X -X -Y +Y Route Computation (RC) Virtual Channel (VC) Switch Allocator (SA) Credits In/Out Core (3,1) Core (3,2) Core (3,3) Core (2,0) Core (2,1) Core (2,2) Core (2,3) Core (1,0) Core (1,1) Core (1,2) Core (1,3) Core (0,0) Core (0,1) Core (0,2) Core (0,3) -Overcomes the problems of scalability and wire delay

5 Power Dissipation 5 Tile Power: Intel Tera-Flops (65 nm) 2 2. Y. Hoskote, “A 5-GHz Mesh Interconnect for A Teraflops Processor,” IEEE Computer Society, 2007 pp. 51-61 28% Recent NSF-sponsored workshop on On-Chip Interconnection Networks 1 : Power consumption of NOCs implemented with current techniques – exceeds expected needs by a factor of 10. Recent NSF-sponsored workshop on On-Chip Interconnection Networks 1 : Power consumption of NOCs implemented with current techniques – exceeds expected needs by a factor of 10. Potential Solutions - Nanophotonics - Nanophotonics - Wireless/RF - Wireless/RF - 3D stacking - 3D stacking 1. Reference : J.D.Owens, W.J.Dally, R.Ho, D.N.Jayasimha, S.W.Keckler and L.S.Peh, “Research Challenges for On-Chip Interconnection Networks”, IEEE Micro, vol. 27, no. 5, pp. 96 – 108, September-October 2007.

6 6 Why use Nanophotonics? CMOS compatible Low Power (0.1 mW) Small Footprint (~10 µm) High Bandwidth (~10 Gbps) Low Latency (10.45 ps/mm) CMOS compatible Low Power (0.1 mW) Small Footprint (~10 µm) High Bandwidth (~10 Gbps) Low Latency (10.45 ps/mm) 1. Lipson, M., Compact Electro-Optic Modulators on a Silicon Chip, IEEE J. Sel. Top. Quant., Vol. 12, No. 6, Nov.-Dec. 2006, p. 1520-6.Compact Electro-Optic Modulators on a Silicon Chip 2. M. Lipson, Guiding, Modulating and Emitting Light on Silicon - Challenges and Opportunities, IEEE Journal of Lightwave Technologies, Vol. 23,Guiding, Modulating and Emitting Light on Silicon - Challenges and Opportunities No. 12, 12 December 2005 (invited).

7 Optical Interconnect Off-Chip Laser On-Chip Modulator Transmission Medium Photodetector TIA Buffer ChainLimiting Amplifier Driver for Electronics Optical Layer Electronics Layer On-Chip 7 On-chip Modulator -Mach-Zehnder modulator or Micro-Ring Resonator Transmission Medium - Freespace or Waveguide (Polymer or Silicon) Photodetectors - GaAs, III-V materials, Ge-on-SOI (Silicon-on-Insulator)

8 Micro-ring Resonators 8 Resonant wavelength ( λ 0 ) λ 0  m= n eff  2  R m  an integer n eff  effective refractive index R  radius of the ring resonator Input Port 0Output Port 0 n+n+ p+p+ n+n+ =V OFF =V ON =V OFF VRVR Output Port 1 VRVR Input Port 0 Output Port 0 n+n+ p+p+ n+n+ VRVR Input Port 0Output Port 0 n+n+ p+p+ n+n+

9 Electrical Interconnect 9 CpCp C0C0 rsrs R, C l opt s opt R =wire resistant per length C =wire capacitance per length Cp=inverter output capacitance C 0 =inverter input capacitance R s = inverter resistance S opt =inverter size L opt = Wire distance RC Link:

10 ITRS 2007 Transistor & Link Parameters? 10 Device90 nm65 nm45nm32nm22nm V dd 1.21.110.90.8 f clk 3.0884.75.8757.3449.18 R 122220312382455 C 170165160155150 CpCp 10.90.80.7120.544 CoCo 0.50.450.40.3560.272 RsRs 18902200350047006900 S opt 72.560.566.973.191.4 Lopt 0.450.350.250.180.13 Ioffn (nA/micron) 5070100150220 Ishortckt (nA/micron) 65100 Increase wire delay due to RC constant Increase in Ioffn & Ishortckt current parameters Electrical link device parameters for various VLSI technologies

11 Waveguide & Receiver 11 WAVEGUIDEPitch (um)Propagation Time (ps) Optical Loss (dB/cm) Si [1]5.510.451.3 Polymer [1]204.931.0 RECEIVERPower (mW/Gbps)Area (mm 2 ) Si-CMOS-Amplifier [2]1.10.02625 80 nm CMOS [3]2.50.0625 SiGe BiCMOS [4]24.51.07 [1] N. Kirman and et. al., “Leveraging Optical Technology in Future Bus-based Chip Multiprocessors”, 39th Annual IEEE/ACM International Symposium on Microarchitecture, 2006 Vol. 9, Iss. 13 Dec. 2006 pg.492 – 50 [2] S. Koester et. al., “Ge-on-SOI-Dectector/Si-CMOS-Amplifier Receivers for High-Performance Optical-Communication Applications,” Journal of Lightwave Technology, Vol. 25, No. 1, January 2007 [3] C. Kromer and et. al., “A 100-mW 4X10 Gb/s Transceiver in 80-nm CMOS for High-Density Optical Interconnects,” IEEE Journal of Solid-State Circuits, Vol. 40, No. 12, December 2005 [4]D. Kuchta and et. al., “120-Gb/s VCSEL-based parallel-optical interconnect and custom 120-Gb/s testing station,” Journal of Lightwave Technology, Vol. 22 No. 9 pp. 2200-2212, Sept. 2004

12 Electrical/Optical Comparison 12 Power-delay product at various technology nodes for a 5 mm link. Optics is more advantageous: 52nm for Global & 45 nm for Semi-global Interconnects

13 core-to-core distance Critical Length 13 Critical Length is the distance where optical becomes more advantageous

14 14 Why PROPEL? Related Work –Corona (ISCA 2008), Circuit-switch(IEEE Transaction 2008), Shared-bus (Micro 2006) Reduce hardware complexity –Current proposed nanophotonic networks use large number of optical components Nanophotonic for communication (links) and electronics for switching –No optical arbitration required –Balance between cheaper electronic and more costly optics Scalable network design

15 15 L2L2 L2L2 L2L2 L2L2 L2L2 L2L2 L2L2 L2L2 L2L2 L2L2 L2L2 L2L2 L2L2 L2L2 L2L2 L2L2 Off- source Laser 0, 1, 2, … Proposed Architecture: PROPEL Tile (3,0) Tile (3,1) Tile (3,2) Tile (3,3) Tile (2,0) Tile (2,1) Tile (2,2) Tile (2,3) Tile (1,0) Tile (1,1) Tile (1,2) Tile (1,3) Tile (0,1) Tile (0,2) Tile (0,3) X-direction y-direction Concentration = 4 Tile (3,0) Tile (0,0)

16 Tile (3,3) Tile (3,2) PROPEL’s Routing & Wavelength Assignment (x-direction) 16 Core 0 Core 2 Core 1 Core 3 Tile (3,0) Core 8 Core 10 Core 9 Core 11 Core 12 Core 14 Core 13 Core 15 Home Channel 0 Home Channel 1 Home Channel 2 Home Channel 3 λ 1 (0,0) λ 2 (1,0) λ 3 (0,0) λ 2 (0,0) λ 0 (1,0) + λ 2 (1,0) + λ 3 (1,0) λ 1 (0,0) + λ 2 (0,0) + λ 3 (0,0) λ 0 (1,0) λ 3 (1,0) λ a (b,c) - a = wavelength, b = destination tile, c = x-direction Core 4 Core 6 Core 5 Core 7 Tile (3,1)

17 Core 12 Core 14 Core 13 Core 15 XbarXbar Core 0 Core 2 Core 1 Core 3 XbarXbar Communication Example 17 - Tile (3,3) communicates with Tile (0,0) Laser Tile (3,0) Tile (3,3) Core 48 Core 50 Core 49 Core 51 Tile (0,0) Crossbar Switch X0X0 X1X1 X2X2 Y0Y0 Y1Y1 Y2Y2 X0X0 L2 Cache X1X1 X2X2 Y0Y0 Y1Y1 Y2Y2 Router (3,0) aaaaaaaaaaaaaaaaaaaaaaaaaa x-direction XbarXbar

18 Core 0 Core 2 Core 1 Core 3 XbarXbar Communication Example 18 - Tile (3,3) communicates with Tile (0,0) Laser Tile (3,0) Core 12 Core 14 Core 13 Core 15 Tile (3,3) Core 12 Core 14 Core 13 Core 15 Tile (0,0) aaaaaaaaaaaaaaaaaaaaaaaaaa y-direction

19 19 Need for E-PROPEL Related work - Corona (ISCA 2008), Processor-DRAM (HOT Interconnects 2008), Firefly (ISCA 2009) Issues with 256-core version of PROPEL - xbar (15×15), Area (Waveguides), Power dissipation Advantages of E-PROPEL - Non-blocking crossbar, multiple roots (Fat tree), reduce components (over PROPEL)

20 20 E-PROPEL Design Combine 4 PROPELs with nanophotonic crossbars Cluster 0Cluster 1Cluster 2Cluster 3 Non-blocking Optical Xbar Non-blocking Optical Xbar Non-blocking Optical Xbar Non-blocking Optical Xbar Non-blocking Optical Xbar Non-blocking Optical Xbar Non-blocking Optical Xbar Top and bottom tiles RE-PROPEL: Top and bottom tiles are only connected

21 21 Crossbar Functionality 4-Input 64- Wavelength AWG Crossbar λ (0) (0-15), λ (0) (16-31), λ (0) (32-47), λ (0) (48-63) λ (1) (0-15), λ (1) (16-31), λ (1) (32-47), λ (1) (48-63) λ (2) (0-15), λ (2) (16-31), λ (2) (32-47), λ (2) (48-63) λ (0) (0-15), λ (1) (16-31), λ (2) (32-47), λ (3) (48-63) λ (1) (0-15), λ (2) (16-31), λ (3) (32-47), λ (0) (48-63) λ (2) (0-15), λ (3) (16-31), λ (0) (32-47), λ (1) (48-63) Input 0 Input 1 Input 2 Input 3 Output 0 Output 1 Output 2 Output 3 λ (3) (0-15), λ (3) (16-31), λ (3) (32-47), λ (3) (48-63) λ (3) (0-15), λ (0) (16-31), λ (1) (32-47), λ (2) (48-63)

22 λ (32-47) λ (0-15) λ (16-31) λ (48-63) λ (0-15) 22 Nanophotonic Crossbar (single ring) (cluster 0) Input 0 (cluster 1) Input 1 (cluster 2) Input 2 Input 0- λ (32-47) Input 1- λ (32-47) (cluster 3) Input 3 (cluster 0) Output 0 (cluster 1) Output 1 (cluster 2) Output 2 (cluster 3) Output 3

23 23 Nanophotonic Crossbar (double ring) Input 0 Input 1 Input 2 Input 3 Output 0 Output 2 Output 3 Output 1 λ (0-15) λ (32-47) λ (0-15) λ (16-31) λ (48-63) λ (0-15) λ (15-31) λ (48-63)

24 24 Performance Evaluation Optical & Electrical Component Comparison Synthetic Traffic –Simulated with OPTISIM –Uniform, Bit-reversal, Butterfly, Complement, Matrix transpose, Perfect Shuffle SPLASH-2 –Traces collected on Simics with GEMS –FFT, LU, Radiosity, Ocean, Raytrace, Radix, Water, FMM and Barnes Networks topologies –Electrical: Mesh, Cmesh and Flattened-butterfly –Optical: Circuit-switch, Shared-bus and Corona

25 Component Comparison: PROPEL 25 Wavelengths Waveguides Micro-rings Photodetectors Power Loss (dB) Optical Area (mm 2 ) Electrical Area (mm 2 ) Shared-Bus 4 168 2,688 1,536 37 16 60 Circuit-Switch 24 64 16,576 2,016 39.2 49 55 Corona 64 99 72,192 7,424 49.2 64.6 195 PROPEL 64 32 3,072 1,536 32.1 17 50 PROPEL is the most cost effective NoCs

26 Component Comparison: E-PROPEL 26 Wavelengths Waveguides Micro-rings Photodetectors Power Loss (dB) Optical Area (mm 2 ) Electrical Area (mm 2 ) Corona 64 387 1,081,344 32,768 49 337 860 PROPEL 64 256 28,672 14,336 44 181 395 E-PROPEL 64 192 19,968 9,216 42 96 280 RE-PROPEL 64 160 16,128 7,680 41 85 240

27 27 Power Dissipation Evaluation Buffers (8.06mW) 1 Xbar (8.66mW) 2 Modulator (0.1mW/Gb) 4 TIA/Amplifier (1.1mW/Gb) 5 Electrical Links (44mW) 3 1,2. B. Grot, J. Hestness, S. W. Keckler, and O. Mutlu, “Express cube topologies for on-chip interconnects,” in the Proceeding of 15th International Symposium on High Performance Computer Architecture, Feburary 2009, pp. 163–174. 3. Y. Pan, P. Kumar, J. Kim, G. Memik, Y. Zhang, and A. Choudhary, “Firefly: Illuminating future network-on-chip with nanophotonics,” in the Proceedings of the 36th annual International Symposium on Computer Architecture, 2009. 4. Q. Xu, S. Manipatruni, B. Schmidt, J. Shakya, and M. Lipson, “12.5 gbit/s carrier-injection-based silicon micro-ring silicon modulators,” Optics Express:The International Electronic Journal of Optics, vol. 15, no. 2, January 2007. 5. S. J. Koester, C. L. Schow, L. Schares, and G. Dehlinger, “Ge-on-soi-detector/si-cmos-amplifier receivers for high-performance opticalcommunication applications,” Journal of Lightwave Technology, vol. 25, no. 1, pp. 46–57, January 2007.

28 Uniform Traffic 28 Throughput Latency Throughput 25% increase - 25% increase performance over Mesh Over 2× increase - Over 2× increase in performance over Circuit-switch, Cmesh and Shared-bus Throughput 25% increase - 25% increase performance over Mesh Over 2× increase - Over 2× increase in performance over Circuit-switch, Cmesh and Shared-bus

29 Throughput: Synthetic Traffic Traces 29 -50% increase -50% increase over mesh for bit-reversal, matrix transpose, and perfect shuffle -50% increase -50% increase over mesh for bit-reversal, matrix transpose, and perfect shuffle

30 Power Dissipation: Synthetic Traffic 30 by a factor of 5 - PROPEL decreases power consumption by a factor of 5

31 31 Splash-2 Speed up -PROPEL speed-up LU, Ocean, Radix, Water, FFM and Barnes factor of 2 by of factor of 2 about 1.5 × -FFT, Radiosity and Raytrace have a speed-up of about 1.5 × -PROPEL speed-up LU, Ocean, Radix, Water, FFM and Barnes factor of 2 by of factor of 2 about 1.5 × -FFT, Radiosity and Raytrace have a speed-up of about 1.5 ×

32 32 Splash-2 Power Dissipation by a factor of 10 - PROPEL decreases power consumption by a factor of 10

33 33 E-PROPEL Throughput - E-PROPEL throughput is similar to PROPEL except for Uniform, Matrix Transpose, and Perfect Shuffle Transpose, and Perfect Shuffle -RE-PROPEL only slightly decreases performance over E-PROPEL -E-PROPEL improves performance by 2x over mesh - E-PROPEL throughput is similar to PROPEL except for Uniform, Matrix Transpose, and Perfect Shuffle Transpose, and Perfect Shuffle -RE-PROPEL only slightly decreases performance over E-PROPEL -E-PROPEL improves performance by 2x over mesh

34 34 E-PROPEL Power - E-PROPEL and RE-PROPEL reduce power dissipation by a factor of 3

35 low power high bandwidth NoCPROPEL and E-PROPEL are both a low power high bandwidth NoC for future many-core processors electronic for packet switching optics for inter-router communicationPROPEL and E-PROPEL uses both electronic for packet switching and optics for inter-router communication, allowing for a reduction in electrical and optical components outperform and dissipate less powerPROPEL and E-PROPEL are able to outperform and dissipate less power when compared to well-known network topologies adaptive routingIn future work, incorporate adaptive routing technique to balance the load across the entire network 35 Conclusion

36 36

37 37 SPLASH-2 Setup ApplicationBenchmark FFT 16 K particles16 K particles LU512 × 512 particles RadiosityLargeroom Ocean258 × 258 Radix1 M integers Water512 molecules FMM16 K particles Barnes16 k particles

38 38 Simulation Parameters (electrical) ParameterMeshCmeshFlattened- Butterfly Bisection Bandwidth(Tbp s) 4.096 8.192 Router Size (xbar) 5×58×810×10 VCs (per Input) 444 Electrical Channel Rate (Gbps) 256

39 39 Simulation Parameters (Optical) ParameterShared-BusCircuit- switch CoronaPROPEL Bisection Bandwidth( Tbps) 15.40.5140.965.12 Router Size (xbar) 4x45x5-8x8 VCs (per Input) 4444 Electrical Channel Rate (Gbps) 64256- Optical Channel Rate (Gbps) 2401282560160


Download ppt "1 Avinash K. Kodi and Randy W. Morris, Jr. Department of Electrical Engineering and Computer Science Ohio University, Athens, OH 45701"

Similar presentations


Ads by Google