ESE532: System-on-a-Chip Architecture

Slides:



Advertisements
Similar presentations
QuT: A Low-Power Optical Network-on-chip
Advertisements

Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.
Packet-Switched vs. Time-Multiplexed FPGA Overlay Networks Kapre et. al RC Reading Group – 3/29/2006 Presenter: Ilya Tabakh.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 15: March 12, 2007 Interconnect 3: Richness.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 24: April 21, 2010 Interconnect 6: Dynamically Switched Interconnect.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day8: October 18, 2000 Computing Elements 1: LUTs.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 13: February 26, 2007 Interconnect 1: Requirements.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 26: April 18, 2007 Et Cetera…
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 11: February 14, 2007 Compute 1: LUTs.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 22: April 4, 2007 Interconnect 7: Time Multiplexed Interconnect.
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)
Architecture and Routing for NoC-based FPGA Israel Cidon* *joint work with Roman Gindin and Idit Keidar.
Penn ESE Spring DeHon 1 FUTURE Timing seemed good However, only student to give feedback marked confusing (2 of 5 on clarity) and too fast.
Switch EECS 252 – Spring 2006 RAMP Blue Project Jue Sun and Gary Voronel Electrical Engineering and Computer Sciences University of California, Berkeley.
ESE Spring DeHon 1 ESE534: Computer Organization Day 19: April 7, 2014 Interconnect 5: Meshes.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 11: March 3, 2014 Instruction Space Modeling 1.
High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
Design and Evaluation of Hierarchical Rings with Deflection Routing Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, ​ Kevin Chang, Greg Nazario, Reetuparna.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR FPGA Fabric n Elements of an FPGA fabric –Logic element –Placement –Wiring –I/O.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.
Anshul Kumar, CSE IITD CSL718 : Multiprocessors Interconnection Mechanisms Performance Models 20 th April, 2006.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 25: April 28, 2014 Interconnect 7: Dynamically Switched Interconnect.
Anshul Kumar, CSE IITD ECE729 : Advanced Computer Architecture Lecture 27, 28: Interconnection Mechanisms In Multiprocessors 29 th, 31 st March, 2010.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 5: February 1, 2010 Memories.
FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR Moore’s Law n Gordon Moore: co-founder of Intel. n Predicted that number of transistors.
Penn ESE534 Spring DeHon 1 ESE534 Computer Organization Day 9: February 13, 2012 Interconnect Introduction.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 11: January 31, 2005 Compute 1: LUTs.
Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 15: March 13, 2013 High Level Synthesis II Dataflow Graph Sharing.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 11: February 20, 2012 Instruction Space Modeling.
ESE Spring DeHon 1 ESE534: Computer Organization Day 20: April 9, 2014 Interconnect 6: Direct Drive, MoT.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
ESE532: System-on-a-Chip Architecture
ESE534: Computer Organization
ESE534: Computer Organization
Parallel Architecture
ESE532: System-on-a-Chip Architecture
ESE534: Computer Organization
Topics SRAM-based FPGA fabrics: Xilinx. Altera..
ESE534: Computer Organization
Dynamic connection system
Lecture 23: Interconnection Networks
ESE532: System-on-a-Chip Architecture
ESE532: System-on-a-Chip Architecture
ESE534 Computer Organization
ESE532: System-on-a-Chip Architecture
Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio
ESE532: System-on-a-Chip Architecture
ESE532: System-on-a-Chip Architecture
CS184a: Computer Architecture (Structure and Organization)
ESE532: System-on-a-Chip Architecture
Azeddien M. Sllame, Amani Hasan Abdelkader
William Stallings Data and Computer Communications
Israel Cidon, Ran Ginosar and Avinoam Kolodny
Using Packet Information for Efficient Communication in NoCs
ESE534: Computer Organization
ESE534: Computer Organization
ESE534: Computer Organization
On-time Network On-chip
ESE534: Computer Organization
ESE534: Computer Organization
ESE534: Computer Organization
CS 6290 Many-core & Interconnect
HopliteBuf: FPGA NoCs with Provably Stall-Free FIFOs
Advanced Computer and Parallel Processing
ESE534: Computer Organization
CS184a: Computer Architecture (Structure and Organization)
ESE532: System-on-a-Chip Architecture
Reconfigurable Computing (EN2911X, Fall07)
Presentation transcript:

ESE532: System-on-a-Chip Architecture Day 19: March 29, 2017 Network-on-a-Chip (NoC) Penn ESE532 Spring 2017 -- DeHon

Today Ring 2D Mesh Networks Design Issues Buffering and deflection Dynamic and static routing Penn ESE532 Spring 2017 -- DeHon

Message Scalable interconnect for locality has rich design space Customize to compute and application Support real-time with static scheduled communication Penn ESE532 Spring 2017 -- DeHon

Interconnect Will need an infrastructure for programmable connections Day 8 Interconnect Will need an infrastructure for programmable connections Rich design space to tune area-bandwidth-locality Will explore more later in course Penn ESE532 Spring 2017 -- DeHon

Interconnect Concerns Avoid being a bottleneck Bandwidth Latency Competes for area and energy against compute and memory Penn ESE532 Spring 2017 -- DeHon

Crossbar Connect any I inputs, O outputs Area ~ I×O For N PEs scale as N2 Penn ESE532 Spring 2017 -- DeHon

Today’s SoC Large At 1mm2 per A9, can put 100 on 1cm2 chip 120 core MIPS on Stratix V FPGA FPGA 2017 1680 core RISC-V on Xilinx Ultrascale http://fpga.org/2017/01/12/grvi-phalanx-joins-the-kilocore-club/ Scaling to 100s and 1000s of processing elements (PEs) that need interconnect Penn ESE532 Spring 2017 -- DeHon

Locality Delay and energy proportional to distance Want to keep communications short Data near compute From compute block to compute block How build network? Scalable (Area ~ N = things connected?) Supports locality Penn ESE532 Spring 2017 -- DeHon

Day 8 Mesh Penn ESE532 Spring 2017 -- DeHon

Bus to Ring Penn ESE532 Spring 2017 -- DeHon

Ring Penn ESE532 Spring 2017 -- DeHon

Preclass 1 Traffic pattern Similar bandwidth? One has higher bandwidth? Penn ESE532 Spring 2017 -- DeHon

Bidirectional Ring Penn ESE532 Spring 2017 -- DeHon

Interleaved Layout What problem does this layout solve? Penn ESE532 Spring 2017 -- DeHon

2D Layout Penn ESE532 Spring 2017 -- DeHon

Scaling How does area scale with N? How does neighbor distance scale with N? Unidirectional bidirectional How does worst-case distance in ring scale with N? Penn ESE532 Spring 2017 -- DeHon

1D to 2D Penn ESE532 Spring 2017 -- DeHon

Ring Abstract Penn ESE532 Spring 2017 -- DeHon

Row and Column Rings Penn ESE532 Spring 2017 -- DeHon

Mesh as Row & Column Rings Penn ESE532 Spring 2017 -- DeHon

Directional Mesh (Torus) Penn ESE532 Spring 2017 -- DeHon

Mesh Datapath Penn ESE532 Spring 2017 -- DeHon

Bidirectional Mesh Penn ESE532 Spring 2017 -- DeHon

2D Mesh Scaling How does area scale with N? How does neighbor distance scale with N? How does worst-case distance in mesh scale with N? Penn ESE532 Spring 2017 -- DeHon

Specifying Destination Simple: add destination address Ring or Mesh wires carry: Valid bit + Address + Payload (Data) Penn ESE532 Spring 2017 -- DeHon

Mesh Routing Route in Y until reach row Then route in X until reach column Consume from PE when arrives Penn ESE532 Spring 2017 -- DeHon

Mesh Routing Not deal with congestion Yin Mesh Routing Xin Xout Yout=Yin.valid & row(Yin.address)!=row & Yin + Pin.valid & P Xout=Xin.valid & column(Xin.address)!=column & Xin + Yin.valid & row(Yin.address==row) Not deal with congestion Yout Penn ESE532 Spring 2017 -- DeHon

Mesh Routing Yout=Yin.valid & row(Yin.address)!=row & Yin + Pin.valid & P Xout=Xin.valid & column(Xin.address)!=column & Xin + Yin.valid & row(Yin.address==row) Complexity of route function can impact Area, cycle time, route latency Penn ESE532 Spring 2017 -- DeHon

Mesh Congestion Penn ESE532 Spring 2017 -- DeHon

Mesh Congest What happens when inputs from 2 sides want to travel out same output? (here Xin, Yin) Penn ESE532 Spring 2017 -- DeHon

Dealing with Congestion Don’t let it happen (offline/static) Schedule to avoid Online/dynamic Store in place -- Buffer Misroute -- Deflect Penn ESE532 Spring 2017 -- DeHon

Congestion 1D For simplicity, we look at congestion in 1D case (Preclass 2) Penn ESE532 Spring 2017 -- DeHon

Preclass 2a Complete table – identify uncongested latencies Penn ESE532 Spring 2017 -- DeHon

Preclass 2b Cycles from simulation? Penn ESE532 Spring 2017 -- DeHon

Observe Did have congestion How we make decisions matters Ran slower than the single-link case How we make decisions matters Who gets to route, which is stalled Best, global decision can be better than local decisions Penn ESE532 Spring 2017 -- DeHon

Offline vs. Online [Kapre et al., FCCM 200] Penn ESE532 Spring 2017 -- DeHon

Dealing with Congestion Don’t let it happen (offline/static) Schedule to avoid Online/dynamic Store in place -- Buffer Misroute -- Deflect Penn ESE532 Spring 2017 -- DeHon

Congestion: Buffer Store inputs that must wait until path available Typically store in FIFO buffer How big do we make the FIFO? FIFO Buffers cost space Often more than multiplexers Penn ESE532 Spring 2017 -- DeHon

Congestion: Buffer Store inputs that must wait until path available Typically store in FIFO buffer How big do we make the FIFO? What if FIFO full? Penn ESE532 Spring 2017 -- DeHon

Congestion: Buffer Store inputs that must wait until path available Typically store in FIFO buffer How big do we make the FIFO? What if FIFO full? Penn ESE532 Spring 2017 -- DeHon

Congestion: Deflect Misroute: (deflection routing) Send in to an available (wrong) direction Avoid Buffer Requires balance of ins and outs Can make work on mesh How much more traffic do we create misrouting? Penn ESE532 Spring 2017 -- DeHon

Mesh Routing: Yout=Yin.valid & row(Yin.address)!=row & Yin + Pin.valid & P +row(Yin.address)==row & (column.Xin.address)!=column) & Y.in Xout=Xin.valid & column(Xin.address)!=column & Xin + Yin.valid & row(Yin.address==row) Gives Preference to X Penn ESE532 Spring 2017 -- DeHon

Alternates: random selection preference based on aging (keep track of # of times misrouted) Mesh Routing: Yout=Yin.valid & row(Yin.address)!=row & Yin + Pin.valid & P +row(Yin.address)==row & (column.Xin.address)!=column) & Y.in Xout=Xin.valid & column(Xin.address)!=column & Xin + Yin.valid & row(Yin.address==row) Penn ESE532 Spring 2017 -- DeHon

Static Schedule Store per-cycle instruction for switch Doesn’t need address header on route Static, local memories control destination Penn ESE532 Spring 2017 -- DeHon

Alternate Static Schedule Control injection cycle from processor so never have conflict Simple datapath logic to select available data Needs address header on routed data Penn ESE532 Spring 2017 -- DeHon

Mesh Packet Switched 32b Split-Merge FIFO Hoplite Deflection bidrectional 1800 LUTs Hoplite Deflection undirectional 60 LUTs Big difference in area costs. Need to look at area and benefits. [Kapre+Gray, FPL 2015] Penn ESE532 Spring 2017 -- DeHon

Deflection Route What concerns might we have about deflection route? Penn ESE532 Spring 2017 -- DeHon

Buffer vs. Deflection [Kapre+Gray, FPL 2015] Penn ESE532 Spring 2017 -- DeHon

Take 2, they are small [Kapre+Gray, FPL 2015] Penn ESE532 Spring 2017 -- DeHon

Tune Bandwidth Add channels to tune bandwidth Rings per row, column Single Hoplite channel ~60 …two around 120 … still << 1800 Penn ESE532 Spring 2017 -- DeHon

Mesh Area Deflection PS/TM [Kapre FCCM 2015] Penn ESE532 Spring 2017 -- DeHon

Static Schedule vs. Deflection [Kapre FCCM 2015] Penn ESE532 Spring 2017 -- DeHon

Static Schedule vs. Deflection Routing 142K message add20 benchmark Marathon statically schedule PS [Kapre FCCM 2015] Penn ESE532 Spring 2017 -- DeHon

Mesh Customization Penn ESE532 Spring 2017 -- DeHon

Tuning Down Bandwidth If need less bandwidth, cluster multiple PEs to share a router. Penn ESE532 Spring 2017 -- DeHon

Simple Bandwidth/Area Control Width of channels Like SIMD All bits going to same destination Penn ESE532 Spring 2017 -- DeHon

Packets Simple story is, each “word” routed on mesh is: address+payload Alternately: Multiword packet with single address Share “address” across larger payload Control width of datapath separate from size of payload Additional control issues to route packet together and buffer Penn ESE532 Spring 2017 -- DeHon

Customization Bandwidth Directional/Bidirectional Width, clustering, channels Directional/Bidirectional Online dynamic/offline static Buffer/deflect Buffer depth Route function sophistication Penn ESE532 Spring 2017 -- DeHon

Large VLIW Natural to use static network with VLIW clusters Network routing becomes part of long instruction word Extreme one operator per mesh PE Tune bandwidth by clustering Penn ESE532 Spring 2017 -- DeHon

Big Ideas Scalable interconnect for locality Has rich design space Customize to compute and application Support real-time with static scheduled communication Penn ESE532 Spring 2017 -- DeHon

Admin Project Design Space Milestone Next milestone out by Friday Due Friday Next milestone out by Friday 4x, area estimate Penn ESE532 Spring 2017 -- DeHon