Presentation is loading. Please wait.

Presentation is loading. Please wait.

ECE669: Lecture 24 aSoC: A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang, Andrew Laffely, and Wayne Burleson University of Massachusetts,

Similar presentations


Presentation on theme: "ECE669: Lecture 24 aSoC: A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang, Andrew Laffely, and Wayne Burleson University of Massachusetts,"— Presentation transcript:

1 ECE669: Lecture 24 aSoC: A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang, Andrew Laffely, and Wayne Burleson University of Massachusetts, Amherst Reconfigurable Computing Group Supported by National Science Foundation Grants CCR-081405 and CCR-9988238

2 ECE669: Lecture 24 Outline Design philosophy Design philosophy Communication architecture Communication architecture Mapping tools / simulation environment Mapping tools / simulation environment Benchmark designs Benchmark designs Experimental results Experimental results Prototype layout Prototype layout

3 ECE669: Lecture 24 Design Goals / Philosophy Low-overhead core interface for on-chip streams Low-overhead core interface for on-chip streams On-chip bus substitute for streaming applications On-chip bus substitute for streaming applications Allows for interconnect of heterogeneous cores Allows for interconnect of heterogeneous cores Differing sizes and clock rates Differing sizes and clock rates Based on static scheduling Based on static scheduling Support for some dynamic events, run-time reconfiguration Support for some dynamic events, run-time reconfiguration Development of complete system Development of complete system Architecture, prototype layout, simulator, mapping tools, target applications Architecture, prototype layout, simulator, mapping tools, target applications

4 ECE669: Lecture 24 aSoC Architecture Multiplier FPGA uProc Multiplier ctrl SouthCore West North East tile West North South East Ctrl Core Communication Interface Heterogeneous Cores Point-to-point connections Communication Interface

5 ECE669: Lecture 24 Point to Point Data Transfer Data transfers from tile to tile on each communication cycle Data transfers from tile to tile on each communication cycle Schedule repeats based on required communication patterns Schedule repeats based on required communication patterns Core Tile ATile B Tile DTile ETile F Tile C Cycle 2 Cycle 3Cycle 4 Cycle 1

6 ECE669: Lecture 24 Core and Communication Interface NSEW South East IP Core coreport Interface Crossbar CDM … PC Decoder PC logic Flow control Interface controller Schedule instruction CDM Interconnect Memory West North NSEW

7 ECE669: Lecture 24 Communication Interface Overview Interconnect memory controls crossbar configuration Interconnect memory controls crossbar configuration Programmable state machine Programmable state machine Allows multiple streams Allows multiple streams Interface controller manages flow control Interface controller manages flow control Supports simple protocol based on single packet buffering Supports simple protocol based on single packet buffering Communication data memory (CDM) buffers stream data Communication data memory (CDM) buffers stream data Single storage location per stream Single storage location per stream Coreport provides storage and synchronization Coreport provides storage and synchronization Storage for input and output streams Storage for input and output streams Requires minimal support for synchronization Requires minimal support for synchronization

8 ECE669: Lecture 24 Interface Control Circuitry

9 ECE669: Lecture 24 Data Dependent Stream Control Two types of branches Two types of branches Unconditional branch – end of schedule reached Unconditional branch – end of schedule reached Conditional branch – test data value to modify schedule sequence Conditional branch – test data value to modify schedule sequence Provides minimal support for reconfiguration Provides minimal support for reconfiguration Requires core interface support Requires core interface support

10 ECE669: Lecture 24 Inter-tile Flow Control / Buffer Provide minimum amount of storage per stream at each node (1 packet) Provide minimum amount of storage per stream at each node (1 packet) First priority: transfer data from storage First priority: transfer data from storage Send and acknowledge simultaneously Send and acknowledge simultaneously Can’t send same stream on consecutive cycles Can’t send same stream on consecutive cycles Data Buffer Full

11 ECE669: Lecture 24 Inter-tile Flow Control Interface Crossbar CDM Addr … Read Addr PC Clear Addr N W E S EWNSCEWNSC Data data clear Addr Valid Bit 1 0 Valid Bit Flow control Data from west 1 0 CDM DataValid Bit Data Wr Addr Rd Addr Read Addr Flow control Data from west To Crossbar

12 ECE669: Lecture 24 Coreport Interface to Communication Data buffer provides synchronization with flow control Data buffer provides synchronization with flow control Stream indicators (CPO, CPI) provide access to flow control bits Stream indicators (CPO, CPI) provide access to flow control bits Data Addr Valid Bit AddrData Clear Data Interface Crossbar Coreport Access? DeMux Data WE Valid Bit AddrData Addr Data NSEWNSEW NSEWNSEW CPO CPI NSEW Flow Control Bits From Interconnect Memory CPO CPI COCI From CoreTo Core Output CoreportInput Coreport

13 ECE669: Lecture 24 Adapting the IP Core Multiplier example Multiplier example State machine sequencer State machine sequencer Valid Bit Data State Machine MUL LDLD LDLD LDLD Data Clear Addr A B Valid Bit Data Addr Data En Input Coreport Output Coreport From CI To CI Multiplier Core

14 ECE669: Lecture 24 Design Mapping Tool Flow Support multiple core clock speeds and design formats Support multiple core clock speeds and design formats Automate scheduling/routing Automate scheduling/routing Allow feedback between core characteristics and mapping decisions Allow feedback between core characteristics and mapping decisions Generate both core and communication programming information Generate both core and communication programming information Lots of room for improvement (StreamIt, HW/SW partitioning, estimators) Lots of room for improvement (StreamIt, HW/SW partitioning, estimators)

15 ECE669: Lecture 24 Design Mapping Tools Front-end parse SUIF optimization Basic block Partition/Assignment Inter-core synchronization Communication scheduling Core compilation Code generation exec. time estimation code exe. time Graph-based Inter. Format Stream assignmentEnhanced I.F. dependencies Stream schedulescore I.F. R4000 Instructions Bit streamsCommunication instructions Source

16 ECE669: Lecture 24 Design Mapping Tool Front End Current system isolates computation into basic blocks Current system isolates computation into basic blocks Stream-oriented front-end (e.g. StreamIt) more appropriate. Stream-oriented front-end (e.g. StreamIt) more appropriate. Front-end preprocessing Front-end preprocessing Built on SUIF Built on SUIF Performs standards optimizations Performs standards optimizations Intermediate form used for subsequent partitioning placement, and scheduling (routing) Intermediate form used for subsequent partitioning placement, and scheduling (routing) User interface allows for interaction and feedback User interface allows for interaction and feedback

17 ECE669: Lecture 24 Partitioning and Assignment Clustering used to collect blocks based on cost function: Clustering used to collect blocks based on cost function: Cost function takes both computation and communication into account Cost function takes both computation and communication into account T compute = estimate overall compute time T compute = estimate overall compute time T overlap = estimate overall time of overlapping communication T overlap = estimate overall time of overlapping communication c total = estimate overall communication time c total = estimate overall communication time Swap-based approach used to minimize cost across cores based on performance estimates. Swap-based approach used to minimize cost across cores based on performance estimates.

18 ECE669: Lecture 24 Scheduled Routing Number and locations of streams known as a result of scheduling Number and locations of streams known as a result of scheduling Stream paths routed as a function of required path bandwidth (channel capacity) Stream paths routed as a function of required path bandwidth (channel capacity) Basic approach Basic approach Order nets by Manhattan length Order nets by Manhattan length Route streams using Prim’s algorithm across time slices based on channel cost Route streams using Prim’s algorithm across time slices based on channel cost Determine feasible path for all streams Determine feasible path for all streams Attempt to “fill-in” unused bandwidth in schedule with additional stream transfers Attempt to “fill-in” unused bandwidth in schedule with additional stream transfers

19 ECE669: Lecture 24 Back-end Code Generation C code targeted to R4000 cores C code targeted to R4000 cores Subsequently compiled with gcc Subsequently compiled with gcc Verilog code for FPGA blocks Verilog code for FPGA blocks Synthesized with Synopsys and Altera tools Synthesized with Synopsys and Altera tools Interconnect memory instructions for each interconnect memory Interconnect memory instructions for each interconnect memory Limited by size of interconnect memory Limited by size of interconnect memory

20 ECE669: Lecture 24 Simulation Overview Simulation takes place in two phases Simulation takes place in two phases Core simulator determines computation cycles between interface accesses Core simulator determines computation cycles between interface accesses Cycle accurate interconnect simulator determines data transfer between cores taking core delay into account. Cycle accurate interconnect simulator determines data transfer between cores taking core delay into account.

21 ECE669: Lecture 24 Simulation Environment Core codes from AppMapper R4000 Sim. (SimpleScalar) MAC Sim.MEM Sim. FPGA Sim. (Quartus) C code Verilog Core config. Core config. Network simulation Core simulation Combined evaluation System statistics Computation delays comm. events Config. Core speed Topology Core location CI instruction Simulator Lib. System performance C representation Of cores

22 ECE669: Lecture 24 Core Simulators Simplescalar (D. Burger/T. Austin – U. Wisconsin) Simplescalar (D. Burger/T. Austin – U. Wisconsin) Models R4000-like architecture at the cycle level Models R4000-like architecture at the cycle level Breakpoints used to calculate cycle counts between communication network interaction Breakpoints used to calculate cycle counts between communication network interaction Cadence Verilog –XL Cadence Verilog –XL Used to model 484 LUT FPGA block designs Used to model 484 LUT FPGA block designs Modeled at RTL and LUT level Modeled at RTL and LUT level Custom C simulation Custom C simulation Cycle counts generated for memory and multiply accumulate blocks Cycle counts generated for memory and multiply accumulate blocks Simulators invoked via scripts Simulators invoked via scripts

23 ECE669: Lecture 24 Interconnect Simulator Based on NSIM (MIT NuMesh Simulator – C. Metcalf) Based on NSIM (MIT NuMesh Simulator – C. Metcalf) Each tile modeled as a separate process Each tile modeled as a separate process Interconnect memory instructions used to control cycle-by-cycle operation Interconnect memory instructions used to control cycle-by-cycle operation Core speeds and flow control circuitry modeled accurately. Core speeds and flow control circuitry modeled accurately. Adapted for a series of on-chip interconnect architectures (bus-based architectures) Adapted for a series of on-chip interconnect architectures (bus-based architectures)

24 ECE669: Lecture 24 Target Architectural Models FPGA blocks contain 121 4-LUT clusters FPGA blocks contain 121 4-LUT clusters Custom MAC and 32Kx8 SRAM (Mem) blocks Custom MAC and 32Kx8 SRAM (Mem) blocks Same configurations used for all benchmarks Same configurations used for all benchmarks

25 ECE669: Lecture 24 Example: MPEG-2 Design partitioned across eleven cores Design partitioned across eleven cores Other applications: IIR filter, image processing, FFT Other applications: IIR filter, image processing, FFT IDCT source - recon. source - recon. source - recon. source - recon. ME In Buf DCT Control Ref Buf source frame reconstructed frame In Buf Ref Buf controlIDCT MEMAC2MAC1DCT MAC3MAC4MAC0 MAC1 MAC2 MAC3 MAC4 recon. block DCT block DCT block Motion error R4000 MEM

26 ECE669: Lecture 24 Core Parameters Communication interface, MAC, FPGA, and MEM sizes determined through layout (TSMC 0.18um) Communication interface, MAC, FPGA, and MEM sizes determined through layout (TSMC 0.18um) ** R4000 size from MIPs web page ** R4000 size from MIPs web page Speed Area ( λ 2 ) Comm. Interface 2.5 ns 2500 x 3500 MIPs R4000 5 ns 4.3 x 10 7 ** MAC 5 ns 1500 x 1000 FPGA 10 ns 30000 x 30000 MEM (32Kx8) 5 ns 10000x 10000

27 ECE669: Lecture 24 Mapping Statistics Number of Interconnect Mem instructions (CI Instruct) deceptively small Number of Interconnect Mem instructions (CI Instruct) deceptively small Likely need to better fold streams in schedule Likely need to better fold streams in schedule Design No. Cores No. Streams Max CI Instruct. Max Streams Per CI Max CPort Mem. Depth IIR911255 IIR1620255 IMG98233 IMG1615444 FFT1625677 MPEG1619488

28 ECE669: Lecture 24 Comparison to IBM CoreConnect 9 Core Model 16 Core Model Execution Time (ms) IIRIMGIIRIMGFFTMPEG R40000.049327.00.350327.00.79152 CoreConnect0.01222.00.01630.50.12173 Coreconnect (burst) 0.01218.90.01524.30.12172 aSoC0.0069.60.0067.30.0983 aSoC Speed-up vs. burst 2.02.32.53.31.32.1 Used aSoC Links 8833274126 aSoC max. link usage 10%8%37%28%2%25% aSoC ave. link usage 7%7%22%25%2%5% CoreConnect busy (burst) 91%100%100%99%32%67% Still work to do on mapping environment to boost aSoC link utilization Still work to do on mapping environment to boost aSoC link utilization

29 ECE669: Lecture 24 Comparison to Hierarchical CoreConnect Multiple levels of arbitration slows down hierarchical CoreConnect Multiple levels of arbitration slows down hierarchical CoreConnect 9-core Model 16-Core Model Execution Time (ms) IIRIMGIIRIMGFFTMPEG Hier CoreConnect 0.01326.015.737.40.15178 aSoC0.0069.67.07.30.0983 aSoC speedup 2.12.72.25.11.62.2

30 ECE669: Lecture 24 aSoC Comparison to Dynamic Network Direct comparison to oblivious routing network 1 Direct comparison to oblivious routing network 1 9-Core Model 16 Core Model Execution Time (ms) IIRIMGIIRIMGMPEG Dynamic Routing 0.00814.48.79.7162.0 aSoC0.0066.17.07.382.5 aSoC Speedup 1.32.41.31.32.0 1. W. Dally and H. Aoki, “Deadlock-free Adaptive Routing in Multi-computer Networks Using Virtual Routing”, IEEE Transactions on Parallel and Distributed Systems, April 1993

31 ECE669: Lecture 24 aSoC Layout

32 ECE669: Lecture 24 aSoC Multi-core Layout Comm. Interface consumes about 6% of tile Comm. Interface consumes about 6% of tile Critical path in flow control between tiles Critical path in flow control between tiles Currently integrating additional cores Currently integrating additional cores

33 ECE669: Lecture 24 Future Work: Dynamic Voltage Scaling Data transfer rate to/from core used to control voltage and clock Data transfer rate to/from core used to control voltage and clock Counter and CAM used to select sources Counter and CAM used to select sources May be software controlled May be software controlled Core Coreports Decoder Local Frequency & Voltage North to South & East Instruction Memory PC Controller North South East West Local Config. North South East West Inputs Outputs

34 ECE669: Lecture 24 Future Work: Dynamic Voltage Scaling CAM allows selection CAM allows selection V1 V2 V3 V4 CAM Voltage Selection System /128 /64 /32 /16 /8 /4 /2 /1 Critical Path Check Set Reset Global Clock Clock Selector Data Rate Measurement Core Coreport In count Clock Enable Coreport Out Local ClockLocal Supply

35 ECE669: Lecture 24 Future Work Improved software mapping environment Improved software mapping environment Integration of more cores Integration of more cores Mapping of substantial applications Mapping of substantial applications Turbo codes Turbo codes Viterbi decoder Viterbi decoder More integrated simulation environment More integrated simulation environment

36 ECE669: Lecture 24 Summary Goal: Create low-overhead interconnect environment for on-chip stream communication Goal: Create low-overhead interconnect environment for on-chip stream communication IP core augmented with communication interface IP core augmented with communication interface Flow control and some stream reconfiguration included in the architecture Flow control and some stream reconfiguration included in the architecture Mapping tools and simulation environment assist in evaluating design Mapping tools and simulation environment assist in evaluating design Initial results show favorable comparison to bus and high-overhead dynamic networks. Initial results show favorable comparison to bus and high-overhead dynamic networks.


Download ppt "ECE669: Lecture 24 aSoC: A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang, Andrew Laffely, and Wayne Burleson University of Massachusetts,"

Similar presentations


Ads by Google