Presentation is loading. Please wait.

Presentation is loading. Please wait.

A hardware-software co-design approach with separated verification/synthesis between computation and communication Masahiro Fujita VLSI Design and Education.

Similar presentations


Presentation on theme: "A hardware-software co-design approach with separated verification/synthesis between computation and communication Masahiro Fujita VLSI Design and Education."— Presentation transcript:

1 A hardware-software co-design approach with separated verification/synthesis between computation and communication Masahiro Fujita VLSI Design and Education Center The University of Tokyo

2 2 State-of-the-art SoC design System on a chip (SoC) C-based design description down to implementation IP core1 DSP Bus IF Interconnect CPU HW1 DSP Interconnect Mem1 Mem2 HW2 IP core2 Mem1 Bus IF IP core3 HW1 Bus IF IP core4 HW2 Bus IF IP core5 Mem2 Bus IF IP core6 CPU Bus IF IF void main() { a = read(); b = read(); c = func(a, b); write(c); } IP library

3 3 Design reuse is extremely important in SoC designs IP (Intellectual Property) core reuse Existing designs have been verified Interface may/may not match Bus CPU Memory Controller MPEG Analog I/F Power Source Memory Controller1 Memory Controller2 Bus1 Bus2 CPU2 CPU1 IP library Ex: MPEG video system Select IP with required functionality

4 4 Need protocol transducers … CPUMPEGRAM Custom HW DMAC CPU (IP) RAM (IP) DMAC (IP) MPEGRAM Custom HW DMAC RAM (IP) DMAC (IP) CPU (IP) Trans -ducer Interconnect (Bus) Different on-chip bus protocols Protocol AProtocol B MPEGRAM Custom HW DMAC RAM (IP) DMAC (IP) CPU (IP) Trans -ducer Solution Functionality is satisfied, but its interface does not match Communication on the interconnect is based on different protocols Insert Protocol Transducer for conversions Protocol transducer should be automatically generated

5 5 Basic ways of thinking and our proposal Like to come with a methodology for large and complicated system designs Design reuse is a key Separation of concerns is essential Computation and communication (control and datapath) must be clearly separated in some ways What we propose New way to design communication protocols (Special mechanisms for rectification after manufacturing) … Multiple of these SDRAM CPU H/W DSP Bus Mem H/W SoC PCB Analog Mechanical

6 6 Propose design methods for communication interface design with clear separation between computation and communication How the separation helps design efficiency Today s topic IP core1 DSP Bus IF Interconnect CPU HW1 DSP Interconnect Mem1 Mem2 HW2 IP core2 Mem1 Bus IF IP core3 HW1 Bus IF IP core4 HW2 Bus IF IP core5 Mem2 Bus IF IP core6 CPU Bus IF IF Interface/ communication Computation Interface/ communication i i Res. Req.

7 7 Propose various rectification methods for computation and communication Designs can be debugged after manufacturing Propose different mechanisms for comp. and comm. Our on-going relating research IP core Bus protocol IF (in-field programmable) With programmable elements Original circuit Programmable elements LUT

8 8 Outline Motivation Background: State-of-the-art design methodology C-based design Proposed method and its application to interface designs for computing elements Key technology for IP reuse Separation of concerns: computation and communication (control and datapath) Application to dynamically reconfigure computing (if time allowed) CPUMPEGRAM Custom HW DMAC CPU (IP) RAM (IP) DMAC (IP) Bus Protocol AProtocol B MPEGRAM Custom HW DMAC RAM (IP) DMAC (IP) CPU (IP) Trans -ducer Protocol A1

9 9 For improving design productivities Start the design in higher abstraction levels C language based HW descriptions is 100~10000 more compact that gate level descriptions The number of lines that on designer can describe per day is limited Reuse of existing designs So called IP reuse in LSI designs Key is to separate computation and communication Interface/ communication Computation Interface/ communication RTL Gates High level C/C++

10 10 Starting from C/C++ designs/specifications Extraction of parallelisms Partition of HW and SW Based on profiling: performance critical parts (mostly loops) are assigned to HW Design issues for large complicated systems void main() { a = read(); b = read(); c = func(a, b); write(c); } IP Library SDRAM CPU H/W DSP Bus Mem H/W IP reuse design Compilation High level synthesis

11 11 C/C++ based design and specification languages for SoC designs SystemC and SpecC are most common Based on C/C++ Structural hierarchy behavior, module Parallelism with event based synchronization par wait, notify channels Others … Support hardware- software co-designs C C C notify C wait par behavior b1 behavior b2 channels Communication through shared variables

12 12 Claim in this talk: Separation of concerns Even inside interface control and datapath should be clearly separated Interface/ communication Computation Interface/ communication This is not sufficient ! Computation Control Datapath Control Datapath Separation of computation and communication A protocol is a collection of sequences Each sequence can operate independently Protocol Sequence1 Sequence2 Sequence3 Sequence4 Hardware definition Read Write 4 burst read 4 burst write Automaton1 Port, signal names, etc. Automaton2 For request or blocking For response i (stb==1) ack<=0 ack<=1 ack<=0 All sequences share initial state

13 13 Goal: automatic generation of correct bus interfaces Communication protocol can be arbitrarily complicated Blocking, non-blocking, out-out-order, tags, etc. Deal only with state-of-the-art bus protocols Specification documents are over 200 pages Mostly subsets of them are actually used Formally verify the definition of protocol in automaton and automatically generate interface circuits from them If necessary, change their functionality in the fields Assuming C-based designs

14 14 State-of-the-art on-chip bus protocol example: OCP (Open Core Protocol) Interface Protocol proposed by OCP-IP Configurable interface protocol Data/Address width, Burst/OutOfOrder features, … At basic configuration, interface has 8 signals (including clock and reset) Full specification documents over 200 pages More than 30 different transactions/sequences OCP Master OCP Slave MCmd MAddr MData SCmdAccept SResp SData

15 15 What protocol transducer does Change from protocol A to B Protocols can be very complicated Over 30 different commands defined in the protocols Manuals over 200 pages Transactions (sequences) such as Bust, out-of-order modes, … Each transaction (sequence) is sent/received one at a time Protocols can be defined with FSM/automaton State-of-the-art protocols may need extremely large and complicated FSM/automaton Protocol B MPEGRAM Custom HW DMAC RAM (IP) DMAC (IP) CPU (IP) Trans -ducer Protocol A … Protocol B …

16 16 Our scenario Start with C/C++ based descriptions for SoC Apply control/data flow analysis for computation Use protocol converter generator in communication interface synthesis Convert the descriptions protocol to the target protocol Protocols themselves are formally verified with model checkers HW/SW generated (scheduled/ allocated) Protocol Converter in HW/SW Original designs in C/C++ Verification through SDG traversal HW/SW synthesis (manual/automatic) Protocol extraction Protocol in design Protocol Converter generator Protocol Library (target Protocol) Model checking on protocol definitions [1] K. Tanabe, S. Sasaki, and M. Fujita, Program Slicing for System Level Designs in SpecC, In Proc. of the IASTED, p.p , Nov [2] S, Sasaki, M. Fujita, et al. FSEN 05. Much small numbers of states to be checked than actual designs

17 17 Example Non-Blocking protocol conversion MASTER: OCP (Single Read, Non-Posted Write) RequestResponseRequestResponse SLAVE: OCP (Single Read, Single Write) Single Read Non-Posted Write Single Read Single Write

18 18 Conversion example FSM for Response (FIFO-ready) FSM for Request (FIFO-ready) MasterSlave FIFO (2bit x 4) M_MCmd M_MAddr M_MData M_SCmdAccept S_MCmd S_MAddr S_MData S_SCmdAccept M_SResp M_SData S_SResp S_SData WData PUSH RData PUSH RST CLK D Single Read Request Single Read Request Non-Posted Write Request Single Write Request Single Read Response Single Read Response Non-Posted Write Response

19 19 How protocol transducer is realized Intuitive understanding of the problem Follow the two protocols compute the product of the two FSM/automata and follow it Protocol A Master Protocol B Slave Request Response Target Exploration [1] + ours Definition of protocol Protocol A Protocol B Protocol transducer In FSM/automaton (stb==1) ack<=0 ack<=1 ack<=0 Clock-wise behavior [1] R.Passerone, J.A.Rowson, A.Sangiovanni-Vincentelli, Automatic Transducer Synthesis of Interfaces between Incompatible Protocols,DAC 98 pp.8-13

20 20 : Dependency violation ! Simple computation of product Follow the two automata Compute the product of the two Eliminate dependency violated nodes/paths AD AEBDBE BFCDCECF CECF CF C B A F E D Protocol A Master 8 ctrl DI Protocol B Slave 8 ctrl DO {Ctrl=0} {Ctrl=1, DI:=data1} {Ctrl=1, D:=data1} {Ctrl=0, DI=data2} {Ctrl=0, D:=data2} Transducer {Ctrl=0} {Ctrl=1, Rcvf1:=DO} {Ctrl=0, Rcvd2:=DO} {Ctrl=0} Data not yet received but sent B(B or C) EF A(A or B) D(D or E) (Transducer) Minimum latency path ! 8 8 Data not yet received but sent [1]

21 21 Need separation between control and datapath If there is a loop in automata, product computation never terminates Data values are different each time going through the loop ADBD CE CF C B A F E D Protocol A Master 8 ctrl DI Protocol B Slave 8 ctrl DO {Ctrl=0} {Ctrl=1, DI:=data1} {Ctrl=1, D:=data1} {Ctrl=0, DI=data2} {Ctrl=0, D:=data2} Transducer {Ctrl=0} {Ctrl=1, Rcvf1:=DO} {Ctrl=0, Rcvd2:=DO} {Ctrl=0} 8 8 AD Data values are different These two are not the same states Need to expand more and more …

22 22 Protocols can be very complicated State-of-the-art protocols introduces many features for faster throughputs Protocol Master Protocol Slave Request (Address / Data) Response (Data) t Split transaction Non blocking Req1 t Out of order transaction Req2 Req3 Res1 Res2 Req1 Req2 Req3 Res1 Res3 Res2 Burst transaction t Addr1 Addr2 Addr3 Data1 Data2 Data3 Data4Addr4 Request Single address Burst trans. Addr1 Data1 Data2 Data3 Data4 Requestt Req1 Res1 Req2 Res2 t Blocking Low throughput

23 23 Problems and solutions Simple product computation has essential problems for realistic on-chip bus protocols If there is a loop in control, no termination If automata become large, may not terminate practically Protocol must be represented in a automaton Cannot deal with non-blocking type protocols The above problems come from the non-separation between control and datapath Solutions: With separation of computation and communication (control and datapath), the followings can be realized Hiding loops Protocols are represented hierarchically with automata

24 24 Separation of communication and computation (data transfer) Data values are abstracted away Only data id is watched in communication part Actual data transfer is realized by computation part Id matching is guaranteed by agreement between computation and communication New request is accepted only after the previous request has been accepted If necessary FIFO (buffer) is inserted to keep not-yet-serviced sequences There can be multiple and simultaneous responses may be coming before finishing the current response

25 25 Separation of computation and communication inside protocol transducers In protocol definition, control and data are separately specified Introduce two FSMs for request and control to describe complicated protocols uniformly FIFO can be made arbitrary complicated if we like Protocol A Master Protocol B Slave Res. FSM Transducer Protocol A Master Protocol B Slave Req. Res. Transducer Res. FSM Even arithmetic computation possible Req. FSM

26 26 Protocols can be very complicated State-of-the-art protocols introduces many features for faster throughputs Protocol Master Protocol Slave Request (Address / Data) Response (Data) t Split transaction Non blocking Req1 t Out of order transaction Req2 Req3 Res1 Res2 Req1 Req2 Req3 Res1 Res3 Res2 Burst transaction t Addr1 Addr2 Addr3 Data1 Data2 Data3 Data4Addr4 Request Single address Burst trans. Addr1 Data1 Data2 Data3 Data4 Requestt Req1 Res1 Req2 Res2 t Blocking Low throughput

27 27 For more complicated protocols … Protocol definition Protocol A Protocol B Req. Res. Req. Res. Protocol A Master Protocol B Slave Req. Res. Send FSM Req. FSM Recv FSM XReq X FIFO WR ResX FIFO RD Res Newly introduced FIFO Transducer Pros: Can deal with more complicated protocols Cons: Need more latency delay due to multiple FIFO Control for FIFO Read Write

28 28 Now we can resolve it Elimination of loops (to initial states) Elimination of intermediate loops i A B i C D i A B i C D e e Exploration i Y e X Z U W i Y X Z U W Introduction of ending state Elimination pf ending states SS = Loops are replaced with super states Exploration [2] S.Watanabe, K.Seto, Y.Ishikawa, S.Komatsu, M.Fujita, Protocol Transducer Synthesis using Divide and Conquer approach, Proc. of the 12th. Asia and South Pacific Design Automation Conference, pp , Concentrating on controls only Date parts are processed separately ! [2]

29 29 How to deal with multiple complicated transactions A protocol is a collection of sequences Each sequence can operate independently True for state-of-the-art protocols with separation between computation and communication Protocol Sequence1 Sequence2 Sequence3 Sequence4 Hardware definition Read Write 4 burst read 4 burst write Automaton1 Port, signal names, etc. Automaton2 For request or blocking For response i (stb==1) ack<=0 ack<=1 ack<=0 All sequences share initial state [2]

30 30 Hierarchical synthesis owing to comp. and comm. separation Protocol A Protocol B Transducer Partial transducer 1 Partial transducer 2 Sequence A2 Sequence B1 Sequence B2 Sequence A1 Exploration iii Merge generated FSM with the same initial state Sequence level synthesis followed by merge process [2]

31 31 The protocol transducer synthesis (1) Transducers including blocking protocols Protocol A Master Protocol B Slave Req. Res. FSM Transducer i i Blocking protocol Automaton level synthesis i i Req. Res. Non-blocking protocol (out of order) Generate blocking automaton by composition i Automata for other sequences Compose

32 32 The protocol transducer synthesis (2) Out-of-order to out-of-order Out of order processing Tags are sent out as they are Non-blocking and out-of-order In order processing Transducer generates tags and reorders responses i i Res. Req. i i Res. Req. Protocol A Protocol B Automaton level synthesis Automaton level synthesis Protocol A Master Protocol B Slave Req. Res. Transducer Res. FSM FIFO memorizes sequences whose responses are not yet received Req. FSM i i Res. Req. Protocol transducer

33 33 The protocol transducer synthesis (3) In state-of-the-art on-chip bus protocols: All masters have waiting mechanisms for request Some slaves do not have waiting mechanisms for responses Ex OCP Automaton Level synthesis Restrictions on protocol definitions: Master does not have waiting mechanisms but slave has (request) Slave does not have waiting mechanisms but master has (response) Next transaction may start before transducer returns to initial state Some requests/responses may not be processed OCP request (Read sequence) Wait with SCmdAccept signal OCP response (Read sequence) Finish in exactly one cycle (no waiting mechanisms)

34 34 The protocol transducer synthesis (4) Responses guaranteed to be processed with FIFO Protocol definition Protocol A Protocol B Req. Res. Req. Res. Protocol A Master Protocol B Slave Req. Res. Send FSM Req. FSM Recv FSM XReq Automaton level synthesis X FIFO WR ResX FIFO RD Res Automaton level synthesis with FIFO control automaton (no waiting) Newly introduced FIFO Transducer Pros: Can deal with more complicated protocols Cons: Need more latency delay due to multiple FIFO Automaton controlling FIFO Read Write

35 35 Tool implementation Planned to be distributed freely from OCP-IP Currently under evaluation at Toshiba

36 36 Experimental results Atholon64 2GH + 1GB RAM Implemented as over 12,000 loc in C++ Input: Hierarchical automaton descriptions in XML Output: RTL synthesizable Verilog Logic synthesis: Xilinx ISE RTL simulator: Model Sim XE Mater's Protocol Slave's Protocol TypeSequencesSynth.TimeGate counts OCPAHB(NB,BK)41.1[s]2,352 AHBOCP(BK,NB)41.3[s]1,843 OCP (NB,NB)21.9[s]1,568 OCPTagged OCP(NB,OoO)22.2[s]3,514 Tagged OCPAXI(OoO,OoO)24.8[s]1,377 AXIOCP(OoO,NB)24.9[s]1,731 OCPAXI(NB,OoO) [s]61,205 No one has ever synthesized !

37 37 Rectification after manufacturing Transducer FSM can be implemented with programmable devices Make run time change of protocols possible Protocol Sequence1 Sequence2 Sequence3 Sequence4 Hardware definition Read Write 4 burst read 4 burst write Port, signal names, etc.

38 38 Conclusion The following have been shown through an example: protocol transducer synthesis Separation of concerns is essential Hierarchical definition of protocol Complete separation between computation and communication State-of-the-art protocols can be processed efficiently Even formal verification becomes possible Rectification after manufacturing can be handled

39 39 Future issues Bit-width conversion Ex: 16-bit write 2 of 8-bit write Need ways to compose multiple sequences Dynamic change of transfer times in burst mode Use super states Sequence Write8bit Sequence Write8bit Composition Sequence Write8bit*2 Sequence Write16bit Existing methods SS Separate the data transfer part Determine loop count Repeat super state by loop count

40 Application to dynamically reconfigurable processors/protocol transducers

41 41 Hardware OS Portions to be reconfigured in dynamically reconfigurable architectures Load and unload functional blocks dynamically Schedule functional blocks dynamically Communicate among functional blocks Hardware OS(Operating System) Self (partial) reconfiguration on FPGA Load and unload circuit blocks (hardware tasks) Just like processes in multi task software Provide ways to communicate among hardware tasks

42 42 Example of hardware OS Herbert et al. Task slot: Rectangle areas to load hardware tasks Interconnect: Shared bus for communications OS module: Scheduling and loading hardware tasks OS module Task slot Interconnect Hardware task Circuit block Circuit block Herbert Walder, Marco Platzner, Reconfigurable Hardware Operating Systems: From Design Concepts to Realizations, Proceedings of ERSA03 pp , 2003 Dynamically reconfigurable

43 43 Interconnect Various topologies have been proposed Assuming all functional blocks use the same protocols Not in general and need protocol transducers FB1FB2 FB3FB4 FB1FB2 FB1FB2 SW BOX SW BOX SW BOX SW BOX FB S S SW BOX a) Shared busb) Mesh networkc) Tree netowrk

44 44 How to build protocol trasnducer Proposal: Dynamically reconfigurable protocol transducers Optimizing protocol transducers Universal protocol transducer for {A, D} {B, C} is simply too complicated and hardware resource consuming Load minimum protocol transducers dynamically Save hardware resources IP1 Protocol A IP2 Protocol B A to B Reconf. IP1 Protocol A IP3 Protocol C Reconf. A to C IP4 Protocol D IP3 Protocol C D to C

45 45 Basic idea: Use our protocol transducer synthesis method Selecting partial protocol transducers dynamically Our Synthesis Method Partial trnsdcr 1 Partial trnsdcr 2 Partial trnsdcr 3 Partial trnsdcr 4 Partial trnsdcr 5 Design phase Run time IP1 Protocol A IP2 Protocol B A to B Partial trnsdcr 1 Partial trnsdcr 3 Selection from library Protocol transducer Compose (in case of static atchitecures Partial trnsdcr 2 Partial trnsdcr 4 A to C

46 46 Architecture of dynamically reconfigure protocol transducer Place functional blocks and partial protocol transducers in task slots Like hardware OS Partial protocol transducers are dynamically loaded and unloaded Func. block Shared bus Partial Trnsdcr Partial Trnsdcr Func. block Func. block Partial Trnsdcr Direct communication when protocols match Through protocol transducer when protocols do not match Dynamically loaded when necessary Task slot

47 47 Dynamic reconfiguration of protocol transducers When loading functional blocks, their partial protocol transducers are also loaded Unload non-in-use partial protocol transducers Func. block Func. block Func. block Partial Trnsdcr Func. block Required conv. AC:Write AC:Read Functional block library Partial transducer library Partial Trnsdcr AC:Read Load Place Search


Download ppt "A hardware-software co-design approach with separated verification/synthesis between computation and communication Masahiro Fujita VLSI Design and Education."

Similar presentations


Ads by Google