Winter-Spring 2001Codesign of Embedded Systems1 Co-Synthesis Algorithms: HW/SW Partitioning Part of HW/SW Codesign of Embedded Systems Course (CE 40-226)

Winter-Spring 2001Codesign of Embedded Systems2 Topics Introduction Preliminaries Hardware/Software Partitioning Distributed System Co-Synthesis

Winter-Spring 2001Codesign of Embedded Systems3 Topics Introduction A Classification Examples Vulcan Cosyma

Winter-Spring 2001Codesign of Embedded Systems4 Introduction to HW/SW Partitioning The first variety of co-synthesis applications Definition A HW/SW partitioning algorithm implements a specification on some sort of multiprocessor architecture Usually Multiprocessor architecture = one CPU + some ASICs on CPU bus

Winter-Spring 2001Codesign of Embedded Systems5 Introduction to HW/SW Partitioning (cont’d) A Terminology Allocation Synthesis methods which design the multiprocessor topology along with the PEs and SW architecture Scheduling The process of assigning PE (CPU and/or ASICs) time to processes to get executed

Winter-Spring 2001Codesign of Embedded Systems6 Introduction to HW/SW Partitioning (cont’d) In most partitioning algorithms Type of CPU is fixed and given ASICs must be synthesized What function to implement on each ASIC? What characteristics should the implementation have? Are single-rate synthesis problems CDFG is the starting model

Winter-Spring 2001Codesign of Embedded Systems7 HW/SW Partitioning (cont’d) Normal use of architectural components CPU performs less computationally-intensive functions ASICs used to accelerate core functions Where to use? High-performance applications No CPU is fast enough for the operations Low-cost application ASIC accelerators allow use of much smaller, cheaper CPU

Winter-Spring 2001Codesign of Embedded Systems8 A Classification Criterion: Optimization Strategy Trade-off between Performance and Cost Primal Approach Performance is the primary goal First, all functionality in ASICs. Progressively move more to CPU to reduce cost. Dual Approach Cost is the primary goal First, all functions in the CPU. Move operations to the ASIC to meet the performance goal.

Winter-Spring 2001Codesign of Embedded Systems9 A Classification (cont’d) Classification due to optimization strategy (cont’d) Example co-synthesis systems Vulcan (Stanford): Primal strategy Cosyma (Braunschweig, Germany): Dual strategy

Winter-Spring 2001Codesign of Embedded Systems10 Co-Synthesis Algorithms: HW/SW Partitioning HW/SW Partitioning Examples: Vulcan

Winter-Spring 2001Codesign of Embedded Systems11 Partitioning Examples: Vulcan Gupta, De Micheli, Stanford University Primal approach 1. All-HW initial implementation. 2. Iteratively move functionality to CPU to reduce cost. System specification language HardwareC Is compiled into a flow graph

Winter-Spring 2001Codesign of Embedded Systems12 Partitioning Examples: Vulcan (cont’d) nop x=ay=b 1 1 x=a; y=b; HardwareC cond x=ey=f c>dc<=d if (c>d) x=e; else y=f; HardwareC

Winter-Spring 2001Codesign of Embedded Systems13 Partitioning Examples: Vulcan (cont’d) Flow Graph Definition A variation of a (single-rate) task graph Nodes Represent operations Typically low-level operations: mult, add Edges Represent data dependencies Each contains a Boolean condition under which the edge is traversed

Winter-Spring 2001Codesign of Embedded Systems14 Partitioning Examples: Vulcan (cont’d) Flow Graph is executed repeatedly at some rate can have initiation-time constraints for each node t(v j )+l ij  t(v j )  t(v j )+u ij can have rate constraints on each node m i  R i  M i

Winter-Spring 2001Codesign of Embedded Systems15 Partitioning Examples: Vulcan (cont’d) Vulcan Co-synthesis Algorithm Partitioning quantum is a thread Algorithm divides the flow graph into threads and allocates them Thread boundary is determined by 1. (always) a non-deterministic delay element, such as wait for an external variable 2. (on choice) other points of flow graph Target architecture CPU + Co-processor (multiple ASICs)

Winter-Spring 2001Codesign of Embedded Systems16 Partitioning Examples: Vulcan (cont’d) Vulcan Co-synthesis algorithm (cont’d) Allocation Primal approach Scheduling is done by a scheduler on the target CPU is generated as part of synthesis process schedules all threads (both HW and SW threads) cannot be static, due to some threads non-deterministic initiation-time

Winter-Spring 2001Codesign of Embedded Systems17 Partitioning Examples: Vulcan (cont’d) Vulcan Co-synthesis algorithm (cont’d) Cost estimation SW implementation Code size relatively straight forward Data size Biggest challenge. Vulcan puts some effort to find bounds for each thread HW implementation ?

Winter-Spring 2001Codesign of Embedded Systems18 Partitioning Examples: Vulcan (cont’d) Vulcan Co-synthesis algorithm (cont’d) Performance estimation Both SW- and HW-implementation From flow-graph, and basic execution times for the operators

Winter-Spring 2001Codesign of Embedded Systems19 Partitioning Examples: Vulcan (cont’d) Algorithm Details Partitioning goal Allocate each thread to one of two partitions CPU Set:  S Co-processor set:  H Required execution-rate must be met, and total cost minimized

Winter-Spring 2001Codesign of Embedded Systems20 Partitioning Examples: Vulcan (cont’d) Algorithm Details (cont’d) Algorithm steps 1. Put all threads in  H set 2. Iteratively do 2.1. Move some operations to  S. 2.1.1. Select a group of operations to move to  S. 2.1.2. Check performance feasibility, by computing worst-case delay through flow-graph given the new thread times 2.1.3. Do the move, if feasible 2.2. Incrementally update the new cost-function to reflect the new partition

Winter-Spring 2001Codesign of Embedded Systems21 Partitioning Examples: Vulcan (cont’d) Algorithm Details (cont’d) Vulcan cost function f(w) = c 1 S h (  H ) - c 2 S s (  S ) + c 3 B - c 4 P + c 5 |m| c: weight constants S(): Size functions B: Bus utilization (<1) P: Processor utilization (<1) m: total number of variables to be transferred between the CPU and the co-processor

Winter-Spring 2001Codesign of Embedded Systems22 Partitioning Examples: Vulcan (cont’d) Algorithm Details (cont’d) Complementary notes A heuristic to minimize communication Once a thread is moved to  S, its immediate successors are placed in the list for evaluation in the next iteration. No back-track Once a thread is assigned to  S, it remains there Experimental results considerably faster implementations than all-SW, but much cheaper than all-HW designs are produced

Winter-Spring 2001Codesign of Embedded Systems23 Co-Synthesis Algorithms: HW/SW Partitioning HW/SW Partitioning Examples: Cosyma

Winter-Spring 2001Codesign of Embedded Systems24 Partitioning Examples: Cosyma Rolf Ernst, et al: Technical University of Braunschweig, Germany Dual approach 1. All-SW initial implementation. 2. Iteratively move basic blocks to the ASIC accelerator to meet performance objective. System specification language C x Is compiled into an ESG (Extended Syntax Graph) ESG is much like a CDFG

Winter-Spring 2001Codesign of Embedded Systems25 Partitioning Examples: Cosyma (cont’d) Cosyma Co-synthesis Algorithm Partitioning quantum is a Basic Block A Basic Blocks is a branch-free block of program Target Architecture CPU + accelerator ASIC(s) Scheduling Allocation Cost Estimation Performance Estimation Algorithm Details

Winter-Spring 2001Codesign of Embedded Systems26 Partitioning Examples: Cosyma (cont’d) Cosyma Co-synthesis Algorithm (cont’d) Performance Estimation SW implementation Done by examining the object code for the basic block generated by a compiler HW implementation Assumes one operator per clock cycle. Creates a list schedule for the DFG of the basic block. Depth of the list gives the number of clock cycles required. Communication Done by data-flow analysis of the adjacent basic blocks. In Shared-Memory Proportional to number of variables to be accessed

Winter-Spring 2001Codesign of Embedded Systems27 Partitioning Examples: Cosyma (cont’d) Algorithm Steps Change in execution-time caused by moving basic block b from CPU to ASIC:  c(b) = w( t HW (b)-t SW (b) + t com (Z) - t com (ZUb)) x It(b) w:Constant weight t(b):Execution time of basic block b t com (b):Estimated communication time between CPU and the accelerator ASIC, given a set Z of basic blocks implemented on the ASIC It(b):Total number of times that b is executed

Winter-Spring 2001Codesign of Embedded Systems28 Partitioning Examples: Cosyma (cont’d) Experimental Results By moving only basic-blocks to HW Typical speedup of only 2x Reason: Limited intra-basic-block parallelism Cure: Implement several control-flow optimizations to increase parallelism in the basic block, and hence in ASIC Examples: loop pipelining, speculative branch execution with multiple branch prediction, operator pipelining Result: Speedups: 2.7 to 9.7 CPU times: 35 to 304 seconds on a typical workstation

Winter-Spring 2001Codesign of Embedded Systems29 What we learned today HW/SW Partitioning: One broad category of co-synthesis algorithms Criteria by which a co-synthesis algorithm is categorized

Winter-Spring 2001Codesign of Embedded Systems1 Co-Synthesis Algorithms: HW/SW Partitioning Part of HW/SW Codesign of Embedded Systems Course (CE 40-226)

Similar presentations

Presentation on theme: "Winter-Spring 2001Codesign of Embedded Systems1 Co-Synthesis Algorithms: HW/SW Partitioning Part of HW/SW Codesign of Embedded Systems Course (CE 40-226)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Winter-Spring 2001Codesign of Embedded Systems1 Co-Synthesis Algorithms: HW/SW Partitioning Part of HW/SW Codesign of Embedded Systems Course (CE 40-226)

Similar presentations

Presentation on theme: "Winter-Spring 2001Codesign of Embedded Systems1 Co-Synthesis Algorithms: HW/SW Partitioning Part of HW/SW Codesign of Embedded Systems Course (CE 40-226)"— Presentation transcript:

Similar presentations

About project

Feedback