Presentation is loading. Please wait.

Presentation is loading. Please wait.

Design & Co-design of Embedded Systems

Similar presentations


Presentation on theme: "Design & Co-design of Embedded Systems"— Presentation transcript:

1 Design & Co-design of Embedded Systems
HW/SW Partitioning Algorithms + Process Scheduling Maziar Goudarzi

2 Design & Co-design of Embedded Systems
Today Program Introduction Preliminaries Hardware/Software Partitioning Distributed System Co-Synthesis (Next session) Reference: Wayne Wolf, “Hardware/Software Co-Synthesis Algorithms,” Chapter 2, Hardware/Software Co-Design: Principles and Practice, Eds: J. Staunstrup, W. Wolf, Kluwer Academic Publishers, 1997. Fall 2005 Design & Co-design of Embedded Systems

3 Design & Co-design of Embedded Systems
Topics Introduction A Classification Examples Vulcan Cosyma Fall 2005 Design & Co-design of Embedded Systems

4 Introduction to HW/SW Partitioning
The first variety of co-synthesis applications Definition A HW/SW partitioning algorithm implements a specification on some sort of multiprocessor architecture Usually Multiprocessor architecture = one CPU + some ASICs on CPU bus Fall 2005 Design & Co-design of Embedded Systems

5 Introduction to HW/SW Partitioning (cont’d)
A Terminology Allocation Synthesis methods which design the multiprocessor topology along with the PEs and SW architecture Scheduling The process of assigning PE (CPU and/or ASICs) time to processes to get executed Fall 2005 Design & Co-design of Embedded Systems

6 Introduction to HW/SW Partitioning (cont’d)
In most partitioning algorithms Type of CPU is fixed and given ASICs must be synthesized What function to implement on each ASIC? What characteristics should the implementation have? Are single-rate synthesis problems CDFG is the starting model Fall 2005 Design & Co-design of Embedded Systems

7 HW/SW Partitioning (cont’d)
Normal use of architectural components CPU performs less computationally-intensive functions ASICs used to accelerate core functions Where to use? High-performance applications No CPU is fast enough for the operations Low-cost application ASIC accelerators allow use of much smaller, cheaper CPU Fall 2005 Design & Co-design of Embedded Systems

8 Design & Co-design of Embedded Systems
A Classification Criterion: Optimization Strategy Trade-off between Performance and Cost Primal Approach Performance is the primary goal First, all functionality in ASICs. Progressively move more to CPU to reduce cost. Dual Approach Cost is the primary goal First, all functions in the CPU. Move operations to the ASIC to meet the performance goal. Fall 2005 Design & Co-design of Embedded Systems

9 A Classification (cont’d)
Classification due to optimization strategy (cont’d) Example co-synthesis systems Vulcan (Stanford): Primal strategy Cosyma (Braunschweig, Germany): Dual strategy Fall 2005 Design & Co-design of Embedded Systems

10 Co-Synthesis Algorithms: HW/SW Partitioning
HW/SW Partitioning Examples: Vulcan

11 Partitioning Examples: Vulcan
Gupta, De Micheli, Stanford University Primal approach 1. All-HW initial implementation. 2. Iteratively move functionality to CPU to reduce cost. System specification language HardwareC Is compiled into a flow graph Fall 2005 Design & Co-design of Embedded Systems

12 Partitioning Examples: Vulcan (cont’d)
nop x=a y=b 1 x=a; y=b; HardwareC cond x=e y=f c>d c<=d if (c>d) x=e; else y=f; HardwareC Fall 2005 Design & Co-design of Embedded Systems

13 Partitioning Examples: Vulcan (cont’d)
Flow Graph Definition A variation of a (single-rate) task graph Nodes Represent operations Typically low-level operations: mult, add Edges Represent data dependencies Each contains a Boolean condition under which the edge is traversed Fall 2005 Design & Co-design of Embedded Systems

14 Partitioning Examples: Vulcan (cont’d)
Flow Graph is executed repeatedly at some rate can have initiation-time constraints for each node t(vi)+lij  t(vj)  t(vi)+uij can have rate constraints on each node mi  Ri  Mi Fall 2005 Design & Co-design of Embedded Systems

15 Partitioning Examples: Vulcan (cont’d)
Vulcan Co-synthesis Algorithm Partitioning quantum is a thread Algorithm divides the flow graph into threads and allocates them Thread boundary is determined by 1. (always) a non-deterministic delay element, such as wait for an external variable 2. (on choice) other points of flow graph Target architecture CPU + Co-processor (multiple ASICs) Fall 2005 Design & Co-design of Embedded Systems

16 Partitioning Examples: Vulcan (cont’d)
Vulcan Co-synthesis algorithm (cont’d) Allocation Primal approach Scheduling is done by a scheduler on the target CPU is generated as part of synthesis process schedules all threads (both HW and SW threads) cannot be static, due to some threads non-deterministic initiation-time Fall 2005 Design & Co-design of Embedded Systems

17 Partitioning Examples: Vulcan (cont’d)
Vulcan Co-synthesis algorithm (cont’d) Cost estimation SW implementation Code size relatively straight forward Data size Biggest challenge. Vulcan puts some effort to find bounds for each thread HW implementation ? Fall 2005 Design & Co-design of Embedded Systems

18 Partitioning Examples: Vulcan (cont’d)
Vulcan Co-synthesis algorithm (cont’d) Performance estimation Both SW- and HW-implementation From flow-graph, and basic execution times for the operators Fall 2005 Design & Co-design of Embedded Systems

19 Partitioning Examples: Vulcan (cont’d)
Algorithm Details Partitioning goal Map each thread to one of two partitions CPU Set: FS Co-processor set: FH Required execution-rate must be met, and total cost minimized Fall 2005 Design & Co-design of Embedded Systems

20 Partitioning Examples: Vulcan (cont’d)
Algorithm Details (cont’d) Algorithm steps 1. Put all threads in FH set 2. Iteratively do 2.1. Move some operations to FS. Select a group of operations to move to FS. Check performance feasibility, by computing worst-case delay through flow-graph given the new thread times Do the move, if feasible 2.2. Incrementally update the new cost-function to reflect the new partition Fall 2005 Design & Co-design of Embedded Systems

21 Partitioning Examples: Vulcan (cont’d)
Algorithm Details (cont’d) Vulcan cost function f(w) = c1Sh(FH) - c2Ss(FS) + c3B - c4P + c5|m| c: weight constants S(): Size functions B: Bus utilization (<1) P: Processor utilization (<1) m: total number of variables to be transferred between the CPU and the co-processor Fall 2005 Design & Co-design of Embedded Systems

22 Partitioning Examples: Vulcan (cont’d)
Algorithm Details (cont’d) Complementary notes A heuristic to minimize communication Once a thread is moved to FS, its immediate successors are placed in the list for evaluation in the next iteration. No back-track Once a thread is assigned to FS, it remains there Experimental results considerably faster implementations than all-SW, but much cheaper than all-HW designs are produced Fall 2005 Design & Co-design of Embedded Systems

23 Co-Synthesis Algorithms: HW/SW Partitioning
HW/SW Partitioning Examples: Cosyma

24 Partitioning Examples: Cosyma
Rolf Ernst, et al: Technical University of Braunschweig, Germany Dual approach 1. All-SW initial implementation. 2. Iteratively move basic blocks to the ASIC accelerator to meet performance objective. System specification language Cx Is compiled into an ESG (Extended Syntax Graph) ESG is much like a CDFG Fall 2005 Design & Co-design of Embedded Systems

25 Partitioning Examples: Cosyma (cont’d)
Cosyma Co-synthesis Algorithm Partitioning quantum is a Basic Block A Basic Block is a branch-free block of program Target Architecture CPU + accelerator ASIC(s) Scheduling Allocation Cost Estimation Performance Estimation Algorithm Details Fall 2005 Design & Co-design of Embedded Systems

26 Partitioning Examples: Cosyma (cont’d)
Cosyma Co-synthesis Algorithm (cont’d) Performance Estimation SW implementation Done by examining the object code for the basic block generated by a compiler HW implementation Assumes one operator per clock cycle. Creates a list schedule for the DFG of the basic block. Depth of the list gives the number of clock cycles required. Communication Done by data-flow analysis of the adjacent basic blocks. In Shared-Memory Proportional to number of variables to be accessed Fall 2005 Design & Co-design of Embedded Systems

27 Partitioning Examples: Cosyma (cont’d)
Algorithm Steps Change in execution-time caused by moving basic block b from CPU to ASIC: Dc(b) = w( tHW(b)-tSW(b) + tcom(Z) - tcom(ZUb)) x It(b) w: Constant weight t(b): Execution time of basic block b tcom(b): Estimated communication time between CPU and the accelerator ASIC, given a set Z of basic blocks implemented on the ASIC It(b): Total number of times that b is executed Fall 2005 Design & Co-design of Embedded Systems

28 Partitioning Examples: Cosyma (cont’d)
Experimental Results By moving only basic-blocks to HW Typical speedup of only 2x Reason: Limited intra-basic-block parallelism Cure: Implement several control-flow optimizations to increase parallelism in the basic block, and hence in ASIC Examples: loop pipelining, speculative branch execution with multiple branch prediction, operator pipelining Result: Speedups: 2.7 to 9.7 CPU times: 35 to 304 seconds on a typical workstation Fall 2005 Design & Co-design of Embedded Systems

29 Design & Co-design of Embedded Systems
Summary What’s co-synthesis Various keywords used in classification of co-synthesis algorithms HW/SW Partitioning: One broad category of co-synthesis algorithms Criteria by which a co-synthesis algorithm is categorized Fall 2005 Design & Co-design of Embedded Systems

30 Processes and operating systems
Scheduling policies: RMS; EDF. Scheduling modeling assumptions. Interprocess communication. Power management. Reference (and slides): Wayne Wolf, “Computers as Components: Principles of Embedded Computing System Design,” Chapter 6 (Processes and Operating Systems), MKP, 2001. © 2000 Morgan Kaufman Overheads for Computers as Components

31 Overheads for Computers as Components
Metrics How do we evaluate a scheduling policy: Ability to satisfy all deadlines. CPU utilization---percentage of time devoted to useful work. Scheduling overhead---time required to make scheduling decision. © 2000 Morgan Kaufman Overheads for Computers as Components

32 Rate monotonic scheduling
RMS (Liu and Layland): widely-used, analyzable scheduling policy. Analysis is known as Rate Monotonic Analysis (RMA). © 2000 Morgan Kaufman Overheads for Computers as Components

33 Overheads for Computers as Components
RMA model All process run on single CPU. Zero context switch time. No data dependencies between processes. Process execution time is constant. Deadline is at end of period. Highest-priority ready process runs. © 2000 Morgan Kaufman Overheads for Computers as Components

34 Overheads for Computers as Components
Process parameters Ti is computation time of process i; ti is period of process i. period ti Pi computation time Ti © 2000 Morgan Kaufman Overheads for Computers as Components

35 Rate-monotonic analysis
Response time: time required to finish process. Critical instant: scheduling state that gives worst response time. Critical instant occurs when all higher-priority processes are ready to execute. © 2000 Morgan Kaufman Overheads for Computers as Components

36 Overheads for Computers as Components
Critical instant P1 P2 P3 interfering processes P1 P2 P3 critical instant P4 © 2000 Morgan Kaufman Overheads for Computers as Components

37 Overheads for Computers as Components
RMS priorities Optimal (fixed) priority assignment: shortest-period process gets highest priority; priority inversely proportional to period; break ties arbitrarily. No fixed-priority scheme does better. © 2000 Morgan Kaufman Overheads for Computers as Components

38 Overheads for Computers as Components
RMS example P2 period P2 P1 period P1 P1 P1 5 10 time © 2000 Morgan Kaufman Overheads for Computers as Components

39 Overheads for Computers as Components
RMS CPU utilization Utilization for n processes is S i Ti / ti As number of tasks approaches infinity, maximum utilization approaches 69%. © 2000 Morgan Kaufman Overheads for Computers as Components

40 RMS CPU utilization, cont’d.
RMS cannot asymptotically guarantee use 100% of CPU, even with zero context switch overhead. Must keep idle cycles available to handle worst-case scenario. However, RMS guarantees all processes will always meet their deadlines. © 2000 Morgan Kaufman Overheads for Computers as Components

41 Overheads for Computers as Components
RMS implementation Efficient implementation: scan processes; choose highest-priority active process. © 2000 Morgan Kaufman Overheads for Computers as Components

42 Earliest-deadline-first scheduling
EDF: dynamic priority scheduling scheme. Process closest to its deadline has highest priority. Requires recalculating processes at every timer interrupt. © 2000 Morgan Kaufman Overheads for Computers as Components

43 Overheads for Computers as Components
EDF example P1 P2 © 2000 Morgan Kaufman Overheads for Computers as Components

44 Overheads for Computers as Components
EDF analysis EDF can use 100% of CPU. But EDF may miss a deadline. © 2000 Morgan Kaufman Overheads for Computers as Components

45 Overheads for Computers as Components
EDF implementation On each timer interrupt: compute time to deadline; choose process closest to deadline. Generally considered too expensive to use in practice. © 2000 Morgan Kaufman Overheads for Computers as Components

46 Fixing scheduling problems
What if your set of processes is unschedulable? Change deadlines in requirements. Reduce execution times of processes. Get a faster CPU. © 2000 Morgan Kaufman Overheads for Computers as Components

47 Overheads for Computers as Components
Priority inversion Priority inversion: low-priority process keeps high-priority process from running. Improper use of system resources can cause scheduling problems: Low-priority process grabs I/O device. High-priority device needs I/O device, but can’t get it until low-priority process is done. Can cause deadlock. © 2000 Morgan Kaufman Overheads for Computers as Components

48 Solving priority inversion
Give priorities to system resources. Have process inherit the priority of a resource that it requests. Low-priority process inherits priority of device if higher. © 2000 Morgan Kaufman Overheads for Computers as Components

49 Overheads for Computers as Components
Data dependencies Data dependencies allow us to improve utilization. Restrict combination of processes that can run simultaneously. P1 and P2 can’t run simultaneously. P1 P2 © 2000 Morgan Kaufman Overheads for Computers as Components

50 Context-switching time
Non-zero context switch time can push limits of a tight schedule. Hard to calculate effects---depends on order of context switches. In practice, OS context switch overhead is small. © 2000 Morgan Kaufman Overheads for Computers as Components

51 Overheads for Computers as Components
What about interrupts? P1 Interrupts take time away from processes. Perform minimum work possible in the interrupt handler. OS P2 intr OS P3 © 2000 Morgan Kaufman Overheads for Computers as Components

52 Device processing structure
Interrupt service routine (ISR) performs minimal I/O. Get register values, put register values. Interrupt service process/thread performs most of device function. © 2000 Morgan Kaufman Overheads for Computers as Components

53 Design & Co-design of Embedded Systems
Summary Two major scheduling policies RMS EDF Other parameters affecting scheduling Priority inversion Data dependencies Context switching time Interrupts Fall 2005 Design & Co-design of Embedded Systems

54 Design & Co-design of Embedded Systems
Assignment Questions from Chapter 6 of text book From 6.17 to 6.26 Due date: Two weeks: Sunday, Azar 27th Fall 2005 Design & Co-design of Embedded Systems


Download ppt "Design & Co-design of Embedded Systems"

Similar presentations


Ads by Google