Design & Co-design of Embedded Systems

Name: Design & Co-design of Embedded Systems
Uploaded: 2017-08-23T21:52:07+00:00
Duration: PTM27S19
Channel: Elvin Byrd
Description: Design & Co-design of Embedded Systems

Design & Co-design of Embedded Systems
HW/SW Partitioning Algorithms + Process Scheduling Maziar Goudarzi

Today Program Introduction Preliminaries Hardware/Software Partitioning Distributed System Co-Synthesis (Next session) Reference: Wayne Wolf, “Hardware/Software Co-Synthesis Algorithms,” Chapter 2, Hardware/Software Co-Design: Principles and Practice, Eds: J. Staunstrup, W. Wolf, Kluwer Academic Publishers, 1997. Fall 2005 Design & Co-design of Embedded Systems

Topics Introduction A Classification Examples Vulcan Cosyma Fall 2005 Design & Co-design of Embedded Systems

Introduction to HW/SW Partitioning
The first variety of co-synthesis applications Definition A HW/SW partitioning algorithm implements a specification on some sort of multiprocessor architecture Usually Multiprocessor architecture = one CPU + some ASICs on CPU bus Fall 2005 Design & Co-design of Embedded Systems

Introduction to HW/SW Partitioning (cont’d)
A Terminology Allocation Synthesis methods which design the multiprocessor topology along with the PEs and SW architecture Scheduling The process of assigning PE (CPU and/or ASICs) time to processes to get executed Fall 2005 Design & Co-design of Embedded Systems

Introduction to HW/SW Partitioning (cont’d)
In most partitioning algorithms Type of CPU is fixed and given ASICs must be synthesized What function to implement on each ASIC? What characteristics should the implementation have? Are single-rate synthesis problems CDFG is the starting model Fall 2005 Design & Co-design of Embedded Systems

HW/SW Partitioning (cont’d)
Normal use of architectural components CPU performs less computationally-intensive functions ASICs used to accelerate core functions Where to use? High-performance applications No CPU is fast enough for the operations Low-cost application ASIC accelerators allow use of much smaller, cheaper CPU Fall 2005 Design & Co-design of Embedded Systems

A Classification Criterion: Optimization Strategy Trade-off between Performance and Cost Primal Approach Performance is the primary goal First, all functionality in ASICs. Progressively move more to CPU to reduce cost. Dual Approach Cost is the primary goal First, all functions in the CPU. Move operations to the ASIC to meet the performance goal. Fall 2005 Design & Co-design of Embedded Systems

A Classification (cont’d)
Classification due to optimization strategy (cont’d) Example co-synthesis systems Vulcan (Stanford): Primal strategy Cosyma (Braunschweig, Germany): Dual strategy Fall 2005 Design & Co-design of Embedded Systems

Co-Synthesis Algorithms: HW/SW Partitioning
HW/SW Partitioning Examples: Vulcan

Partitioning Examples: Vulcan
Gupta, De Micheli, Stanford University Primal approach 1. All-HW initial implementation. 2. Iteratively move functionality to CPU to reduce cost. System specification language HardwareC Is compiled into a flow graph Fall 2005 Design & Co-design of Embedded Systems

Partitioning Examples: Vulcan (cont’d)
nop x=a y=b 1 x=a; y=b; HardwareC cond x=e y=f c>d c<=d if (c>d) x=e; else y=f; HardwareC Fall 2005 Design & Co-design of Embedded Systems

Flow Graph Definition A variation of a (single-rate) task graph Nodes Represent operations Typically low-level operations: mult, add Edges Represent data dependencies Each contains a Boolean condition under which the edge is traversed Fall 2005 Design & Co-design of Embedded Systems

Flow Graph is executed repeatedly at some rate can have initiation-time constraints for each node t(vi)+lij  t(vj)  t(vi)+uij can have rate constraints on each node mi  Ri  Mi Fall 2005 Design & Co-design of Embedded Systems

Vulcan Co-synthesis Algorithm Partitioning quantum is a thread Algorithm divides the flow graph into threads and allocates them Thread boundary is determined by 1. (always) a non-deterministic delay element, such as wait for an external variable 2. (on choice) other points of flow graph Target architecture CPU + Co-processor (multiple ASICs) Fall 2005 Design & Co-design of Embedded Systems

Vulcan Co-synthesis algorithm (cont’d) Allocation Primal approach Scheduling is done by a scheduler on the target CPU is generated as part of synthesis process schedules all threads (both HW and SW threads) cannot be static, due to some threads non-deterministic initiation-time Fall 2005 Design & Co-design of Embedded Systems

Vulcan Co-synthesis algorithm (cont’d) Cost estimation SW implementation Code size relatively straight forward Data size Biggest challenge. Vulcan puts some effort to find bounds for each thread HW implementation ? Fall 2005 Design & Co-design of Embedded Systems

Vulcan Co-synthesis algorithm (cont’d) Performance estimation Both SW- and HW-implementation From flow-graph, and basic execution times for the operators Fall 2005 Design & Co-design of Embedded Systems

Algorithm Details Partitioning goal Map each thread to one of two partitions CPU Set: FS Co-processor set: FH Required execution-rate must be met, and total cost minimized Fall 2005 Design & Co-design of Embedded Systems

Algorithm Details (cont’d) Algorithm steps 1. Put all threads in FH set 2. Iteratively do 2.1. Move some operations to FS. Select a group of operations to move to FS. Check performance feasibility, by computing worst-case delay through flow-graph given the new thread times Do the move, if feasible 2.2. Incrementally update the new cost-function to reflect the new partition Fall 2005 Design & Co-design of Embedded Systems

Algorithm Details (cont’d) Vulcan cost function f(w) = c1Sh(FH) - c2Ss(FS) + c3B - c4P + c5|m| c: weight constants S(): Size functions B: Bus utilization (<1) P: Processor utilization (<1) m: total number of variables to be transferred between the CPU and the co-processor Fall 2005 Design & Co-design of Embedded Systems

Algorithm Details (cont’d) Complementary notes A heuristic to minimize communication Once a thread is moved to FS, its immediate successors are placed in the list for evaluation in the next iteration. No back-track Once a thread is assigned to FS, it remains there Experimental results considerably faster implementations than all-SW, but much cheaper than all-HW designs are produced Fall 2005 Design & Co-design of Embedded Systems

Co-Synthesis Algorithms: HW/SW Partitioning
HW/SW Partitioning Examples: Cosyma

Partitioning Examples: Cosyma
Rolf Ernst, et al: Technical University of Braunschweig, Germany Dual approach 1. All-SW initial implementation. 2. Iteratively move basic blocks to the ASIC accelerator to meet performance objective. System specification language Cx Is compiled into an ESG (Extended Syntax Graph) ESG is much like a CDFG Fall 2005 Design & Co-design of Embedded Systems

Partitioning Examples: Cosyma (cont’d)
Cosyma Co-synthesis Algorithm Partitioning quantum is a Basic Block A Basic Block is a branch-free block of program Target Architecture CPU + accelerator ASIC(s) Scheduling Allocation Cost Estimation Performance Estimation Algorithm Details Fall 2005 Design & Co-design of Embedded Systems

Cosyma Co-synthesis Algorithm (cont’d) Performance Estimation SW implementation Done by examining the object code for the basic block generated by a compiler HW implementation Assumes one operator per clock cycle. Creates a list schedule for the DFG of the basic block. Depth of the list gives the number of clock cycles required. Communication Done by data-flow analysis of the adjacent basic blocks. In Shared-Memory Proportional to number of variables to be accessed Fall 2005 Design & Co-design of Embedded Systems

Algorithm Steps Change in execution-time caused by moving basic block b from CPU to ASIC: Dc(b) = w( tHW(b)-tSW(b) + tcom(Z) - tcom(ZUb)) x It(b) w: Constant weight t(b): Execution time of basic block b tcom(b): Estimated communication time between CPU and the accelerator ASIC, given a set Z of basic blocks implemented on the ASIC It(b): Total number of times that b is executed Fall 2005 Design & Co-design of Embedded Systems

Experimental Results By moving only basic-blocks to HW Typical speedup of only 2x Reason: Limited intra-basic-block parallelism Cure: Implement several control-flow optimizations to increase parallelism in the basic block, and hence in ASIC Examples: loop pipelining, speculative branch execution with multiple branch prediction, operator pipelining Result: Speedups: 2.7 to 9.7 CPU times: 35 to 304 seconds on a typical workstation Fall 2005 Design & Co-design of Embedded Systems

Summary What’s co-synthesis Various keywords used in classification of co-synthesis algorithms HW/SW Partitioning: One broad category of co-synthesis algorithms Criteria by which a co-synthesis algorithm is categorized Fall 2005 Design & Co-design of Embedded Systems

Processes and operating systems
Scheduling policies: RMS; EDF. Scheduling modeling assumptions. Interprocess communication. Power management. Reference (and slides): Wayne Wolf, “Computers as Components: Principles of Embedded Computing System Design,” Chapter 6 (Processes and Operating Systems), MKP, 2001. © 2000 Morgan Kaufman Overheads for Computers as Components

Overheads for Computers as Components
Metrics How do we evaluate a scheduling policy: Ability to satisfy all deadlines. CPU utilization---percentage of time devoted to useful work. Scheduling overhead---time required to make scheduling decision. © 2000 Morgan Kaufman Overheads for Computers as Components

Rate monotonic scheduling
RMS (Liu and Layland): widely-used, analyzable scheduling policy. Analysis is known as Rate Monotonic Analysis (RMA). © 2000 Morgan Kaufman Overheads for Computers as Components

RMA model All process run on single CPU. Zero context switch time. No data dependencies between processes. Process execution time is constant. Deadline is at end of period. Highest-priority ready process runs. © 2000 Morgan Kaufman Overheads for Computers as Components

Rate-monotonic analysis
Response time: time required to finish process. Critical instant: scheduling state that gives worst response time. Critical instant occurs when all higher-priority processes are ready to execute. © 2000 Morgan Kaufman Overheads for Computers as Components

RMS priorities Optimal (fixed) priority assignment: shortest-period process gets highest priority; priority inversely proportional to period; break ties arbitrarily. No fixed-priority scheme does better. © 2000 Morgan Kaufman Overheads for Computers as Components

RMS CPU utilization, cont’d.
RMS cannot asymptotically guarantee use 100% of CPU, even with zero context switch overhead. Must keep idle cycles available to handle worst-case scenario. However, RMS guarantees all processes will always meet their deadlines. © 2000 Morgan Kaufman Overheads for Computers as Components

Earliest-deadline-first scheduling
EDF: dynamic priority scheduling scheme. Process closest to its deadline has highest priority. Requires recalculating processes at every timer interrupt. © 2000 Morgan Kaufman Overheads for Computers as Components

EDF implementation On each timer interrupt: compute time to deadline; choose process closest to deadline. Generally considered too expensive to use in practice. © 2000 Morgan Kaufman Overheads for Computers as Components

Fixing scheduling problems
What if your set of processes is unschedulable? Change deadlines in requirements. Reduce execution times of processes. Get a faster CPU. © 2000 Morgan Kaufman Overheads for Computers as Components

Priority inversion Priority inversion: low-priority process keeps high-priority process from running. Improper use of system resources can cause scheduling problems: Low-priority process grabs I/O device. High-priority device needs I/O device, but can’t get it until low-priority process is done. Can cause deadlock. © 2000 Morgan Kaufman Overheads for Computers as Components

Solving priority inversion
Give priorities to system resources. Have process inherit the priority of a resource that it requests. Low-priority process inherits priority of device if higher. © 2000 Morgan Kaufman Overheads for Computers as Components

Data dependencies Data dependencies allow us to improve utilization. Restrict combination of processes that can run simultaneously. P1 and P2 can’t run simultaneously. P1 P2 © 2000 Morgan Kaufman Overheads for Computers as Components

Context-switching time
Non-zero context switch time can push limits of a tight schedule. Hard to calculate effects---depends on order of context switches. In practice, OS context switch overhead is small. © 2000 Morgan Kaufman Overheads for Computers as Components

Device processing structure
Interrupt service routine (ISR) performs minimal I/O. Get register values, put register values. Interrupt service process/thread performs most of device function. © 2000 Morgan Kaufman Overheads for Computers as Components

Summary Two major scheduling policies RMS EDF Other parameters affecting scheduling Priority inversion Data dependencies Context switching time Interrupts Fall 2005 Design & Co-design of Embedded Systems

Assignment Questions from Chapter 6 of text book From 6.17 to 6.26 Due date: Two weeks: Sunday, Azar 27th Fall 2005 Design & Co-design of Embedded Systems

Design & Co-design of Embedded Systems

Similar presentations

Presentation on theme: "Design & Co-design of Embedded Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Design & Co-design of Embedded Systems

Similar presentations

Presentation on theme: "Design & Co-design of Embedded Systems"— Presentation transcript:

Similar presentations

About project

Feedback