Hardware/Software Codesign of Embedded Systems

Hardware/Software Codesign of Embedded Systems
PARTITIONING Voicu Groza SITE Hall, Room 5017 ext. 2159

Allocation of system components Estimation Partitioning
Metrics and cost functions How good is the estimation Partitioning Basic algorithms HW partitioning algorithms HW/SW partitioning algorithms System partitioning algorithms

Hardware/Software Codesign

Functionality to be implemented in software or in hardware?
Decision based on hardware/ software partitioning, a special case of hardware/ software codesign.

Exploration Allocation, Partitioning, Transformation Estimation
solve these problems not by the given order, but iterate many times before we are satisfied with our systemlevel design

The Partitioning Problem
Definition: The partitioning problem is to assign n objects O ={o1, ..., on} to m blocks (also called partitions) P={p1, ..., pm}, such that p1∪ p2 ∪ ...∪ pm = O pi ∩ pj = { } ∀ i,j: i≠ j and cost c(P) are minimized. In system synthesis: objects = problem graph nodes blocks = architecture graph nodes H. Thiele - ETH Swiss Federal Institute of Technology

Quality Metrics HW Cost metrics: SW Cost metrics Performance metrics
design area (# transistors, gates, registers, etc.) Packaging cost (#pins) SW Cost metrics Program memory size Data memory size Performance metrics Other metrics

Cost Functions Measure quality of a design point may include
C … system cost in [$] L … latency in [sec] P… power consumption in [W] requires estimation to find C, L, P Example: linear cost function with penalty hC , hL , hP … denote how strong C, L, P violate the design constraints Cmax, Lmax, Pmax k1 , k2 , k3 … weighting and normalization

Cost functions Costfct =
= k1 • F (component1:size; component1:size constr) + k2 • F (component2:size; component2:size constr) + k3 • F (component1:IO; component1:IO constr)… k's are userprovided constants indicating the relative importance of each metric, and F indicates the desirability of a metric's value. A common form of F returns the degree of constraint violation, normalized such that 0 = no violation, and 1 = very large violation. This form of F causes the cost function to return zero when a partition meets all constraints, making the goal of partitioning to obtain a cost of zero.

Behavior Closeness Metrics
Connectivity - based on the number of wires shared between the sets of behaviors. Grouping behaviors that share wires should result in fewer pins. Communication is based on the number of bits of data transferred between the sets of behaviors, independent of the number of wires used to transfer the data. Grouping heavily communicating behaviors should result in better performance, due to decreased communication time. Hardware sharing is based on the estimated percentage of hardware that can be shared between two sets of behaviors. Grouping behaviors that can share hardware should result in a smaller overall hardware size.

Behavior Closeness Metrics (cont.)
Common accessors is based on the number of behaviors that access both sets of behaviors. Grouping such behaviors should result in fewer overall wires. Sequential execution is based on the ability to execute behaviors sequentially without loss in performance. Constrained communication is based on the amount of communication between the sets of behaviors that contributes to each performance constraint. Grouping such behaviors should help ensure that performance constraints are met. Balanced size is based on the size of the sets of behaviors. Grouping smaller behaviors should eventually lead to groups of balanced size.

Performance Behavior's execution time is calculated as the sum of the behavior's internal computation time (ict) and communication time (commtime). The ict is the execution time on a particular component, assuming all accessed behaviors and variables take zero time. The communication time (commtime) includes time to transfer data to/from accessed behaviors and variables, and time for such accessed behaviors to execute (e.g., the time for a called procedure to execute and return). This model leads to some inaccuracy, since some computation and communication could occur in parallel, but the model provides reasonable accuracy while enabling rapid estimations.

Performance Metrics Clock cycle effect on Execution Time and resources required

Execution Time b.exectime = b.ict p + b.commtime
b.commtime = ck  b.outchannels ck.accfreq  (ck.ttimebus + (ck.dst).exectime) ck.ttimebus = bus.time  (ck.bitsbus.width) bus.time = bus.timesame if (ck.dst).p = p, = bus.timediff otherwise. Preestimation -- A behavior's ict based on profiling = determines the execution count of each basic block (a sequence of statements not containing a branch ) Online estimation -- Given a partition of every functional object to a component, the actual ict, bus values, and bus times become known; execution time can be evaluated.

HW Estimation Model clk > delay(SR) + delay(CL) +delay(RF) + delay(Mux) +delay(FU) + delay(NS) +delay(SR) + setup(SR) +delay(ni) CLOCK CYCLE ESTIMATION Maximum-operator-delay method clk(MOD) > Max [(delay(ti)]

Control Step Estimation Operator-use Method
u1 := u x dx u1 := u x dx u2 := 5 x w u2 := 5 x w Estimate the number of control steps required to execute a behavior, given the resources: u3 := 3 x y u3 := 3 x y y1 := i x dx w := w + dx y1 := i x dx t num(t) clocks(t) add mult sub w := w + dx u4 := u1 x u2 u1 := u x dx u2 := 5 x w u3 := 3 x y y1 := i x dx w := w + dx u4 := u1 x u2 u5 := dx x u3 y := y + y1 u6 := u – u4 u := u6 – u5 The method partitions all statements into a set of nodes such that all statements in a node could be executed concurrently u4 := u1 x u2 u5 := dx x u3 y := y + y1 u5 := dx x u3 y := y + y1 u6 := u – u4 u6 := u – u4 u := u6 – u5 u := u6 – u5

Clock Cycle Estimation Clock Slack
Minimize idle time of functional units Clock Slack = portion of clock cycle for which the FU is idle delay(+)=49ns delay(-) = 56 ns delay(x) = 163 ns

Slack-minimization method
slack(65,x)=(3x65)-163=32ns slack(65,x)=(1x65)-56=9ns slack(65,x)=(1x65)-48=17ns uitlization(65ns)= = /65 = 0.62 = 62% The clock utilization is repeated for all clock values from 14 ns to 163 and the maximum 92% was achieved at a clock of 56 ns.

Control Steps Control Unit sequences operations through a series of control steps 1 control step corresponds to a single state The number of control steps affects the complexity of the control logic in the implementation in time min 50 ns behavior B begin … A:=A+1 end B process Q max 10 ms channel C max 10 Mb/s out time process P Communication Message generated by 1 behavior (producer) received by other behavior (consumer) avgrate(C) peakrate(C) Execution time Inter-event timing

Functionality Partitioning
Hardware partitioning techniques aim to partition functionality among hardware modules (ASICs or blocks on an ASIC) Most such techniques partition at the granularity of arithmetic operations Partitioning functionality among a hardware/software architecture at the level of: statement statement sequence subroutine/task levels.

Manual partitioning Provide the ability to manually relocate objects,
Allows user control of the relative weights of various metrics in the cost function Automatically provide hints of what changes might yield improvements to the current partition. Closeness hints provide a list of object pairs, sorted by the closeness of the objects in each pair. Closeness is based on a weighted function of various closeness metrics.

Hardware/software partitioning
No need to consider special purpose hardware in the long run? Correct for fixed functionality, but wrong in general, since “By the time MPEG-n can be implemented in software, MPEG-n+1 has been invented” [de Man] Functionality to be implemented in software or in hardware?

General Partitioning Methods
Exact methods: enumeration Integer Linear Programs (ILP) Heuristic methods: constructive methods random mapping hierarchical clustering iterative methods Simulated Annealing Evolutionary Algorithms (EA)

Example of HW/SW partitioning
Inputs Target technology Design constraints Required behavior

HW/SW codesign: approach
Specification Mapping Processor P1 Processor P2 Hardware [Niemann, Hardware/Software Co-Design for Data Flow Dominated Embedded Systems, Kluwer Academic Publishers, 1998 (Comprehensive mathematical model)]

Steps of a partitioning algorithm (1)
Translation of the behavior into an internal graph model Translation of the behavior of each node from VHDL into C Compilation All C programs compiled for the target processor, Computation of the resulting program size, estimation of the resulting execution time (simulation input data might be required) Synthesis of hardware components:  leaf node, application-specific hardware is synthesized. High-level synthesis sufficiently fast.

Flattening of the hierarchy: Granularity used by the designer is maintained. Cost and performance information added to the nodes. Precise information required for partitioning is pre-computed Generating and solving a mathematical model of the optimization problem: Integer programming IP model for optimization. Optimal with respect to the cost function (approximates communication time)

Iterative improvements: Adjacent nodes mapped to the same hardware component are now merged.

Interface synthesis: After partitioning, the glue logic required for interfacing processors, application-specific hardware and memories is created.

Integer programming models
Ingredients: Cost function Constraints Involving linear expressions of integer variables from a set X={xi} Cost function Constraints: ℕ ℝ Def.: The problem of minimizing (1) subject to the constraints (2) is called an integer programming (IP) problem. If all xi are constrained to be either 0 or 1, the IP problem is said to be a 0/1 integer programming problem.

Example C Optimal

Remarks on integer programming
Maximizing the cost function can be done by setting C‘=-C Integer programming is NP-complete :( In practice, running times can increase exponentially with the size of the problem, but problems of some thousands of variables can still be solved with commercial solvers, depending on the size and structure of the problem. IP models can be a good starting point for modelling, even if in the end heuristics have to be used to solve them.

An IP model for HW/SW Partitioning
Notation: Index set I denotes task graph nodes. Each i I corresponds to a task graph node. Index set L denotes task graph node types Each ℓ  L corresponds to a task graph node type, e.g. square root, DCT (Discrete Cosine Transform) or FFT Index set KH denotes hardware component types. e.g. there is one index value k1 KH for the DCT hardware component type and another one k2 KH for the FFT hardware component type . For each of the hardware component there may be multiple copies or “instances”; each instance is identified by an index j J Index set KP denotes processors. All processors are assumed to be of the same type

An IP model for HW/SW Partitioning
Xi,k: =1 if node vi is mapped to hardware component type k  KH and 0 otherwise. Yi,k: =1 if node vi is mapped to processor k  KP and 0 otherwise. NYℓ,k = 1 if at least one node of type ℓ is mapped to processor k  KP and 0 otherwise. T is a mapping from task graph nodes to their types: T : I  L The cost function accumulates the cost of hardware units: C = cost(processors) + cost(memories) cost(application specific hardware)

Operation assignment constraints (1)
All task graph nodes have to be mapped either in software or in hardware. All decision variables (Xi,k and Yi,k) are assumed to be positive integers. Additional constraints to guarantee they are either 0 or 1:

 ℓ L, i : T(vi) = cℓ,  k  KP : NY ℓ,k  Yi,k For all types ℓ of operations and for all nodes i of this type: if i is mapped to some processor k (i.e., Yi,k = 1), then that processor must implement the functionality of ℓ i.e., a copy of the SW that implements that functionality must be in the processor’s memory. Decision variables must also be 0/1 variables:  ℓ L,  k  KP : NY ℓ,k  1.

Resource & design constraints
 k  KH, the cost (area) used for components of that type is calculated as the sum of the costs of the components of that type. This cost should not exceed its maximum.  k  KP, the cost for associated data storage area should not exceed its maximum.  k  KP the cost for storing instructions should not exceed its maximum. The total cost (k  KH) of HW components should not exceed its maximum The total cost of data memories (k  KP) should not exceed its maximum The total cost instruction memories (k  KP) should not exceed its maximum

Scheduling / precedence constraints
For all nodes vi1 and vi2 that are potentially mapped to the same processor or hardware component instance, introduce a binary decision variable bi1,i2 with bi1,i2=1 if vi1 is executed before vi2 and = 0 otherwise. Define constraints of the type (end-time of vi1)  (start time of vi2) if bi1,i2=1 and (end-time of vi2)  (start time of vi1) if bi1,i2=0 Ensure that the schedule for executing operations is consistent with the precedence constraints in the task graph.

Other constraints Timing constraints These constraints can be used to guarantee that certain time constraints are met. Some less important constraints omitted ..

Example HW types H1, H2 and H3 with costs of 20, 25, and 30.
Processors of type P. Tasks T1 to T5. Execution times: T H1 H2 H3 P $20 $25 $30

T H1 H2 H3 P A maximum of one processor (P1) is used: X1,1+Y1,1=1 (task 1 either mapped to H1 or to P1) X2,2+Y2,1=1 (task 2 either mapped to H2 or to P1) X3,3+Y3,1=1 (task 3 either mapped to H3 or to P1) X4,3+Y4,1=1 (task 4 either mapped to H4 or to P1) X5,1+Y5,1=1 (task 5 either mapped to H5 or to P1)

T H1 H2 H3 P Assume that the types of tasks T1 to T5 are ℓ =1, 2, 3, 3, and 1, respectively; then: ℓ L, i : T(vi) = cℓ, k  KP : NYℓ,k  Yi,k If node 1 (T1) is mapped to the processor P1, then the function ℓ=1 must be implemented on that processor The same function ℓ=1 must also be implemented on that processor if task T1 is mapped to the processor P1.

Other equations Time constraints leading to: Application specific hardware required for time constraints under 100 time units. T H1 H2 H3 P #(…) represents the number of instances of HW components Cost function: C = 20 #(H1) + 25 #(H2) + 30 # (H3) + cost(processor) + cost(memory)

Result For a time constraint of 100 time units and cost(P)<cost(H3): T H1 H2 H3 P Solution (educated guessing) : T1  H1 T2  H2 T3  P T4  P T5  H1

Separation of scheduling-partitioning
Combined scheduling/partitioning very complex;  Heuristic: Compute estimated schedule Perform partitioning for estimated schedule Perform final scheduling If final schedule does not meet time constraint, go to 1 using a reduced overall timing constraint. 2nd Iteration t specification Actual execution time 1st Iteration approx. execution time New specification

Application example Audio lab (mixer, fader, echo, equalizer, balance units); slow SPARC processor 1µ ASIC library Allowable delay of µs (~ 44.1 kHz) SPARC processor ASIC (Compass, 1 µ) External memory Outdated technology; just a proof of concept.

Running time for COOL optimization
 Only simple models can be solved optimally.

Deviation from optimal design
 Hardly any loss in design quality.

Running time for heuristic

Design space for audio lab
Everything in software: µs, 2 Everything in hardware: µs, 457.9x106 2 Lowest cost for given sample rate: µs, x106 2,

Final remarks COOL approach: Other approaches for HW/SW partitioning:
shows that formal model of hardware/SW codesign is beneficial; IP modeling can lead to useful implementation even if optimal result is available only for small designs. Other approaches for HW/SW partitioning: starting with everything mapped to hardware; gradually moving to software as long as timing constraint is met. starting with everything mapped to software; gradually moving to hardware until timing constraint is met. Binary search.

Petru Eles1 Zebo Peng1 Krzysztof Kuchcinski1 Alex Doboli2
System Level Hardware/Software Partitioning Based on Simulated Annealing and Tabu Search Petru Eles1 Zebo Peng1 Krzysztof Kuchcinski1 Alex Doboli2 1Embedded Systems Laboratory (ESLAB) Department of Computer and Information Science Linköping University, SWEDEN 2 VLSI Systems Design Lab Electrical and Computer Engineering Department State University of New York at Stony Brook, USA

Outline Partitioning is performed at the granularity of blocks, loops, subprograms, and processes, formulated as a graph partitioning problem Employ Simulated annealing Tabu search Define metric values Real-life examples

Co-Synthesis Environment
Accepts input designs specified in an extended VHDL Processes are the basic modules - interact using a synchronous message passing mechanism with predefined send/receive commands. high-level synthesis Communication channels are VHDL signals. Communication interfaces between processes can be modified during automatic partitioning, when new processes are created or functionality is moved from one process to another.

Target Architecture 1. There is a single programmable component (microprocessor) executing the software processes (with a run-time system performing dynamic scheduling); 2. The microprocessor and the hardware coprocessor are working in parallel (the architecture does not enforce a mutual exclusion between the software and hardware); 3. Reducing the amount of communication between the microprocessor (software partition) and the hardware coprocessor (hardware partition) improves the overall performance of the application.

Partitioning Objectives
1. To identify basic regions (processes, subprograms, loops, and blocks of statements) which are responsible for most of the execution time in order to be assigned to the hardware partition; 2. To minimize communication between the hardware and software domains; 3. To increase parallelism within the resulted system at the following three levels: - internal parallelism of each hardware process (during high-level synthesis, operations are scheduled to be executed in parallel by the available functional units); - parallelism between processes assigned to the hardware partition; - parallelism between the hardware coprocessor and the microprocessor executing the software processes.

Statistics Used… CLi = Σ N_actj x opj
1. Computation load (CL) of a basic region is a quantitative measure of the total computation executed by that region, considering all its activations during the simulation process. It is expressed as the total number of operations (at the level of internal representation) executed inside that region, where each operation is weighted with a coefficient depending on its relative complexity: CLi = Σ N_actj x opj N_actj is the number of activations of operation opj belonging to the basic region BRi and opj is the weight associated to that operation. The relative computation load (RCL) of a block of statements, loop, or a subprogram is the computation load of the respective basic region divided by the computation load of the process the region belongs to. The relative computation load of a process is the computation load of that process divided by the total computation load of the system. 2. Communication intensity (CI) on a channel connecting two processes is expressed as the total number of send operations executed on the respective channel. opjBRi

Partitioning Steps 1. Extraction of blocks of statements, loops…
By identifying a certain region to be extracted and assigning it to the hardware or software partition By imposing two boundary values: - a threshold X on the RCL (relative computation load) of processes that are examined for basic region extraction; - a threshold Y on the RCL of a block, loop, or subprogram to be considered for basic region extraction. 2. Process graph is generated as internal structure 3. Partitioning of the process graph: the HW/SW partitioning is formulated as a graph partitioning problem 4. Process merging: During the first step one or several child processes are possibly extracted from a parent process. If, as result of step 3, some of the child processes are assigned to the same partition with their parent process, they are, optionally, merged back together.

Process Graph Each node in the graph corresponds to a process and an
edge connects two nodes if and only if there exists at least one direct communication channel between the corresponding processes. The graph partitioning algorithm takes into account weights associated to each node and edge. Node weights reflect the degree of suitability for hardware implementation of the corresponding process. Edge weights measure communication and mutual synchronization between processes. Information extracted from static analysis of the system spec. or of the internal representation resulted after its compilation: Nr_opi: total number of operations in the dataflow graph of process i; Nr_kind_opi: number of different operations in process i; L_pathi: length of the critical path through process i.

Process (computation) Weights
The weight assigned to process node i, has two components. The first one, , is equal to the CL of the respective process. The second one is calculated by the following formula: where: = the RCL of process i, and thus is a measure of the computation load; is a measure of the uniformity of operations in process i; is a measure of the potential parallelism inside process i; captures the suitability of operations of process i for a SW implementation

Edge (communication) Weights
There are 2 components of the weight assigned to an edge connecting nodes i and j that depend on the amount of communication between processes i and j: is a measure of the total data quantity transferred between the two processes does not consider the number of bits transferred but only the degree of synchronization between the processes, expressed in the total number of mutual interactions they are involved in: where Chij is the set of channels used for communication between processes i and j; wdck is the width (number of transported bits) of channel ck in bits; CIck is the communication intensity on channel ck

Cost Function HW/SW partitioning heuristics are guided by the following cost function stimulates placement into HW of processes which have a reduced interaction with the rest of the system. pushes processes with a high node weight into the HW partition and those with a low node weight into SW, by increasing the difference between the average weight of nodes in the two partitions. total amount of communication HWSW where Hw and Sw are sets representing the HW and the SW partition; NH and NS are the cardinality of the two sets; cut is the set of edges connecting the two partitions; (ij) is the edge connecting nodes i and j; and (i) represents node i.

Cost Constraints Total HW and SW cost have to be within specified limits: Nodes with a weight smaller than a given limit have to go into SW and those with a weight greater than a certain limit should be assigned to HW: Cost estimation has to be performed before graph partitioning, for both the HW (in terms of design area) and software implementation (in terms of memory size) alternatives of the processes.

Simulated Annealing algorithm
We take random walks through the problem space, looking for points with low energies; the probability of taking a step is determined by the Boltzmann distribution p = e^{-(E_{i+1} - E_i)/(kT)} if E_{i+1} < E_i, and p = 0 when E_{i+1} \ge E_i. In other words, a step *will* occur if the new energy is lower. If the new energy is higher, the transition can still occur, and its likelihood is proportional to the temperature T and inversely proportional to the energy difference E_{i+1} - E_i. The temperature T is initially set to a high value, and a random walk is carried out at that temperature. Then the temperature is lowered very slightly (according to a *cooling schedule*) and another random walk is taken. This slight probability of taking a step that gives *higher* energy is what allows simulated annealing to frequently get out of local minima. An initial guess is supplied. At each step, a point is chosen at a random distance from the current one, where the random distance *r* is distributed according to a Boltzmann distribution r = e^(-E/kT). After a few search steps using this distribution, the temperature *T* is lowered according to some scheme, for example T -> T/mu_T where mu_T is slightly greater than 1.

Simulated Annealing Algorithm
Step 1. Construct initial configuration xnow:=(Hw0, Sw0) Step 2. Initialize Temperature T:=TI Step for i:=1 to TL do Generate randomly a neighboring solution x’ N(xnow) Compute change of cost function DC:=C(x’)-C(xnow) if DC  0 then xnow:= x’ else Generate q:=random(0,1) if q<e-DC/T then xnow:=x’ 3.2. Set new temperature T:= a * T Step 4. if stopping criterion not met then goto Step 3 Step 5. return solution corresponding to the minimum cost function

x denote one solution consisting of the two sets HW & SW.
xnow represents the current solution and N(xnow) denotes the neighborhood of xnow in the solution space. Stopping criterion = the system is considered as frozen if for three consecutive temperatures no new solution has been accepted. A node is randomly selected for being moved to the other partition. The configuration resulted after this move becomes the candidate solution x´. Random node selection is repeated if transfer of the selected node violates some design constraints.

Specification Refinement
System specification consists of functional objects: behaviors, variables, communication channels System design: group these functional objects into a set of system components such as processors, ASICs, memories, buses Updating the specification to reflect the transformation of the functional objects into system components is called specification refinement.

Refining Variable Grouping: Variable Folding/Memory Width Mapping

Other Refining Operations:
Channel refinement (bus width, bus rate, etc.) Refining Incompatible Interfaces Communication protocols Transducer Synthesis (glue logic that connects 2 blocks) Protocol converters

Communication channel c1
Scheduling v1 v2 v3 v4 Processor p1 ASIC h1 FIR1 FIR2 e3 e4 v5 v6 v7 v8 Communication channel c1 v9 v10 t p1 v8 v7 or ... c1 e3 e4 FIR2 on h1 v4 v3 Let me now give you an overview of our codesign tool "COOL". COOL consists of a specification tool, called COSYS and a partitioning tool COPA. COSYS leads to an intermediate system representation from which COPA calculates a partitioned system. This partitioned system has then to be realized and simulated with the traditionell cosynthesis approach. COSYS and COPA should be described more detailled. v11

Design Quality Estimation

Estimation More accurate estimates require more time!
Preestimation: Each functional object (behavior, variable and channel) is annotated with information, (the number of bytes for a behavior when compiled to a particular processor, the average frequency of channel access, or the number of channel bits). Preestimation occurs only once at the beginning of exploration, is independent of any particular partition and allocation, and may take seconds… minutes… Onlineestimation: Preestimated annotations are combined in complex expressions to obtain metric values for a particular partition and allocation. Onlineestimation occurs hundreds or thousands of times during manual or automated exploration, so it must take ms.

Typical Estimation Models
Design model Additional tasks Accuracy Fidelity Speed Low fast Mem Mem allocation Mem+FUs FU allocation Mem+ FUs +Reg Lifetime analys. Mem+ FUs +Reg+Muxes FU binding Mem+ FUs +Reg+Muxes+Wiring Floor planning High slow Mem = memories FUs = functional units Reg = registers Muxes = multiplexers

Accuracy VS Speed Accuracy (A) = measure how close the estimate (E) is to the actual value (M) of the metric measured after design implementation (D) Simplified estimation models yield fast estimators but with less accuracy

Fidelity of Estimation
= percentage of correctly predicted comparisons between design implementations Let be a set of implementations of a given specification Define 1 if μij= 0 otherwise The fidelity F of an estimation can be defined as a percentage of correct predictions:

Estimation Fidelity 100% 33% Quality metric Design points A B C
measured estimate A B C 100% 33%

SpecSyn Daniel D. Gajski, Frank Vahid, Sanjiv Narayan and Jie Gong
University of California, Irvine

Arbiter Generation Refinement task that inserts an arbitration mechanism in the specification whenever there is a resource contention in the system Arbitration models static arbitration model, accesses by a behavior are assigned to a specific port of the memory dynamic arbitration model, behaviors may access the memory through different ports at different times, depending on their availability

Static Arbitration Model
addr/data addr/data MemArbiter port1 port2 memory MEM behaviour P behaviour Q behaviour R Behavior accesses data through the port assigned to it throughout the lifetime of the system

Dynamic Arbitration Model
addr/data addr/data MemArbiter port1 port2 memory MEM behaviour P behaviour Q behaviour R Pros: higher utilization of the two ports => faster execution times Cons: requires a complex implementation

Arbitration Schemes: Fixed-priority
Statically assigns a priority to each behavior The fixed priority for various behaviors depends on some metric which has to be optimized mean waiting time can be approximated by metrics that can be evaluated relatively easily, like: the size of the data the frequency of accesses made by the behavior to the shared resource system designer may use either a single criterion or a weighted combination of several criteria

Arbitration Schemes: Dynamic-priority
Determines the priority of a behavior according to the state of the system at run-time Solutions: round-robin scheme assigns the lowest priority to the behavior that most recently accessed the shared resource. first-come-first-served scheme grants access privileges to behaviors in the order they requested the access; they are characterized by the absence of any absolute order in which behaviors are granted access to a resource. Dynamic arbitration schemes are expected to be fair, i.e., a behavior will not have to wait indefinitely to gain access to the shared resource.

Hardware/Software Codesign of Embedded Systems

Similar presentations

Presentation on theme: "Hardware/Software Codesign of Embedded Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hardware/Software Codesign of Embedded Systems

Similar presentations

Presentation on theme: "Hardware/Software Codesign of Embedded Systems"— Presentation transcript:

Similar presentations

About project

Feedback