Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tecniche di ottimizzazione per lo sviluppo di applicazioni embedded su piattatforme multiprocessore su singolo chip Michela Milano

Similar presentations


Presentation on theme: "Tecniche di ottimizzazione per lo sviluppo di applicazioni embedded su piattatforme multiprocessore su singolo chip Michela Milano"— Presentation transcript:

1 Tecniche di ottimizzazione per lo sviluppo di applicazioni embedded su piattatforme multiprocessore su singolo chip Michela Milano michela.milano@unibo.it DEIS Università di Bologna

2 Digital Convergence – Mobile Example Broadcasting TelematicsImaging Computing Communication Entertainment One device, multiple functions Center of ubiquitous media network Smart mobile device: next drive for semicon. Industry

3 SoC: Enabler for Digital Convergence Today Future > 100X Performance Low Power Complexity Storage SoCSoC

4 Design as optimization Design space The set of “all” possible design choices Constraints Solutions that we are not willing to accept Cost function A property we are interested in (execution time, power, reliability…)

5 Embedded system design MOTIVATION & CONTEXT System design with MPSoCs Exploit application and platform parallelism to achieve real time performance  Given a platform description  and an application abstraction  compute an allocation & schedule  verify results & perform changes Design flow 1 23 4 5 P1P2ram P1 P2 12 3 4 5 12 3 4 5 t Crucial role of the allocation & scheduling algorithm

6 MPSoC platform  Identical processing elements (PE)  Local storage devices  Remote on-chip memory  Shared bus PROBLEM DESCRIPTION MPSoC platform Resources: Constraints:  PEs  Local memory devices  Shared BUS  PE frequencies  Local device capacity  Bus bandwidth (additive resource)  Architecture dependent constraints DVS

7 Application  Nodes are tasks/processes  Arcs are data communication PROBLEM DESCRIPTION 1 23 4 5 6 RD Each task:  Reads data for each ingoing arc  Performs some computation  Writes data for each outgoing arc EXEC WR COMM. BUFFER LOC/REM PE PROG. DATA LOC/REM FREQ. Application Task Graph

8 Application  Memory allocation  Remote memory is slower than the local ones  Execution frequency PROBLEM DESCRIPTION 1 23 4 5 6 RDEXEC WR COMM. BUFFER LOC/REM PE PROG. DATA LOC/REM FREQ. Durations depend on: Different phases have different bus requirements Application Task Graph

9 Application PROBLEM VARIANTS O. F. We focused on problem variants with different  Objective function  Graph features G.F. Bus traffic Energy consumption Makespan Pipelined Generic Generic with cond. branches DVS no DVS Problem variants

10 Objective function PROBLEM VARIANTS Bus traffic  Tasks produce traffic when they access the bus  Completely depends on memory allocation Makespan  Completely depends on the computed schedule Energy (DVS)  The higher the frequency, the higher the energy consumption  Cost for frequency switching 123 t 4 Time & energy cost Allocation dependent Schedule dependent Objective function

11 Graph structure PROBLEM VARIANTS 1234 Pipelined Typical of stream processing applications Generic 1 23 4 5 6 Generic with cond. branches 1 23 4 5 6 a !a b !b Stochastic problem 0.3 0.7 0.5 Stochastic O.F. (exp. value) Graph Structure

12 Application Development Flow CTG Characterization Phase Simulator Optimization Phase Optimizer Application Profiles Optimal SW Application Implementation Allocation Scheduling Application Development Support Platform Execution

13 When & Why Offline Optimization? Plenty of design-time knowledge Applications pre-characterized at design time Dynamic transitions between different pre- characterized scenarios Aggressive exploitation of system resources Reduces overdesign (lowers cost) Strong performance guarantees Applicable for many embedded applications

14 Question? Complete or incomplete solver? Can I solve the instances with a complete solver? See the structure of the problem and the average instance dimension If yes, which technique to use? If no, which is the quality of the heuristic solution proposed?

15 Optimization in system design Complete solvers find the optimal solution and prove its optimality The System Design community uses Integer Programming techniques for every optimization problem despite the structure of the problem itself Scheduling is poorly handled by IP Incomplete solvers: lack of estimated optimality gap Decomposition of the problem and sequential solution of each subproblem. Local search/metaheuristic algorithms Require a lot of tuning

16 Optimization techniques We will consider two techniques: Constraint Programming Integer Programming Two aspects of a problem : Feasibility Optimality Merging the two, one could obtain better results

17 Constraint Programming Quite recent programming paradigm declarative paradigm Inherits from logic programming operations research software engineering AI constraint solving

18 Constraint Programming Problem model variable domains constraints Problem solving constraint propagation search

19 Example 1 X::[1..10], Y::[5..15], X>Y Arc-consistency: for each value v of X, there should be a value in the domain of Y consistent with v X::[6..10], Y::[5..9] after propagation Example 2 X::[1..10], Y::[5..15], X=Y X::[5..10], Y::[5..10] after propagation Example 3 X::[1..10], Y::[5..15], X  Y No propagation Mathematical Constraints

20 Every variable involved in many constraints: each changing in the domain of a variable could affect many constraints Agent perspective: Example: Constraint Interaction X::[1..5], Y::[1..5], Z::[1..5] X = Y + 1 Y = Z + 1 X = Z - 1

21 Constraint Interaction X::[1..5], Y::[1..5], Z::[1..5] X = Y + 1 Y = Z + 1 X = Z - 1 First propagation of X = Y + 1 leads to X::[2..5] Y::[1..4] Z::[1..5] X = Y + 1 suspended

22 Constraint Interaction Second propagation of Y = Z + 1 leads to X::[2..5] Y::[2..4] Z::[1..3] Y = Z + 1 suspended The domain og Y is changed and X = Y + 1 is awaken X::[3..5] Y::[2..4] Z::[1..3] X = Y + 1 suspended

23 Constraint Interaction Third propagation of Z = X - 1 leads to The order in which constraints are considered does not affect the result BUT can influence the efficiency FAIL X::[] Y::[2..4] Z::[1..3]

24 Global Constraints cumulative([S 1,..,S n ],[D 1,..,D n ],[R 1,..,R n ], L) S 1,...S n activity starting time (domain variables) D 1,...D n duration (domain variables) R 1,...R n resource consumption (domain variables) L resource capacity Given the interval [min,max] where min = min i { S i }, max = max{ S i +D i } - 1, the cumulative constraint holds iff max{  R i }  L j|S j  i  S j +D j

25 Global Constraints cumulative([1,2,4],[4,2,3],[1,2,2],3)

26 One of the propagation algorithms used in resource constraints is the one based on obligatory parts 8 8 S min S max Obligatory part Propagation

27 Another propagation is the one based on edge finding [ Baptiste, Le Pape, Nuijten, IJCAI95] Consider a unary resource R and three activities that should be scheduled on the R 1 0 17 11 6 4 1 3 12 S1 S2 S3

28 Propagation S1 S2 S3 We can deduce that the earliest start time of S1 is 8. In fact, S1 should be executed after both S2 e S3. Global reasoning Global reasoning: if S1 is executed before the two, then there is no space for executing both S2 and S3 1 8 17 11 6 4 1 3 12

29 CP solving process The solution process interleaves propagation and search Constraints propagate as much as possible When no more propagation can be performed, either the process fails since one domain is empty or search is performed. The search heuristics chooses Which variable to select Which value to assign to it Optimization problems Solved via a set of feasibility problems with constraints on the objective function variable

30 Pros and Cons Declarative programming: the user states the constraint the solver takes into account propagation and search Strong on the feasibility side Constraints are symbolic and mathematic: expressivity Adding a constraint helps the solution process: flexibility  Weak optimality pruning if the link between problem decision variables and the objective function is loose  No use of relaxations

31 Standard form of Combinatorial Optimization Problem (IP) min z =  c j x j subject to  a ij x j = b i i = 1..m x j  0 j = 1..n x j integer Inequality y  0 recasted in y - s = 0 Maximization expressed by negating the objective function j =1 n n May make the problem NP complete Integer Programming

32 0-1 Integer Programming Many Combinatorial Optimization Problem can be expressed in terms of 0-1 variables (IP) min z =  c j x j subject to  a ij x j = b i i = 1..m x j :[0,1] j =1 n n May make the problem NP complete

33 Linear Relaxation min z =  c j x j subject to  a ij x j = b i i = 1..m x j  0 j = 1..n xj integer The linear relaxation is solvable in POLYNOMIAL TIME The SIMPLEX ALGORITHM is the technique of choice even if it is exponential in the worst case j =1 n n Removed Linear Relaxation

34 Geometric Properties The set of constraints defines a polytope The optimal solution is located on one of its vertices min z =  c j x j subject to  a ij x j = b i i = 1..m x j  0 j = 1..n The simplex algorithm starts from one vertex and moves to an adjacent one with a better value of the objective function j =1 n n Optimal solution Objective function

35 IP solving process The optimal LP solution is in general fractional: violates the integrality constraint but provides a bound on the solution of the overall problem The solution process interleaves branch and bound: relaxation search

36 Pros and Cons Declarative programming: the user states the constraint the solver takes into account relaxation and search Strong on the optimality side Many real problem structure has been deeply studied  Only linear constraints should be used  If sophisticated techniques are used, we lose flexibility  No pruning on feasibility (only some preprocessing)

37 Resource-Efficient Application mapping for MPSoCs Given a platform 1. Achieve a specified throughput 2. Minimize usage of shared resources MULTIMEDIA APPLICATIONS

38 Allocation and scheduling Given: An hardware platform with processors, (local and remote) storage devices, a communication channel A pre characterized task graph representing a functional abstraction of the application we should run Find: An allocation and a scheduling of tasks to resources respecting Real time constraints Task deadlines Precedences among tasks Capacity of all resources Such that the communication cost is minimized

39 Allocation and scheduling The platform is a multi-processor system with N nodes Each node includes a processor and a schretchpad memory The bus is a shared communication channel In addition we have a remote memory of unlimited capacity (realistic assomption for our application, but easily generalizable) The task graph has a pipeline workload Real time video graphic processing pixel of a digital image Task dependencies, i.e., arcs between tasks Computation, communication, storage requirements on the graph

40 Embedded system design MOTIVATION & CONTEXT  Given a platform description  and an application abstraction  compute an allocation & schedule Design flow 1 23 4 5 P1P2ram P1 P2 12 3 4 5 12 3 4 5 t Allocation and scheduling

41 MPSoC platform  Identical processing elements (PE)  Local storage devices  Remote on-chip memory  Shared bus PROBLEM DESCRIPTION MPSoC platform Resources: Constraints:  PEs Unary resource  Local memory devices Limited capacity  Shared BUS Limited bandwidth  Remote memory assumed to be infinite  Local device capacity  Bus bandwidth (additive resource)  Architecture dependent constraints

42 Application  Nodes are tasks/processes  Arcs are data communication PROBLEM DESCRIPTION 1 23 4 5 6 RD Each task:  Reads data for each ingoing arc  Performs some computation  Writes data for each outgoing arc EXEC WR COMM. BUFFER LOC/REM PE PROG. DATA LOC/REM FREQ. Application Task Graph

43 Application  Memory allocation  Remote memory is slower than the local ones PROBLEM DESCRIPTION 1 23 4 5 6 RDEXEC WR COMM. BUFFER LOC/REM PE PROG. DATA LOC/REM FREQ. Durations depend on: Different phases have different bus requirements Application Task Graph

44 Problem structure As a whole it is a scheduling problem with alternative resources: very tough problem It smoothly decomposes into allocation and scheduling Allocation better handled with IP techniques Not with CP due to the complex objective function Scheduling better handled with CP techniques Not with IP since we should model for each task all its possible starting time with a 0/1 variable INTERACTION REGULATED VIA CUTTING PLANES

45 Logic Based Benders Decomposition Obj. Function: Communication cost Valid allocation Allocation INTEGER PROGRAMMING Scheduling: CONSTRAINT PROGRAMMING No good: linear constraint Memory constraints Timing constraint Decomposes the problem into 2 sub-problems: Allocation → IP Objective Function: communication of tasks Scheduling → CP Secondary Objective Function: makespan

46 Master Problem model Assignment of tasks and memory slots  Assignment of tasks and memory slots (master problem) T ij = 1 if task i executes on processor j, 0 otherwise, Y ij =1 if task i allocates program data on processor j memory, 0 otherwise, Z ij =1 if task i allocates the internal state on processor j memory, 0 otherwise X ij =1 if task i executes on processor j and task i+1 does not, 0 otherwise Each process should be allocated to one processor  T ij = 1 for all j Link between variables X and T: X ij = | T ij – T i+1 j | for all i and j (can be linearized) If a task is NOT allocated to a processor nor its required memories are: T ij = 0  Y ij =0 and Z ij =0 Objective function   mem i (T ij - Y ij ) + state i (T ij - Y ij ) + data i X ij /2 i ij

47 Improvement of the model With the proposed model, the allocation problem solver tends to pack all tasks on a single processor and all memory required on the local memory so as to have a ZERO communication cost: TRIVIAL SOLUTION  With the proposed model, the allocation problem solver tends to pack all tasks on a single processor and all memory required on the local memory so as to have a ZERO communication cost: TRIVIAL SOLUTION  To improve the model we should add a relaxation of the subproblem to the master problem model:  For each set S of consecutive tasks whose sum of durations exceeds the Real time requirement, we impose that their processors should not be the same  WCET i > RT   T ij  |S| -1 i  S

48 Sub-Problem model  Task scheduling with static resource assignment  Task scheduling with static resource assignment (subproblem)  We have to schedule tasks so we have to decide when they start  Activity Starting Time: Start i ::[0..Deadline i ]  Precedence constraints: Start i +Dur i  Start j  Real time constraints: for all activities running on the same processor  ( Start i +Dur i )  RT  Cumulative constraints on resources processors are unary resources: cumulative([Start], [Dur], [1],1) memories are additive resources: cumulative([Start],[Dur],[MR],C) What about the bus?? i

49 Bus model BANDWIDTH BIT/SEC TIME Max bus bandwidth Size of program data TaskExecTime Task0 accesses input data: BW=MaxBW/NoProc Task0 reads state Task0 writes state task0 task1 Additive bus model The model does not hold under heavy bus congestion Bus traffic has to be minimized

50 Results Algorithm search time The combined approach dominates, and its higher complexity comes out only for simple system configurations

51 Energy-Efficient Application mapping for MPSoCs Given a platform 1. Achieve a specified throughput 2. Minimize power consumption MULTIMEDIA APPLICATIONS

52 Logic Based Benders Decomposition Obj. Function: Communication cost & energy consumption Valid allocation Allocation & Freq. Assign.: INTEGER PROGRAMMING Scheduling: CONSTRAINT PROGRAMMING No good and cutting planes Memory constraints Timing constraint Decomposes the problem into 2 sub-problems: Allocation & Assignment (& freq. setting) → IP Objective Function: minimizing energy consumption during execution Scheduling → CP Objective Function: E.g.: minimizing energy consumption during frequency switching

53 Allocation problem model Xtfp = 1 if task t executes on processor p at frequency f; Wijfp = 1 if task i and j run on different core. Task i on core p writes data to j at freq. f; Rijfp = 1 if task i and j run on different core. Task j on core p reads data to i at freq. f; Each task can execute only on one processor at one freq. Communication between tasks can execute only once for execution and one write corresponds to one read The objective function: minimize energy consumption associated with task execution and communication

54 Communication energy for Reads from shared memory. Reads carried out at the same frequency of the task Allocation problem model Bus Mem CPU Computation energy for all tasks in the system Communication energy for Writes to shared memory. Writes carried out at the same frequency of the task

55 Scheduling problem model INPUTEXECOUTPUT The objective function: minimize energy consumption associated with frequency switching Processors are modelled as unary resource Bus is modelled as additive resource Duration of task i is now fixed since mode is fixed: Task i Task j Tasks running on the same processor at the same frequency Tasks running on the same processor at different frequencies Tasks running on different processors

56 Computational efficiency Search time for an increasing number of tasks Similar plot for an increasing number of processors Standalone CP and IP proved not comparable on a simpler problem We varied the RT constraint: tight deadline: few feasible solutions very loose deadline: trivial solution search time within 1 order of magnitude P4 2GHz 512 Mb RAM Professional solving tools (ILOG)

57 Conditional Task Graphs Conditional Task graphs: the problem becomes stochastic since the outcome of the conditions labeling the arcs are known at execution time. We only know the probability distribution Minimize the expected value of the objective function Min communication cost: easier Min makespan: much more complicated Promising results

58 Other problems What if the task durations are not known: only the worst and best case are available Scheduling only the worst case is not enough Scheduling anomalies Change platform: We are facing the CELL BE Processor architecture Model synchronous data flow applications Model communication intensive applications

59 OptimizationDevelopment The abstraction gap between high level optimization tools and standard application programming models can introduce unpredictable and undesired behaviours. Programmers must be conscious about simplified assumptions taken into account in optimization tools. Platform Modelling Optimization Analysis Optimal Solution Starting Implementation Platform Execution Abstraction gap (. Final Implementation Challenge: the Abstraction Gap

60 MAX error lower than 10% AVG error equal to 4.51%, with standard deviation of 1.94 All deadlines are met Optimizer Optimal Allocation & Schedule Virtual Platform validation 250 instances Validation of optimizer solutions Throughput Probability (%) Throughput difference (%)

61 MAX error lower than 10%; AVG error equal to 4.80%, with standard deviation of 1.71; Optimizer Optimal Allocation & Schedule Virtual Platform validation 250 instances Validation of optimizer solutions Power 250 instances Probability (%) Energy consumption difference (%)

62 GSM Encoder Throughput required: 1 frame/10ms. With 2 processors and 4 possible frequency & voltage settings: Task Graph:  10 computational tasks;  15 communication tasks. Without optimizations: 50.9μJ With optimizations: 17.1 μJ - 66,4%

63 Challenge: programming environment A software development toolkit to help programmers in software implementation: a generic customizable application template  OFFLINE SUPPORT; a set of high-level APIs  ONLINE SUPPORT in RT-OS (RTEMS) The main goals are: predictable application execution after the optimization step; guarantees on high performance and constraint satisfaction. Starting from a high level task and data flow graph, software developers can easily and quickly build their application infrastructure. Programmers can intuitively translate high level representation into C-code using our facilities and library


Download ppt "Tecniche di ottimizzazione per lo sviluppo di applicazioni embedded su piattatforme multiprocessore su singolo chip Michela Milano"

Similar presentations


Ads by Google