Martino Ruggiero, Michele Lombardi, Michela Milano and Luca Benini

Name: Martino Ruggiero, Michele Lombardi, Michela Milano and Luca Benini
Uploaded: 2017-07-31T07:03:33+00:00
Duration: PTM20S56
Channel: Lindsay Newman
Description: Martino Ruggiero, Michele Lombardi, Michela Milano and Luca Benini

Martino Ruggiero, Michele Lombardi, Michela Milano and Luca Benini
Communication-Aware Stochastic Allocation and Scheduling Framework for Conditional Task Graphs in Multi-Processor Systems-on-Chip Martino Ruggiero, Michele Lombardi, Michela Milano and Luca Benini University of Bologna, DEIS - Italy

Outline Motivations Our approach Problem Model Methodology
Experimental Results Conclusions

Task Graph T1 T2 T3 T4 T5 T6 T8 T7 … Proc. 1 Proc. 2 Proc. N INTERCONNECT Private Mem Allocation T1 T2 T3 T4 T5 T6 T7 T8 Schedule Time Resources T1 T2 T3 T4 T5 T7 Deadline T8 Many realistic applications can only be specified as conditional task graphs The problem of allocating and scheduling conditional task graphs on processors in a distributed real-time system is NP-hard. New tool flows for efficient mapping of multi-task applications onto hardware platforms

Starting Implementation Optimization Analysis
Design flow graph Optimization Development Abstraction gap Platform Modelling Starting Implementation Optimization Analysis Final Implementation Optimal Solution ( . . Platform Execution The abstraction gap between high level optimization tools and standard application programming models can introduce unpredictable and undesired behaviours. Programmers must be conscious about simplified assumptions taken into account in optimization tools. New methodology for multi-task application development on MPSoCs.

Our approach Our Focus: Our Objectives:
Statically scheduled Conditional Task Graph Applications; Our Objectives: Complete approach to allocation and scheduling: High computational efficiency w.r.t. commercial solvers; High accuracy of generated solutions; New methodology for multi-task application development: To quickly develop multi-task applications; To easily apply the optimal solution found by our optimizer.

Target architecture - 1 An architectural template for a message-oriented distributed memory MPSoC: Support for message exchange between the computation tiles; Single-token communication; Availability of local memory devices at the computation tiles and of remote memories for program data. Several MPSoC platforms available on the market match this template: The Silicon Hive Avispa-CH1 processor; The Cradle CT3600 family of multiprocessor; The Cell Processor The ARM MPCore platform. The throughput requirement is reflected in the maximum tolerable scheduling period T of each processor; . Act. A Act. B Act. N period T

Target architecture - 2 Homogeneous computation tiles:
ARM cores (including instruction and data caches); Tightly coupled software-controlled scratch-pad memories (SPM); AMBA AHB; DMA engine; RTEMS OS; Cores use non-cacheable shared memory to communicate; Semaphore and interrupt facilities are used for synchronization; Private on-chip memory to store data.

Target Application: Conditional Task Graph (CTG)
Seldom target applications behaves in same ways between several executions: they contain cycles, conditional jumps or other elements of variability. FORK A CTG is a triple <T,A,C>, where: T is the set of nodes modelling generic tasks (e.g. elementary operations, subprograms, ...); A the set of arcs modelling precedence constraints (e.g. due to data communication); C is a set of conditions, each one associated to an arc, modelling what should be true in order to choose that branch during execution (e.g. the condition of a if-then-else construct). Extension to the generic task graph model with stochastic elements: Conditional Branches; Conditional Nodes; Branch Nodes. AND BRANCH N N N OR

Task memory requirements
System Bus Private Mem ARM Core Int controller SPM Semaphores #1 #2 Each task has three kinds of memory requirements: Program Data; Internal State; Communication queues. Program Data & Internal State can be allocated by Optimizer: On the local SPM; On the remote Private Memory. The communication task might run: On the same processor → negligible communication cost On a remote processor → costly message exchange procedure Optimizer constraint: Communication queues only in SPM → more efficient message passing

Task memory requirements
System Bus Private Mem ARM Core Int controller SPM Semaphores #1 Each task has three kinds of memory requirements: Program Data; Internal State; Communication queues. #2 Program Data & Internal State can be allocated by Optimizer: On the local SPM; On the remote Private Memory. The communication task might run: On the same processor → negligible communication cost On a remote processor → costly message exchange procedure Optimizer constraint: Communication queues only in SPM → more efficient message passing

Logic Based Benders Decomposition
Memory constraints Obj. Function: Communication cost ALLOCATION: INTEGER PROGRAMMING Valid allocation No good: linear constraint Real Time constraint SCHEDULING: CONSTRAINT PROGRAMMING Decomposes a problem into 2 sub-problems: Allocation → IP Scheduling → CP The process continues until the master problem and sub-problem converge providing the same value. Methodology has been proven to converge to the optimal solution [J.N.Hooker and G.Ottosson].

Allocation problem model
Tij = 1 if task i executes on processor j; Mij = 1 if task i allocates the program data on SPM of PE j; Sij = 1 if task i allocates the internal state on SPM of PE j; Crj =1 if arc r is allocated on SPM of PE j. Each process can execute only on one processor Program data and internal state can be allocated locally on a PE only if the task run on it Communication queue of arcr can be locally only if both the source and the destination tasks run on a PEj The sum of locally allocated structures cannot exceed the SPM capacity

Allocation problem model
The objective function: the minimization of the amount of data transferred on the bus Tij = 1 if task i executes on processor j; Mij = 1 if task i allocates the program data on SPM of PE j; Sij = 1 if task i allocates the internal state on SPM of PE j; Crj =1 if arc r is allocated on SPM of PE j. Bus Mem CPU

Bus Traffic modelling Equal to 1 if task i internal state
is remotely allocated Equal to 1 if task i program data is remotely allocated Activation function equal to 1 if task i executes Activation function equal to 1 if task i and k execute Equal to 1 if communication queue is remotely allocated

Bus Traffic modelling The minimization of a stochastic function
Given an allocation these two terms are constants The minimization of a stochastic function is a very complex operation (even more than exponential)

Bus Traffic modelling Every stochastic dependence is removed And
Existence and coexistence probabilities of tasks Constant terms Every stochastic dependence is removed And The expected value is reduced to a deterministic expression We developed two polynomial cost algorithms to compute these probabilities

Scheduling problem model
INPUT RS EXEC WS OUTPUT Five phases behaviour INPUT=input data reading; RS=internal state reading; EXEC=computation activity; WS=internal state writing; OUTPUT=output data writing. Not breakable activities The adopted schema and precedence relations vary with the type of the corresponding node (or/and, branch/fork) Since the objective function depends only on the allocation, Scheduling is just a feasibility problem We decided to provide a unique worst case schedule, forcing each task to execute after all its predecessors in any scenario

Efficient Application Development Support
In optimization tools many simplifying assumptions are generally considered The neglecting of these assumptions in software implementation can generate: unpredictable and not desired system-level interactions; make the overall system error-prone. We propose an entire framework to help programmers in software implementation: a generic customizable application template  OFFLINE SUPPORT; a set of high-level APIs  ONLINE SUPPORT. The main goals of our development framework are: the exact and reliable application’s execution after the optimization step; guarantees about high performance and constraint satisfaction.

Customizable Application Template
Starting from a high level task and data flow graph, software developers can easily and quickly build their application infrastructure. Programmer can intuitively translate high level representation into C-code using our facilities and library. Users can specify: the number of tasks included in the target application; their nature (e.g. branch, fork, or-node, and-node); their precedence constraints (e.g. due to data communication); ….thus quickly drawing its CTG schema. Programmer can focus onto the functionalities of the tasks: the main effort is given to the more specific and critic sections of the application.

OS-level and Task-level APIs
Users can easily reproduce optimizer solutions, thus: Indirectly neglecting optimizer’s abstractions Task model; Communication model; OS overheads. Obtaining the needed application constraint satisfaction. Programmer can allocate to the right hardware resources Tasks; Program data; Queues. Scheduling support APIs Communication issues Shared queues; Semaphores; Interrupts.

Example Number of nodes : 12 Graph of activities Node type
Normal, Branch, Conditional, Terminator Node behaviour Or, And, Fork, Branch Number of CPU : 2 Task Allocation Task Scheduling Arc priorities a2 a1 fork T2 B2 B3 T3 branch branch a3 a4 a5 a6 C4 T4 C5 T5 T6 C6 C7 T7 a7 a8 a9 a10 or N8 T8 N9 T9 N10 T10 a12 //Node Type: 0 NORMAL; 1 BRANCH ; 2 STOCHASTIC uint node_type[TASK_NUMBER] = {1,2,2,1,..}; a11 or uint queue_consumer [..] [..] = { {0,1,1,0,..}, {0,0,0,1,1,.}, {0,0,0,0,0,1,1..}, {0,0,0,0,..}..}; #define TASK_NUMBER 12 N11 T11 a13 #define N_CPU 2 uint task_on_core[TASK_NUMBER] = {1,1,2,1}; int schedule_on_core[N_CPU][TASK_NUMBER] = {{1,2,4,8}..}; //Node Behaviour: 0 AND ; 1 OR; 2 FORK; 3 BRANCH uint node_behaviour[TASK_NUMBER] = {2,3,3,..}; and a14 T12 T12 Deadline Resources B3 B3 C7 C7 N10 N10 N1 B2 C4 N8 N11 T12 T12 Time

Queue ordering optimization
CPU1 CPU2 T1 Wait! C3 C1 T4 RUN! C2 T2 C4 C5 T3 … T5 T6 … … … Communication ordering affects system performances

Synchronization among tasks
Proc. 1 Proc. 2 C1 T2 T4 T1 T3 T4 T2 C2 C3 T4 is suspended T4 re-activated T3 Non blocked semaphores

Application Development Methodology
Simulator Optimizer Application Profiles CTG Characterization Phase Optimization Phase Allocation Scheduling Application Development Support Optimal SW Application Implementation Platform Execution

Computational Efficiency
2 groups of instances: slightly structured very short tracks quite often contain singleton nodes; completely structured one head, one tail, long tracks The solution times are of the same order of the deterministic case

Validation of optimizer solutions
Optimal Allocation & Schedule Virtual Platform validation MAX error lower than 10%; AVG error equal to 4.8%, with standard deviation of 2.41;

Validation of optimizer solutions
Differences are marginal; All the deadline constraints are satisfied.

Conclusions Cooperative framework to solve the allocation and scheduling problem to optimality for conditional task graphs onto MPSoCs; Logic-Based Benders Decomposition; New development methodology; Solutions validated by means of a complete MPSoC virtual platform; Experimental results proved accuracy of the problem model.

Martino Ruggiero, Michele Lombardi, Michela Milano and Luca Benini

Similar presentations

Presentation on theme: "Martino Ruggiero, Michele Lombardi, Michela Milano and Luca Benini"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Martino Ruggiero, Michele Lombardi, Michela Milano and Luca Benini

Similar presentations

Presentation on theme: "Martino Ruggiero, Michele Lombardi, Michela Milano and Luca Benini"— Presentation transcript:

Similar presentations

About project

Feedback