Presentation is loading. Please wait.

Presentation is loading. Please wait.

Particle Swarm Optimization for Run-Time Task Decomposition and Scheduling in Evolvable MPSoC Shervin Vakili, Sied Mehdi Fakhraie, Siamak Mohammadi, and.

Similar presentations


Presentation on theme: "Particle Swarm Optimization for Run-Time Task Decomposition and Scheduling in Evolvable MPSoC Shervin Vakili, Sied Mehdi Fakhraie, Siamak Mohammadi, and."— Presentation transcript:

1 Particle Swarm Optimization for Run-Time Task Decomposition and Scheduling in Evolvable MPSoC Shervin Vakili, Sied Mehdi Fakhraie, Siamak Mohammadi, and Ali Ahmadi International Conference on Computer Engineering and Technology 2009 (ICCET 2009) January 24, 2009

2 Outline ► Why MPSoC? ► Introduction to EvoMP ► Processing Platform Hardware Architecture ► PSO Hardware Core ► Simulation and Synthesis Results

3 Why MPSoC? ► An emerging trend to design high performance computing architectures. ► Have most of the desirable advantages of single-processor solutions such as short time-to market, post-fabricate reusability, flexibility and programmability ► Moving toward large number of simple processors on a chip

4 MPSoC Development Challenges ► Programming models: MP systems require concurrent software. Two main solutions:  Software development using parallel models e.g. OpenMP and MPI ► “Software developers have been well-trained by sixty years of computing history to think in terms of sequentially defined applications code” [2] ► Requires huge investment to re-develop existing software

5 MPSoC Development Challenges (2)  Automatic parallelization at compile-time ► Does not require reprogramming but requires re- compilation ► Such compiler must solve two complex problems:  Decomposition of the program into some tasks  Scheduling the tasks among cooperating processors ► Both task decomposition and scheduling operations are NP-complete problems ► G. Martin [2]: “Decomposition of an application described in a serial fashion into a set of concurrent or parallel tasks that can cooperate in an orderly and predictable way is one of the most difficult jobs imaginable and despite of forty or more years of intensive research in this area there are very few applications for which this can be done automatically”.

6 MPSoC Development Challenges (3) ► All MPSoCs can be divided into two categories:  Static Scheduling ► Task scheduling is performed before run-time ► Number of contributing processors must be predetermined  Dynamic scheduling ( e.g. current multi-core PC processors ) ► A run-time scheduler (in hardware, middleware, or OS) is in charge of task scheduling ► Does not require prior information about number of available processors (desirable for fault tolerance)

7 Introduction to EvoMP ► An NoC-Based Homogeneous Multi-processor system with evolvable task decomposition and scheduling ► Features:  Distributed control and computing  Scalable  Does not need parallel programming ► One of the main difficulties in parallel processing ► Requires reprogramming all the developed (sequential) software

8 Introduction to EvoMP (2) ► Features:  All computational units have one copy of the entire program  A hardware PSO core is exploited in EvoMP architecture to generates a bit-string ► Specifies each instruction must be executed in which processor  Our first version of EvoMP had used a genetic algorithm core [8]

9 Introduction to EvoMP (3) ► Target Applications: Applications, which perform a unique computation on a stream of data, e.g.:  digital signal processing of video and audio signals  Different codec standards  Huge sensory data processing  Packet processing in network applications  …

10 EvoMP Top View ► PSO core produces a bit-string (particle) which determines the location of execution of each instruction at the beginning of each Iterations. SW00 Cell-01 SW01 SW10SW11 Cell-00 Cell-11Cell-10 PSO Core 1- MOV R1, 0 2- MOV R2, 0 L1:;Loop 3- MOV R1, Input 4- MUL R3, R1, Coe1 5- MUL R4, R2, Coe2 6- ADD R1, R3, R4 7- MOV Output, R1 8- MOV R1, R2 9- JUMP L1 1- MOV R1, 0 2- MOV R2, 0 L1:;Loop 3- MOV R1, Input 4- MUL R3, R1, Coe1 5- MUL R4, R2, Coe2 6- ADD R1, R3, R4 7- MOV Output, R1 8- MOV R1, R2 9-JUMP L1 1- MOV R1, 0 2- MOV R2, 0 L1:;Loop 3- MOV R1, Input 4- MUL R3, R1, Coe1 5- MUL R4, R2, Coe2 6- ADD R1, R3, R4 7- MOV Output, R1 8- MOV R1, R2 9- JUMP L1 1- MOV R1, 0 2- MOV R2, 0 L1:;Loop 3- MOV R1, Input 4- MUL R3, R1, Coe1 5- MUL R4, R2, Coe2 6- ADD R1, R3, R4 7- MOV Output, R1 8- MOV R1, R2 9- JUMP L1 Particle: …11

11 How Chromosome Codes the Scheduling Data ► Streaming applications have two main parts:  Initialization  Infinite (or semi-infinite) Loop ;Initial 1- MOV R1, 0 2- MOV R2, 0 L1:;Loop 3- MOV R1, Input 4- MUL R3, R1, Coe1 5- MUL R4, R2, Coe2 6- ADD R1, R3, R4 7- MOV Output, R1 8- MOV R1, R2 9- PSO 10-JUMP L1

12 How EvoMP Works ► Following process is repeated for each iteration:  At the beginning of each iteration, PSO core generates and sends the bit-string (particle) to all processors  Then processor executes this iteration of the program with the decomposition and scheduling scheme specified by this bit-string  An internal counter in PSO core is used to count number of spent clock cycles meanwhile execution of each iteration  When all processors reached the end of the loop, the PSO core uses the output of this counter as the fitness value of the last generated particle

13 How EvoMP Works (2) ► The system has three main states  Initialize: ► Just in first population ► PSO core generates random particles ► Evolution:  PSO core produces the new population through particular computations using best previously archived particles  When the termination condition is met, system goes to final state ► Final:  The best particle achieved in evolution stage is used as constant output of the PSO core  When one of the processors becomes faulty the system returns to evolution stage to perform re-evolution (beneficial for fault tolerance capability of the EvoMP) InitializeEvolutionFinal Fault detected Terminate

14 How Chromosome Codes the Scheduling Data (1) ► Each bit-string (Particle) consists of some small words (Sub-Particles) ► Each Sub-Particles contains two fields:  A processor number  A limited number which specifies number of instructions which must be executed in specified processor in first field

15 How Chromosome Codes the Scheduling Data (2) ► Assume that we have a 2X2 mesh # of Instructions Particle ;Initial 1- MOV R1, 0 2- MOV R2, 0 L1:;Loop 3- MOV R1, Input 4- MUL R3, R1, Coe1 5- MUL R4, R2, Coe2 6- ADD R1, R3, R4 7- MOV Output, R1 8- MOV R1, R2 9- GENETIC 10-JUMP L1

16 Inter-Processor Data Dependencies ► Inter-processor data dependencies  are detected in source processor using architectural mechanisms  Source processor transmits the required data for the destination one(s) through NoC  Does not require request-send scheme

17 Architecture of each Processor ► Number of FUs is a configurable parameter ► Supports out of order execution ► First free FU grabs the instruction from Instr bus and send a signal to Fetch_Issue to fetch next instruction

18 Particle Swarm Optimization Algorithm ► An stochastic population-based evolutionary algorithm ► Ties to find the optimum solution over the search space by  sampling points and converging the swarm on the most promising regions  Number of these sampling points (called particle) is constant (population size)  Each sampling point is a candidate solution

19 PSO Core SW00 Cell-01 SW01 SW10SW11 Cell-00 Cell-11Cell-10 PSO Core [6]

20 Evolution Phase Results: DCT-16 ► Parameters:  Population size=16  NoC connection width=16 ► 324 Instructions ► 128 multiplication ► Execution results of 16-point Descrete Cosine Transform on different-size EvoMPs ► Best fitness shows number of clock cycles required to execute one iteration using the best particle which has been found yet.

21 Evolution Phase Results: MATRIX-5x5 ► 406 Instructions ► 125 multiplication ► Execution results of 5x5 Matrix multiplication on different-size EvoMPs ► Parameters:  Population size=16  NoC connection width=16

22 Final Evolution Phase Results ► Following table shows final results achived in evolution phase (and corresponding evolution time) in both genetic-based and PSO-based EvoMPs. ► These results shows small improvement in final results and convergence time in PSO-based system. # of Instr. # of Multiplies Particle length (bits) 1 Processor 2 Processors3 Processors5 Processors Both GeneticPSOGeneticPSOGeneticPSO FitTimeFitTimeFitTimeFitTimeFitTimeFitTime FIR unevaluated DCT DCT MAT-5x

23 Synthesis Results Population sizeSub-particle Length Total LUTsMax. Freq. (MHz) PSO-based scheduler Genetic-based scheduler ► Follwing table shows the synthesis results of bothe PSO and genetic cores on a VIRTEX II (XC2V3000) FPGA

24 References [1] A. A. Jerraya and W. Wolf, Multiprocessor Systems-on-Chips, San Francisco: Morgan Kaufmann Publishers, [2] G. Martin, “Overview of the MPSoC design challenge,” Proc. Design and Automation Conf., July 2005, pp [3] M. Hubner, K. Paulsson, and J. Becker, “Parallel and flexible multiprocessor system-on-chip for adaptive automo­tive applications based on Xilinx MicroBlaze soft-cores,” Proc. Int. Symp. Parallel and Distributed Processing, 2005, pp [4] D. Gohringer, M. Hubner, V. Schatz, and J. Becker, “Runtime adaptive multi-processor system- on-chip: RAMP­SoC,” Proc. Int. Symp. Parallel and Distributed Processing, Apr. 2008, pp [5] A. Klimm, L. Braun, and J. Becker, “An adaptive and scalable multiprocessor system for Xilinx FPGAs using minimal sized processor cores,” Proc. Symp. Parallel and Distributed Processing, April 2008, pp [6] A. Farmahini-Farahani, S. Vakili, S. M. Fakhraie, S. Safari, and C. Lucas, “Parallel scalable hardware implementation of asynchronous discrete particle swarm optimization,” Elsevier J. of Engineering Applications of Artificial Intelligence, submitted for publication. [7] A.J. Page and T.J. Naughton, “Dynamic task scheduling using genetic algorithms for heterogeneous distributed computing,” Proc. Int. Symp. Parallel and Distributed Processing, April 2005, pp [8] S. Vakili, S. M. Fakhraie, and S. Mohammadi, “EvoMP: a novel MPSoC architecture with evolvable task decompo­sition and scheduling,” to be Appeared in IET Comp. & Digital Tech. [9] E. Carvalho, N. Calazans, and F. Moraes, "Heuristics for dynamic task mapping in NoC-based heterogeneous MPSoCs," Proc. Intl. Rapid System Prototyping Workshop, 2007, pp

25 Fetch_Issue Unit ► PC1-Instr bus is used for executive instructions ► PC2-Invalidate_Instr bus is used for data dependency detection

26 Example: 2-Order FIR filter MOV R1, 0 MOV R2, 0 L1: MOV R1, Input MUL R3, R1, Coe1 MUL R4, R2, Coe2 ADD R1, R3, R4 MOV Output, R1 MOV R1, R2 GENETIC JUMP L1 Cell1Cell2 Clock: PC1 MOV R1, 0 MOV R2, 0 L1: MOV R1, Input MUL R3, R1, Coe1 MUL R4, R2, Coe2 ADD R1, R3, R4 MOV Output, R1 MOV R1, R2 GENETIC JUMP L PC1 1 PC2 PC1 23 PC PC1 13 PC1 PC2 14 PC2 15 PC1 R3 PC2 Waiting for R PC2 18 PC1 PC2 Total time: 18 Clock Cycles


Download ppt "Particle Swarm Optimization for Run-Time Task Decomposition and Scheduling in Evolvable MPSoC Shervin Vakili, Sied Mehdi Fakhraie, Siamak Mohammadi, and."

Similar presentations


Ads by Google