Presentation on theme: "Particle Swarm Optimization for Run-Time Task Decomposition and Scheduling in Evolvable MPSoC Shervin Vakili, Sied Mehdi Fakhraie, Siamak Mohammadi, and."— Presentation transcript:
Particle Swarm Optimization for Run-Time Task Decomposition and Scheduling in Evolvable MPSoC Shervin Vakili, Sied Mehdi Fakhraie, Siamak Mohammadi, and Ali Ahmadi International Conference on Computer Engineering and Technology 2009 (ICCET 2009) January 24, 2009
Why MPSoC? ► An emerging trend to design high performance computing architectures. ► Have most of the desirable advantages of single-processor solutions such as short time-to market, post-fabricate reusability, flexibility and programmability ► Moving toward large number of simple processors on a chip
MPSoC Development Challenges ► Programming models: MP systems require concurrent software. Two main solutions: Software development using parallel models e.g. OpenMP and MPI ► “Software developers have been well-trained by sixty years of computing history to think in terms of sequentially defined applications code”  ► Requires huge investment to re-develop existing software
MPSoC Development Challenges (2) Automatic parallelization at compile-time ► Does not require reprogramming but requires re- compilation ► Such compiler must solve two complex problems: Decomposition of the program into some tasks Scheduling the tasks among cooperating processors ► Both task decomposition and scheduling operations are NP-complete problems ► G. Martin : “Decomposition of an application described in a serial fashion into a set of concurrent or parallel tasks that can cooperate in an orderly and predictable way is one of the most difficult jobs imaginable and despite of forty or more years of intensive research in this area there are very few applications for which this can be done automatically”.
MPSoC Development Challenges (3) ► All MPSoCs can be divided into two categories: Static Scheduling ► Task scheduling is performed before run-time ► Number of contributing processors must be predetermined Dynamic scheduling ( e.g. current multi-core PC processors ) ► A run-time scheduler (in hardware, middleware, or OS) is in charge of task scheduling ► Does not require prior information about number of available processors (desirable for fault tolerance)
Introduction to EvoMP ► An NoC-Based Homogeneous Multi-processor system with evolvable task decomposition and scheduling ► Features: Distributed control and computing Scalable Does not need parallel programming ► One of the main difficulties in parallel processing ► Requires reprogramming all the developed (sequential) software
Introduction to EvoMP (2) ► Features: All computational units have one copy of the entire program A hardware PSO core is exploited in EvoMP architecture to generates a bit-string ► Specifies each instruction must be executed in which processor Our first version of EvoMP had used a genetic algorithm core 
Introduction to EvoMP (3) ► Target Applications: Applications, which perform a unique computation on a stream of data, e.g.: digital signal processing of video and audio signals Different codec standards Huge sensory data processing Packet processing in network applications …
How Chromosome Codes the Scheduling Data ► Streaming applications have two main parts: Initialization Infinite (or semi-infinite) Loop ;Initial 1- MOV R1, 0 2- MOV R2, 0 L1:;Loop 3- MOV R1, Input 4- MUL R3, R1, Coe1 5- MUL R4, R2, Coe2 6- ADD R1, R3, R4 7- MOV Output, R1 8- MOV R1, R2 9- PSO 10-JUMP L1
How EvoMP Works ► Following process is repeated for each iteration: At the beginning of each iteration, PSO core generates and sends the bit-string (particle) to all processors Then processor executes this iteration of the program with the decomposition and scheduling scheme specified by this bit-string An internal counter in PSO core is used to count number of spent clock cycles meanwhile execution of each iteration When all processors reached the end of the loop, the PSO core uses the output of this counter as the fitness value of the last generated particle
How EvoMP Works (2) ► The system has three main states Initialize: ► Just in first population ► PSO core generates random particles ► Evolution: PSO core produces the new population through particular computations using best previously archived particles When the termination condition is met, system goes to final state ► Final: The best particle achieved in evolution stage is used as constant output of the PSO core When one of the processors becomes faulty the system returns to evolution stage to perform re-evolution (beneficial for fault tolerance capability of the EvoMP) InitializeEvolutionFinal Fault detected Terminate
How Chromosome Codes the Scheduling Data (1) ► Each bit-string (Particle) consists of some small words (Sub-Particles) ► Each Sub-Particles contains two fields: A processor number A limited number which specifies number of instructions which must be executed in specified processor in first field
How Chromosome Codes the Scheduling Data (2) ► Assume that we have a 2X2 mesh 10001 # of Instructions 00010 11000 10101 10 00 Particle 11 10 00 10 01 11 ;Initial 1- MOV R1, 0 2- MOV R2, 0 L1:;Loop 3- MOV R1, Input 4- MUL R3, R1, Coe1 5- MUL R4, R2, Coe2 6- ADD R1, R3, R4 7- MOV Output, R1 8- MOV R1, R2 9- GENETIC 10-JUMP L1
Inter-Processor Data Dependencies ► Inter-processor data dependencies are detected in source processor using architectural mechanisms Source processor transmits the required data for the destination one(s) through NoC Does not require request-send scheme
Architecture of each Processor ► Number of FUs is a configurable parameter ► Supports out of order execution ► First free FU grabs the instruction from Instr bus and send a signal to Fetch_Issue to fetch next instruction
Particle Swarm Optimization Algorithm ► An stochastic population-based evolutionary algorithm ► Ties to find the optimum solution over the search space by sampling points and converging the swarm on the most promising regions Number of these sampling points (called particle) is constant (population size) Each sampling point is a candidate solution
Evolution Phase Results: DCT-16 ► Parameters: Population size=16 NoC connection width=16 ► 324 Instructions ► 128 multiplication ► Execution results of 16-point Descrete Cosine Transform on different-size EvoMPs ► Best fitness shows number of clock cycles required to execute one iteration using the best particle which has been found yet.
Evolution Phase Results: MATRIX-5x5 ► 406 Instructions ► 125 multiplication ► Execution results of 5x5 Matrix multiplication on different-size EvoMPs ► Parameters: Population size=16 NoC connection width=16
Final Evolution Phase Results ► Following table shows final results achived in evolution phase (and corresponding evolution time) in both genetic-based and PSO-based EvoMPs. ► These results shows small improvement in final results and convergence time in PSO-based system. # of Instr. # of Multiplies Particle length (bits) 1 Processor 2 Processors3 Processors5 Processors Both GeneticPSOGeneticPSOGeneticPSO FitTimeFitTimeFitTimeFitTimeFitTimeFitTime FIR-16 741624035021423.721112.317130.117414.3unevaluated DCT-8 883228067140393.03936.231999.830821.8285138.120315.6 DCT-16 3241287202722184174.5183141.7146023.3143945.31213633.7119198.3 MAT-5x5 40612580031812344198.3231286.31868294.81821148.31596546.71518240.9
Synthesis Results Population sizeSub-particle Length Total LUTsMax. Freq. (MHz) PSO-based scheduler164186492.6 Genetic-based scheduler16-164268.4 ► Follwing table shows the synthesis results of bothe PSO and genetic cores on a VIRTEX II (XC2V3000) FPGA
References  A. A. Jerraya and W. Wolf, Multiprocessor Systems-on-Chips, San Francisco: Morgan Kaufmann Publishers, 2005.  G. Martin, “Overview of the MPSoC design challenge,” Proc. Design and Automation Conf., July 2005, pp. 274-279.  M. Hubner, K. Paulsson, and J. Becker, “Parallel and flexible multiprocessor system-on-chip for adaptive automotive applications based on Xilinx MicroBlaze soft-cores,” Proc. Int. Symp. Parallel and Distributed Processing, 2005, pp. 149.1.  D. Gohringer, M. Hubner, V. Schatz, and J. Becker, “Runtime adaptive multi-processor system- on-chip: RAMPSoC,” Proc. Int. Symp. Parallel and Distributed Processing, Apr. 2008, pp. 1-7.  A. Klimm, L. Braun, and J. Becker, “An adaptive and scalable multiprocessor system for Xilinx FPGAs using minimal sized processor cores,” Proc. Symp. Parallel and Distributed Processing, April 2008, pp. 1-7.  A. Farmahini-Farahani, S. Vakili, S. M. Fakhraie, S. Safari, and C. Lucas, “Parallel scalable hardware implementation of asynchronous discrete particle swarm optimization,” Elsevier J. of Engineering Applications of Artificial Intelligence, submitted for publication.  A.J. Page and T.J. Naughton, “Dynamic task scheduling using genetic algorithms for heterogeneous distributed computing,” Proc. Int. Symp. Parallel and Distributed Processing, April 2005, pp. 189.1.  S. Vakili, S. M. Fakhraie, and S. Mohammadi, “EvoMP: a novel MPSoC architecture with evolvable task decomposition and scheduling,” to be Appeared in IET Comp. & Digital Tech.  E. Carvalho, N. Calazans, and F. Moraes, "Heuristics for dynamic task mapping in NoC-based heterogeneous MPSoCs," Proc. Intl. Rapid System Prototyping Workshop, 2007, pp. 34-40.
Fetch_Issue Unit ► PC1-Instr bus is used for executive instructions ► PC2-Invalidate_Instr bus is used for data dependency detection