Design and Implementation of a NoC-Based Cellular Computational System By: Shervin Vakili Supervisors: Dr. Sied Mehdi Fakhraie Dr. Siamak Mohammadi February.

Design and Implementation of a NoC-Based Cellular Computational System By: Shervin Vakili Supervisors: Dr. Sied Mehdi Fakhraie Dr. Siamak Mohammadi February 09, 2009

2 Outline  Introduction and Motivations  Basics of Evolvable Multiprocessor System (EvoMP)  EvoMP Operational View  EvoMP Architectural View  Simulation and Synthesis Results  Summary

3 Introduction and Motivations Basics of Evolvable Multiprocessor System (EvoMP) EvoMP Operational View EvoMP Architectural View Simulation and Synthesis Results Summary

4 Introduction and Motivations (1)  Computing systems have played an important role in advances of human life in last four decades.  Number and complexity of applications are countinously increasing.  More computational power is required.  Three main hardware design approaches: -ASIC (hardware realization) -Reconfigurable Computing -Processor-Based Designs (software realization) Flexibility Performance

5 Introduction and Motivations (2)  Microprocessors are the most pupular approach. -Flexibility and reprogramability -Low performance  Architectural techniques to improve processor performance: -Pipeline, out of order execution, Super Scalar, VLIW, etc.  Seems to be saturated in recent years.

6 Introduction and Motivations (3)  Emerging trends aim to achieve: -More performance -Preserving the classical software development process. [1]

7 Why Multi-Proseccor?  One of the main trends is to increase number of processors.  Uses Thread-level Parallelism (TLP)  Similarity to single-processor: - Short time-to market - Post-fabricate reusability - Flexibility and programmability  Moving toward large number of simple processors on a chip.

8 Number of Processing Cores in Different Products [3] [3]

9 MPSoC Development Challenges (1)  MP systems faces some major challenges.  Programming models: -MP systems require concurrent software. -Concurrent software development requires two operations: -Decomposition of the program into some tasks -Scheduling the tasks among cooperating processors -Both are NP-complete problems -Strongly affects the performance

10 MPSoC Development Challenges (2) -Two main solutions: 1. Software development using parallel programming libraries. -e.g. MPI and OpenMP -Manually by the programmer. -Requires huge investment to re-develop existing software. 2. Automatic parallelization at compile-time -Does not require reprogramming but requires re-compilation. -Compiler performs both Task decomposition and scheduling.

11 MPSoC Development Challenges (3)  Control and Synchronization -To Address inter-processor data dependencies  Debugging -Tracking concurrent execution is difficult. -Particularly in heterogeneous architecture with different ISA processors.

12 MPSoC Development Challenges (4)  All MPSoCs can be divided into two categories: -Static scheduling -Task scheduling is performed before execution. -Predetermined number of contributing processors. -Has access to entire program. -Dynamic scheduling -A run-time scheduler (in hardware or OS) performs task scheduling. -Does not depend on number of processors. -Only has access to pending tasks and available resources.

13 Introduction and Motivations Basics of Evolvable Multiprocessor System EvoMP Operational View EvoMP Architectural View Simulation and Synthesis Results Summary

14 Proposal of Evolvable Multi-processor System (1)  This thesis introduces a novel MPSoC -Uses evolutionary strategies for run-time task decomposition and scheduling. -Is called EvoMP (Evolvable Multi-Processor system).  Features: -Can directly execute classical sequential codes on MP platform. -Uses a hardware evolutionary algorithm core to perform run time task decomposition and scheduling. -Distributed control and computing -Flexibility -NoC-Based, 2D mesh, and homogeneous

15  All computational units have one copy of the entire program  EvoMP architecture exploits a hardware evolutionary core -to generates a bit-string (chromosome). -This bit-string determines the processor which is in charge of executing each instruction.  Primary version of EvoMP uses a genetic algorithm core. Proposal of Evolvable Multi-processor System (2)

16 Target Applications  Target Applications: -Applications, which perform a unique computation on a stream of data, e.g.: -Digital signal processing -Packet processing in network applications -Huge sensory data processing -…

17 Streaming Applications Code Style  Streaming programs have two main parts: -Initialization -Infinite (or semi-infinite) Loop ;Initial 1- MOV R1, 0 2- MOV R2, 0 L1:;Loop 3- MOV R1, Input 4- MUL R3, R1, Coe1 5- MUL R4, R2, Coe2 6- ADD R1, R3, R4 7- MOV Output, R1 8- MOV R1, R2 9- Genetic 10-JUMP L1 Two-Tap FIR Filter

19 EvoMP Top View  Genetic core produces a bit- string (chromosome) -Determines the location of execution of each instruction SW00 P-01 SW01 SW10SW11 P-00 P-11P-10 Genetic Core 1- MOV R1, 0 2- MOV R2, 0 L1:;Loop 3- MOV R1, Input 4- MUL R3, R1, Coe1 5- MUL R4, R2, Coe2 6- ADD R1, R3, R4 7- MOV Output, R1 8- MOV R1, R2 9- JUMP L1 1- MOV R1, 0 2- MOV R2, 0 L1:;Loop 3- MOV R1, Input 4- MUL R3, R1, Coe1 5- MUL R4, R2, Coe2 6- ADD R1, R3, R4 7- MOV Output, R1 8- MOV R1, R2 9-JUMP L1 1- MOV R1, 0 2- MOV R2, 0 L1:;Loop 3- MOV R1, Input 4- MUL R3, R1, Coe1 5- MUL R4, R2, Coe2 6- ADD R1, R3, R4 7- MOV Output, R1 8- MOV R1, R2 9- JUMP L1 1- MOV R1, 0 2- MOV R2, 0 L1:;Loop 3- MOV R1, Input 4- MUL R3, R1, Coe1 5- MUL R4, R2, Coe2 6- ADD R1, R3, R4 7- MOV Output, R1 8- MOV R1, R2 9- JUMP L1 Chromosome: 0110110…11

20 How EvoMP Works? (1)  Following process is repeated in each iteration: -At the beginning of each iteration: -genetic core generates and sends the bit-string (chromosome) to all processors. -Processors execute this iteration with the determined decomposition and scheduling scheme. -A counter in genetic core counts number of spent clock cycles. -When all processors reached end of the loop: -The genetic core uses the output of this counter as the fitness value.

21 How EvoMP Works? (2)  Three main working states -Initialize: -Just in first population -Genetic core generates random particles. -Evolution: -Uses recombination to produce new populations. -When the termination condition is met, system goes to final state. -Final: -The best chromosome is used as constant output of the genetic core. -When one of the processors becomes faulty, the system returns to evolution stage InitializeEvolutionFinal Fault detected Terminate

22 How Chromosome Codes the Scheduling Data? (1)  Each chromosome consists of some small words (gene).  Each word contains two fields: -A processor number -Number of instructions

23 10 How Chromosome Codes the Scheduling Data (2)  Assume that we have a 2X2 mesh 10001 # of Instructions 01010 11000 10101 Chromosome 00 10 01 11 1- MOV R1, 0 2- MOV R2, 0 L1:;Loop 3- MOV R1, Input 4- MUL R3, R1, Coe1 5- MUL R4, R2, Coe2 6- ADD R1, R3, R4 7- MOV Output, R1 8- MOV R1, R2 9- GENETIC 10-JUMP L1 Word1 Word2 Word3 Word4

24 Data Dependency Problem  Data dependencies are the main challenge.  Must be detected dynamically at run-time.  Is addressed using: -Particular machine code style -Architectural techniques

25 EvoMP Machine Code Style  Source operands are replaced by line-number of the most recent instructions that has changed it (ID).  Will enormously simplify dependency detection. 10. ADD R1,R2,R3 ; R3=R1+R2 11. AND R2,R6,R7 ; R7=R2&R6 12. SUB R7,R3,R4 ; R4=R7-R3 12. SUB (11), (10), R4

27 Architecture of each Processor  Number of FUs is configurable.  Homogeneous or heterogeneous policies can be used for FUs.  Supports out of order execution.  First free FU grabs the instruction from Instr bus (Daisy Chain).

28 Fetch_Issue Unit  PC1-Instr bus is used for executive instructions.  PC2-Invalidate_Instr bus is used for data dependency detection.

29 Functional Unit  Can be configured to execute different operations: -Arithmetic Operations -Add -Sub -Shift/Rotate Right/Left -Multiply: Add and shift -Logical Operations

30 Genetic Core SW00 Cell-01 SW01 SW10SW11 Cell-00 Cell-11Cell-10 Genetic Core  Population size and mutation rate are configurable.  Elite count is constant and equal to two in order to reduce the hardware complexity

31 EvoMP Challenges  Current versions uses centralized memory unit. -In “00” address. -This address does not contain computational circuits. -Major issue for scalability  Search space of genetic algorithm is very large. -Exponentially grows up with linear increase of number of processors.

32 PSO Core [8]

34 Configurable Parameters  There are some configurable parameters in EvoMP: -Word-length of the system -Size of the mesh (number of processors) -Flit length: bit-length of NoC switch links -Population size -Crossover rate

35 Simulation Results  Two sets of applications are used for performance evaluation. -Some DSP programs -Some sample neural Network  Two other decomposition and scheduling methods are implemented enabling the comparison - Static Decomposition Genetic Scheduler (SDGS) -Decomposition is performed statically i.e. tasks are predetermined manually -Genetic core only specifies scheduling scheme -Static Decomposition First Free Scheduler (FF) -Assigns the first task in job-queue to the first free processor in the system

36 16-Tap FIR Filter  Parameters: -16 bit mode -Population size=16 -Crossover Rate=8 -NoC connection width=16  74 Instructions  16 multiplication Best fitness shows number of clock cycles required to execute one iteration using the best particle which has been found yet.

37 8-Point DCT  Parameters: -16 bit mode -Population size=16 -Crossover Rate=8 -NoC connection width=16  88 Instructions  32 multiplication

38 16-point DCT  Parameters: -16 bit mode -Population size=16 -Crossover Rate=6 -NoC connection width=16  320 Instructions  128 multiplication

39 5x5 Matrix Multiplication  Parameters: -16 bit mode -Population size=16 -Crossover Rate=6 -NoC connection width=16  406 Instructions  125 multiplication

40 FIR-16DCT-8DCT-16MATRIX-5x5 Number of Instructions 7488324406 Number of Multiply Instructions 1632128125 1x2 mesh (One Proc.) In all three schemes Fitness (clock cycles) 35067127223181 Speed-up 1111 1x3 mesh Main Design Fitness (clock cycles) 21440318412344 Speed-up 1.631.661.471.37 Evolution Time (us) 273424280774582198384 SDGSFitness (clock cycles) 20240118122218 Speed-up 1.731.671.501.43 Evolution Time (us) 1967293158436565119 First FreeFitness (clock cycles) 29373325292487 Speed-up 1.190.911.081.27 2x2 mesh Main Design Fitness (clock cycles) 17131914601868 Speed-up 2.042.101.861.70 Evolution Time (us) 301745479023319294828 SDGSFitness (clock cycles) 16130611891817 Speed-up 2.172.192.281.75 Evolution Time (us) 107395247753656510092 First FreeFitness (clock cycles) 23968119332098 Speed-up 1.460.981.401.51

41 FIR-16DCT-8DCT-16MATRIX-5x5 Number of Instructions 7488324406 Number of Multiply Instructions 1632128125 1x2 mesh (One Proc.) In all three schemes Fitness (clock cycles) 35067127223181 Speed-up 1111 2x3 mesh Main Design Fitness (clock cycles) Unevaluated28512131596 Speed-up 2.332.251.99 Evolution Time (us) 93034630482546095 SDGSFitness (clock cycles) 25611061575 Speed-up 2.622.462.01 Evolution Time (us) 41023111118178219 First FreeFitness (clock cycles) 49615871815 Speed-up 1.351.711.75

42 Neural Network Case Study # of Instr. # of Multip lies 1x2 mesh1x3 mesh2x2 mesh2x3 mesh FitnessSpeed- up FitnessSpeed- up TimeFitnessSpeed- up TimeFitnessSpeed- up Time 4-4-1 582045012811.601252451.83522072.17262 3-9-2 954590515701.59525031.801634631.95342 12-20-10 9244408304151531.6189243651.90183238132.183436

43 Fault Tolerance Results  When a fault is detected in a processor, the evolutionary core eliminates it of contribution in next iterations.  It also returns to evolution stage to find the suitable solution for the new situation.  Best obtained fitness in a 2x3 EvoMP for 16-point DCT program is evaluated.  Faults are injected into 010, 001 and 101 processors in 1000000us, 2000000us and 3000000us respectively

44 Genetic vs. PSO # of Instr. # of Multi -plies Particle length (bits) 1x2 mesh 1x3 mesh2x2 mesh2x3 mesh BothGeneticPSOGeneticPSOGeneticPSO FitTimeFitTimeFitTimeFitTimeFitTimeFitTime FIR-16 741624035021423.721112.317130.117414.3unevaluated DCT-8 883228067140393.03936.231999.830821.8285138.120315.6 DCT-16 3241287202722184174.5183141.7146023.3143945.31213633.7119198.3 MAT-5x5 40612580031812344198.3231286.31868294.81821148.31596546.71518240.9  Population size in both experiments is 16

45 Synthesis Results using Sinplify Pro.  Synthesis results on VIRTEX II (XC2V3000) FPGA using Sinplify Pro. NoC switchGenetic CorePSO CoreMMUProcessorTotal System Area (Total LUTs) 729 (2%)1864 (6%)1642 (5%)3553 (12%)4433 (15%)20112 (70%) Max Freq. (MHz) -68.494.6--61.4

47 Summary  The EvoMP which is a novel MPSoC system was studied.  EvoMP exploits evolvable strategies to perform run-time task decomposition and scheduling.  EvoMP does not require concurrent codes because it can parallelize th sequential codes at run-time.  Exploits particular and novel processor architecture in order to address data dependency problem.  Experimental results confirm the applicability of EvoMP novel ideas.

48 Main References [1] N. S. Voros and K. Masselos, System Level Design of Reconfigurable Systems-on-Chip. Netherlands: Springer, 2005. [2] G. Martin, “ Overview of the MPSoC design challenge, ” Proc. Design and Automation Conf., July 2005, pp. 274-279. [3] S. Amarasinghe, “Multicore programming primer and programming competition,” class notes for 6.189, Computer Architecture Group, Massachusetts Institute of Technology, Available: www.cag.csail.mit.edu/ps3/lectures/6.189-lecture1-intro.pdf. [4] M. Hubner, K. Paulsson, and J. Becker, “ Parallel and flexible multiprocessor system-on-chip for adaptive automo tive applications based on Xilinx MicroBlaze soft-cores, ” Proc. Intl. Symp. Parallel and Distributed Processing, 2005. [5] D. Gohringer, M. Hubner, V. Schatz, and J. Becker, “ Runtime adaptive multi-processor system- on-chip: RAMP SoC, ” Proc. Intl. Symp. Parallel and Distributed Processing, April 2008, pp. 1-7. [6] A. Klimm, L. Braun, and J. Becker, “ An adaptive and scalable multiprocessor system for Xilinx FPGAs using minimal sized processor cores, ” Proc. Symp. Parallel and Distributed Processing, April 2008, pp. 1-7. [7] Z.Y. Wen and Y.J. Gang, “ A genetic algorithm for tasks scheduling in parallel multiprocessor systems, ” Proc. Intl. Conf. Machine Learning and Cybernetics, Nov. 2003, pp.1785-1790. [8] A. Farmahini-Farahani, S. Vakili, S. M. Fakhraie, S. Safari, and C. Lucas, “Parallel scalable hardware implementation of asynchronous discrete particle swarm optimization,” Elsevier J. of Engineering Applications of Artificial Intelligence, submitted for publication.

49 Main References (2) [9] A. A. Jerraya and W. Wolf, Multiprocessor Systems-on-Chips. San Francisco: Morgan Kaufmann Publishers, 2005. [10] A.J. Page and T.J. Naughton, “ Dynamic task scheduling using genetic algorithms for heterogeneous distributed computing, ” Proc. Intl. Symp. Parallel and Distributed Processing, April 2005, pp. 189.1. [11] E. Carvalho, N. Calazans, and F. Moraes, “ Heuristics for dynamic task mapping in NoC based heterogeneous MPSoCs ”, Proc. Int. Rapid System Prototyping Workshop, pp. 34-40, 2007. [12] R. Canham, and A. Tyrrell, “ An embryonic array with improved efficiency and fault tolerance, ” Proc. NASA/DoD Conf. on Evolvable Hardware, July 2003, pp. 265-272. [13] W. Barker, D. M. Halliday, Y. Thoma, E. Sanchez, G. Tempesti, and A. Tyrrell, “ Fault tolerance using dynamic reconfiguration on the POEtic Tissue, ” IEEE Trans. Evolutionary Computing, vol. 11, num. 5, Oct. 2007, pp. 666-684.

50 Related Publications  Journal: 1.S. Vakili, S. M. Fakhraie, and S. Mohammadi, “EvoMP: a novel MPSoC architecture with evolvable task decomposition and scheduling,” Submitted to IET Comp. & Digital Tech., (Under Revision). 2.S. Vakili, S. M. Fakhraie, and S. Mohammadi, “Low-cost fault tolerance in evolvable multiprocessor system: a graceful degradation approach,” Submitted to Journal of Zhejiang University SCIENCE A (JZUS-A).  Conference: 1.S. Vakili, S. M. Fakhraie, and S. Mohammadi, “Designing an MPSoC architecture with run-time and evolvable task decomposition and scheduling,” Proc. 5’th IEEE Intl. Conf. Innovations in Information Technology, Dec. 2008. 2.S. Vakili, S. M. Fakhraie, S. Mohammadi, and Ali Ahmadi, “Particle swarm optimization for run-time task decomposition and scheduling in evolvable MPSoC,” Proc. IEEE. Intl. conf. Computer Engineering and Technology, Jan. 2009.

Design and Implementation of a NoC-Based Cellular Computational System By: Shervin Vakili Supervisors: Dr. Sied Mehdi Fakhraie Dr. Siamak Mohammadi February.

Similar presentations

Presentation on theme: "Design and Implementation of a NoC-Based Cellular Computational System By: Shervin Vakili Supervisors: Dr. Sied Mehdi Fakhraie Dr. Siamak Mohammadi February."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Design and Implementation of a NoC-Based Cellular Computational System By: Shervin Vakili Supervisors: Dr. Sied Mehdi Fakhraie Dr. Siamak Mohammadi February.

Similar presentations

Presentation on theme: "Design and Implementation of a NoC-Based Cellular Computational System By: Shervin Vakili Supervisors: Dr. Sied Mehdi Fakhraie Dr. Siamak Mohammadi February."— Presentation transcript:

Similar presentations

About project

Feedback