# PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

## Presentation on theme: "PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker."— Presentation transcript:

PARALLEL PROCESSING COMPARATIVE STUDY 1

CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker has a limit Inadequate for long works 2

CONTEXT How to finish a calculation in short time???? Solution To use quicker calculator (processor).[1960-2000] Inconvenient: The speed of processor has reach a limit Inadequate for long calculations 3

CONTEXT How to finish a work in short time???? Solution 1. To use quicker worker. (Inadequate for long works) 4

CONTEXT How to finish a work in short time???? Solution 1. To use quicker worker. (Inadequate for long works) 5

CONTEXT How to finish a work in short time???? Solution 1. To use quicker worker. (Inadequate for long works) 2. To use more than one worker concurrently 6

CONTEXT How to finish a Calculation in short time???? Solution 1. To use quicker processor (Inadequate for long calculations ) 7

CONTEXT How to finish a Calculation in short time???? Solution 1. To use quicker processor (Inadequate for long calculations ) 8

CONTEXT How to finish a Calculation in short time???? Solution 1. To use quicker processor (Inadequate for long calculations) 2. To use more than one processor concurrently 9

CONTEXT How to finish a Calculation in short time???? Solution 1. To use quicker processor (Inadequate for long calculations) 2. To use more than one processor concurrently Parallelism 10

CONTEXT Definition The parallelism is the concurrent use of more than one processing unit (CPUs, Cores of processor, GPUs, or combinations of them) in order to carry out calculations more quickly 11

PROJECT GOAL Parallelism needs 1. Parallel Computer (more than one processors) 2. Accommodate Calculation to Parallel Computer 12

THE GOAL Parallelism needs 1. Parallel Computer (more than one processors) 2. Accommodate Calculation to Parallel Computer 13

THE GOAL Parallel Computer  Several parallel computers in the hardware market  Differ in their architecture  Several Classifications  Based on the Instruction and Data Streams (Flynn classification)  Based on the Memory Charring Degree  …. 14

THE GOAL Flynn Classification A. Single Instruction and Single Data stream 15

THE GOAL Flynn Classification B. Single Instruction and Multiple Data 16

THE GOAL Flynn Classification C. Multiple Instruction and Single Data stream 17

THE GOAL Flynn Classification D. Multiple Instruction and Multiple Data stream 18

THE GOAL Memory Sharing Degree Classification A. Shared Memory B. Distributed memory 19

THE GOAL Memory Sharing Degree Classification C. Hybrid Distributed-Shared Memory 20

THE GOAL Parallelism needs 1. Parallel Computer (more than one processors) 2. Accommodate Calculation to Parallel Computer Dividing the calculation and data between the processors Defining the execution scenario (how the processor cooperates) 21

THE GOAL Parallelism needs 1. Parallel Computer (more than one processors) 2. Accommodate Calculation to Parallel Computer Dividing the calculation and data between the processors Defining the execution scenario (how the processor cooperates) 22

THE GOAL Parallelism needs 1. Parallel Computer (more than one processors) 2. Accommodate Calculation to Parallel Computer Dividing the calculation and data between the processors Defining the execution scenario (how the processors cooperate) 23

THE GOAL The accommodation of calculation to parallel computer Is called parallel processing Depend closely on the architecture 24

THE GOAL Goal : A comparative study between 1. Shared Memory Parallel Processing approach 2. Distributed Memory Parallel Processing approach 25

PLAN 1. Distributed Memory Parallel Processing approach 2. Shared Memory Parallel Processing approach 3. Case study problems 4. Comparison results and discussion 5. Conclusion 26

DISTRIBUTED MEMORY PARALLEL PROCESSING APPROACH 27

DISTRIBUTED MEMORY PARALLEL PROCESSING APPROACH Distributed-Memory Computers (DMC) = Distributed Memory System (DMS) = Massively Parallel Processor (MPP) 28

DISTRIBUTED MEMORY PARALLEL PROCESSING APPROACH Distributed-memory computers architecture 29

DISTRIBUTED MEMORY PARALLEL PROCESSING APPROACH Architecture of nodes Nodes can be : identical processors  Pure DMC different types of processor  Hybrid DMC different type of nodes with different Architectures  Heterogeneous DMC 30

DISTRIBUTED MEMORY PARALLEL PROCESSING APPROACH Architecture of Interconnection Network  No shared memory space between nodes  Network is the only way of node-communications  Network performance influence directly the performance of parallel program on DMC  Network performance depends on : 1. Topology 2. Physical connectors (as wires…) 3. Routing Technique  The DMC evolutions closely depends on the Networking evolutions 31

DISTRIBUTED MEMORY PARALLEL PROCESSING APPROACH The Used DMC in our Comparative Study Heterogeneous DMC Modest cluster of workstations Three nodes: Sony Laptop: i3 processor HP Laptop: i3 processor HP Laptop core 2 due processor Communication Network: 100 MByte-Ethernet 32

DISTRIBUTED MEMORY PARALLEL PROCESSING APPROACH Parallel Software Development for DMC Designer main tasks: 1. Global Calculation decomposition and tasks assignment 2. Data decomposition 3. Communications scheme Definition 4. Synchronization Study 33

DISTRIBUTED MEMORY PARALLEL PROCESSING APPROACH Parallel Software Development for DMC Important considerations for efficiency: 1. Minimize Communication 2. Avoid barrier synchronization 34

DISTRIBUTED MEMORY PARALLEL PROCESSING APPROACH Implementation environments Several implementation environments PVM (Parallel Virtual Machine) MPI (Message Passing Interface) 35

DISTRIBUTED MEMORY PARALLEL PROCESSING APPROACH MPI Application Anatomy All the node execute the same code All the nodes does not do the same work It’s possible using SPMD application form SPMD :.... The processes are organized in one controller and workers Contradiction

SHARED MEMORY PARALLEL PROCESSING APPROACH Several SMPC in the Markets Multi-core PC: Intel i3 i5 i7,AMD Which SMPC we use ? - GPU originally for image processing - GPU NOW : Domestic Super-Computer Characteristics: Chipset and fastest Shared Memory Parallel computer Hard Parallel Design 37

SHARED MEMORY PARALLEL PROCESSING APPROACH  The GPU Architecture  The implementation environment 38

SHARED MEMORY PARALLEL PROCESSING APPROACH GPU Architecture As the classical processing unit, the Graphics Processing Unit is composed from two main components: A- Calculation Units B- Storage Unit 39

SHARED MEMORY PARALLEL PROCESSING APPROACH 40

SHARED MEMORY PARALLEL PROCESSING APPROACH 41 SHARED MEMORY PARALLEL PROCESSING APPROACH

SHARED MEMORY PARALLEL PROCESSING  The GPU Architecture  The implementation environment 1. CUDA : for GPU S manufactured by NVIDIA 2. OpenCL: independent of the GPU architecture 42

SHARED MEMORY PARALLEL PROCESSING CUDA Program Anatomy 43

SHARED MEMORY PARALLEL PROCESSING Q: How to execute code fragments to be parallelized in the GPU? R: By Calling a kernel Q: What’s Kernel ? R: A kernel is a function callable from the host and executed on the device simultaneously by many threads in parallel 44

KERNEL LAUNCH 45 SHARED MEMORY PARALLEL PROCESSING

KERNEL LAUNCH 46 SHARED MEMORY PARALLEL PROCESSING

KERNEL LAUNCH 47 SHARED MEMORY PARALLEL PROCESSING

Design recommendations  utilize the shared memory to reduce the amount of time to access the global memory.  reduce the amount of idle threads ( control divergence) to fully utilize the GPU resource. 48

CASE STUDY PROBLEM 49

CASE STUDY PROBLEM 50

COMPARSION Comparisons Creteria Analysis and conclusion 51

COMPARISON Criteria 1 : Time-Cost factor = ∗ : Parallel Execution Time (in Milliseconds) : The Hardware Cost (in Saudi Arabia Riyals) The Hardware costs() GPU : 5000 SAR Cluster of workstation : 9630 SAR. 52

COMPARISON 53

COMPARISON Conclusion: GPU is better if we need to perform a lot of number of small amount of iterations calculation. However if our need is to perform a calculation with big amount of iterations, the cluster of workstations is the best choice. 54

COMPARISON Criteria 2 : required Memory Matrix multiplication problem Graphics Processing Unit  The Global-Memory-based-method requirement: ℎ =6 ∗∗∗  The Shared-Memory-based-method requirement: ℎ =8 ∗∗∗ Cluster of workstations  The used cluster contains three nodes ℎ =19/3 ∗∗∗ 55

COMPARISON Criteria 2 : required Memory Pi approximation problem Graphics Processing Unit The size of these arrays depends on the number of used thread The required memory = ∗ ∗ Cluster of workstations Small amount of memory used on each node almost 15 ∗ 56

COMPARISON Criteria 2 : required Memory Conclusion: We cannot judge which parallel approach is the better for the required memory criteria. This criteria depends on the intrinsic characteristics of the on-hand problem. 57

COMPARISON Criteria 3 : The Gap between the Theoretical Complexity and E ﬀ ective Complexity The Gap between the Theoretical Complexity and E ﬀ ective Complexity- calculated by: =((/) − 1)×100 : Experimental Parallel Time : Theoretical Parallel Time = / : Sequential Time. : Number of processing unit. 58

CLUSTER OF WORKSTATIONS 59 COMPARISON Criteria 3 : The Gap between the Theoretical Complexity and E ﬀ ective Complexity

GRAPHICS PROCESSING UNIT 60 COMPARISON Criteria 3 : The Gap between the Theoretical Complexity and E ﬀ ective Complexity

COMPARISON Conclusion In the GPU, the resulting execution time of parallel program can give less time than the theoretical expected time. That is impossible to achieve when using a Cluster of workstation because of the communication overhead. To minimize the Gap, or take it constant, in the cluster of workstations, the designer has to maintain constant, as possible, number and sizes of communicated messages when increasing the problem size. 61 Criteria 3 : The Gap between the Theoretical Complexity and E ﬀ ective Complexity

COMPARISON 62

CRITERIA 4 : EFFICIENCY 63 COMPARISON

Criteria 4 : Efficiency Conclusion: The efficiency (speedup) is much better in the GPU than in the cluster of workstations. 64

IMPORTANT NOTIFICATION 65 COMPARISON

COMPARISON Criteria 5 : Hardness of development Cuda MPI 67

COMPARISON Criteria 6 : necessary hardware and software materials GPU (Nvidia gt 525m ) Cluster of workstation( 3 pc, switch, internet modem and wires) 68

69

CONCLUSION 70

Parallel Processing Comparative Study Shared Memory Parallel Processing ApproachDistributed Memory Parallel Processing Approach Graphics Processing Unit (GPU)Cluster Of work-station GPU and Cluster are the main two components of the Fastest Word Computers (As Shahin) To compare we use : Two different problems (Matrix-Multiplication and Pi Approximation) Six Measure’s Criteria More Adequate for Data-Level Parallelism FormMore Adequate for Task –Level Parallelism Form Big number of small calculationA Big calculation Memory requirement ̴ Problem Characteristics Better than the expected Run TimeImpossible Null or Negative GAP Complicate Design and programmingLess complicated Implementation environment very practical Complicated

72

Download ppt "PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker."

Similar presentations