PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Parallel Computing Glib Dmytriiev
Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),
SE-292 High Performance Computing
Introduction to Parallel Computing
Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
A Parallel Computational Model for Heterogeneous Clusters Jose Luis Bosque, Luis Pastor, IEEE TRASACTION ON PARALLEL AND DISTRIBUTED SYSTEM, VOL. 17, NO.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Operating Systems CS208. What is Operating System? It is a program. It is the first piece of software to run after the system boots. It coordinates the.
 Parallel Computer Architecture Taylor Hearn, Fabrice Bokanya, Beenish Zafar, Mathew Simon, Tong Chen.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Contemporary Languages in Parallel Computing Raymond Hummel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.
The hybird approach to programming clusters of multi-core architetures.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
GPGPU platforms GP - General Purpose computation using GPU
Lecture 2 : Introduction to Multicore Computing Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
KUAS.EE Parallel Computing at a Glance. KUAS.EE History Parallel Computing.
Computer System Architectures Computer System Software
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Lecture 2 : Introduction to Multicore Computing
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
GPU Architecture and Programming
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
Copyright © 2011 Curt Hill MIMD Multiple Instructions Multiple Data.
Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.
Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
Parallel Computing.
Thinking in Parallel – Implementing In Code New Mexico Supercomputing Challenge in partnership with Intel Corp. and NM EPSCoR.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.
Server HW CSIS 4490 n-Tier Client/Server Dr. Hoganson Server Hardware Mission-critical –High reliability –redundancy Massive storage (disk) –RAID for redundancy.
Outline Why this subject? What is High Performance Computing?
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Lecture 13: Basic Parallel.
Synchronization These notes introduce:
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Parallel Computing Presented by Justin Reschke
Background Computer System Architectures Computer System Software.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.
Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Flynn’s Taxonomy Many attempts have been made to come up with a way to categorize computer architectures. Flynn’s Taxonomy has been the most enduring of.
Introduction to Parallelism.
Team 1 Aakanksha Gupta, Solomon Walker, Guanghong Wang
What is Parallel and Distributed computing?
Presented by: Isaac Martin
CSE8380 Parallel and Distributed Processing Presentation
Synchronization These notes introduce:
6- General Purpose GPU Programming
Presentation transcript:

PARALLEL PROCESSING COMPARATIVE STUDY 1

CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker has a limit Inadequate for long works 2

CONTEXT How to finish a calculation in short time???? Solution To use quicker calculator (processor).[ ] Inconvenient: The speed of processor has reach a limit Inadequate for long calculations 3

CONTEXT How to finish a work in short time???? Solution 1. To use quicker worker. (Inadequate for long works) 4

CONTEXT How to finish a work in short time???? Solution 1. To use quicker worker. (Inadequate for long works) 5

CONTEXT How to finish a work in short time???? Solution 1. To use quicker worker. (Inadequate for long works) 2. To use more than one worker concurrently 6

CONTEXT How to finish a Calculation in short time???? Solution 1. To use quicker processor (Inadequate for long calculations ) 7

CONTEXT How to finish a Calculation in short time???? Solution 1. To use quicker processor (Inadequate for long calculations ) 8

CONTEXT How to finish a Calculation in short time???? Solution 1. To use quicker processor (Inadequate for long calculations) 2. To use more than one processor concurrently 9

CONTEXT How to finish a Calculation in short time???? Solution 1. To use quicker processor (Inadequate for long calculations) 2. To use more than one processor concurrently Parallelism 10

CONTEXT Definition The parallelism is the concurrent use of more than one processing unit (CPUs, Cores of processor, GPUs, or combinations of them) in order to carry out calculations more quickly 11

PROJECT GOAL Parallelism needs 1. Parallel Computer (more than one processors) 2. Accommodate Calculation to Parallel Computer 12

THE GOAL Parallelism needs 1. Parallel Computer (more than one processors) 2. Accommodate Calculation to Parallel Computer 13

THE GOAL Parallel Computer  Several parallel computers in the hardware market  Differ in their architecture  Several Classifications  Based on the Instruction and Data Streams (Flynn classification)  Based on the Memory Charring Degree  …. 14

THE GOAL Flynn Classification A. Single Instruction and Single Data stream 15

THE GOAL Flynn Classification B. Single Instruction and Multiple Data 16

THE GOAL Flynn Classification C. Multiple Instruction and Single Data stream 17

THE GOAL Flynn Classification D. Multiple Instruction and Multiple Data stream 18

THE GOAL Memory Sharing Degree Classification A. Shared Memory B. Distributed memory 19

THE GOAL Memory Sharing Degree Classification C. Hybrid Distributed-Shared Memory 20

THE GOAL Parallelism needs 1. Parallel Computer (more than one processors) 2. Accommodate Calculation to Parallel Computer Dividing the calculation and data between the processors Defining the execution scenario (how the processor cooperates) 21

THE GOAL Parallelism needs 1. Parallel Computer (more than one processors) 2. Accommodate Calculation to Parallel Computer Dividing the calculation and data between the processors Defining the execution scenario (how the processor cooperates) 22

THE GOAL Parallelism needs 1. Parallel Computer (more than one processors) 2. Accommodate Calculation to Parallel Computer Dividing the calculation and data between the processors Defining the execution scenario (how the processors cooperate) 23

THE GOAL The accommodation of calculation to parallel computer Is called parallel processing Depend closely on the architecture 24

THE GOAL Goal : A comparative study between 1. Shared Memory Parallel Processing approach 2. Distributed Memory Parallel Processing approach 25

PLAN 1. Distributed Memory Parallel Processing approach 2. Shared Memory Parallel Processing approach 3. Case study problems 4. Comparison results and discussion 5. Conclusion 26

DISTRIBUTED MEMORY PARALLEL PROCESSING APPROACH 27

DISTRIBUTED MEMORY PARALLEL PROCESSING APPROACH Distributed-Memory Computers (DMC) = Distributed Memory System (DMS) = Massively Parallel Processor (MPP) 28

DISTRIBUTED MEMORY PARALLEL PROCESSING APPROACH Distributed-memory computers architecture 29

DISTRIBUTED MEMORY PARALLEL PROCESSING APPROACH Architecture of nodes Nodes can be : identical processors  Pure DMC different types of processor  Hybrid DMC different type of nodes with different Architectures  Heterogeneous DMC 30

DISTRIBUTED MEMORY PARALLEL PROCESSING APPROACH Architecture of Interconnection Network  No shared memory space between nodes  Network is the only way of node-communications  Network performance influence directly the performance of parallel program on DMC  Network performance depends on : 1. Topology 2. Physical connectors (as wires…) 3. Routing Technique  The DMC evolutions closely depends on the Networking evolutions 31

DISTRIBUTED MEMORY PARALLEL PROCESSING APPROACH The Used DMC in our Comparative Study Heterogeneous DMC Modest cluster of workstations Three nodes: Sony Laptop: i3 processor HP Laptop: i3 processor HP Laptop core 2 due processor Communication Network: 100 MByte-Ethernet 32

DISTRIBUTED MEMORY PARALLEL PROCESSING APPROACH Parallel Software Development for DMC Designer main tasks: 1. Global Calculation decomposition and tasks assignment 2. Data decomposition 3. Communications scheme Definition 4. Synchronization Study 33

DISTRIBUTED MEMORY PARALLEL PROCESSING APPROACH Parallel Software Development for DMC Important considerations for efficiency: 1. Minimize Communication 2. Avoid barrier synchronization 34

DISTRIBUTED MEMORY PARALLEL PROCESSING APPROACH Implementation environments Several implementation environments PVM (Parallel Virtual Machine) MPI (Message Passing Interface) 35

DISTRIBUTED MEMORY PARALLEL PROCESSING APPROACH MPI Application Anatomy All the node execute the same code All the nodes does not do the same work It’s possible using SPMD application form SPMD :.... The processes are organized in one controller and workers Contradiction

SHARED MEMORY PARALLEL PROCESSING APPROACH Several SMPC in the Markets Multi-core PC: Intel i3 i5 i7,AMD Which SMPC we use ? - GPU originally for image processing - GPU NOW : Domestic Super-Computer Characteristics: Chipset and fastest Shared Memory Parallel computer Hard Parallel Design 37

SHARED MEMORY PARALLEL PROCESSING APPROACH  The GPU Architecture  The implementation environment 38

SHARED MEMORY PARALLEL PROCESSING APPROACH GPU Architecture As the classical processing unit, the Graphics Processing Unit is composed from two main components: A- Calculation Units B- Storage Unit 39

SHARED MEMORY PARALLEL PROCESSING APPROACH 40

SHARED MEMORY PARALLEL PROCESSING APPROACH 41 SHARED MEMORY PARALLEL PROCESSING APPROACH

SHARED MEMORY PARALLEL PROCESSING  The GPU Architecture  The implementation environment 1. CUDA : for GPU S manufactured by NVIDIA 2. OpenCL: independent of the GPU architecture 42

SHARED MEMORY PARALLEL PROCESSING CUDA Program Anatomy 43

SHARED MEMORY PARALLEL PROCESSING Q: How to execute code fragments to be parallelized in the GPU? R: By Calling a kernel Q: What’s Kernel ? R: A kernel is a function callable from the host and executed on the device simultaneously by many threads in parallel 44

KERNEL LAUNCH 45 SHARED MEMORY PARALLEL PROCESSING

KERNEL LAUNCH 46 SHARED MEMORY PARALLEL PROCESSING

KERNEL LAUNCH 47 SHARED MEMORY PARALLEL PROCESSING

Design recommendations  utilize the shared memory to reduce the amount of time to access the global memory.  reduce the amount of idle threads ( control divergence) to fully utilize the GPU resource. 48

CASE STUDY PROBLEM 49

CASE STUDY PROBLEM 50

COMPARSION Comparisons Creteria Analysis and conclusion 51

COMPARISON Criteria 1 : Time-Cost factor = ∗ : Parallel Execution Time (in Milliseconds) : The Hardware Cost (in Saudi Arabia Riyals) The Hardware costs() GPU : 5000 SAR Cluster of workstation : 9630 SAR. 52

COMPARISON 53

COMPARISON Conclusion: GPU is better if we need to perform a lot of number of small amount of iterations calculation. However if our need is to perform a calculation with big amount of iterations, the cluster of workstations is the best choice. 54

COMPARISON Criteria 2 : required Memory Matrix multiplication problem Graphics Processing Unit  The Global-Memory-based-method requirement: ℎ =6 ∗∗∗  The Shared-Memory-based-method requirement: ℎ =8 ∗∗∗ Cluster of workstations  The used cluster contains three nodes ℎ =19/3 ∗∗∗ 55

COMPARISON Criteria 2 : required Memory Pi approximation problem Graphics Processing Unit The size of these arrays depends on the number of used thread The required memory = ∗ ∗ Cluster of workstations Small amount of memory used on each node almost 15 ∗ 56

COMPARISON Criteria 2 : required Memory Conclusion: We cannot judge which parallel approach is the better for the required memory criteria. This criteria depends on the intrinsic characteristics of the on-hand problem. 57

COMPARISON Criteria 3 : The Gap between the Theoretical Complexity and E ff ective Complexity The Gap between the Theoretical Complexity and E ff ective Complexity- calculated by: =((/) − 1)×100 : Experimental Parallel Time : Theoretical Parallel Time = / : Sequential Time. : Number of processing unit. 58

CLUSTER OF WORKSTATIONS 59 COMPARISON Criteria 3 : The Gap between the Theoretical Complexity and E ff ective Complexity

GRAPHICS PROCESSING UNIT 60 COMPARISON Criteria 3 : The Gap between the Theoretical Complexity and E ff ective Complexity

COMPARISON Conclusion In the GPU, the resulting execution time of parallel program can give less time than the theoretical expected time. That is impossible to achieve when using a Cluster of workstation because of the communication overhead. To minimize the Gap, or take it constant, in the cluster of workstations, the designer has to maintain constant, as possible, number and sizes of communicated messages when increasing the problem size. 61 Criteria 3 : The Gap between the Theoretical Complexity and E ff ective Complexity

COMPARISON 62

CRITERIA 4 : EFFICIENCY 63 COMPARISON

Criteria 4 : Efficiency Conclusion: The efficiency (speedup) is much better in the GPU than in the cluster of workstations. 64

IMPORTANT NOTIFICATION 65 COMPARISON

IMPORTANT NOTIFICATION

COMPARISON Criteria 5 : Hardness of development Cuda MPI 67

COMPARISON Criteria 6 : necessary hardware and software materials GPU (Nvidia gt 525m ) Cluster of workstation( 3 pc, switch, internet modem and wires) 68

69

CONCLUSION 70

Parallel Processing Comparative Study Shared Memory Parallel Processing ApproachDistributed Memory Parallel Processing Approach Graphics Processing Unit (GPU)Cluster Of work-station GPU and Cluster are the main two components of the Fastest Word Computers (As Shahin) To compare we use : Two different problems (Matrix-Multiplication and Pi Approximation) Six Measure’s Criteria More Adequate for Data-Level Parallelism FormMore Adequate for Task –Level Parallelism Form Big number of small calculationA Big calculation Memory requirement ̴ Problem Characteristics Better than the expected Run TimeImpossible Null or Negative GAP Complicate Design and programmingLess complicated Implementation environment very practical Complicated

72