Parallel Computing Department Of Computer Engineering Ferdowsi University Hossain Deldari
Parallel Processing Super Computer Parallel Computer Amdahl’s Low, Speedup, Efficiency Parallel Machine Architecture Computational Model Concurrency Approach Parallel Programming Cluster Computing Lecture organization
It is the division of work into smaller tasks Assigning many smaller tasks to multiple workers to work on simultaneously Parallel processing is the use of multiple processors to execute different parts of the same program simultaneously Difficulties: coordinating, controlling and monitoring the workers The main goals of parallel processing are: -solve much bigger problems much faster! to reduce wall-clock time of execution of computer programs to increase the size of computational problems that can be solved What is Parallel Processing?
What is a Supercomputer? A supercomputer is a computer that is a lot faster than the computers that normal people use Note: This is a time-dependent definition Manufacturer Computer/Procs R max R peak Installation Site Country/Year TMC CM-5/1024/ Los Alamos National Laboratory USA/ June 1993: TOP500 Lists Supercomputer & parallel computer
June 2003: Manufacturer Computer/Procs R max R peak Installation Site Country/Year NEC Earth-Simulator/ Earth simulator center Japan R max Maximal LINPACK performance achieved R peak Theoretical peak performance LINPACK is a Benchmark
Amdahl’s Law Amdahl’s low, Speedup, Efficiency
Efficiency is a measure of the fraction of time that a processor spends performing useful work. Efficiency
Shunt Operation
SIMD MIMD MISD Clusters Parallel and Distributed Computers
SIMD (Single Instruction Multiple Data)
MISD(Multi Instruction Single Data)
MIMD (Multiple Instruction Multiple Data)
MIMD(cont.)
Shared memory model Bus-based Switch-based NUMA Distributed memory model Distributed shared memory model Page-based Object-based Hardware Parallel machine architecture
Shared memory model
- Shared memory or Multiprocessor -OpenMP is a standard (C/C++/FORTRAN) Advantage: Easy Programming. Disadvantage: Design Complexity Not Scalable Shared memory model(cont.)
-Bus is bottleneck - Not scalable Bus-based shared memory model
- Maintenance is difficult. - Expensive - scalable Switch-based shared memory model
NUMA stands for Non-Uniform Memory Access. Simulated shared memory Better scalability NUMA model
Multi computer MPI(Message Passing Interface) Easy design Low cost High scalability Difficult programming Distributed memory model
Linear Array Ring Mesh Fully Connected Examples of Network Topology
S d = 4 Hypercubes Examples of Network Topology(cont.)
Simpler abstraction Sharing data easier portability Easy design with easy programming Low performance(for high communication) Distributed shared memory model
Degree of Coupling SIMDMIMD Shared Memory Distributed Memory Supported Grain Sizes Communication Speed slowfast fine coarse loose tight SIMDSMPNUMACluster Parallel and Distributed Architecture (Leopold, 2001)
RAM PRAM BSP LOGP MPI Computational Model
RAM Model
Synchronized Read Compute Write Cycle EREW ERCW CREW CRCW Control Private Memory P1P1 Private Memory P2P2 Private Memory PpPp Global Memory Parallel Random Access Machine PRAM Model
Generalization of PRAM Model Processor- Memory Pairs Communication Network Barrier Synchronization Super-step Processes Execute Communications Barrier Synchronization Bulk Synchronous Parallel (BSP) Model
Cost of superstep = w+max(hs,hr).g+l –w (maximum number of local operation) –hs (maximum # of packets sent) –hr (maximum # of packets received) g (communication throughput) p (number of Processors) l (synchronization latency) BSP Space Complexity
Closely related to BSP It models asynchronous execution News Parameters L (message latency) o The overhead, defined as the length of time that a processor is engaged in the transmission or reception of each message. During this time the processor cannot perform other operations. g : The gap, defined as the minimum time interval between consecutive message transmissions or receptions. The reciprocal of g corresponds to the available per-processor bandwidth P: The number of processor/memory modules. LogP Model
Logp (cont.)
What Is MPI? A message-passing library specification message-passing model not a compiler specification not a specific product For parallel computers, clusters, and heterogeneous networks Full-featured Designed to permit (unleash?) the development of parallel software libraries Designed to provide access to advanced parallel hardware for end users library writers tool developers MPI(Message Passing Interface)
Application MPI Comm. Application MPI Comm. Node 1 Node 2 Task 1Task 2 Virtual communication Real communication MPI Layer
Matrix Multiplication Example
PRAM Matrix Multiplication Cost Of PRAM Algorithm
BSP Matrix Multiplication Cost of algorithm
Concurrency Approach Control Parallel Data Parallel
Control Parallel
Data Parallel
The Best granularity for programming
Explicit Parallel Programming Occam, MPI, PVM Implicit Parallel Programming Parallel functional programming ML,… Concurrent object-oriented programming COOL,… Data parallel programming Fortran 90, HPF,… Parallel Programming
A Cluster system is –Parallel multicomputer built from high-end PCs and conventional high-speed network. –Support parallel programming Cluster Computing
Scientific Computing –Simulation, CFD, CAD/CAM, Weather prediction, process large volume of data Super server system –Scalable internet/ web server –Database server –Multimedia, video, audio server Applications Cluster Computing(cont.)
Cluster System Building Block High Speed Network HW OS Single System Image Layer System Tool Layer Application Layer Cluster Computing(cont.)
Why cluster computing? Scalability –Build small system first, grow it later. Low-cost –Hardware based on COTS model (Component off-the-shelf) –S/w(SoftWare) based on freeware from research community Easier to maintain Vendor independent Cluster Computing(cont.)
The End