Parallel Architecture

Slides:



Advertisements
Similar presentations
1 Uniform memory access (UMA) Each processor has uniform access time to memory - also known as symmetric multiprocessors (SMPs) (example: SUN ES1000) Non-uniform.
Advertisements

SE-292 High Performance Computing
Multiple Processor Systems
Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.
1 Lecture 23: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Appendix E)

Lecture 10 Outline Material from Chapter 2 Interconnection networks Processor arrays Multiprocessors Multicomputers Flynn’s taxonomy.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
Introduction to Parallel Processing Ch. 12, Pg
Interconnect Network Topologies
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
Course Outline Introduction in software and applications. Parallel machines and architectures –Overview of parallel machines –Cluster computers (Myrinet)
Interconnect Networks
Network Topologies Topology – how nodes are connected – where there is a wire between 2 nodes. Routing – the path a message takes to get from one node.
CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Parallel Programming Sathish S. Vadhiyar Course Web Page:
Parallel Computer Architecture and Interconnect 1b.1.
CHAPTER 12 INTRODUCTION TO PARALLEL PROCESSING CS 147 Guy Wong page
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Course Wrap-Up Miodrag Bolic CEG4136. What was covered Interconnection network topologies and performance Shared-memory architectures Message passing.
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
An Overview of Parallel Computing. Hardware There are many varieties of parallel computing hardware and many different architectures The original classification.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters,
Birds Eye View of Interconnection Networks
Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations.
Outline Why this subject? What is High Performance Computing?
Super computers Parallel Processing
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
1 Computer Architecture & Assembly Language Spring 2001 Dr. Richard Spillman Lecture 26 – Alternative Architectures.
INTERCONNECTION NETWORK
CS 704 Advanced Computer Architecture
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
SE-292 High Performance Computing
Overview Parallel Processing Pipelining
Distributed and Parallel Processing
Interconnect Networks
Multiprocessor Systems
Distributed Processors
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Interconnection Networks (Part 2) Dr.
Lecture 23: Interconnection Networks
Course Outline Introduction in algorithms and applications
Computer Engineering 2nd Semester
The University of Adelaide, School of Computer Science
CS 147 – Parallel Processing
Multiprocessor Cache Coherency
Parallel and Multiprocessor Architectures
CMSC 611: Advanced Computer Architecture
Parallel and Multiprocessor Architectures – Shared Memory
Parallel Architectures Based on Parallel Computing, M. J. Quinn
Chapter 17 Parallel Processing
Outline Interconnection networks Processor arrays Multiprocessors
Interconnection Network Design Lecture 14
Multiprocessors - Flynn’s taxonomy (1966)
High Performance Computing & Bioinformatics Part 2 Dr. Imad Mahgoub
Advanced Computer and Parallel Processing
Birds Eye View of Interconnection Networks
Chapter 4 Multiprocessors
The University of Adelaide, School of Computer Science
Advanced Computer and Parallel Processing
Lecture 23: Virtual Memory, Multiprocessors
CSL718 : Multiprocessors 13th April, 2006 Introduction
The University of Adelaide, School of Computer Science
Chapter 2 from ``Introduction to Parallel Computing'',
Presentation transcript:

Parallel Architecture Sathish Vadhiyar L1

Motivations of Parallel Computing Faster execution times From days or months to hours or seconds E.g., climate modelling, bioinformatics Large amount of data dictate parallelism Parallelism more natural for certain kinds of problems, e.g., climate modelling Due to computer architecture trends CPU speeds have saturated Slow memory bandwidths Lec22

Classification of Architectures – Flynn’s classification In terms of parallelism in instruction and data stream Single Instruction Single Data (SISD): Serial Computers Single Instruction Multiple Data (SIMD) - Vector processors and processor arrays - Examples: CM-2, Cray-90, Cray YMP, Hitachi 3600 Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

Classification of Architectures – Flynn’s classification Multiple Instruction Single Data (MISD): Not popular Multiple Instruction Multiple Data (MIMD) - Most popular - IBM SP and most other supercomputers, clusters, computational Grids etc. Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

Classification of Architectures – Based on Memory Shared memory 2 types – UMA and NUMA NUMA Examples: HP-Exemplar, SGI Origin, Sequent NUMA-Q UMA Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

Classification 2: Shared Memory vs Message Passing Shared memory machine: The n processors share physical address space Communication can be done through this shared memory The alternative is sometimes referred to as a message passing machine or a distributed memory machine P M P M P M P M P M P M M P P P P Interconnect P P P P Main Memory Interconnect Lec22

Shared Memory Machines The shared memory could itself be distributed among the processor nodes Each processor might have some portion of the shared physical address space that is physically close to it and therefore accessible in less time Terms: NUMA vs UMA architecture Non-Uniform Memory Access Uniform Memory Access Lec22

Classification of Architectures – Based on Memory Distributed memory Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/ Recently multi-cores Yet another classification – MPPs, NOW (Berkeley), COW, Computational Grids

Shared memory and caches

Shared Memory Architecture: Caches P1 P2 Read X Read X Write X=1 Read X Cache hit: Wrong data!! X: 1 X: 0 X: 0 Lec24 X: 0 X: 1

Cache Coherence Problem If each processor in a shared memory multiple processor machine has a data cache Potential data consistency problem: the cache coherence problem Shared variable modification, private cache Objective: processes shouldn’t read `stale’ data Solutions Hardware: cache coherence mechanisms Lec24

Cache Coherence Protocols Write update – propagate cache line to other processors on every write to a processor Write invalidate – each processor gets the updated cache line whenever it reads stale data Which is better? Lec24

Invalidation Based Cache Coherence P1 P2 Read X Read X Write X=1 Read X X: 1 X: 1 X: 0 X: 0 Invalidate Lec24 X: 0 X: 1

Cache Coherence using invalidate protocols 3 states associated with data items Shared – a variable shared by 2 caches Invalid – another processor (say P0) has updated the data item Dirty – state of the data item in P0

Implementations of cache coherence protocols Snoopy for bus based architectures shared bus interconnect where all cache controllers monitor all bus activity There is only one operation through bus at a time; cache controllers can be built to take corrective action and enforce coherence in caches Memory operations are propagated over the bus and snooped

Implementations of cache coherence protocols Directory-based Instead of broadcasting memory operations to all processors, propagate coherence operations to relevant processors A central directory maintains states of cache blocks, associated processors

Implementation of Directory Based Protocols Using presence bits for the owner processors Two schemes: Full bit vector scheme – O(MxP) storage for P processors and M cache lines But not necessary Modern day processors use sparse or tagged directory scheme Limited cache lines and limited presence bits

False Sharing Cache coherence occurs at the granularity of cache lines – an entire cache line is invalidated Modern day cache lines are 64 bytes in size Consider a Fortran program dealing with a matrix Assume each thread or process accessing a row of a matrix Leads to false sharing

False sharing: Solutions Reorganize the code so that each processor access a set of rows Can still lead to overlapping of cache lines if matrix size not divisible by processors In such cases, employ padding Padding: dummy elements added to make the matrix size divisible

Interconnection Networks

Interconnects Used in both shared memory and distributed memory architectures In shared memory: Used to connect processors to memory In distributed memory: Used to connect different processors Components Interface (PCI or PCI-e): for connecting processor to network link Network link connected to a communication network (network of connections)

Communication network Consists of switching elements to which processors are connected through ports Switch: network of switching elements Switching elements connected with each other using a pattern of connections Pattern defines the network topology In shared memory systems, memory units are also connected to communication network

Parallel Architecture: Interconnections Routing techniques: how the route taken by the message from source to destination is decided Network topologies Static – point-to-point communication links among processing nodes Dynamic – Communication links are formed dynamically by switches Lec22

Network Topologies Static Bus Completely connected Star Linear array, Ring (1-D torus) Mesh k-d mesh: d dimensions with k nodes in each dimension Hypercubes – 2-logp mesh Trees – our campus network Dynamic – Communication links are formed dynamically by switches Crossbar Multistage For more details, and evaluation of topologies, refer to book by Grama et al.

Network Topologies Bus, ring – used in small-scale shared memory systems Crossbar switch – used in some small-scale shared memory machines, small or medium-scale distributed memory machines

Crossbar Switch Consists of 2D grid of switching elements Each switching element consists of 2 input ports and 2 output ports An input port dynamically connected to an output port through a switching logic

Multistage network – Omega network To reduce switching complexity Omega network – consisting of logP stages, each consisting of P/2 switching elements Contention In crossbar – nonblocking In Omega – can occur during multiple communications to disjoint pairs

Mesh, Torus, Hypercubes, Fat-tree Commonly used network topologies in distributed memory architectures Hypercubes are networks with dimensions

Mesh, Torus, Hypercubes 2D Mesh Hypercube (binary n-cube) Torus n=2 Lec22 Torus

Fat Tree Networks Binary tree Processors arranged in leaves Other nodes correspond to switches Fundamental property: No. of links from a node to a children = no. of links from the node to its parent Edges become fatter as we traverse up the tree

Fat Tree Networks Any pairs of processors can communicate without contention: non-blocking network Constant Bisection Bandwidth (CBB) networks Two level fat tree has a diameter of four

Evaluating Interconnection topologies Diameter – maximum distance between any two processing nodes Full-connected – Star – Ring – Hypercube - Connectivity – multiplicity of paths between 2 nodes. Miniimum number of arcs to be removed from network to break it into two disconnected networks Linear-array – 2-d mesh – 2-d mesh with wraparound – D-dimension hypercubes – 1 2 p/2 logP 1 2 2 4 d

Evaluating Interconnection topologies bisection width – minimum number of links to be removed from network to partition it into 2 equal halves Ring – P-node 2-D mesh - Tree – Star – Completely connected – Hypercubes - 2 Root(P) 1 1 P2/4 P/2

Evaluating Interconnection topologies channel width – number of bits that can be simultaneously communicated over a link, i.e. number of physical wires between 2 nodes channel rate – performance of a single physical wire channel bandwidth – channel rate times channel width bisection bandwidth – maximum volume of communication between two halves of network, i.e. bisection width times channel bandwidth