Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters,

Slides:



Advertisements
Similar presentations
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Advertisements

SE-292 High Performance Computing
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Today’s topics Single processors and the Memory Hierarchy
1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Jie Liu, Ph.D. Professor Department of Computer Science
Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.
Multiprocessors CSE 4711 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor –Although.
Parallel Programming Henri Bal Rob van Nieuwpoort Vrije Universiteit Amsterdam Faculty of Sciences.
Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.
Interconnection Networks 1 Interconnection Networks (Chapter 6) References: [1,Wilkenson and Allyn, Ch. 1] [2, Akl, Chapter 2] [3, Quinn, Chapter 2-3]
NUMA Mult. CSE 471 Aut 011 Interconnection Networks for Multiprocessors Buses have limitations for scalability: –Physical (number of devices that can be.

Parallel Computing Platforms
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
Lecture 10 Outline Material from Chapter 2 Interconnection networks Processor arrays Multiprocessors Multicomputers Flynn’s taxonomy.
1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.
Models of Parallel Computation Advanced Algorithms & Data Structures Lecture Theme 12 Prof. Dr. Th. Ottmann Summer Semester 2006.
CSE 160 – Lecture 2. Today’s Topics Flynn’s Taxonomy Bit-Serial, Vector, Pipelined Processors Interconnection Networks –Topologies –Routing –Embedding.
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
Fall 2008Introduction to Parallel Processing1 Introduction to Parallel Processing.
Parallel Computer Architectures
4. Multiprocessors Main Structures 4.1 Shared Memory x Distributed Memory Shared-Memory (Global-Memory) Multiprocessor:  All processors can access all.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Introduction to Parallel Processing Ch. 12, Pg
CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis.
MULTICOMPUTER 1. MULTICOMPUTER, YANG DIPELAJARI Multiprocessors vs multicomputers Interconnection topologies Switching schemes Communication with messages.
Chapter 5 Array Processors. Introduction  Major characteristics of SIMD architectures –A single processor(CP) –Synchronous array processors(PEs) –Data-parallel.
1 Parallel computing and its recent topics. 2 Outline 1. Introduction of parallel processing (1)What is parallel processing (2)Classification of parallel.
Course Outline Introduction in software and applications. Parallel machines and architectures –Overview of parallel machines –Cluster computers (Myrinet)
Parallel Computing Basic Concepts Computational Models Synchronous vs. Asynchronous The Flynn Taxonomy Shared versus Distributed Memory Interconnection.
CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.
CSC 364/664 Parallel Computation Fall 2003 Burg/Miller/Torgersen Chapter 1: Parallel Computers.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Parallel Computer Architecture and Interconnect 1b.1.
Chapter 2 Parallel Architectures. Outline Interconnection networks Interconnection networks Processor arrays Processor arrays Multiprocessors Multiprocessors.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
CHAPTER 12 INTRODUCTION TO PARALLEL PROCESSING CS 147 Guy Wong page
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
An Overview of Parallel Computing. Hardware There are many varieties of parallel computing hardware and many different architectures The original classification.
Chapter 8-2 : Multicomputers Multiprocessors vs multicomputers Multiprocessors vs multicomputers Interconnection topologies Interconnection topologies.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Lecture 3 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.
Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations.
Data Structures and Algorithms in Parallel Computing Lecture 1.
Outline Why this subject? What is High Performance Computing?
Super computers Parallel Processing
Lecture 3: Computer Architectures
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 May 2, 2006 Session 29.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs in Parallel Machines Dr. Xiao Qin Auburn University
Parallel Architecture
Distributed and Parallel Processing
Multiprocessor Systems
Lecture 23: Interconnection Networks
Course Outline Introduction in algorithms and applications
Data Structures and Algorithms in Parallel Computing
Parallel Architectures Based on Parallel Computing, M. J. Quinn
Outline Interconnection networks Processor arrays Multiprocessors
Advanced Computer and Parallel Processing
Advanced Computer and Parallel Processing
Chapter 2 from ``Introduction to Parallel Computing'',
Presentation transcript:

Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters, many-cores Programming methods, languages, and environments Message passing (SR, MPI, Java) Higher-level language: HPF Applications N-body problems, search algorithms Many-core (GPU) programming (Rob van Nieuwpoort)

Parallel Machines Parallel Computing – Techniques and Applications Using Networked Workstations and Parallel Computers (2/e) Section (part of) 1.4 Barry Wilkinson and Michael Allen Pearson, 2005

Overview Processor organizations Types of parallel machines – Processor arrays – Shared-memory multiprocessors – Distributed-memory multicomputers Cluster computers Blue Gene

Processor Organization Network topology is a graph – A node is a processor – An edge is a communication path Evaluation criteria –Diameter (maximum distance) –Bisection width (minimum number of edges that should be removed to split the graph into 2 -almost- equal halves) –Number of edges per node

Key issues in network design Bandwidth: –Number of bits transferred per second Latency: –Network latency: time to make a message transfer through the network –Communication latency: total time to send the message, including software overhead and interface delays –Message latency (startup time): time to send a zero-length message Diameter influences latency Bisection width influences bisection bandwidth (collective bandwidth over the ``removed’’ edges)

Mesh q-dimensional lattice q=2 -> 2-D grid Number of nodes: k² (k = SQRT(p)) Diameter 2(k - 1) Bisection width k Edges per node 4

Binary Tree Number of nodes: 2 k – 1 (k = 2 LOG(P)) Diameter: 2 (k -1) Bisection width: 1 Edges per node: 3

Hypertree Tree with multiple roots, gives better bisection width 4-ary tree: –Number of nodes 2 k ( 2 k+1 - 1) –Diameter 2k –Bisection width 2 k+1 –Edges per node 6

Engineering solution: fat tree Tree with more bandwidth at links near the root CM-5

Hypercube k-dimensional cube, each node has binary value, nodes that differ in 1 bit are connected Number of nodes2 k Diameterk Bisection width2 k-1 Edges per nodek

Hypercube Label nodes with binary value, connect nodes that differ in 1 coordinate Number of nodes2 k Diameterk Bisection width2 k-1 Edges per nodek

Comparison MeshTreeHypercube Diametero++ Bisection widtho-+ #edges43unlimited

Types of parallel machines Processor arrays Shared-memory multiprocessors Distributed-memory multicomputers

Processor Arrays Instructions operate on scalars or vectors Processor array = front-end + synchronized processing elements

Processor Arrays Front-end Sequential machine that executes program Vector operations are broadcast to PEs Processing element Performs operation on its part of the vector Communicates with other PEs through a network

Examples of Processor Arrays CM-200, Maspar MP-1, MP-2, ICL DAP (~1970s) Earth Simulator (Japan, 2002, former #1 of top-500) Ideas are now applied in GPUs and CPU-extensions like MMX

Shared-Memory Multiprocessors Bus easily gets saturated => add caches to CPUs Central problem: cache coherency –Snooping cache: monitor bus, invalidate copy on write –Write-through or copy-back Bus-based multiprocessors do not scale

Other Multiprocessor Designs (1/2) Switch-based multiprocessors (e.g., crossbar) Expensive (requires many very fast components)

Other Multiprocessor Designs (2/2) Non-Uniform Memory Access (NUMA) multiprocessors Memory is distributed Some memory is faster to access than other memory Example: –Teras at Sara, Dutch National Supercomputer (1024-node SGI) Ideas now applied in multi-cores

Distributed-Memory Multicomputers Each processor only has a local memory Processors communicate by sending messages over a network Routing of messages: –Packet-switched message routing: split message into packets, buffered at intermediate nodes Store-and-forward –Circuit-switched message routing: establish path between source and destination

Packet-switched Message Routing Messages are forwarded one node at a time Forwarding is done in software Every processor on path from source to destination is involved Latency linear todistance x message length –Old examples: Parsytec GCel (T800 transputers), Intel Ipsc

Circuit-switched Message Routing Each node has a routing module Circuit set up between source and destination Latency linear to distance + message length Example: Intel iPSC/2

Modern routing techniques Circuit switching: needs to reserve all links in the path (cf. old telephone system) Packet switching: high latency, buffering space (cf. postal mail) Cut-through routing: packet switching, but immediately forward (without buffering) packets if outgoing link is available Wormhole routing: transmit head (few bits) of message, rest follows like a worm

Performance Distance (number of hops) Wormhole routing Circuit switching Packet switching Network latency

Distributed Shared Memory Shared memory is easier to program, but doesn’t scale Distributed memory is hard to program, but does scale Distributed Shared Memory (DSM): provide shared- memory programming model on top of distributed memory hardware –Shared Virtual Memory (SVM): use memory management hardware (paging), copy pages over the network –Object-based: provide replicated shared objects (Orca language) Was hot research topic in 1990s, but performance remained the bottleneck

Flynn's Taxonomy Instruction stream: sequence of instructions Data stream: sequence of data manipulated by instructions Single DataMultiple Data Single Instruction SISDSIMD Multiple Instruction MISDMIMD

Flynn's Taxonomy Single DataMultiple Data Single Instruction SISDSIMD Multiple Instruction MISDMIMD SISD: Single Instruction Single Data Traditional uniprocessors SIMD: Single Instruction Multiple Data Processor arrays MISD: Multiple Instruction Single Data Nonexistent? MIMD: Multiple Instruction Multiple Data Multiprocessors and multicomputers