Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.

Slides:

Advertisements

Similar presentations

SE-292 High Performance Computing

Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Super computers Parallel Processing By: Lecturer \ Aisha Dawood.

♦ Commodity processor with commodity interprocessor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.

Today’s topics Single processors and the Memory Hierarchy

.1 Network Connected Multi’s [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005]

Multiple Processor Systems

Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.

Types of Parallel Computers

CSCI-455/522 Introduction to High Performance Computing Lecture 2.

CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.

Chapter 5: Computer Systems Organization Invitation to Computer Science, Java Version, Third Edition.

Parallel Programming Chapter 2 Introduction to Parallel Architectures Johnnie Baker January 23,

Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.

Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.

Parallel Processing Architectures Laxmi Narayan Bhuyan

Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.

1 CSE SUNY New Paltz Chapter Nine Multiprocessors.

Fall 2008Introduction to Parallel Processing1 Introduction to Parallel Processing.

Parallel Computer Architectures

Chapter 5 Array Processors. Introduction  Major characteristics of SIMD architectures –A single processor(CP) –Synchronous array processors(PEs) –Data-parallel.

1 Parallel computing and its recent topics. 2 Outline 1. Introduction of parallel processing (1)What is parallel processing (2)Classification of parallel.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

1 Lecture 20: Parallel and Distributed Systems n Classification of parallel/distributed architectures n SMPs n Distributed systems n Clusters.

CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.

Maximizing The Compute Power With Mellanox InfiniBand Connectivity Gilad Shainer Wolfram Technology Conference 2006.

09/01/2011CS4961 CS4961 Parallel Programming Lecture 4: Memory Systems and Interconnects Mary Hall September 1,

August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Department of Computer Science University of the West Indies.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

An Overview of Parallel Computing. Hardware There are many varieties of parallel computing hardware and many different architectures The original classification.

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.

1 CMPE 511 HIGH PERFORMANCE COMPUTING CLUSTERS Dilek Demirel İşçi.

Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.

Computer System Architecture Dept. of Info. Of Computer. Chap. 13 Multiprocessors 13-1 Chap. 13 Multiprocessors n 13-1 Characteristics of Multiprocessors.

Copyright © 2011 Curt Hill MIMD Multiple Instructions Multiple Data.

MODERN OPERATING SYSTEMS Third Edition ANDREW S. TANENBAUM Chapter 8 Multiple Processor Systems Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.

Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 April 11, 2006 Session 23.

Data Structures and Algorithms in Parallel Computing Lecture 1.

2016/1/5Part I1 Models of Parallel Processing. 2016/1/5Part I2 Parallel processors come in many different varieties. Thus, we often deal with abstract.

Interconnection network network interface and a case study.

Outline Why this subject? What is High Performance Computing?

Lecture 3: Computer Architectures

Understanding Parallel Computers Parallel Processing EE 613.

Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 May 2, 2006 Session 29.

Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.

Parallel Computing Presented by Justin Reschke

CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.

COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

Processor Level Parallelism 1

Network Connected Multiprocessors

Parallel Architecture

Distributed and Parallel Processing

Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”

Constructing a system with multiple computers or processors

Parallel Processing Architectures

Constructing a system with multiple computers or processors

Constructing a system with multiple computers or processors

Chapter 5: Computer Systems Organization

Constructing a system with multiple computers or processors

Chapter 4 Multiprocessors

Multicore and GPU Programming

Multiprocessor System Interconnects

Presentation transcript:

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder Chapter 2: Understanding Parallel Computers

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Hardware Changes quickly Our design needs to be hardware independent Concerns for stale cachelines 2-2

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 2-3 Figure 2.2 Logical Organization of the AMD Dual Core Opteron. The processors address a private L2 cache; memory consistency is provided by the System Request Interface; HyperTransport technology connects to RAM and, possibly, other Opteron chips.

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 2-4 Figure 2.3

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 2-5 Figure 2.4 Sun Fire E25K. Eighteen boards are connected with crossbars for address, data and response; each board contains four UltraSPARC IV Cu processors; the snoopy buses are shown as dashed lines.

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 2-6 Figure 2.5 Crossbar switch connecting four nodes. Notice the output and input channels; crossing wires do not connect unless a connection is shown. Each pair of nodes is directly connected by setting one of the open circles.

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Heterogeneous chip design General processor performs hard to parallelize portion of algorithm Attached processors perform compute-intensive portion of the computation 2-7

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Examples Graphic processing units Field programmable gate arrays Cell processor, designed for video games –8 specialized cores, synergistic processing elements (SPEs) –High communication bandwidth among processors –Does not provide coherent memory to SPEs Chose performance over programmer convenience 2-8

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 2-9 Figure 2.6 Architecture of the Cell processor. The architecture is designed to move data: The high speed I/O controllers have a capacity of 76.8 GB/s; each of the two channels to RAM runs at 12.8 GB/s; the capacity of the EIB is theoretically capable of GB/s.

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Clusters Gigabit ethernet Myrinet –Lower protocol overhead –Better throughput Quadrics –Company formed in 96 –In 2003, 6 of 10 fastest computers based on this Infiniband – most common in supercomputers Fiber channel – connect data storage 2-10

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Blade servers stripped down server computer with a modular design optimized to minimize the use of physical space and energy. Whereas a standard rack-mount server can function with (at least) a power cord and network cable 2-11

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Super computers BlueGene/L –65,536 dual core processors –770 mhz (relatively slow) 2-12

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 2-13 Figure 2.7 Logical organization of a BlueGene/L node.

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 2-14 Figure 2.8 BlueGene/L communication networks; (a) 3D torus for standard interprocessor data transfer; (b) collective network for fast evaluation of reductions.

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley How do these models differ? Shared address space available to all processors –Core Duo, Dual Core, Sun Fire E25K Distributed address space –HP cluster, BlueGene/L 2-15

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Distributed vs Shared Memory Shared seems easier, more natural –Delays occur with more processors –Coherent issues increasing memory-reference time –Message passing is easier 2-16

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Flynn’s Taxonomy SISD SIMD –As in Cell’s SPE MISD MIMD 2-17

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Computing model -- sequential Von-Neumann architecture (random access) –Stores instruction/data w/o concern of details Simplifies talking about programs 2-18

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 2-19 Figure 2.9 Two searching computations: a) linear search, b) binary search.

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley PRAM model (parallel random access) Single unbounded shared memory Processors follow their own threads Execute in lock step Ignores communication costs (not good) Does not guide programmers to best solutions 2-20

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Candidate Type Architecture (CTA) 2 types of memory references –Inexpensive local –Expensive non-local 2-21

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley CTA properties P processors sequentially executing local inst. Local mem access time normal for seq. comp. Non-local mem access is 2 to 5 orders of magnitude longer. Node has 1 or 2 active network transfers at given time. Global controller does basic operations –Initiation, synchronization 2-22

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 2-23 Figure 2.10

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 2-24 Figure 2.11 Common topologies used for interconnection networks; (a) 2-D torus, (b) binary 3-cube (see Exercise 8), (c) fat tree, (d) omega network.

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 2-25 Figure 2.11 Common topologies used for interconnection networks; (a) 2-D torus, (b) binary 3-cube (see Exercise 8), (c) fat tree, (d) omega network. (cont.)

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 2-26 Figure 2.11 Common topologies used for interconnection networks; (a) 2-D torus, (b) binary 3-cube (see Exercise 8), (c) fat tree, (d) omega network. (cont.)

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 2-27 Table 2.1 Estimates for λ for common architectures; speeds generally do not include congestion or other traffic delays.

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Properties of CTA cause: Locality rule: maximize the number of local memory references and minimize the number of non-local memory references. 2-28

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Example P processors need a random number Solution 1 –One processor generate the next random number Solution 2 –Send the seed to each processor, each generates its own number 2-29

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley CTA Good model Should scale Abstracts general features of MIMD 2-30