1 Lecture 1 Parallel Processing for Scientific Applications.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Fundamental of Computer Architecture By Panyayot Chaikan November 01, 2003.
Commodity Computing Clusters - next generation supercomputers? Paweł Pisarczyk, ATM S. A.
SGI’2000Parallel Programming Tutorial Supercomputers 2 With the acknowledgement of Igor Zacharov and Wolfgang Mertz SGI European Headquarters.
Today’s topics Single processors and the Memory Hierarchy
1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
Classification of Distributed Systems Properties of Distributed Systems n motivation: advantages of distributed systems n classification l architecture.
Parallel Computers Chapter 1
CSCI-455/522 Introduction to High Performance Computing Lecture 2.
1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability.
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
History of Distributed Systems Joseph Cordina
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
CS 284a, 7 October 97Copyright (c) , John Thornley1 CS 284a Lecture Tuesday, 7 October 1997.
Multiprocessors ELEC 6200 Computer Architecture and Design Instructor: Dr. Agrawal Yu-Chun Chen 10/27/06.

Chapter 17 Parallel Processing.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
Parallel Processing Architectures Laxmi Narayan Bhuyan
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Parallel Architectures
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)
1 Parallel computing and its recent topics. 2 Outline 1. Introduction of parallel processing (1)What is parallel processing (2)Classification of parallel.
Course Outline Introduction in software and applications. Parallel machines and architectures –Overview of parallel machines –Cluster computers (Myrinet)
Parallel Computing Basic Concepts Computational Models Synchronous vs. Asynchronous The Flynn Taxonomy Shared versus Distributed Memory Interconnection.
1 Lecture 20: Parallel and Distributed Systems n Classification of parallel/distributed architectures n SMPs n Distributed systems n Clusters.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.
Lappeenranta University of Technology / JP CT30A7001 Concurrent and Parallel Computing Introduction to concurrent and parallel computing.
1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
CSC 364/664 Parallel Computation Fall 2003 Burg/Miller/Torgersen Chapter 1: Parallel Computers.
Edgar Gabriel Short Course: Advanced programming with MPI Edgar Gabriel Spring 2007.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
1 CMPE 511 HIGH PERFORMANCE COMPUTING CLUSTERS Dilek Demirel İşçi.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Copyright © 2011 Curt Hill MIMD Multiple Instructions Multiple Data.
Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture Multiprocessors.
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
Data Structures and Algorithms in Parallel Computing Lecture 1.
Outline Why this subject? What is High Performance Computing?
Lecture 3: Computer Architectures
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Background Computer System Architectures Computer System Software.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
These slides are based on the book:
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Multiprocessor Systems
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
Course Outline Introduction in algorithms and applications
CS 147 – Parallel Processing
Chapter 17 Parallel Processing
Parallel Processing Architectures
Chapter 4 Multiprocessors
An Overview of MIMD Architectures
Presentation transcript:

1 Lecture 1 Parallel Processing for Scientific Applications

2 Parallel Computing Multiple processes cooperating to solve a single problem

3 Why Parallel Computing ? Easy to get huge computational problems *physical simulation in 3D: 100 x 100 x 100 = 10 6 *oceanography example: 48 M cells, several variables per cell, one time step = 30 Gflop = 30,000,000,000 floating point operations)

4 Why Parallel Computing ? Numerical Prototyping: *real phenomena are too complicated to model *real experiments are too hard, too expensive, or too dangerous for a laboratory: Examples: simulate aging effects on nuclear weapon (ASCI Project), oil reservoir simulation, large wind tunnels, galactic evolution, whole factory or product life cycle design and optimization, DNA matching (bioinformatics)

5 An Example -- Climate Prediction Grid point

6 An Example -- Climate Prediction What is Climate?  Climate (longitude, latitude, height, time) *return a vector of 6 values: temperature, pressure, humidity, and wind velocity (3 words) Discretize: only evaluate on a grid point ; *Climate(i, j, k, n), where t = n*dt, dt is a fixed time step, n an integer, i,j,k are integers indexing the grid cells.

7 An Example -- Climate Prediction Area: 3000 x 3000 miles, Height: 11 miles x3000x11 cube mile domain Segment size: 0.1x0.1x0.1 cube miles different segments Two-day period, dt = 0.5 hours (2x24x2=96) 100 instructions per segment *the computation of parameters inside a segment uses the initial values and the values from neighboring segments

8 An Example -- Climate Prediction A single updating of the parameters in the entire domain requires x100, or instructions (10 Trillion instructions). Update 96 times instructions Single-CPU supercomputer: *1000 MHz RISC CPU *Execution time: 280 hours. ??? Taking 280 hours to predict the weather for next 48 hours.

9 Issues in Parallel Computing Design of Parallel Computers Design of Efficient Algorithms Methods for Evaluating Parallel Algorithms Parallel Programming Languages Parallel Programming Tools Portable Parallel Programs Automatic Programming of Parallel Computers

10 Some Basic Studies

11 Design of Parallel Computer Parallel computing is information processing that emphasizes the concurrent manipulation of data elements belonging to one or more processes solving a single problem [Quinn:1994] Parallel computer: a multiple-processor computer capable of parallel computing.

12 Efficient Algorithms Throughput: the number of results per second Speedup: S = T 1 / T p Efficiency = S/P (P= no. of processor)

13 Scalability Algorithmic scalability: an algorithm is scalable if the available parallelism increases at least linearly with problem size. Architectural scalability: an architecture is scalable if it continues to yield the same performance per processor, as the number of processors is increased and as the problem size is increased. *Solve larger problems in the same amount of time by buying a parallel computer with more processors. ($$$$$ ??)

14 Parallel Architectures SMP: Symmetric Multiprocessor (SGI Power Challenger, SUN Enterprise 6000) MPP: Massively Parallel Processors  INTEL ASCI Red : 9152 processors (1997)  SGI/Cray T3E 120 LC nodes (1998) Cluster: True distributed systems -- tightly- coupled software on a loosely-coupled (LAN- based) hardware. *NOW: Network of Workstation or COW: Cluster of Workstations, Pile-of-PC (PoPC)

15 Levels of Abstraction Applications (Sequential ?) (Parallel ?) Programming Models (Shared Memory ?) (Message Passing ?) Addressing Space (Shared Memory?) (Distributed Memory ?) Hardware Architecture

16 Is Parallel Computing Simple ?

17 A Simple Example Take a paper and pen. Algorithm: * Step 1: Write a number on your pad * Step 2: Compute the sum of your neighbor's values * Step 3: Write the sum on the paper

18 ** Questions 1 How do you get values from your neighbors?

19 Shared Memory Model 5, 0, 4

20 Message Passing Model Hey !! What’s your number ?

21 ** Questions 2 Are you sure the sum is correct ?

22 Some processor starts earlier = 9

23 Synchronization Problem !! =14 Step 3. Step 2.

24 ** Questions 3 How do you decide when you are done? (throw away the paper)

25 Some processor finished earlier = 9

26 Some processor finished earlier 9

27 Some processor finished earlier Sorry !! We closed !!

28 Some processor finished earlier Sorry !! We closed !! ?+5+0=? Step 2.

29 Classification of Parallel Architectures

30 1. Based on Control Mechanism Flynn’s Classification : data or instruction stream :  SISD : single instruction stream single data streams  SIMD : single instruction stream multiple data streams  MIMD : multiple instruction streams multiple data streams  MISD : multiple instruction stream single data stream

31 SIMD Examples: *Thinking Machines: CM-1, CM-2 *MasPar MP-1 and MP-2 Simple processor: e.g., 1- or 4-bit CPU Fast global synchronization (global clock) Fast neighborhood communication Applications: image/signal processing, numerical analysis, data compression,...

32 2. Based on Address-space organization Bell’s Classification on MIMD architecture *Message-passing architecture local or private memory multicomputer = MIMD message-passing computer (or distributed-memory computer) *Shared-address-space architecture hardware support for one-side communication (read/write) multiprocessor = MIMD shared-address-space computers

33 Address Space A region of a computer’s total memory within which addresses are continuous and may refer to one another directly by hardware. A shared memory computer has only one user- visible address space A disjoin memory computer can have several. Disjoint memory is more commonly called distributed memory, but the memory of many shared memory computer (multiprocessors) is physically distributed.

34 Multiprocessors vs. Multicomputers Shared-Memory Multiprocessors Models  UMA : uniform memory access (all SMP servers)  NUMA : nonuniform-memory-access (DASH, T3E)  COMA : cache-only memory architecture (KSR) Distributed-Memory Multicomputers Model *message-passing network  NORMA model (no-remote-memory-access) *IBM SP2, Intel Paragon, TMC CM-5, INTEL ASCI Red, cluster

Symmetric Multiprocessors (SMPs) SGI PowerChallenge Cluster: IBM PowerPC Clusters Distributed Memory Machine IBM SP2 Parallel Computers at HKU CYC 807 CYC LG 102. Computer Center

36 Symmetric Multiprocessors (SMPs) Processors are connected to a shared memory module through a shared bus Each processor has equal right to access : the shared memory all I/O devices A single copy of OS

37 P6 PCI Bridge DRAM Controller Data Path MIC Memory controller MIC: memory interface controller NIC PCI Bus PCI Device PCI Device To Network Pentium Pro processor bus Interleave data (288 bits) 32-bit address 64-bit data 533 MB/s 32-bit address 32-bit data 132 MB/s Mem data (72 bits) CYC414 SRG Lab.

38 SMP Machine SGI POWER CHALLENGE POWER CHALLENGE XL –2-36 CPUs –16 GB memory (for 36 CPUs) –The bus performance: up to 1.2GB/sec Runs on a 64 bits OS (IRIX6.2) Common memory is shared which suitable for single-address-space programming

Distributed Memory Machine Consists of multiple computers (nodes) Nodes are communicated by message passing Each node is an autonomous computer Processor(s) (may be an SMP) Local memory Disks, network adapter, and other I/O peripherals No-remote-memory-access (NORMA)

40 Distributed Memory Machine IBM SP2 SP2 => Scalable POWERparallel System Developed based on RISC System/6000 workstation Power 2 processor, 66.6 MHz, 266 MFLOP

41 SP2 - Message Passing

42 8x8 Switch Switch among the nodes simultaneously and quickly Maximum 40MB point-to-point bandwidth SP2 - High Performance Switch

43 SP2 - Nodes (POWER 2 processor) Two types of nodes: –Thin node (smaller capacity, used to process individual works) 4 micro-channel slots, 96KB cache, MB memory, 1-4 GB disk –Wide node (larger capacity, used to be servers of the system) 8 micro-channel slots, 288KB cache, MB memory, 1-8 GB disk

44 SP2 – The largest SP (P2SC, 120 MHz) machine: Pacific Northwest National Lab. U.S., 512 processors, TOP 26, 1998.

45 What’s a Cluster ? A cluster is a group of whole computers that works cooperatively as a single system to provide fast and efficient computing service.

Switched Ethernet Node 1Node 2Node 3Node 4 I need variable A from Node 2! OK! Thank You!

47 Clusters Advantages *Cheaper *Easy to scale *Coarse-grain parallelism Disadvantages *Poor communication performance (typically the latency) as compared with other parallel systems

48 TOP 500 (1997) TOP 1 INTEL: ASCI Red at Sandia Nat’l Lab. USA, June 1997 TOP 2 Hitachi/Tsukuba: CP-PACS (2048 processors), Tflops at Univ. Tsukuba Japan, 1996 TOP 3 SGI/Gray: T3E 900 LC (696 processors), Tflops at UK, Meteorological Office UK, 1997

49 TOP 500 (June, 1998) TOP 1 INTEL: ASCI Red (9152 Pentium Pro processors, 200 MHz), 1.3 Teraflops at Sandia Nat’l Lab. U.S., since June 1997 TOP 2 SGI/Gray: T3E 1200 LC , 1080 processors, Tflops, U.S. government, 1998 installed TOP 3. SGI/Cray : T3E900 LC , 1248 processors, Tflops, U.S. government

50 INTEL ASCI Red Compute node: 4536 (Dual Pentium Pro 200 MHz sharing a 533 MB/s bus) Peak speed.: 1.8 Teraflops (Trillion: ) 1,600 square feet 85 cabinets

51 INTEL ASCI Red (Network) Split 2-D Mesh Interconnect Node-to-node bidirectional bandwidth: 800 Mbytes/sec 10 times faster than SP2 (one-way:40 MB/s)

52 Cray T3E 1200 Processor performance: 600 MHz, 1200 Mflops Overall system peak performance: 7.2 gigaflops to 2.5 teraflops, scale to thousands of processors. Interconnect: a three-dimensional bidirectional torus (a peak interconnect speed of 650MB/sec) Cray UNICOS/mk distribute OS Scalable GigaRing I/O system

53 Cray T3E Interconnect 3-D Torus

54 CP-PACS/2048, Japan Peak Perf TFLOPS CPU: PA-RISC 1.1, 150 MHz

55 CP-PACS Interconnect Comm. Bandwidth: 300 MB/s per link

56 TOP 500 (Asia) 1996: *Japan: (1) SR2201/1024 (1996) *Taiwan: (76) SP2/80 *Korea: (97) Cray Y-MP/16 *China: (231) SP2/32 *Hong Kong: (232) -- SP2/32 (HKU/CC) 1997: *Japan: (2)CP-PACS/2048 (1996) (5) SR2201/1024 (1996) *Korea: (34)T3E 900 LC (154) Ultra HPC 1000 *Taiwan: (167) SP2/80 *Hong Kong: (426) SGI Origin 2000 (CU) *(500): SP2/38 (UCLA)

57 TOP500 Asia 1998 Japan: *TOP 6: CP-PACS/2048 (1997, TOP 2) *TOP 12: NEC SX-4/128M4 *TOP 13: NEC SX-4/128H4 *TOP 14: Hitachi SR2201/1024 *…more Korea: SGI/Cray T3E900 LC (TOP 52),... Taiwan: IBM SP2/110, 1998 (TOP 241)

58 More Information TOP500: ASCI Red: Cray T3E 1200: t3e/1200/ Chapter: ,