©2003 Dror Feitelson Parallel Computing Systems Part I: Introduction Dror Feitelson Hebrew University.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

1 Uniform memory access (UMA) Each processor has uniform access time to memory - also known as symmetric multiprocessors (SMPs) (example: SUN ES1000) Non-uniform.
Distributed Systems CS
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Commodity Computing Clusters - next generation supercomputers? Paweł Pisarczyk, ATM S. A.
♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.
2. Computer Clusters for Scalable Parallel Computing
Today’s topics Single processors and the Memory Hierarchy
Beowulf Supercomputer System Lee, Jung won CS843.
Zhao Lixing.  A supercomputer is a computer that is at the frontline of current processing capacity, particularly speed of calculation.  Supercomputers.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
Multiple Processor Systems
Types of Parallel Computers
CSCI-455/522 Introduction to High Performance Computing Lecture 2.
Introduction CS 524 – High-Performance Computing.
Parallel Computers Past and Present Yenchi Lin Apr 17,2003.
Parallel Computing Overview CS 524 – High-Performance Computing.
Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
Parallel Processing Architectures Laxmi Narayan Bhuyan
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
Parallel Computer Architectures
Real Parallel Computers. Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra, Meuer, Simon Parallel.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,
Chapter 2 Computer Clusters Lecture 2.1 Overview.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
Course Outline Introduction in software and applications. Parallel machines and architectures –Overview of parallel machines –Cluster computers (Myrinet)
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Outline Course Administration Parallel Archtectures –Overview –Details Applications Special Approaches Our Class Computer Four Bad Parallel Algorithms.
Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
Parallel Computing.
+ Clusters Alternative to SMP as an approach to providing high performance and high availability Particularly attractive for server applications Defined.
Outline Why this subject? What is High Performance Computing?
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Background Computer System Architectures Computer System Software.
BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.
Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Why Parallel/Distributed Computing Sushil K. Prasad
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
Super Computing By RIsaj t r S3 ece, roll 50.
Constructing a system with multiple computers or processors
Modern Processor Design: Superscalar and Superpipelining
What is Parallel and Distributed computing?
Course Description: Parallel Computer Architecture
Parallel Processing Architectures
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Chapter 4 Multiprocessors
Types of Parallel Computers
Presentation transcript:

©2003 Dror Feitelson Parallel Computing Systems Part I: Introduction Dror Feitelson Hebrew University

©2003 Dror Feitelson Topics Overview of the field Architectures: vectors, MPPs, SMPs, and clusters Networks and routing Scheduling parallel jobs Grid computing Evaluating performance

©2003 Dror Feitelson Today (and next week?) What is parallel computing Some history The Top500 list The fastest machines in the world Trends and predictions

©2003 Dror Feitelson What is a Parallel System? In particular, what is the difference between parallel and distributed computing?

©2003 Dror Feitelson What is a Parallel System? Chandy: it is related to concurrency. In distributed computing, concurrency is part of the problem. In parallel computing, concurrency is part of the solution.

©2003 Dror Feitelson Distributed Systems Concurrency because of physical distribution –Desktops of different users –Servers across the Internet –Branches of a firm –Central bank computer and ATMs Need to coordinate among autonomous systems Need to tolerate failures and disconnections

©2003 Dror Feitelson Parallel Systems High-performance computing: solve problems that are to big for a single machine –Get the solution faster (weather forecast) –Get a better solution (physical simulation) Need to parallelize algorithm Need to control overhead Can assume friendly system?

©2003 Dror Feitelson The Convergence Use distributed resources for parallel processing Networks of workstations – use available desktop machines within organization Grids – use available resources (servers?) across organizations Internet computing – use personal PCs across the globe

©2003 Dror Feitelson Some History

©2003 Dror Feitelson Early HPC Parallel systems in academia/research –1974: C.mmp –1974: Illiac IV –1978: Cm* –1983: Goodyear MPP

©2003 Dror Feitelson Illiac IV 1974 SIMD: all processors do the same Numerical calculations at NASA Now in Boston computer museum

©2003 Dror Feitelson The Illiac IV in Numbers 64 processors arranged as 8  8 grid Each processor has 10 4 ECL transistors Each processor has 2K 64-bit words (total is 8 Mbit) Arranged in 210 boards Packed in 16 cabinets 500 Mflops peak performance Cost: $31 million

©2003 Dror Feitelson Sustained vs. Peak Peak performance: product of clock rate and number of functional units Sustained rate: what you actually achieve on a real application Sustained is typically much lower than peak –Application does not require all functional units –Need to wait for data to arrive from memory –Need to synchronize –Best for dense matrix operations (Linpack) A rate that the vendor guarantees will not be exceeded

©2003 Dror Feitelson Early HPC Parallel systems in academia/research –1974: C.mmp –1974: Illiac IV –1978: Cm* –1983: Goodyear MPP Vector systems by Cray and Japanese firms –1976: Cray 1 rated at 160 Mflops peak –1982: Cray X-MP, later Y-MP, C90, … –1985: Cray 2, NEC SX-2

©2003 Dror Feitelson Cray’s Achievements Architectural innovations –Vector operations on vector registers –All memory is equally close: no cache –Trade off accuracy and speed Packaging –Short and equally long wires –Liquid cooling systems Style

©2003 Dror Feitelson Vector Supercomputers Vector registers store vectors of fast access Vector instructions operate on whole vectors of values –Overhead of instruction decode only once per vector –Pipelined execution of instruction on vector elements: one result per clock tick (at least after pipeline is full) –Possible to chain vector operations: start feeding second functional unit before finishing first one

©2003 Dror Feitelson Cray MHz clock 160 Mflops peak Liquid cooling World’s most expensive love seat Power supply and cooling under the seat Available in red, blue, black… No operating system

©2003 Dror Feitelson Cray 1 Wiring Round configuration for small and uniform distances Longest wire: 4 feet Wires connected manually by extra- small engineers

©2003 Dror Feitelson Cray X-MP Gflop Multiprocessor with 2 or 4 Cray1-like processors Shard memory

©2003 Dror Feitelson Cray X-MP

©2003 Dror Feitelson Cray Smaller and more compact than Cray 1 4 (or 8) processors Total immersion liquid cooling

©2003 Dror Feitelson Cray Y-MP proc’s Achieved 1 Gflop

©2003 Dror Feitelson Cray Y-MP – Opened

©2003 Dror Feitelson Cray Y-MP – From Back Power supply and cooling

©2003 Dror Feitelson Cray C Gflop per processor 8 or more processors

©2003 Dror Feitelson The MPP Boom 1985: Thinking Machines introduces the Connection Machine CM-1 –16K single-bit processors, SIMD –Followed by CM-2, CM-200 –Similar machines by MasPar mid ’80s: hypercubes become successful Also: Transputers used as building blocks Early ’90s: big companies join –IBM, Cray

©2003 Dror Feitelson SIMD Array Processors ’80 favorites –Connection machine –Maspar Very many single-bit processors with attached memory – proprietary hardware Single control unit: everything is totally synchronized (SIMD = single instruction multiple data) Massive parallelism even with “correct counting” (i.e. divide by 32)

©2003 Dror Feitelson Connection Machine CM-2 Cube of 64K proc’s Acts as backend Hyper- cube topology Data vault for parallel I/O

©2003 Dror Feitelson Hypercubes Early ’80s: Caltech 64-node Cosmic Cube Mid to late ’80s: Commercialized by several companies –Intel iPSC, iPSC/2, iPSC/860 –nCUBE, nCUBE 2 (later turned into a VoD server…) Early ’90s: replaced by mesh/torus –Intel Paragon – i860 processors –Cray T3D, T3E – Alpha processors

©2003 Dror Feitelson Transputers A microprocessor with built-in support for communication Programmed using Occam Used in Meiko and other systems PAR SEQ x := 13; c ! x; SEQ c ? y; z := y; -- z is 13 Synchronous communication: an assignment across processes

©2003 Dror Feitelson Attack of the Killer Micros Commodity microprocessors advance at a faster rate than vector processors Takeover point was around year 2000 Even before that, using many together could provide lots of power –1992: TMC uses SPARC in CM-5 –1992: Intel uses i860 in Paragon –1993: IBM SP uses RS/6000, later PowerPC –1993: Cray uses Alpha in T3D –Berkeley NoW project

©2003 Dror Feitelson Connection Machine CM SPARC-based Fat-tree network Dominant in early ’90s Featured in Jurassic Park Support for gang scheduling!

©2003 Dror Feitelson Intel Paragon i860 proc’s per node: –Compute –Commun. Mesh interconnect with spiffi display

©2003 Dror Feitelson Cray T3D/T3E 1993 – Cray T3D Uses commodity microprocessors (DEC Alpha) 3D Torus interconnect 1995 – Cray T3E

©2003 Dror Feitelson IBM SP RS/6000 processors per rack Each runs AIX (full Unix) Multistage network Flexible configurations First large IUCC machine

©2003 Dror Feitelson Berkeley NoW The building is the computer Just need some glue software…

©2003 Dror Feitelson Not Everybody is Convinced… Japan’s computer industry continues to build vector machines NEC –SX series of supercomputers Hitachi –SR series of supercomputers Fujitsu –VPP series of supercomputers Albeit with less style

©2003 Dror Feitelson Fujitsu VPP700

©2003 Dror Feitelson NEC SX-4

©2003 Dror Feitelson More Recent History 1994 – 1995 slump –Cold war is over –Thinking machines files for chapter 11 –KSR Research files for chapter 11 Late ’90s much better –IBM, Cray retain parallel machine market –Later also SGI, Sun, especially with SMPs –ASCI program is started 21 st century: clusters take over –Based on SMPs

©2003 Dror Feitelson SMPs Machines with several CPUs Initially small scale: 8-16 processors Later achieved large scale of processors Global shared memory accessed via a bus Hard to scale further due to shared memory and cache coherence

©2003 Dror Feitelson SGI Challenge 1 to 16 processors Bus interconnect Dominated low end of Top500 list in mid ’90s Not only graphics…

©2003 Dror Feitelson SGI Origin An Origin 2000 installed at IUCC MIPS processors Remote memory access

©2003 Dror Feitelson Architectural Convergence Shared memory used to be uniform (UMA) –Based on bus or crossbar –Conventional load/store operations Distributed memory used message passing Newer machines support remote memory access –Nonuniform (NUMA): access to remote memory costs more –Put/get operations (but handled by NIC) –Cray T3D/T3E, SGI Origin 2000/3000

©2003 Dror Feitelson The ASCI Program 1996: nuclear test ban leads to need for simulation of nuclear explosions Accelerated Strategic Computing Initiative: Moore’s law not fast enough… Budget of a billion dollars

©2003 Dror Feitelson The Vision Market-driven progress ASCI requirements PathForward Technology transfer Time Performance

©2003 Dror Feitelson ASCI Milestones

©2003 Dror Feitelson The ASCI Red Machine 9260 processors – PentiumPro 200 Arranged as 4-way SMPs in 86 cabinets 573 GB memory total 2.25 TB disk space total 2 miles of cables 850 KW peak power consumption 44 tons (+300 tons air conditioning equipment) Cost: $55 million

©2003 Dror Feitelson Clusters vs. MPPs Mix and match approach –PCs/SMPs/blades used as processing nodes –Fast switched network for interconnect –Linux on each node –MPI for software development –Something for management Lower cost to set up Non-trivial to operate effectively

©2003 Dror Feitelson SMP Nodes PCs, workstations, or servers with several CPUs Small scale (4-8) used as nodes in MPPs or clusters Access to shared memory via shared L2 cache SMP support (cache coherence) built into modern microprocessors

©2003 Dror Feitelson Myrinet 1995 Switched gigabit LAN –As opposed to Ethernet that is a broadcast medium Programmable NIC –Offload communication operations from the CPU Allows clusters to achieve communication rates of MPPs Very expensive Later: gigabit Ethernet

©2003 Dror Feitelson

Blades PCs/SMPs require resources –Floor space –Cables for interconnect –Power supplies and fans This is meaningful if you have thousands Blades provide dense packaging With vertical mounting get < 1U on average The hot new thing in 2002

©2003 Dror Feitelson SunFire Servers 16 servers in a rack- mounted box Used to be called “single-board computers” in the ’80s (Makbilan)

©2003 Dror Feitelson The Cray Name 1972: Cray Research founded –Cray 1, X-MP, Cray 2, Y-MP, C90… –From 1993: MPPs T3D, T3E 1989: Cray Computer founded –GaAs efforts, closed 1996: SGI Acquires Cray Research –Attempt to merge T3E and Origin 2000: sold to Tera –Use name to bolster MTA 2002: Cray sells Japanese NEC SX : Announces new X1 supercomputer

©2003 Dror Feitelson Vectors are not Dead! 1994: Cray T90 –Continues Cray C90 line 1996: Cray J90 –Continues Cray Y-MP line 2000: Cray SV1 2002: Cray X1 –Only “Big-Iron” company left

©2003 Dror Feitelson Cray J Very popular continuation of Y-MP 8, then 16, then 32 processors One installed at IUCC

©2003 Dror Feitelson Cray X Up to 1024 nodes 4 custom vector proc’s per node 12.8 GFlops peak each Torus interconnect

©2003 Dror Feitelson Confused?

©2003 Dror Feitelson The Top500 List List of the 500 most powerful computer installations in the world Separates academic chic from real impact Measured using Linpack –Dense matrix operations –Might not be representative of real applications

©2003 Dror Feitelson

The Competition How to achieve a rank: Few vector processors –Maximize power per processor –High efficiency Many commodity processors –Ride technology curve –Power in numbers –Low efficiency

©2003 Dror Feitelson Vector Programming Conventional Fortran Automatic vectorization of loops by compiler Autotasking uses processors that happen to be available at runtime to execute chunks of loop iterations Easy for application writers Very high efficiency

©2003 Dror Feitelson MPP Programming Library added to programming language –MPI for distributed memory –OpenMP for shared memory Applications need to be partitioned manually Many possible efficiency losses –Fragmentation in allocating processors –Stalls waiting for memory and communication –Imbalance among threads Hard for programmers

©2003 Dror Feitelson Also National Competition Japan Inc. is “more Cray than Cray” –Computers based on few but powerful proprietary vector processors –Numerical wind tunnel at rank 1 from 1993 to 1995 –CP-PACS at rank 1 in 1996 –Earth simulator at rank 1 in 2003 US industry switched to commodity microprocessors –Even Cray did –ASCI machines at rank 1 in 1997—2002

©2003 Dror Feitelson Vectors vs. MPPs – 1994 Feitelson, Int. J. High-Perf. Comput. App., 1999

©2003 Dror Feitelson Vectors vs. MPPs – 1997

©2003 Dror Feitelson Vectors vs. MPPs – 1997

©2003 Dror Feitelson The Current Situation

©2003 Dror Feitelson Real Usage Control functions for telecomm companies Reservoir modeling for oil companies Graphic rendering for Hollywood Financial modeling for Wall Street Drug design for pharmaceuticals Weather prediction Airplane design for Boeing and Airbus Hush-hush activities

©2003 Dror Feitelson

The Earth Simulator Operational in late 2002 Top rank in Top500 list Result of 5-year design and implementation effort Equivalent power to top 15 US machines (including all ASCI machines) Really big

©2003 Dror Feitelson

The Earth Simulator in Numbers 640 nodes 8 vector processors per node, 5120 total 8 Gflops per processor, 40 Tflops total 16 GB memory per node, 10 TB total 2800 km of cables 320 cabinets (2 nodes each) Cost: $ 350 million

©2003 Dror Feitelson Trends

©2003 Dror Feitelson

Exercise Look at 10 years of Top500 lists and try to say something non-trivial about trends Are there things that grow? Are there things that stay the same? Can you make predictions?

©2003 Dror Feitelson Distribution of Vendors – 1994 Feitelson, Int. J. High-Perf. Comput. App., 1999

©2003 Dror Feitelson Distribution of Vendors – 1997

©2003 Dror Feitelson IBM in the Lists Arrows are the ANL SP1 with 128 processors Rank doubles each year

©2003 Dror Feitelson Minimal Parallelism

©2003 Dror Feitelson Min vs. Max

©2003 Dror Feitelson Power with Time Rmax of last machine doubles each year This is 8-fold in three years Degree of parallelism doubles every three years So power of each processor increases 4-fold in three years (=doubles in 18 months) Which is Moore’s Law…

©2003 Dror Feitelson Distribution of Power Rank of machines doubles each year Power of rank 500 machine doubles each year So rank 250 machine this year has double the power of rank 500 machine this year And rank 125 machine has double the power of rank 250 machine In short, power decreases exponentially with rank

©2003 Dror Feitelson Power and Rank

©2003 Dror Feitelson Power and Rank

©2003 Dror Feitelson Power and Rank

©2003 Dror Feitelson Power and Rank The slope is becoming flatter

©2003 Dror Feitelson Machine Ages in Lists

©2003 Dror Feitelson New Machines

©2003 Dror Feitelson Industry Share

©2003 Dror Feitelson Vector Share

©2003 Dror Feitelson Summary Invariants of the last few years: Power grows exponentially with time Parallelism grows exponentially with time But maximal usable parallelism is ~10000 Power drops polynomially with rank Age in the list drops exponentially About 300 new machines each year About 50% of machines in industry About 15% of power due to vector processors