Paths and Cul de Sacs to High Performance computing International Symposium on Modern Computing Celebrating John Vincent Atanasoff Centenary Iowa State.

Slides:



Advertisements
Similar presentations
Gordon Bell Bay Area Research Center Microsoft Corporation
Advertisements

ICCS 2003 Progress, prizes, & Community-centric Computing Melbourne June 3, 2003.
Gordon Bell Bay Area Research Center Microsoft Corporation
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
IDA Defense Science Study Group 11/99Mark D. Hill, University of Wisconsin-Madison Exploiting Market Realities to Address National Security’s High-Performance.
Beowulf Supercomputer System Lee, Jung won CS843.
Zhao Lixing.  A supercomputer is a computer that is at the frontline of current processing capacity, particularly speed of calculation.  Supercomputers.
1 Burroughs B5500 multiprocessor. These machines were designed to support HLLs, such as Algol. They used a stack architecture, but part of the stack was.
Manchester ‘1998 The University of Manchester Celebrates the Birth of the Modern Computer June 18, 1998 Gordon Bell Microsoft Corporation.
Copyright Gordon Bell LANL 5/17/2002 Technical computing: Observations on an ever changing, occasionally repetitious, environment Los Alamos National Laboratory.
NPACI Panel on Clusters David E. Culler Computer Science Division University of California, Berkeley
Graph Analysis with High Performance Computing by Bruce Hendrickson and Jonathan W. Berry Sandria National Laboratories Published in the March/April 2008.
Supercomputers Daniel Shin CS 147, Section 1 April 29, 2010.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.
CS 300 – Lecture 2 Intro to Computer Architecture / Assembly Language History.
Parallel Processing Architectures Laxmi Narayan Bhuyan
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
CLUSTER COMPUTING Prepared by: Kalpesh Sindha (ITSNS)
1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)
1 In Summary Need more computing power Improve the operating speed of processors & other components constrained by the speed of light, thermodynamic laws,
Parallel Computing Basic Concepts Computational Models Synchronous vs. Asynchronous The Flynn Taxonomy Shared versus Distributed Memory Interconnection.
Information Technology at Purdue Presented by: Dr. Gerry McCartney Vice President and CIO, ITaP HPC User Forum September 8-10, 2008 Using SiCortex SC5832.
CLUSTER COMPUTING STIMI K.O. ROLL NO:53 MCA B-5. INTRODUCTION  A computer cluster is a group of tightly coupled computers that work together closely.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
High-End Computing Systems EE380 State-of-the-Art Lecture Hank Dietz Professor & Hardymon Chair in Networking Electrical & Computer Engineering Dept. University.
Outline Course Administration Parallel Archtectures –Overview –Details Applications Special Approaches Our Class Computer Four Bad Parallel Algorithms.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
Led the WWII research group that broke the code for the Enigma machine proposed a simple abstract universal machine model for defining computability devised.
Export Controls—What’s next? Joseph Young Bureau of Industry and Security Export Controls – What’s Next? Joseph Young Bureau of Industry and Security.
Architectural Considerations for Petaflops and beyond Bill Camp Sandia National Lab’s March 4,2003 SOS7 Durango, CO, USA -
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
The Case for Architectural Diversity
Cray Innovation Barry Bolding, Ph.D. Director of Product Marketing, Cray September 2008.
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE Capability Computing – High-End Resources Wayne Pfeiffer Deputy Director NPACI & SDSC NPACI.
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
M U N - February 17, Phil Bording1 Computer Engineering of Wave Machines for Seismic Modeling and Seismic Migration R. Phillip Bording February.
CS591x -Cluster Computing and Parallel Programming
1 NSF/TeraGrid Science Advisory Board Meeting July 19-20, San Diego, CA Brief TeraGrid Overview and Expectations of Science Advisory Board John Towns TeraGrid.
Prabal Dutta March 12, 2003Networks of Workstations1 Prabal Dutta Electrical Engineering 864 Advanced Computer Design.
Outline Why this subject? What is High Performance Computing?
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
Axis AI Solves Challenges of Complex Data Extraction and Document Classification through Advanced Natural Language Processing and Machine Learning MICROSOFT.
Tackling I/O Issues 1 David Race 16 March 2010.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Background Computer System Architectures Computer System Software.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
Super Computing By RIsaj t r S3 ece, roll 50.
CS775: Computer Architecture
Course Description: Parallel Computer Architecture
Parallel Processing Architectures
Constructing a system with multiple computers or processors
What is Computer Architecture?
What is Computer Architecture?
Chapter 4 Multiprocessors
Husky Energy Chair in Oil and Gas Research
CSE378 Introduction to Machine Organization
Presentation transcript:

Paths and Cul de Sacs to High Performance computing International Symposium on Modern Computing Celebrating John Vincent Atanasoff Centenary Iowa State University 31 October 2003 Gordon Bell Microsoft Bay Area Research Center

The Paths DARPA’s HPC for National Security …5 >3 > 2 –Standards paradox: the greater the architectural diversity, the less the learning and program market size COTS evolution…if only we could interconnect and cool them, so that we can try to program it Terror Grid– the same research community that promised a clusters programming environment Response to Japanese with another program …and then a miracle happens

A brief, simplified history of HPC 1. Cray formula evolves smPv for FORTRAN (US:60-90) : VAXen threaten computer centers… NSF response: Lax Report. Create 7-Cray centers : The Japanese are coming with the 5 th AI Generation 5. DARPA SCI response: search for parallelism w/scalables 6. Scalability is found: “bet the farm” on micros clusters Beowulf standard forms. (In spite of funders.) >1995 “Do-it-yourself” Beowulfs negate computer centers since everything is a cluster enabling “do-it-yourself” centers! >2000. Result >95 : EVERYONE needs to re-write codes!! 7. DOE’s ASCI: petaflops clusters => “arms” race continues! : The Japanese came! Just like they said in HPC for National Security response: 5 bets &7 years 10. Next Japanese effort? Evolve? (Especially software) red herrings or hearings : High speed nets enable peer2peer & Grid or Teragrid Atkins Report-- Spend $1.1B/year, form more and larger centers and connect them as a single center… 13. DARPA HP 2010 project 5 >3 (Cray, IBM, SUN) > 1 winner

Copyright Gordon Bell Steve Squires & Gordon Bell at our “Cray” at the start of DARPA’s SCI program c years later: Clusters of Killer micros become the single standard

Copyright Gordon Bell & Jim Gray PC+ “ ” In Dec computers with 1,000 processors will do most of the scientific processing. Danny Hillis 1990 (1 paper or 1 company)

RIP Lost in the search for parallelism ACRI French-Italian program Alliant Proprietary Crayette American Supercomputer Ametek Applied Dynamics Astronautics BBN CDC >ETA ECL transition Cogent Convex > HP Cray Computer > SRC GaAs flaw Cray Research > SGI > Cray Manage Culler-Harris Culler Scientific Vapor… Cydrome VLIW Dana/Ardent/Stellar/Stardent Denelcor Encore Elexsi ETA Systems aka CDC;Amdahl flaw Evans and Sutherland Computer Exa Flexible Floating Point Systems SUN savior Galaxy YH-1 Goodyear Aerospace MPP SIMD Gould NPL Guiltech Intel Scientific Computers International Parallel Machines Kendall Square Research Key Computer Laboratories searching again MasPar Meiko Multiflow Myrias Numerix Pixar Parsytec nCube Prisma Pyramid Early RISC Ridge Saxpy Scientific Computer Systems (SCS) Soviet Supercomputers Supertek Supercomputer Systems Suprenum Tera > Cray Company Thinking Machines Vitesse Electronics Wavetracer SIMD

Copyright Gordon Bell Sources of lack of success Poor management – Poor engineering, or lack of supporting technology – Not understanding Moore’s Law & CMOS (ECL, GaAs) – Massive bureaucracy, no ideas or leader, minimal technology Plan – Research program, not a company or product – Market channel FPS, Cray CMOS knock-offs – No sustaining technology or second product – Lack of ability to get apps – Just a RISC (one time hit) – Too expensive for the return Bad ideas – SIMD…lack of enough apps; not general purpose – Vector/scalar >10 – VLIW, including Itanium – MTA – Too many ideas

1987 Interview July 1987 as first CISE AD Kicked off parallel processing initiative with 3 paths –Vector processing was totally ignored –Message passing multicomputers including distributed workstations and clusters –smPs (multis) -- main line for programmability –SIMDs might be low-hanging fruit Kicked off Gordon Bell Prize Goal: common applications parallelism –10x by 1992; 100x by 1997

1989 CACM CACM 1989 X X X X X X

A narrowing of ideas s RISC, SIMD, Systolic, MTA, NUMA, clusters DARPA fund n startups Vectors: disdain; Japan: evolve them Interconnect = ???? Many tries Scalability = ???? Programmability = nil 2003 COTS Clusters n-node types* ; Japan: Vector evolution DARPA fund 3 big co’s + xx Vectors: NEC & Fujitsu; Cray almost rediscovers Interconnect = standard; Japan: cross-point Scalability = CLUSTERS! Programmability = nil; Japan: HPC *1P, smP, smPv, NUMA,…

© Gordon Bell 11 Perf (PAP) = c x $s x 1.6**(t-1992); c = 128 GF/$300M ‘94 prediction: c = 128 GF/$30M

60% 100% 110% ES 50 PS2.1 X How Long?

Bell Prize Performance Gain 26.58TF/ TF = 59,000 in 15 years = Cost increase $15 M >> $300 M? say 20x Inflation was 1.57 X, so effective spending increase 20/1.57 = ,000/12.73 = 4639 X = Price-performance : $2500/MFlops > $0.25/MFlops = 10 4 = $1K/4GFlops PC = $0.25/MFlops

Bell Prize Performance Winners Vector: Cray-XMP, -YMP, CM2* (2), Clustered: CM5, Intel 860 (2), Fujitsu (2), NEC (1) = 10 Cluster of SMP (Constellation): IBM Cluster, single address, very fast net: Cray T3E Numa: SGI… good idea, but not universal Special purpose (2) No winner: 91 By 1994, all were scalable (Not: Cray-x,y,CM2) No x86 winners! *note SIMD classified as a vector processor)

Heuristics Use dense matrices, or almost embarrassingly // apps Memory BW… you get what you pay for (4-8 Bytes/Flop) RAP/$ is constant. Cost of memory bandwidth is constant. Vectors will continue to be an essential ingredient; the low overhead formula to exploit the bandwidth, stupid Bad ideas: SIMD; Multi-threading tbd Fast networks or larger memories decrease inefficiency Specialization really, really pays in performance/price! 2003: 50 Sony for 50K is good. COTS aka x86 for Performance/Price BUT not Perf. Bottom Line: Memory BW, Interconnect BW <>Memory Size, FLOPs,

Copyright Gordon Bell HPC for National Security Schedule

Copyright Gordon Bell Does the schedule make sense? Early 90s-97 4 yr. firm proposal yr. for SDV/compiler yr. useful system

Copyright Gordon Bell System Components & Technology

Copyright Gordon Bell What about software? Will we ever learn? The working group did not establish a roadmap for software technologies. One reason for this is that progress on software technologies for HEC are less likely to result from focused efforts on specific point technologies, and more likely to emerge from large integrative projects and test beds: one cannot develop, in a meaningful way, software for high performance computing in absence of high performance computing platforms.

Copyright Gordon Bell All HPC systems are CLUSTERS and therefore very similar, but still diverse! Nodes connected through a switch… – Ethernet | Fast switched | cross-point switches – Messages | Direct memory access Most performance is lost by making all systems into uni-processors w/ excessive parallelism Nodes have diversity that must be exploited – Scalar | Vector – Uni | smP | NUMA smP | distributed smP (DSM) Predicable: new mt micros will be a disaster!

Courtesy of John Levesque, Cray ---- NRC/CTSB Future of Supers The State of HPC Software and It’s Future John M. Levesque Senior Technologist Cray Inc.

Courtesy of John Levesque, Cray ---- NRC/CTSB Future of Supers Bottom Line Attack of the Killer Micros significantly hurt application development and overall performance has declined over their 10 year reign. –As Peak Performance has increased, the sustained percentage of peak performance has declined Attack of the “Free Software” is finishing off any hopes of recovery. Companies cannot build a business case to supplied needed software Free Software does not result in productive software. –While there are numerous “free software” databases, Oracle and other proprietary databases are preferred due to robustness and performance.

Copyright Gordon Bell Innovation The Virtuous Economic Cycle drives the PC industry… & Beowulf Volume Competition Standards Utility/value DOJ Greater lower cost Creates apps, tools, training, Attracts users Attracts suppliers

PC Nodes Don’t Make Good Large-Scale Technical Clusters PC microprocessors are optimized for desktop market (the highest volume, most competitive segment) –Have very high clock rates to win desktop benchmarks –Have very high power consumption and (except in small, zero cache miss rate applications) are quite mismatched to memory PC nodes are physically large –To provide power and cooling – Papadopoulous “computers… suffer from excessive cost, complexity, and power consumption” –To support other components High node-count clusters must be spread over many cabinets with cable-based multi-cabinet networks –Standard (Ethernet) and faster, expensive networks have quite inadequate performance for technical applications …Switching cost equals node cost –Overall size, cabling, and power consumption reduce reliability …Infrastructure cost equals node cost

Copyright Gordon Bell Lessons from Beowulf An experiment in parallel computing systems ‘92 Established vision- low cost high end computing Demonstrated effectiveness of PC clusters for some (not all) classes of applications Provided networking software Provided cluster management tools Conveyed findings to broad community Tutorials and the book Provided design standard to rally community! Standards beget: books, trained people, software … virtuous cycle that allowed apps to form Industry began to form beyond a research project Courtesy, Thomas Sterling, Caltech.

Copyright Gordon Bell A Massive Public Works Program … but will it produce a HPC for NS? Furthermore, high-end computing laboratories are needed. These laboratories will fill a critical capability gap … to – test system software on dedicated large- scale platforms, – support the development of software tools and algorithms, – develop and advance benchmarking and modeling, and – simulations for system architectures, and – conduct detailed technical requirements analysis. these functions would be executed by existing computer government and academic centers.

What I worry about our direction Overall: tiny market and need. –Standards paradox: the more unique and/or greater diversity of the architectures, the less the learning and market for sofware. Resources, management, and engineering: –Schedule for big ideas isn’t in the ballpark e.g. Intel & SCI (c1984) –Are the B & B working on the problem? Or marginal people T & M? –Proven good designers and engineers vs. proven mediocre|unproven! –Creating a “government programming co” versus an industry Architecture –Un-tried or proven poor ideas? Architecture moves slowly! –Evolution versus more radical, but completely untested ideas –CMOS designs(ers): poor performance… from micros to switches –Memory and disk access growing at 10% versus 40-60% Software –No effort that is of the scale of the hardware –Computer science versus working software –Evolve and fix versus start-over effort with new & inexperienced

© Gordon Bell 28 The End