Presentation is loading. Please wait.

Presentation is loading. Please wait.

Scalable Scientific Computing at Compaq CAS 2001 Annecy, France October 29 – November 1, 2001 Dr. Martin Walker Compaq Computer EMEA

Similar presentations


Presentation on theme: "Scalable Scientific Computing at Compaq CAS 2001 Annecy, France October 29 – November 1, 2001 Dr. Martin Walker Compaq Computer EMEA"— Presentation transcript:

1 Scalable Scientific Computing at Compaq CAS 2001 Annecy, France October 29 – November 1, 2001 Dr. Martin Walker Compaq Computer EMEA CAS 2001 Annecy, France October 29 – November 1, 2001 Dr. Martin Walker Compaq Computer EMEA

2 Agenda of the entertainment  From EV4 to EV7: four implementations of the Alpha microprocessor over ten years  Performance on a few applications, including numerical weather forecasting  The Terascale Computing System at the Pittsburgh Supercomputing Center  Marvel: the next (and last) AlphaServer  Grid Computing

3 Scientific basis for vector processor choice for Earth Simulator project  Comparison of Cray T3D and Cray Y-MP/C90 J.J. Hack, et al, “Computational design of the NCAR community climate model”, J. Parallel Computing 21 (1995)  Fraction of peak performance achieved –1-7% on Cray T3D –30% on Cray Y-MP/C90  Cray T3D used the Alpha EV4 processor from 1992

4 Key ratios that determine sustained application performance (U.S. DoD/DoE)

5 Int Reg Map Branch Predictors Alpha EV6 Architecture FETCH MAP QUEUE REG EXEC DCACHE Stage: Int Issue Queue (20) Exec 4 Instructions / cycle Reg File (80 ) Victim Buffer L1 Data Cache 64KB 2-Set FP Reg Map FP ADD Div/Sqrt FP MUL Addr 80 in-flight instructions plus 32 loads and 32 stores Addr Miss Address Next-Line Address L1 Ins. Cache 64KB 2-Set Exec Reg File (80 ) FP Issue Queue (15) Reg File (72 )

6 Weather Forecasting Benchmark  LM = local model, German Weather Service (DWD) Current version is RAPS 2.0  Grid size is 325  325  35; predefined INPUT set dwd used for all benchmarks  First forecast hour timed (contains more I/O than subsequent forecast hours)  Machines –Cray T3E/1200 (EV5/600 MHz) in Jülich, Germany –AlphaServer SC40 (EV67/667 MHz) in Marlboro, MA  Study performed by Pallas GmbH (www.pallas.com)

7 Total time (AS SC40 vs. Cray T3E)

8 Performance comparisons  Alpha EV67/667 MHz in AS SC40 delivers about 3 times the performance of EV5/600 MHz in Cray T3E to the LM application  EV5 is running at about 6.7% of peak  EV67 is running at about 18.5% of peak

9 Compilation Times  Cray T3E Flags: -O3 -O aggress,unroll2,split1,pipeline2 Compilation time: 41 min 37 sec  Compaq EV6/500 MHz (EV67 is faster) Flags: -fast -O4 Compilation time: 5 min 15 sec  IBM SP3 Flags: -04 -qmaxmem=-1 Compilation time: 40 min 19 sec Note: numeric_utilities.f90 had to be compiled with -O3 in order to avoid crashes

10 SWEEP3DSWEEP3D  3D discrete ordinates (S n ) neutron transport  Implicit wavefront algorithm –Convergence to stable solution  Target System - multitasked PVP / MPP –Vector style code –High ratio of (load,stores) to flops  memory bandwidth and latency sensitive  performance is sensitive to grid size

11 SWEEP3D “as is” Performance

12 Optimizations to SWEEP3D  Fuse inner loops –demote temporary vectors to scalars –reduce load/store count  Separate loops with explicit values for “i2” = -1,1 –allows prefetch code to be generated  Fixup code moved “outside” loop –loop unrolling, pipelining

13 Instruction counts/iteration (+ measured cycles on EV6)

14 Optimized SWEEP3D Performance

15 128b 8.0GB/s AlphaServer ES45 (EV68/1.001 GHz) Crossbar Switch (Typhoon chipset) 64b (4.2GB/s) Quad Ctl Data Slices (8) PAPP 256b 4.2GB/s 64b 266MB/s Alpha 264 L2 Cache SDRAM Memory 133 MHz 128MB - 32GB Bank 3 Bank 2 Bank 1 Bank 0 Alpha 264 L2 Cache 4xAGP PCI 2 PCI 1 PCI 3 PCI 1 PCI 0 PCI 1 PCI 0 512MB/s 512MB/s 256MB/s 512MB/s PCI 0

16 Pittsburgh Supercomputing Center (PSC)  Cooperative effort of –Carnegie Mellon University –University of Pittsburgh –Westinghouse Electric  Offices in Mellon Institute –On CMU campus –Adjacent to UofP campus

17 Westinghouse Electric  Energy Center, Monroeville PA  Major computing systems  High-speed network connections

18 Terascale Computing System at Pittsburgh Supercomputing Center  Sponsored by the U.S. National Science Foundation  Integrated into the PACI program (Partnerships for Academic Computing Infrastructure)  Serving the “very high end” for academic computational science and engineering  The largest open facility in the world  PSC in collaboration with Compaq and with –Application scientists and engineers –Applied mathematicians –Computer Scientists –Facilities staff  Compaq AlphaServer SC technology

19 System Block Diagram  3040 CPUs  Tru64 UNIX  3 TB memory  41 TB disk  152 CPU cabs  20 switch cabs

20  ES45 nodes –5 per cabinet –3 local disks

21 Row upon row…

22 Quadrics Switches  Rail 1 &  Rail 0

23 Middle Aisle, Switches in Center

24 QSW switch chassis  Fully wired switch chassis  1 of 42

25 Control nodes and concentrators

26 The Front Row

27 Installation: from 0 to TFLOPS in 29 days (Latest: TFLOPS on 3024 CPUs)  Deliveries & continual integration: –44 nodes arrived at PSC on Saturday, –50 nodes arrived on Friday, –30 nodes arrived on Saturday, –50 nodes arrived on Monday, –180 nodes arrived on Wednesday, –130 nodes arrived on Sunday, –180 nodes arrived on Thursday,  To have shipped 12 September!  Federated switch cabled/operational by  760 nodes clustered by  TFLOPS Linpack by  TFLOPS in Dongarra’s list dated Mon Oct 22 (67% of peak performance)

28 MM5

29

30 Alpha Microprocessor Summary  EV6 –.35  m, 600 MHz –4-wide superscalar –Out-of-order execution –High memory BW  EV67 –.25  m, up to 750 MHz  EV68 –.18  m,  1000 MHz  EV7 –.18  m, 1250 MHz –L2 cache on-chip –Memory control on-chip –I/O control on-chip –cc inter-proc com on- chip  EV79 –.13  m, ~1600 MHz

31 EV7 – The System is the Silicon…. EV68 core with enhancements Integrated L2 cache – 1.75 MB (ECC) – 20 GB/s cache bandwidth Integrated memory controllers – Direct RAMbus (ECC) – 12.8 GB/s memory bandwidth – Optional RAID in memory Integrated network interface – Direct processor-processor interconnects – 4 links GB/s aggregate bandwidth – ECC (single error correct, double error detect) – 3.2 GB/s I/O interface per processor SMP CPU interconnect used to be external logic… Now it’s on the chip

32 Alpha EV7

33 EV7 – The System is the Silicon…. Electronics to do cache-coherent communications gets placed within the EV7 chip EV7

34 Int Reg Map Branch Predictors Alpha EV7 Core FETCH MAP QUEUE REG EXEC DCACHE Stage: L2 cache 1.75MB 7-Set Int Issue Queue (20) Exec 4 Instructions / cycle Reg File (80 ) Victim Buffer L1 Data Cache 64KB 2-Set FP Reg Map FP ADD Div/Sqrt FP MUL Addr 80 in-flight instructions plus 32 loads and 32 stores Addr Miss Address Next-Line Address L1 Ins. Cache 64KB 2-Set Exec Reg File (80 ) FP Issue Queue (15) Reg File (72 )

35 Virtual Page Size  Current virtual page size –8K –64K –512K –4M  New virtual page size (boot time selection) –64K –2M –64M –512M

36 PerformancePerformance  SPEC95 –SPECint95 75 –SPECfp95160  SPEC2000 –CINT –CFP  59% higher than EV68/1GHz

37 Building Block Approach to System Design Key Components: EV7 Processor IO7 I/O Interface Dual Processor Module Systems Grow by Adding:  Processors  Memory  I/O

38

39

40

41 Two complementary views of the Grid The hierarchy of understanding  Data are uninterpreted signals  Information is data equipped with meaning  Knowledge is information applied in practice to accomplish a task  The Internet is about information  The Grid is about knowledge –Tony Hey, Director, UK eScience Core Program Main technologies developed by man  Writing captures knowledge  Mathematics enables rigorous understanding, prediction  Computing enables prediction of complex phenomena  The Grid enables intentional design of complex systems –Rick Stevens, ANL

42 What is the Grid? “ A computational grid is a hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to high-end computing capabilities.” –Ian Foster and Carl Kesselman, editors, “The GRID: Blueprint for a New Computing Infrastructure” (Morgan-Kaufmann Publishers, SF, 1999) 677 pp. ISBN  The Grid is an infrastructure to enable virtual communities to share distributed resources to pursue common goals  The Grid infrastructure consists of protocols, application programming interfaces, and software development kits to provide authentication, authorization, and resource location and access –Foster, Kesselman, Tuecke: “The anatomy of the Grid: Enabling Scalable Virtual Organizations”

43 Compaq and The Grid  Sponsor of the Global Grid Forum (www.globalgridforum.org)  Founding member of the New Productivity Initiative for Distributed Resource Management (www.newproductivity.org)  Industrial member of the GridLab consortium (www.gridlab.org) –20 leading European and US institutions –Infrastructure, applications, testbed –Cactus “worm” demo at SC2001 (www.cactuscode.org)  Intra-Grid within Compaq firewall –Nodes in Annecy, Galway, Nashua, Marlboro, Tokyo –Globus, Cactus, GridLab infrastructure and applications –iPAQ Pocket PC (www.ipaqlinux.com)

44 Potential dangers for the Grid  Solution in search of a problem  Shell game for cheap (free) computing  Plethora of unsupported, incompatible, non- standard tools and interfaces

45 “Big Science”  As with the Internet, scientific computing will be the first to benefit from the Grid. Examples: –GriPhyN (US Grid Physics Network for Data-intensive Science)  Elementary particle physics, gravitational wave astronomy, optical astronomy (digital sky survey)  –DataGrid (led by CERN)  Analysis of data from scientific exploration  –There are also compute-intensive applications that can benefit from the Grid

46 Final Thoughts: all this will not be easy  How good have we been as a community at making parallel computing easy and transparent?  There are still some things we can’t do –predict the El Niño phenomenon correctly –plate tectonics and earth mantel convection –failure mechanisms in new materials  Validation and verification of numerical simulation are crying needs

47 Thank You! Please visit our HPTC Web Site

48

49 Stability & Continuity for AlphaServer customers Commitment to continue implementing the Alpha Roadmap according to the current plan-of-record –EV68, EV7 & EV79 –Marvel systems –Tru64 UNIX support –AlphaServer systems, running Tru64 UNIX, will be sold as long as customers demand, at least several years after EV79 system arrive in 2004, with support continuing for a minimum of 5 years beyond that

50 EV68 Itanium™ Processor Family McKinley MadisonItanium Processor Family Next Generation Alpha Processor ProLiant Servers Itanium™ 1-32P McKinley family Alpha Servers EV7EV EV7 Family 8–64P (8P BB) 2–8P (2P BB) EV68 Product Family GS P ES 1 – 4P DS 1 – 2P EV79 8–64P (8P BB) 2–8P (2P BB) Madison EV68 Microprocessor and System Roadmaps 1-8P Next Generation Server Family 8-64P, Blades, 2P, 4P, 8P Itanium™ 1 – 4P

51 The New HP  Chairman and CEOCarly Fiorina  PresidentMichael Capellas  Imaging and Printing $20B Vyamesh Joshi  Access Devices $29B Duane Zitzen  IT Infrastructure $23B Peter Blackmore  Services $15B Ann Livermore


Download ppt "Scalable Scientific Computing at Compaq CAS 2001 Annecy, France October 29 – November 1, 2001 Dr. Martin Walker Compaq Computer EMEA"

Similar presentations


Ads by Google