ECMWF Slide 1 Introduction to Parallel Computing George Mozdzynski March 2004.

Slides:



Advertisements
Similar presentations
TWO STEP EQUATIONS 1. SOLVE FOR X 2. DO THE ADDITION STEP FIRST
Advertisements

Symantec 2010 Windows 7 Migration EMEA Results. Methodology Applied Research performed survey 1,360 enterprises worldwide SMBs and enterprises Cross-industry.
Symantec 2010 Windows 7 Migration Global Results.
Analyzing Parallel Performance Intel Software College Introduction to Parallel Programming – Part 6.
Shared-Memory Model and Threads Intel Software College Introduction to Parallel Programming – Part 2.
1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
Pricing for Utility-driven Resource Management and Allocation in Clusters Chee Shin Yeo and Rajkumar Buyya Grid Computing and Distributed Systems (GRIDS)
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Chapter 1 The Study of Body Function Image PowerPoint
1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
Processes and Operating Systems
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Author: Julia Richards and R. Scott Hawley
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 3 CPUs.
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
UNITED NATIONS Shipment Details Report – January 2006.
Clusters, Grids and their applications in Physics David Barnes (Astro) Lyle Winton (EPP)
and 6.855J Spanning Tree Algorithms. 2 The Greedy Algorithm in Action
1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×
CALENDAR.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Year 6 mental test 5 second questions
Year 6 mental test 10 second questions
Solve Multi-step Equations
REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.
SE-292 High Performance Computing
Intel VTune Yukai Hong Department of Mathematics National Taiwan University July 24, 2008.
PP Test Review Sections 6-1 to 6-6
EU Market Situation for Eggs and Poultry Management Committee 21 June 2012.
EIS Bridge Tool and Staging Tables September 1, 2009 Instructor: Way Poteat Slide: 1.
1 CMSC421: Principles of Operating Systems Nilanjan Banerjee Principles of Operating Systems Acknowledgments: Some of the slides are adapted from Prof.
2 |SharePoint Saturday New York City
IP Multicast Information management 2 Groep T Leuven – Information department 2/14 Agenda •Why IP Multicast ? •Multicast fundamentals •Intradomain.
VOORBLAD.
Name Convolutional codes Tomashevich Victor. Name- 2 - Introduction Convolutional codes map information to code bits sequentially by convolving a sequence.
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
© 2012 National Heart Foundation of Australia. Slide 2.
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
HJ-Hadoop An Optimized MapReduce Runtime for Multi-core Systems Yunming Zhang Advised by: Prof. Alan Cox and Vivek Sarkar Rice University 1.
Before Between After.
Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M
25 seconds left…...
Subtraction: Adding UP
Equal or Not. Equal or Not
Januar MDMDFSSMDMDFSSS
Analyzing Genes and Genomes
Chapter 9 Interactive Multimedia Authoring with Flash Introduction to Programming 1.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
Intracellular Compartments and Transport
PSSA Preparation.
Essential Cell Biology
Raspberry Pi Performance Benchmarking
1 Chapter 13 Nuclear Magnetic Resonance Spectroscopy.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 23, 2002 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
The University of Adelaide, School of Computer Science
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Team 1 Aakanksha Gupta, Solomon Walker, Guanghong Wang
Presentation transcript:

ECMWF Slide 1 Introduction to Parallel Computing George Mozdzynski March 2004

ECMWF Slide 2 Outline What is parallel computing? Why do we need it? Types of computer Parallel Computing today Parallel Programming Languages OpenMP and Message Passing Terminology

ECMWF Slide 3 What is Parallel Computing? The simultaneous use of more than one processor or computer to solve a problem

ECMWF Slide 4 Why do we need Parallel Computing? Serial computing is too slow Need for large amounts of memory not accessible by a single processor

ECMWF Slide 5 An IFS operational T L 511L60 forecast model takes about one hour wall time for a 10 day forecast using 288 CPUs of our IBM Cluster GHz system (total 1920 CPUs). How long would this model take using a fast PC with sufficient memory? e.g. 3.2 GHz Pentium 4

ECMWF Slide 6 Ans. About 8 days This PC would need about 25 Gbytes of memory. 8 days is too long for a 10 day forecast! 2-3 hours is too long …

ECMWF Slide 7 CPUs Wall time Amdahls Law: Wall Time = S + P/N CPUS IFS Forecast Model (T L 511L60) Serial =574 secs Parallel= secs (Calculated using Excels LINEST function) (Named after Gene Amdahl) If F is the fraction of a calculation that is sequential, and (1-F) is the fraction that can be parallelised, then the maximum speedup that can be achieved by using N processors is 1/(F+(1-F)/N).

ECMWF Slide 8 IFS Forecast Model (T L 511L60) CPUs Wall time SpeedUp Efficiency

ECMWF Slide 9

ECMWF Slide 10 IFS model would be inefficient on large numbers of CPUs. But OK up to 512.

ECMWF Slide 11 Types of Parallel Computer P=Processor M=Memory S=Switch Shared MemoryDistributed Memory P M P … P M P M S …

ECMWF Slide 12 IBM Cluster 1600 ( at ECMWF) P=Processor M=Memory S=Switch … S P M P … P M P … Node

ECMWF Slide 13 IBM Cluster 1600s at ECMWF (hpca + hpcb)

ECMWF Slide 14 ECMWF supercomputers 1979 CRAY 1AVector CRAY XMP-2 CRAY XMP-4 CRAY YMP-8 CRAY C90-16 Fujitsu VPP700 Fujitsu VPP IBM p690Scalar + MPI +Shared Memory Parallel } } Vector + MPI Parallel Vector + Shared Memory Parallel

ECMWF Slide 15 ECMWFs first Supercomputer CRAY-1A 1979

ECMWF Slide 16 Where have 25 years gone?

ECMWF Slide 17 Types of Processor DO J=1,1000 A(J)=B(J) + C ENDDO LOAD B(J) FADD C STORE A(J) INCR J TEST SCALAR PROCESSOR VECTOR PROCESSOR LOADV B->V1 FADDV B,C->V2 STOREV V2->A Single instruction processes one element Single instruction processes many elements

ECMWF Slide 18 Parallel Computing Today Vector Systems NEC SX6 CRAY X-1 Fujitsu VPP5000 Scalar Systems IBM Cluster 1600 FujitsuPRIMEPOWER HPC2500 HP Integrity rx2600 Itanium2 Cluster Systems (typically installed by an Integrator) Virgina Tech, Apple G5 / Infiniband NCSA, Dell PowerEdge 1750, P4 Xeon / Myrinet LLNL, MCR Linux Cluster Xeon / Quadrics LANL, Linux Networx AMD Opteron / Myrinet

ECMWF Slide 19 The TOP500 project started in 1993 Top 500 sites reported Report produced twice a year -EUROPE in JUNE -USA in NOV Performance based on LINPACK benchmark

ECMWF Slide 20 Top 500 Supercomputers

ECMWF Slide 21 Where is ECMWF in Top 500 R max R peak R max – Gflop/sec using Linpack Benchmark R peak – Peak Hardware Gflop/sec (that will never be reached!)

ECMWF Slide 22 What performance do Meteorological Applications achieve? Vector computers -About 30 to 50 percent of peak performance -Relatively more expensive -Also have front-end scalar nodes Scalar computers -About 5 to 10 percent of peak performance -Relatively less expensive Both Vector and Scalar computers are being used in Met Centres around the world Is it harder to parallelize than vectorize? -Vectorization is mainly a compiler responsibility -Parallelization is mainly the users responsibility

ECMWF Slide 23 Overview of Recent Supercomputers Aad J. van der Steen and Jack J. Dongarra

ECMWF Slide 24

ECMWF Slide 25

ECMWF Slide 26 Parallel Programming Languages? High Performance Fortran (HPF) directive based extension to Fortran works on both shared and distributed memory systems not widely used (more popular in Japan?) not suited to applications using irregular grids OpenMP directive based support for Fortran 90/95 and C/C++ shared memory programming only

ECMWF Slide 27 Most Parallel Programmers use… Fortran 90/95, C/C++ with MPI for communicating between tasks (processes) -works for applications running on shared and distributed memory systems Fortran 90/95, C/C++ with OpenMP -For applications that need performance that is satisfied by a single node (shared memory) Hybrid combination of MPI/OpenMP -ECMWFs IFS uses this approach

ECMWF Slide 28 the myth of automatic parallelization (2 common versions) Compilers can do anything (but we may have to wait a while) -Automatic parallelization makes it possible (or will soon make it possible) to port any application to a parallel machine and see wonderful speedups without any modifications to the source Compilers cant do anything (now or never) -Automatic parallelization is useless. Itll never work on real code. If you want to port an application to a parallel machine, you have to restructure it extensively. This is a fundamental limitation and will never be overcome

ECMWF Slide 29 Terminology Cache, Cache line NUMA false sharing Data decomposition Halo, halo exchange FLOP Load imbalance Synchronization

ECMWF Slide 30 THANKYOU

ECMWF Slide 31 Cache P M C P=Processor C=Cache M=Memory M P C1C1 C1C1 C2C2 P

ECMWF Slide 32 IBM node = 8 CPUs + 3 levels of $ P C1C1 C1C1 C2C2 PP C1C1 C1C1 C2C2 PP C1C1 C1C1 C2C2 PP C1C1 C1C1 C2C2 P C3C3 Memory

ECMWF Slide 33 Cache is … Small and fast memory Cache line typically 128 bytes Cache line has state (copy,exclusive owner) Coherency protocol Mapping, sets, ways Replacement strategy Write thru or not Important for performance -Single stride access of always the best!!! -Try to avoid writes to same cache line from different Cpus But dont lose sleep over this

ECMWF Slide 34 IFS blocking in grid space ( IBM p690 / T L 159L60 ) Optimal use of cache / subroutine call overhead