Manchester Computing Supercomputing, Visualization & eScience Zoe Chaplin 11 September 2003 CAS2K3 Comparison of the Unified Model Version 5.3 on Various.

Slides:



Advertisements
Similar presentations
Designing a cluster for geophysical fluid dynamics applications Göran Broström Dep. of Oceanography, Earth Science Centre, Göteborg University.
Advertisements

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.
Beowulf Supercomputer System Lee, Jung won CS843.
Recent Developments for Parallel CMAQ Jeff Young AMDB/ASMD/ARL/NOAA David Wong SAIC – NESCC/EPA.
Page 1© Crown copyright 2007 High-resolution modelling in support of T-REX observations Simon Vosper and Peter Sheridan Met Office, UK T-REX Workshop,
SHARCNET. Multicomputer Systems r A multicomputer system comprises of a number of independent machines linked by an interconnection network. r Each computer.
MPI and C-Language Seminars Seminar Plan  Week 1 – Introduction, Data Types, Control Flow, Pointers  Week 2 – Arrays, Structures, Enums, I/O,
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
Exploring Communication Options with Adaptive Mesh Refinement Courtenay T. Vaughan, and Richard F. Barrett Sandia National Laboratories SIAM Computational.
Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
High-resolution numerical modeling and predictability of atmospheric flows M. Ehrendorfer, A. Gohm and G. J. Mayr Institut für Meteorologie und Geophysik.
Nor Asilah Wati Abdul Hamid, Paul Coddington, Francis Vaughan School of Computer Science, University of Adelaide IPDPS - PMEO April 2006 Comparison of.
PARALLEL PROCESSING The NAS Parallel Benchmarks Daniel Gross Chen Haiout.
Lecture 3: Computer Performance
1 00/XXXX © Crown copyright Update on the Met Office forecast model. Damian Wilson, Met Office.
Distributed Databases Dr. Lee By Alex Genadinik. Distributed Databases? What is that!?? Distributed Database - a collection of multiple logically interrelated.
1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.
Tomographic mammography parallelization Juemin Zhang (NU) Tao Wu (MGH) Waleed Meleis (NU) David Kaeli (NU)
Iterative and direct linear solvers in fully implicit magnetic reconnection simulations with inexact Newton methods Xuefei (Rebecca) Yuan 1, Xiaoye S.
© Fujitsu Laboratories of Europe 2009 HPC and Chaste: Towards Real-Time Simulation 24 March
“SEMI-AUTOMATED PARALLELISM USING STAR-P " “SEMI-AUTOMATED PARALLELISM USING STAR-P " Dana Schaa 1, David Kaeli 1 and Alan Edelman 2 2 Interactive Supercomputing.
1 CCOS Seasonal Modeling: The Computing Environment S.Tonse, N.J.Brown & R. Harley Lawrence Berkeley National Laboratory University Of California at Berkeley.
Kernel and Application Code Performance for a Spectral Atmospheric Global Circulation Model on the Cray T3E and IBM SP Patrick H. Worley Computer Science.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Willem A. Landman Asmerom Beraki Francois Engelbrecht Stephanie Landman Supercomputing for weather and climate modelling: convenience or necessity.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.
Pursuing Faster I/O in COSMO POMPA Workshop May 3rd 2010.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Common Practices for Managing Small HPC Clusters Supercomputing 12
ESMF Performance Evaluation and Optimization Peggy Li(1), Samson Cheung(2), Gerhard Theurich(2), Cecelia Deluca(3) (1)Jet Propulsion Laboratory, California.
Supercomputing Cross-Platform Performance Prediction Using Partial Execution Leo T. Yang Xiaosong Ma* Frank Mueller Department of Computer Science.
CMAQ Runtime Performance as Affected by Number of Processors and NFS Writes Patricia A. Bresnahan, a * Ahmed Ibrahim b, Jesse Bash a and David Miller a.
Sept COMP60611 Fundamentals of Concurrency Lab Exercise 2 Notes Notes on the finite difference performance model example – for the lab… Graham Riley,
Parallelization of the Classic Gram-Schmidt QR-Factorization
G063 - Distributed Databases. Learning Objectives: By the end of this topic you should be able to: explain how databases may be stored in more than one.
1 Monday, 26 October 2015 © Crown copyright Met Office Computing Update Paul Selwood, Met Office.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
Parameter Range Study of Numerically-Simulated Isolated Multicellular Convection Z. DuFran, B. Baranowski, C. Doswell III, and D. Weber This work is supported.
Parallelization of 2D Lid-Driven Cavity Flow
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
ATmospheric, Meteorological, and Environmental Technologies RAMS Parallel Processing Techniques.
CS591x -Cluster Computing and Parallel Programming
Introduction Application of parallel programming to the KAMM model
Memory Hierarchy. Hierarchy List Registers L1 Cache L2 Cache Main memory Disk cache Disk Optical Tape.
CS 471 Final Project 2d Advection/Wave Equation Using Fourier Methods December 10, 2003 Jose L. Rodriguez
Data Structures and Algorithms in Parallel Computing Lecture 7.
Outline Why this subject? What is High Performance Computing?
Developments in High Performance Computing A Preliminary Assessment of the NAS SGI 256/512 CPU SSI Altix (1.5 GHz) Systems SC’03 November 17-20, 2003 Jim.
September 10 Performance Read 3.1 through 3.4 for Wednesday Only 3 classes before 1 st Exam!
Computer Organization Instruction Set Architecture (ISA) Instruction Set Architecture (ISA), or simply Architecture, of a computer is the.
Operating Systems (CS 340 D) Dr. Abeer Mahmoud Princess Nora University Faculty of Computer & Information Systems Computer science Department.
NCAS Unified Model Introduction Part 4b: UM Limited Area Models University of Reading, 3-5 December 2014.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Page 1© Crown copyright 2007SRNWP 8-11 October 2007, Dubrovnik Variable resolution or lateral boundary conditions Terry Davies Dynamics Research Yongming.
Manchester Computing Supercomputing, Visualization & eScience Seamless Access to Multiple Datasets Mike AS Jones ● Demo Run-through.
Performance Computer Organization II 1 Computer Science Dept Va Tech January 2009 © McQuain & Ribbens Defining Performance Which airplane has.
Performance Comparison of Winterhawk I and Winterhawk II Systems Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory.
Performance of a Semi-Implicit, Semi-Lagrangian Dynamical Core for High Resolution NWP over Complex Terrain L.Bonaventura D.Cesari.
Enabling Grids for E-sciencE LRMN ThIS on the Grid Sorina CAMARASU.
CSE 340 Computer Architecture Summer 2016 Understanding Performance.
Computer Graphics Graphics Hardware
Defining Performance Which airplane has the best performance?
Scalability to Hundreds of Clients in HEP Object Databases
CLUSTER COMPUTING.
Quantitative verification of cloud fraction forecasts
COMP60621 Designing for Parallelism
Computer Graphics Graphics Hardware
Parallel Feature Identification and Elimination from a CFD Dataset
Presentation transcript:

Manchester Computing Supercomputing, Visualization & eScience Zoe Chaplin 11 September 2003 CAS2K3 Comparison of the Unified Model Version 5.3 on Various Platforms

Supercomputing, Visualization & eScience2 Contents  Why am I talking about the UM?  The Platforms  The Model Version  Results –Global –Mesoscale –Discussion about the IBM  Conclusions and Further Work

Manchester Computing Supercomputing, Visualization & eScience Why am I talking about the UM?

Supercomputing, Visualization & eScience4 Why am I talking about the UM?  SVE consists of CSAR (HPC group), MVC (Visualisation) and eScience  CSAR has 512 processor Origin 3000, 816 processor Cray T3E and is getting a 256 processor Altix Itanium  Many academic users of the UM based throughout the UK using the CSAR service (UGAMP – The UK Universities Global Atmospheric Modelling Group)  Links between Manchester Computing and the Met Office

Manchester Computing Supercomputing, Visualization & eScience The Platforms

Supercomputing, Visualization & eScience6 The Platforms  At CSAR –‘Green’: Origin 3000, 400MHz MIPS R12000 processors, 1GB memory/processor ie 512 GB memory in total –‘Turing’: Cray T3E, MHz processors, 256MB memory/processor ie 209 GB memory in total  At ECMWF –IBM P690: 2x30 P690 nodes, GHz processors/node ie 16 POWER4 chips/node. Each node divided into 4 LPARs, most P690 nodes have 32 GB memory, 2x3 have 128 GB. SPSwitch2 connects LPARs.

Manchester Computing Supercomputing, Visualization & eScience The Model Version

Supercomputing, Visualization & eScience8 The Model Version  All experiments performed at version 5.3 ‘the new dynamics’  Semi-Lagrangian dynamics  Semi-implicit physics  Non-hydrostatic  Arakawa C-grid  Charney-Phillips grid in the vertical  Must have even number of procs in x direction

Supercomputing, Visualization & eScience9 Global Runs  N216 ie 432x325 points in the horizontal  38 vertical levels  20 minute timestep  Simulation run for 1 day ie 72 timesteps  5 point halos in both directions  Up to 256 processors used (144 on the IBM)  Fast solver used  MPI used for communication

Supercomputing, Visualization & eScience10 Mesoscale Runs  Mes covers the UK, parts of Northern Europe and Scandinavia  146x182 points in the horizontal  38 vertical levels  5 minute timestep  Simulation run for 36 hours ie 432 timesteps

Supercomputing, Visualization & eScience11 Mesoscale Runs  5 point halos in both directions  8 points for merging lbcs with main field  Up to 120 processors used  Limited to 10 procs max east-west and 13 procs max north-south  MPI used for communication

Manchester Computing Supercomputing, Visualization & eScience Global Results

Supercomputing, Visualization & eScience13 Global Results  T3E required minimum of 24 processors so comparisons are against this value  At lower processor numbers, the Origin 3000 proves to be the most scalable  Above ~156 processors, the T3E overtakes the Origin  The IBM is the least scalable but only run up to 144 processors (normally using < 8 cpus/LPAR – discussed later)  Curve showing IBM results by LPAR show scalability tailing off above 4 LPARs

Supercomputing, Visualization & eScience14 Global Results

Supercomputing, Visualization & eScience15 Global Results

Supercomputing, Visualization & eScience16 General Comments on the Global Results  Up to 144 processors, normally better to use 4 processors in the east-west direction  Below 32 processors for the IBM, results more varied – sometimes 2 is better  Origin between 1.33 and 1.63 times faster than the T3E  IBM between 2.36 and 3.07 times faster than the T3E

Supercomputing, Visualization & eScience17 Global Results

Supercomputing, Visualization & eScience18 Some Timings for the Global Model No of Processors T3EOriginIBM

Manchester Computing Supercomputing, Visualization & eScience Mesoscale Results

Supercomputing, Visualization & eScience20 Mesoscale Results  T3E required minimum of 8 processors so comparisons are against this value  Up to 48 processors the IBM out performs the other two machines (using < 8cpus/LPAR – discussed later)  Above ~64 processors, the T3E has the greatest scalability  Curve showing IBM results by LPAR show scalability tailing off above 4 LPARs (using all 8 cpus/LPAR)

Supercomputing, Visualization & eScience21 Mesoscale Results

Supercomputing, Visualization & eScience22 Mesoscale Results

Supercomputing, Visualization & eScience23 General Comments on the Mesoscale Results  For the T3E, better to use fewer processors in EW direction  For the Origin, below 36 processors use 2 or 4 procs in the EW direction. For 36 procs and above use 6 or even 8  For the IBM, below 24 processors, use more procs in the north-south direction than the east-west. For 24 and above, reverse this

Supercomputing, Visualization & eScience24 General Comments on the Mesoscale Results  Origin between 1.30 and 1.65 times faster than the T3E  IBM between 1.93 and 3.82 times faster than the T3E  The dip in the results on the IBM at 64 processors is due to having to use 8 procs/LPAR rather than 6 (discussed later)

Supercomputing, Visualization & eScience25 Mesoscale Results

Supercomputing, Visualization & eScience26 Some Timings for the Mesoscale Model No of Processors T3EOriginIBM

Manchester Computing Supercomputing, Visualization & eScience Discussion About the IBM

Supercomputing, Visualization & eScience28 Discussion about the IBM  For a given processor configuration, results indicate that it is better to use < 8 tasks/LPAR  If you are charged by the number of LPARs used then it is faster to use all the processors on the LPAR (ie increase the configuration)  Eg For a configuration of 4x9, use 6 LPARs (ie 6 tasks/LPAR)  However, if you are using 6 LPARs, the run will be faster using a 4x12 configuration. But 4x12 will produce faster results on 8 LPARs rather than 6.

Supercomputing, Visualization & eScience29 Discussion about the IBM Timings from the Global Model for the IBM No of ProcsProcessor Configuration No of LPARsTasks/LPARTiming 364x x x x x x

Manchester Computing Supercomputing, Visualization & eScience Conclusions and Further Work

Supercomputing, Visualization & eScience31 Conclusions  The amount of time spent on optimising the UM for use with the T3E is clearly reflected in the results  Further work needs to be done to optimise the code adequately for the Origin and the IBM  The processor configuration for the IBM may be dependent on the charging mechanism

Supercomputing, Visualization & eScience32 Conclusions  For a given configuration, using < 8 tasks/LPAR will produce faster results  On all machines, it is generally better to use as few processors in the east-west direction as possible

Supercomputing, Visualization & eScience33 Further Work  Optimisations for the Origin and the P690  Perform similar tests on the SGI Altix Itanium

Manchester Computing Supercomputing, Visualization & eScience World Leading Supercomputing Service, Support and Research Bringing Science and Supercomputers Together Thanks to ECMWF Manchester Computing