4.3.2003SOS71 Is a Grid cost-effective? Ralf Gruber, EPFL-SIC/FSTI-ISE-LIN, Lausanne.

Slides:



Advertisements
Similar presentations
Clusters, Grids and their applications in Physics David Barnes (Astro) Lyle Winton (EPP)
Advertisements

CoMPI: Enhancing MPI based applications performance and scalability using run-time compression. Rosa Filgueira, David E.Singh, Alejandro Calderón and Jesús.
Setting up Small Grid Testbed
Signal Processing Laboratory Swiss Federal Institute of Technology, Lausanne 1 Supercomputers and client-server environment for biomedical image processing.
♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.
Beowulf Supercomputer System Lee, Jung won CS843.
GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP.
Parallel Computation of the 2D Laminar Axisymmetric Coflow Nonpremixed Flames Qingan Andy Zhang PhD Candidate Department of Mechanical and Industrial Engineering.
Communication Pattern Based Node Selection for Shared Networks
Job Submission on WestGrid Feb on Access Grid.
SHARCNET. Multicomputer Systems r A multicomputer system comprises of a number of independent machines linked by an interconnection network. r Each computer.
History of Distributed Systems Joseph Cordina
Introduction CS 524 – High-Performance Computing.
Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
1 Computer Science, University of Warwick Metrics  FLOPS (FLoating point Operations Per Sec) - a measure of the numerical processing of a CPU which can.
Lunarc history IBM /VF IBM S/VF Workstations, IBM RS/ – 1997 IBM SP2, 8 processors Origin.
UNL Computer Science & Engineering Cluster Computing David R. Swanson Beowulf and Bombs.
IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.
Swiss-T1 : A Commodity MPI computing solution Mars 1999 Ralf Gruber, EPFL-SIC/CAPA/Swiss-Tx, Lausanne.
Cluster Computing Slides by: Kale Law. Cluster Computing Definition Uses Advantages Design Types of Clusters Connection Types Physical Cluster Interconnects.
P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism Efficient Longest Common Subsequence Computation using Bulk-Synchronous.
Vincent Keller, Ralf Gruber, EPFL Intelligent GRID Scheduling Service (ISS) K. Cristiano, A. Drotz, R.Gruber, V. Keller, P. Kunszt, P. Kuonen, S. Maffioletti,
Router Architectures An overview of router architectures.
Router Architectures An overview of router architectures.
Chapter 2 Computer Clusters Lecture 2.1 Overview.
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
1 Parallel computing and its recent topics. 2 Outline 1. Introduction of parallel processing (1)What is parallel processing (2)Classification of parallel.
Operational computing environment at EARS Jure Jerman Meteorological Office Environmental Agency of Slovenia (EARS)
A Secure Protocol for Computing Dot-products in Clustered and Distributed Environments Ioannis Ioannidis, Ananth Grama and Mikhail Atallah Purdue University.
Statistical Performance Analysis for Scientific Applications Presentation at the XSEDE14 Conference Atlanta, GA Fei Xing Haihang You Charng-Da Lu July.
 What is an operating system? What is an operating system?  Where does the OS fit in? Where does the OS fit in?  Services provided by an OS Services.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
So, Jung-ki Distributed Computing System LAB School of Computer Science and Engineering Seoul National University Implementation of Package Management.
University of Southampton Clusters: Changing the Face of Campus Computing Kenji Takeda School of Engineering Sciences Ian Hardy Oz Parchment Southampton.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
CLUSTER COMPUTING STIMI K.O. ROLL NO:53 MCA B-5. INTRODUCTION  A computer cluster is a group of tightly coupled computers that work together closely.
HPC Technology Track: Foundations of Computational Science Lecture 1 Dr. Greg Wettstein, Ph.D. Research Support Group Leader Division of Information Technology.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Of Rostock University DuDE: A D istributed Computing System u sing a D ecentralized P2P E nvironment The 4th International Workshop on Architectures, Services.
MIMD Distributed Memory Architectures message-passing multicomputers.
G-JavaMPI: A Grid Middleware for Distributed Java Computing with MPI Binding and Process Migration Supports Lin Chen, Cho-Li Wang, Francis C. M. Lau and.
Example: Sorting on Distributed Computing Environment Apr 20,
Advanced Computer Architecture Fundamental of Computer Design Instruction Set Principles and Examples Pipelining:Basic and Intermediate Concepts Memory.
LogP Model Motivation BSP Model Limited to BW of Network (g) and Load of PE Requires large load per super steps. Need Better Models for Portable Algorithms.
Multiprossesors Systems.. What are Distributed Databases ? “ A Logically interrelated collection of shared data ( and a description of this data) physically.
Tests and tools for ENEA GRID Performance test: HPL (High Performance Linpack) Network monitoring A.Funel December 11, 2007.
- Rohan Dhamnaskar. Overview  What is a Supercomputer  Some Concepts  Couple of examples.
HPCMP Benchmarking Update Cray Henry April 2008 Department of Defense High Performance Computing Modernization Program.
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future.
edit type on title master Fortran ISV Release I to L LINPACK TOP500 Technical Systems Division * Scalable Computing Lab 2 Hsin-Ying Lin
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
LACSI 2002, slide 1 Performance Prediction for Simple CPU and Network Sharing Shreenivasa Venkataramaiah Jaspal Subhlok University of Houston LACSI Symposium.
Holding slide prior to starting show. Scheduling Parametric Jobs on the Grid Jonathan Giddy
VGrADS and GridSolve Asim YarKhan Jack Dongarra, Zhiao Shi, Fengguang Song Innovative Computing Laboratory University of Tennessee VGrADS Workshop – September.
Network Connected Multiprocessors
Dieter an Mey Center for Computing and Communication, RWTH Aachen University, Germany V 3.0.
How do we evaluate computer architectures?
Department of Computer Science University of California, Santa Barbara
Chapter 1 Introduction.
CINECA HIGH PERFORMANCE COMPUTING SYSTEM
Memory System Performance Chapter 3
Department of Computer Science University of California, Santa Barbara
Designing a PC Farm to Simultaneously Process Separate Computations Through Different Network Topologies Patrick Dreher MIT.
Optimizing MPI collectives for SMP clusters
Cluster Computers.
Parallel Matrix Multiply
Presentation transcript:

SOS71 Is a Grid cost-effective? Ralf Gruber, EPFL-SIC/FSTI-ISE-LIN, Lausanne

SOS72 TOP500: 176 in Europe, 12 have more than 1 Tflops/s Linpack First is CEA-DAM: No. 7 Germany: 71, UK: 39, France: 22, Italy: 16, Others: 28 Industry: 108, first (Telecom I) at No. 96 BMW: 11, Daimler-Chrysler: 5, Car F: 6 Not one big, but many smaller machines HPC Companies: Quadrics Scali, SCI-based clusters: No. 51 SCS: see Toni’s presentation Beowulf production: Paralline, Dalco, HPC in Europe

SOS73 The Swiss-Tx machines (with TNet switch): 1998: Prototype Swiss-T0 with 16 Alphas : Swiss-T1 (Baby) with 16 Alphas : Swiss-T1 with 70 Alphas Know-how transfer to industry: 2001: GeneProt protein sequencing machine with 1420 Alphas Peak performance=1780Gflop/s In June 2001, would have been No. 12 in the Top500, 2nd in Europe and Was world number 1 of industrial computer installations Would be No. 48 (=C-Plant) in the Top500 list of November 2002 and Is still number 2 of industrial computer installations Swiss-Tx project

SOS74 NO! Is a grid cost-effective? Reasons: Since 25 years, we can use machines all over the world Those who needed good connections, installed it (HEPNET, Swissprot,..) Using Java is against HPC

SOS75 EPFL-SIC: SGI Origin3800 (500 MHz) 128 processors HP Alpha ES45/Quadrics (1.25 GHz) 100 processors Institutes PC clusters (CFD, Chemistry, Mathematics, Physics) IBM SP-2 (EFD) CSCS NEC SX-5 (16 processors) IBM Regatta (256 processors, 1.3 GHz) Parallel machines at EPFL and CSCS

SOS76 Parameterisation of. Single processor. Cluster. Application Application tailored Grid scheduling Optimal grid scheduling

SOS77 V a = Operations (Ops) / Memory accesses (LS) Examples SAXPY: y = y + a * x Ops = 2 LS = 3 (2 loads + 1 store) V a = 2 / 3 Matrix*matrix multiply and add: V a = n / 2 r a = min (R , R  * V a / V m ) = min (R , M  * V a )  r a = 2/3 * M   r a = R  Characteristic single processor parameters V a and r a

SOS78 V m = R  [Mflop/s] / M  [Mword/s] MachinePR  r a =M  V  r % NEC SX Pentium 4 1.5/R Alpha Pentium 4 1.7/S AMD 1.2/S r:Performance mesurée %:100*r/ r a /S: Slow SDRAM memory /R:Fast Rambus or RDRAM memory Results with MATMULT V a = 1 (double precision) R  [Mflop/s] = Theoretical peak performance M  [Mword/s] = Theoretical peak memory bandwidth

SOS79  > 1 Tailoring clusters to applications

SOS710  =  a /  m Application:  a = O / S Machine:  m = r a / b O: Number of operations in Flops S: Number of words sent in Words r a : Theoretical peak performance of application in Mflops/s b: Peak network bandwidth per processor in Mwords/s Tailoring clusters to applications

SOS711 Table : The  m values for MATMULT (double precision) Machine P P*r a C  m [Mflops/s][Mwords/s] T1 (TNet) 32* T1 (Fast Ethernet) 32* IELNX (P4+FE)  m = P * r a [Mflops/s] * / C [Mwords/s]  m = r a / b b = C / P Cluster characterisation

SOS712 Swiss-T1 (TNet): r a  = 1000 Mflops/s, b = 10 Mwords/s  m = 100 Water molecules:  a = 5*P*(0.65*N orb +4.24*log 2 V) / 3*(P-1) P=8, N orb =128, log 2 V=20  a = 330  = 3.3 (3.6 measured) -> 25% of overall time is due to communication 75% is due to computation LAUTREC on Swiss-T1 + TNet

SOS713 Swiss-T1 (FE): r a  = 2000 Mflops/s, b = 1.5 Mwords/s  m = 1333 Water molecules:  a = 5*P*(0.65*N orb +4.24*log 2 V) / 3*(P-1) P=8, N orb =128, log 2 V=20  a = 330  = 0.25 (0.25 measured) -> 20% of overall time is due to computation 80% is due to communication LAUTREC on Swiss-T1 + Fast Ethernet

SOS714 TNet/Swiss-T1: L=13  s MPI latency, b=80MB/s Break-even message length: beml=L*b=1000B Fast Ethernet: L=100  s MPI latency, b=10MB/s Break-even message length: beml=L*b=1000B Average message length in Lautrec: aml=  *V/16*P 2 For test case (V=96**3, P=8): aml=40 kB>>beml LAUTREC : Effect of latency

SOS715  a = Operations (O) / Sends (S) FE/FV: O  Nb of volume nodes O  Nb of variables per node square O  Nb of non-zero matrix elements O  Nb of operations per matrix element FE/FV: S  Nb of surface nodes S  Nb of variables per node FE/FV:  a  Nb of nodes in one direction  a  Nb of variables per node  a  Nb of non-zero matrix elements  a  Nb of operations per matrix element  a  Nb of surfaces  a (NS/FV/100**3) C 2000  a (Poisson/FD/100**3) C 400 Reminder (Beowulf+Fast Ethernet):  m C 250 Point-to-point applications

SOS716 Memory usage Price per 1h CPU time Engineering salary Energy consumption Maintenance/servicing/personnel costs User commodity Other quantities

SOS717 Goal: Add an application tailored Grid scheduling to RMS. Estimate machine and application parameters by counts. Measure machine and application parameters (PAPI,...). Build up a data base on these parameters. Find and submit to best suited Grid ressource (not always optimum). Update the data base dynamically. Perform statistics on decisions and decision failures Optimal Grid scheduling

SOS718 Settle and apply rules to find best suited ressource by:. Match machine/application (MPI or not MPI). Best price/performance ratio based on parameterisation. Availability of the ressources. Engineering costs. Energy consumption Optimal Grid scheduling

SOS719 Perform statistics to:. Detect too often demanded unavailable ressources. Detect real costs of an application. Detect applications that should be parallelised/optimised to reduce costs. Guide decision making for the next purchase. Guide decision on R&D money attribution Optimal Grid scheduling

SOS720 Yes, it can be! Is a grid cost-effective? Minimise overall costs by application adapted job execution Purchase not available demanded low-cost ressources Parallelise cost-ineffective applications Reduce engineering and energy costs Note: “Cheap” ressources do not have to be used up during 90% Results in More computing ressources for the same price More rapid increase of application efficiencies Questions Do computer manufacturers play the game? Do application owners play the game? Can we change users, decision makers and computing centres?

SOS721 R. Gruber, P. Volgers, A. de Vita, M. Stengel, T.-M. Tran, Parameterisation to tailor commodity clusters to applications, Future Generation Computer Systems 19 (2003) see also: Reference