SOS71 Is a Grid cost-effective? Ralf Gruber, EPFL-SIC/FSTI-ISE-LIN, Lausanne
SOS72 TOP500: 176 in Europe, 12 have more than 1 Tflops/s Linpack First is CEA-DAM: No. 7 Germany: 71, UK: 39, France: 22, Italy: 16, Others: 28 Industry: 108, first (Telecom I) at No. 96 BMW: 11, Daimler-Chrysler: 5, Car F: 6 Not one big, but many smaller machines HPC Companies: Quadrics Scali, SCI-based clusters: No. 51 SCS: see Toni’s presentation Beowulf production: Paralline, Dalco, HPC in Europe
SOS73 The Swiss-Tx machines (with TNet switch): 1998: Prototype Swiss-T0 with 16 Alphas : Swiss-T1 (Baby) with 16 Alphas : Swiss-T1 with 70 Alphas Know-how transfer to industry: 2001: GeneProt protein sequencing machine with 1420 Alphas Peak performance=1780Gflop/s In June 2001, would have been No. 12 in the Top500, 2nd in Europe and Was world number 1 of industrial computer installations Would be No. 48 (=C-Plant) in the Top500 list of November 2002 and Is still number 2 of industrial computer installations Swiss-Tx project
SOS74 NO! Is a grid cost-effective? Reasons: Since 25 years, we can use machines all over the world Those who needed good connections, installed it (HEPNET, Swissprot,..) Using Java is against HPC
SOS75 EPFL-SIC: SGI Origin3800 (500 MHz) 128 processors HP Alpha ES45/Quadrics (1.25 GHz) 100 processors Institutes PC clusters (CFD, Chemistry, Mathematics, Physics) IBM SP-2 (EFD) CSCS NEC SX-5 (16 processors) IBM Regatta (256 processors, 1.3 GHz) Parallel machines at EPFL and CSCS
SOS76 Parameterisation of. Single processor. Cluster. Application Application tailored Grid scheduling Optimal grid scheduling
SOS77 V a = Operations (Ops) / Memory accesses (LS) Examples SAXPY: y = y + a * x Ops = 2 LS = 3 (2 loads + 1 store) V a = 2 / 3 Matrix*matrix multiply and add: V a = n / 2 r a = min (R , R * V a / V m ) = min (R , M * V a ) r a = 2/3 * M r a = R Characteristic single processor parameters V a and r a
SOS78 V m = R [Mflop/s] / M [Mword/s] MachinePR r a =M V r % NEC SX Pentium 4 1.5/R Alpha Pentium 4 1.7/S AMD 1.2/S r:Performance mesurée %:100*r/ r a /S: Slow SDRAM memory /R:Fast Rambus or RDRAM memory Results with MATMULT V a = 1 (double precision) R [Mflop/s] = Theoretical peak performance M [Mword/s] = Theoretical peak memory bandwidth
SOS79 > 1 Tailoring clusters to applications
SOS710 = a / m Application: a = O / S Machine: m = r a / b O: Number of operations in Flops S: Number of words sent in Words r a : Theoretical peak performance of application in Mflops/s b: Peak network bandwidth per processor in Mwords/s Tailoring clusters to applications
SOS711 Table : The m values for MATMULT (double precision) Machine P P*r a C m [Mflops/s][Mwords/s] T1 (TNet) 32* T1 (Fast Ethernet) 32* IELNX (P4+FE) m = P * r a [Mflops/s] * / C [Mwords/s] m = r a / b b = C / P Cluster characterisation
SOS712 Swiss-T1 (TNet): r a = 1000 Mflops/s, b = 10 Mwords/s m = 100 Water molecules: a = 5*P*(0.65*N orb +4.24*log 2 V) / 3*(P-1) P=8, N orb =128, log 2 V=20 a = 330 = 3.3 (3.6 measured) -> 25% of overall time is due to communication 75% is due to computation LAUTREC on Swiss-T1 + TNet
SOS713 Swiss-T1 (FE): r a = 2000 Mflops/s, b = 1.5 Mwords/s m = 1333 Water molecules: a = 5*P*(0.65*N orb +4.24*log 2 V) / 3*(P-1) P=8, N orb =128, log 2 V=20 a = 330 = 0.25 (0.25 measured) -> 20% of overall time is due to computation 80% is due to communication LAUTREC on Swiss-T1 + Fast Ethernet
SOS714 TNet/Swiss-T1: L=13 s MPI latency, b=80MB/s Break-even message length: beml=L*b=1000B Fast Ethernet: L=100 s MPI latency, b=10MB/s Break-even message length: beml=L*b=1000B Average message length in Lautrec: aml= *V/16*P 2 For test case (V=96**3, P=8): aml=40 kB>>beml LAUTREC : Effect of latency
SOS715 a = Operations (O) / Sends (S) FE/FV: O Nb of volume nodes O Nb of variables per node square O Nb of non-zero matrix elements O Nb of operations per matrix element FE/FV: S Nb of surface nodes S Nb of variables per node FE/FV: a Nb of nodes in one direction a Nb of variables per node a Nb of non-zero matrix elements a Nb of operations per matrix element a Nb of surfaces a (NS/FV/100**3) C 2000 a (Poisson/FD/100**3) C 400 Reminder (Beowulf+Fast Ethernet): m C 250 Point-to-point applications
SOS716 Memory usage Price per 1h CPU time Engineering salary Energy consumption Maintenance/servicing/personnel costs User commodity Other quantities
SOS717 Goal: Add an application tailored Grid scheduling to RMS. Estimate machine and application parameters by counts. Measure machine and application parameters (PAPI,...). Build up a data base on these parameters. Find and submit to best suited Grid ressource (not always optimum). Update the data base dynamically. Perform statistics on decisions and decision failures Optimal Grid scheduling
SOS718 Settle and apply rules to find best suited ressource by:. Match machine/application (MPI or not MPI). Best price/performance ratio based on parameterisation. Availability of the ressources. Engineering costs. Energy consumption Optimal Grid scheduling
SOS719 Perform statistics to:. Detect too often demanded unavailable ressources. Detect real costs of an application. Detect applications that should be parallelised/optimised to reduce costs. Guide decision making for the next purchase. Guide decision on R&D money attribution Optimal Grid scheduling
SOS720 Yes, it can be! Is a grid cost-effective? Minimise overall costs by application adapted job execution Purchase not available demanded low-cost ressources Parallelise cost-ineffective applications Reduce engineering and energy costs Note: “Cheap” ressources do not have to be used up during 90% Results in More computing ressources for the same price More rapid increase of application efficiencies Questions Do computer manufacturers play the game? Do application owners play the game? Can we change users, decision makers and computing centres?
SOS721 R. Gruber, P. Volgers, A. de Vita, M. Stengel, T.-M. Tran, Parameterisation to tailor commodity clusters to applications, Future Generation Computer Systems 19 (2003) see also: Reference