Tests and tools for ENEA GRID Performance test: HPL (High Performance Linpack) Network monitoring A.Funel December 11, 2007.

Slides:



Advertisements
Similar presentations
CoMPI: Enhancing MPI based applications performance and scalability using run-time compression. Rosa Filgueira, David E.Singh, Alejandro Calderón and Jesús.
Advertisements

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
1 Computational models of the physical world Cortical bone Trabecular bone.
Packets, Frames & Error Detection. Packet Concepts A packet is a small block of data. Networks which use packets are called packet networks or packet-
1 Virtual Machine Resource Monitoring and Networking of Virtual Machines Ananth I. Sundararaj Department of Computer Science Northwestern University July.
Performance Analysis of MPI Communications on the SGI Altix 3700 Nor Asilah Wati Abdul Hamid, Paul Coddington, Francis Vaughan Distributed & High Performance.
12a.1 Introduction to Parallel Computing UNC-Wilmington, C. Ferner, 2008 Nov 4, 2008.
NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger.
1 Lecture 8 Architecture Independent (MPI) Algorithm Design Parallel Computing Fall 2007.
A Comparative Study of Network Protocols & Interconnect for Cluster Computing Performance Evaluation of Fast Ethernet, Gigabit Ethernet and Myrinet.
Behaviour and Performance of Interactive Multi-player Game Servers Ahmed Abdelkhalek, Angelos Bilas, and Andreas Moshovos.
UC Berkeley 1 Time dilation in RAMP Zhangxi Tan and David Patterson Computer Science Division UC Berkeley.
1 Computer Science, University of Warwick Metrics  FLOPS (FLoating point Operations Per Sec) - a measure of the numerical processing of a CPU which can.
Nor Asilah Wati Abdul Hamid, Paul Coddington, Francis Vaughan School of Computer Science, University of Adelaide IPDPS - PMEO April 2006 Comparison of.
Earth Simulator Jari Halla-aho Pekka Keränen. Architecture MIMD type distributed memory 640 Nodes, 8 vector processors each. 16GB shared memory per node.
IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.
P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism Efficient Longest Common Subsequence Computation using Bulk-Synchronous.
PRASHANTHI NARAYAN NETTEM.
Bandwidth Measurements for VMs in Cloud Amit Gupta and Rohit Ranchal Ref. Cloud Monitoring Framework by H. Khandelwal, R. Kompella and R. Ramasubramanian.
MULTICOMPUTER 1. MULTICOMPUTER, YANG DIPELAJARI Multiprocessors vs multicomputers Interconnection topologies Switching schemes Communication with messages.
© 2005, it - instituto de telecomunicações. Todos os direitos reservados. System Level Resource Discovery and Management for Multi Core Environment Javad.
CPP Staff - 30 CPP Staff - 30 FCIPT Staff - 35 IPR Staff IPR Staff ITER-India Staff ITER-India Staff Research Areas: 1.Studies.
Hjemmeeksamen 1 INF3190. Oppgave Develop a monitoring/administration tool which allows an administrator to use a client to monitor all processes running.
Measuring zSeries System Performance Dr. Chu J. Jong School of Information Technology Illinois State University 06/11/2012 Sponsored in part by Deer &
GridFTP Guy Warner, NeSC Training.
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with POSIX thread.
Parallel and Distributed Simulation FDK Software.
Chapter 4 Performance. Times User CPU time – Time that the CPU is executing the program System CPU time – time the CPU is executing OS routines for the.
1 Chapter Client-Server Interaction. 2 Functionality  Transport layer and layers below  Basic communication  Reliability  Application layer.
Tools and Utilities for parallel and serial codes in ENEA-GRID environment CRESCO Project: Salvatore Raia SubProject I.2 C.R. ENEA-Portici. 11/12/2007.
Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
F. Brasolin / A. De Salvo – The ATLAS benchmark suite – May, Benchmarking ATLAS applications Franco Brasolin - INFN Bologna - Alessandro.
The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
MIMD Distributed Memory Architectures message-passing multicomputers.
Progress report on the alignment of the tracking system A. Bonissent D. Fouchez A.Tilquin CPPM Marseille Mechanical constraints from optical measurement.
SSS Test Results Scalability, Durability, Anomalies Todd Kordenbrock Technology Consultant Scalable Computing Division Sandia is a multiprogram.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
1 University of Maryland Linger-Longer: Fine-Grain Cycle Stealing in Networks of Workstations Kyung Dong Ryu © Copyright 2000, Kyung Dong Ryu, All Rights.
Massive Supercomputing Coping with Heterogeneity of Modern Accelerators Toshio Endo and Satoshi Matsuoka Tokyo Institute of Technology, Japan.
Chapter 8-2 : Multicomputers Multiprocessors vs multicomputers Multiprocessors vs multicomputers Interconnection topologies Interconnection topologies.
ITEP computing center and plans for supercomputing Plans for Tier 1 for FAIR (GSI) in ITEP  8000 cores in 3 years, in this year  Distributed.
In Large-Scale Cluster Yutaka Ishikawa Computer Science Department/Information Technology Center The University of Tokyo
Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.
Computing Resources at Vilnius Gediminas Technical University Dalius Mažeika Parallel Computing Laboratory Vilnius Gediminas Technical University
George Tsouloupas University of Cyprus Task 2.3 GridBench ● 1 st Year Targets ● Background ● Prototype ● Problems and Issues ● What's Next.
Making Watson Fast Daniel Brown HON111. Need for Watson to be fast to play Jeopardy successfully – All computations have to be done in a few seconds –
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
Parco Auto-optimization of linear algebra parallel routines: the Cholesky factorization Luis-Pedro García Servicio de Apoyo a la Investigación Tecnológica.
Data Management for Decision Support Session-4 Prof. Bharat Bhasker.
1 THE EARTH SIMULATOR SYSTEM By: Shinichi HABATA, Mitsuo YOKOKAWA, Shigemune KITAWAKI Presented by: Anisha Thonour.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future.
TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.
edit type on title master Fortran ISV Release I to L LINPACK TOP500 Technical Systems Division * Scalable Computing Lab 2 Hsin-Ying Lin
Module 16: Distributed System Structures Silberschatz, Galvin and Gagne ©2005 Operating System Concepts – 7 th Edition, Apr 4, 2005 Distributed.
GridFTP Guy Warner, NeSC Training Team.
INDIANAUNIVERSITYINDIANAUNIVERSITY Tsunami File Transfer Protocol Presentation by ANML January 2003.
Cluster computing. 1.What is cluster computing? 2.Need of cluster computing. 3.Architecture 4.Applications of cluster computing 5.Advantages of cluster.
TCP/IP1 Address Resolution Protocol Internet uses IP address to recognize a computer. But IP address needs to be translated to physical address (NIC).
Connect communicate collaborate Performance Metrics & Basic Tools Robert Stoy, DFN EGI TF, Madrid September 2013.
Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.
VIRTUAL NETWORK COMPUTING SUBMITTED BY:- Ankur Yadav Ashish Solanki Charu Swaroop Harsha Jain.
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
Grid Computing.
CRESCO Project: Salvatore Raia
Is System X for Me? Cal Ribbens Computer Science Department
Implementation of a small-scale desktop grid computing infrastructure in a commercial domain    
Presentation transcript:

Tests and tools for ENEA GRID Performance test: HPL (High Performance Linpack) Network monitoring A.Funel December 11, 2007

HPL TEST HPL measures the floating point execution rate for solving a sistem of linear equations AX = B HPL requires the availibility of MPI and libraries for linear algebra (BLAS, VSIPL, ATLAS) HPL is scalable: parallel efficiency constant with respect to the processor memory usage

HPL Results (1) A(n  n) X = B GFLOPS = [(2/3)n 3 +(3/2)n 2 ]/[th  10 9 ] th = CPU time Th. Peak = # of CORES  CPU CLOCK SPEED   FPO ISSUE RATE Linux (bw305): 15 CORES, Th. Peak = 72 GFLOPS Test completed AIX (sp4-2): 32 CORES, Th. Peak  96 GFLOPS AIX (sp4-3-4): 32 CORES Th. Peak =  122 GFLOPS Test did not complete!!! LSF SUBMITION

HPL Results (2) Expected CPU time th : # FPO / (# of CORES  CPU CLOCK SPEED  FPO ISSUE RATE) USED: ATLAS Version 3.6 HIGH USER WAIT TIME (NOT CPU TIME)  MAYBE DUE TO THE NETWORK INTERCONNECTS (PUBLIC WHEN THE TEST WAS DONE) Linux (bw305), P  Q = 3  5 CORES (LSF SUBMITION)  (HPL COMPLETED) n (matrix size) bytes (B) = 8  n 2 % MEMORY (TOTAL = 12 GB) obtained CPU time (sec) expected CPU time th (sec) GFLOPS n   10 3 B  1.2 %  n   10 3 B  4.0 %  74.0  n   10 3 B  9.0 %  

A PONT-TO-POINT COMMUNICATION TEST USING MPI HPL POINT-TO-POINT COMMUNICATION BETWEEN PROCESSORS IS BASED ON MPI (MPI_Send MPI_Recv) ROUTINES

HPL Results (3) PROBLEMS FOR AIX (LSF SUBMITION)  HPL MAKES THE MACHINES HANGING OUT, THE TEST DOES NOT COMPLETE EVEN IF MEMORY USAGE < 10% l ONLY A FEW CPU SECONDS OVER DAYS OF RUNNING TIME!!!  UNDER INVESTIGATION AIX (sp4-1), 4  8=32 CORES  HPL COMPLETED n (matrix size) bytes (B) = 8  n 2 % MEMORY (TOTAL = 25.6 GB) obtained CPU time (sec) expected CPU time th (sec) GFLOPS n   10 3 B  10 %  AIX (sp4-2), 4  8=32 CORES, 20% TOTAL (32 GB) MEMORY  HPL NOT COMPLETED INTERACTIVE SUBMITIONS AIX (ostro), 4  4=16 CORES, 20% TOTAL (16 GB) MEMORY  HPL NOT COMPLETED

NETWORK MONITORING (coll. G. Guarnieri) A TOOL HAS BEEN PROVIDED IN ORDER TO DETECT WHETHER THE COMMUNICATION SPEED BETWEEN TWO HOSTS (CLIENT AND SERVER) OF THE ENEA GRID CHANGES OVER TIME THE TEST MEASURES THE ROUND TRIP TIME IT TAKES TO SEND A SMALL PACKET (10, 100, 1000 BYTES) OF DATA AND RECEIVE IT BACK SMALL PACKETS: NOT CHOPPED (NO SPURIOUS DELAY EFFECTS), FAST FLUCTUATIONS NOT HIDDEN BY THE FINAL INTEGRATED AVERAGE TIME NEEDED FOR WAITING BIG SIZE PACKETS client server start stop 60 PACKETS SENT IN SEQUENCE EACH SECOND BOTH CLIENT AND SERVER BLOCK UNTIL THE FULL PACKET IS SENT/RECEIVED: NO LOSS OF DATA TCP/IP PROTOCOL

NETWORK MONITORING (2) Client: eurofel00 Server: bw305-2 HIGH SPIKES CLEARLY DETECTED  OVERALL COMMUNICATION DELAY Client: kleos Server: feronix0

Conclusions HPL BENCHMARK TEST: Linux (LSF)  THE TEST COMPLETES HOWEVER: 1.OBTAINED CPU TIME >> EXPECTED CPU TIME  (PEAK) exp < (PEAK) th 2.TOO MUCH (USER) TIME TO COMPLETE AIX (LSF)  THE TEST DOES NOT COMPLETE: ONLY A FEW CPU SECONDS OVER DAYS OF RUNNING TIME!!!! AIX (INTERACTIVE SUBMITION): ONLY sp4-1 (32 CORES, 10% TOTAL MEMORY) TESTED  TEST COMPLETED BUT STILL (CPU TIME) >> (EXPECTED CPU TIME) USER WAIT TIME  35 minutes NETWORK MONITORING: A TOOL HAS BEEN PROVIDED TO DETECT VARIATIONS IN THE COMMUNICATION SPEED BEWTEEN TWO HOSTS OF THE ENEA GRID USEFUL FOR IMPROVING THE OVERALL NETWORK EFFICIENCY