A Commodity Cluster for Lattice QCD Calculations at DESY Andreas Gellrich *, Peter Wegner, Hartmut Wittig DESY CHEP03, 25 March 2003 Category 6: Lattice.

Slides:



Advertisements
Similar presentations
Liverpool HEP – Site Report May 2007 John Bland, Robert Fay.
Advertisements

Computing Infrastructure
IBM Software Group ® Integrated Server and Virtual Storage Management an IT Optimization Infrastructure Solution from IBM Small and Medium Business Software.
Beowulf Supercomputer System Lee, Jung won CS843.
Information Technology Center Introduction to High Performance Computing at KFUPM.
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
HELICS Petteri Johansson & Ilkka Uuhiniemi. HELICS COW –AMD Athlon MP 1.4Ghz –512 (2 in same computing node) –35 at top500.org –Linpack Benchmark 825.
High Performance Computing (HPC) at Center for Information Communication and Technology in UTM.
Peter Wegner, DESY CHEP03, 25 March LQCD benchmarks on cluster architectures M. Hasenbusch, D. Pop, P. Wegner (DESY Zeuthen), A. Gellrich, H.Wittig.
Data Storage Willis Kim 14 May Types of storages Direct Attached Storage – storage hardware that connects to a single server Direct Attached Storage.
CPP Staff - 30 CPP Staff - 30 FCIPT Staff - 35 IPR Staff IPR Staff ITER-India Staff ITER-India Staff Research Areas: 1.Studies.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
CT NIKHEF June File server CT system support.
12/04/98HEPNT - Windows NT Days1 NT Cluster & MS Dfs Gunter Trowitzsch & DESY WindowsNT Group.
PC DESY Peter Wegner 1. Motivation, History 2. Myrinet-Communication 4. Cluster Hardware 5. Cluster Software 6. Future …
Cluster computing facility for CMS simulation work at NPD-BARC Raman Sehgal.
Building a High-performance Computing Cluster Using FreeBSD BSDCon '03 September 10, 2003 Brooks Davis, Michael AuYeung, Gary Green, Craig Lee The Aerospace.
Data oriented job submission scheme for the PHENIX user analysis in CCJ Tomoaki Nakamura, Hideto En’yo, Takashi Ichihara, Yasushi Watanabe and Satoshi.
A Linux PC Farm for Physics Analysis at the ZEUS Experiment Marek Kowal, Krzysztof Wrona, Tobias Haas, Ingo Martens, Rainer Mankel DESY, Notkestrasse 85,
1 RH033 Welcome to RedHat Linux. 2 Hardware Requirements ♦ Pentium Pro or better with 256 MB RAM ♦ Or ♦ 64-bit Intel/AMD with 512 MB RAM ♦ 2-6 GB disk.
ScotGrid: a Prototype Tier-2 Centre – Steve Thorn, Edinburgh University SCOTGRID: A PROTOTYPE TIER-2 CENTRE Steve Thorn Authors: A. Earl, P. Clark, S.
Online Systems Status Review of requirements System configuration Current acquisitions Next steps... Upgrade Meeting 4-Sep-1997 Stu Fuess.
The High Performance Cluster for QCD Calculations: System Monitoring and Benchmarking Lucas Fernandez Seivane Summer Student 2002.
University of Illinois at Urbana-Champaign NCSA Supercluster Administration NT Cluster Group Computing and Communications Division NCSA Avneesh Pant
So, Jung-ki Distributed Computing System LAB School of Computer Science and Engineering Seoul National University Implementation of Package Management.
Computing Labs CL5 / CL6 Multi-/Many-Core Programming with Intel Xeon Phi Coprocessors Rogério Iope São Paulo State University (UNESP)
9/16/2000Ian Bird/JLAB1 Planning for JLAB Computational Resources Ian Bird.
Design & Management of the JLAB Farms Ian Bird, Jefferson Lab May 24, 2001 FNAL LCCWS.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
1 Web Server Administration Chapter 2 Preparing For Server Installation.
Ohio Supercomputer Center Cluster Computing Overview Summer Institute for Advanced Computing August 22, 2000 Doug Johnson, OSC.
Paul Scherrer Institut 5232 Villigen PSI HEPIX_AMST / / BJ95 PAUL SCHERRER INSTITUT THE PAUL SCHERRER INSTITUTE Swiss Light Source (SLS) Particle accelerator.
The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
23 Oct 2002HEPiX FNALJohn Gordon CLRC-RAL Site Report John Gordon CLRC eScience Centre.
1 Interface Two most common types of interfaces –SCSI: Small Computer Systems Interface (servers and high-performance desktops) –IDE/ATA: Integrated Drive.
ScotGRID:The Scottish LHC Computing Centre Summary of the ScotGRID Project Summary of the ScotGRID Project Phase2 of the ScotGRID Project Phase2 of the.
6/26/01High Throughput Linux Clustering at Fermilab--S. Timm 1 High Throughput Linux Clustering at Fermilab Steven C. Timm--Fermilab.
Manchester HEP Desktop/ Laptop 30 Desktop running RH Laptop Windows XP & RH OS X Home server AFS using openafs 3 DB servers Kerberos 4 we will move.
Computing Resources at Vilnius Gediminas Technical University Dalius Mažeika Parallel Computing Laboratory Vilnius Gediminas Technical University
May 25-26, 2006 LQCD Computing Review1 Jefferson Lab 2006 LQCD Analysis Cluster Chip Watson Jefferson Lab, High Performance Computing.
Rob Allan Daresbury Laboratory NW-GRID Training Event 25 th January 2007 Introduction to NW-GRID R.J. Allan CCLRC Daresbury Laboratory.
A Hardware Based Cluster Control and Management System Ralf Panse Kirchhoff Institute of Physics.
Roadrunner Supercluster University of New Mexico -- National Computational Science Alliance Paul Alsing.
1 Lattice QCD Clusters Amitoj Singh Fermi National Accelerator Laboratory.
CASPUR Site Report Andrei Maslennikov Lead - Systems Amsterdam, May 2003.
Disk Farms at Jefferson Lab Bryan Hess
IDE disk servers at CERN Helge Meinhard / CERN-IT CERN OpenLab workshop 17 March 2003.
Cluster Software Overview
RAL Site report John Gordon ITD October 1999
PC clusters in KEK A.Manabe KEK(Japan). 22 May '01LSCC WS '012 PC clusters in KEK s Belle (in KEKB) PC clusters s Neutron Shielding Simulation cluster.
1 Cluster Development at Fermilab Don Holmgren All-Hands Meeting Jefferson Lab June 1-2, 2005.
Weekly Report By: Devin Trejo Week of June 21, 2015-> June 28, 2015.
CERN Computer Centre Tier SC4 Planning FZK October 20 th 2005 CERN.ch.
Status of the Bologna Computing Farm and GRID related activities Vincenzo M. Vagnoni Thursday, 7 March 2002.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
23 May 2001LSCCW A.Manabe1 System installation & updates A.Manabe (KEK)
The 2001 Tier-1 prototype for LHCb-Italy Vincenzo Vagnoni Genève, November 2000.
SA1 operational policy training, Athens 20-21/01/05 Presentation of the HG Node “Isabella” and operational experience Antonis Zissimos Member of ICCS administration.
Windows NT at DESY Status report HEP NT 4 th -8 th October 1999 SLAC.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
1 5/4/05 Fermilab Mass Storage Enstore, dCache and SRM Michael Zalokar Fermilab.
The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.
CNAF - 24 September 2004 EGEE SA-1 SPACI Activity Italo Epicoco.
Supervisor: Andreas Gellrich
Cloud based Open Source Backup/Restore Tool
LQCD benchmarks on cluster architectures M. Hasenbusch, D. Pop, P
NCSA Supercluster Administration
Web Server Administration
Lucas Fernandez Seivane Summer Student 2002 IT Group, DESY Hamburg
Cost Effective Network Storage Solutions
Presentation transcript:

A Commodity Cluster for Lattice QCD Calculations at DESY Andreas Gellrich *, Peter Wegner, Hartmut Wittig DESY CHEP03, 25 March 2003 Category 6: Lattice Gauge Computing *

Andreas Gellrich, DESY CHEP03, 25 March Initial Remark This talk is being held in conjunction with two other talks in the same session: K. Jansen:Lattice Gauge Theory and High Performance (NIC/DESY) Computing: The LATFOR initiative in Germany A. Gellrich : A Commodity Cluster for Lattice QCD (DESY) Calculations DESY P. Wegner:LQCD benchmarks on cluster architectures (DESY)

Andreas Gellrich, DESY CHEP03, 25 March Contents DESY Hamburg and DESY Zeuthen operate high-performance cluster for LQCD calculations, exploiting commodity hardware. This talk focuses on system aspects of the two installations.   Introduction   Cluster Concept   Implementation   Software   Operational Experiences   Conclusions

Andreas Gellrich, DESY CHEP03, 25 March Cluster Concept In clusters a set of main building blocks can be identified:   computing nodes   high-speed network   administration network   host system: login, batch system   slow network: monitoring, alarming

Andreas Gellrich, DESY CHEP03, 25 March Cluster Concept (cont’d) host high-speed switch node switch slow control slow control uplink WAN

Andreas Gellrich, DESY CHEP03, 25 March Cluster Concept (cont’d) Aspects for the cluster integration and software installation:   Operating system (Linux)   Security issues (login, open ports, private subnet)   User administration   (Automatic) software installation   Backup and archiving   Monitoring   Alarming

Andreas Gellrich, DESY CHEP03, 25 March Racks Rack-mounted dual-CPU PCs. Hamburg:   32 dual-CPU nodes   16 dual-Xeon P4 2.0 GHz   16 dual-Xeon P4 1.7 GHz Zeuthen:   16 dual-CPU nodes   16 dual-Xeon P4 1.7 GHz Hamburg racks

Andreas Gellrich, DESY CHEP03, 25 March Boxes front back host console spare node

Andreas Gellrich, DESY CHEP03, 25 March Nodes Xeon Pentium kB L2 cache 18.3 GB IBM IC35L SCSI disk Myrinet2000 M3F-PCI64B 2 x 2.0 Gbit/sec optical link Ethernet 100Base TX card 4U module RDRAM 4 x 256 MB Mainboard Supermicro P4DC6 i860 chipset

Andreas Gellrich, DESY CHEP03, 25 March Switches Switch Gigaline 2024M, 48 ports 100Base TX, GigE uplink Myrinet M3-E32 5 slot chassis 4 Myrinet M3-SW16 line-card

Andreas Gellrich, DESY CHEP03, 25 March Software   Network:nodes in private subnet ( )   Operating system:Linux (S.u.S.E. 7.2); 2.4 SMP kernel   Communication:MPI-based on GM (Myricom low level communication library)   Compiler: Fortran 77/90 and C/C++ in use GNU, Portland Group, KAI, Intel   Batch system: PBS (OpenPBS)   Cluster management:Clustware, (SCORE) Time synchronization via XNTP

Andreas Gellrich, DESY CHEP03, 25 March Backup and Archiving Backup:   system data   user home directories   (hopefully) never accessed again …   incremental backup via DESY’s TSM environment (Hamburg) Archive:   individual storage of large amounts of data O(1 TB/y)   DESY product dCache   pseudo-NFS directory on host system   special dccp (dCache-copy) command

Andreas Gellrich, DESY CHEP03, 25 March /local Backup and Archiving (cont’d) host node /home / / /data 18.3 GB 25.5 GB 8.3 GB archive pnfs 10 TB TSM

Andreas Gellrich, DESY CHEP03, 25 March Monitoring Requirements:   web-based   history   alarming ( , sms) clustware: (by MEGWARE)   no history kept   exploits udp clumon: (simple; home-made; Perl-based; exploits NFS)   deploys MRTG for history   includes alarming via and/or sms

Andreas Gellrich, DESY CHEP03, 25 March Monitoring: clustware

Andreas Gellrich, DESY CHEP03, 25 March Monitoring: clustware

Andreas Gellrich, DESY CHEP03, 25 March Monitoring: clumon

Andreas Gellrich, DESY CHEP03, 25 March Monitoring: clumon

Andreas Gellrich, DESY CHEP03, 25 March Monitoring: clumon (cont’d)

Andreas Gellrich, DESY CHEP03, 25 March Operational Experiences Zeuthen:   running since December 2001   18 user accounts; 6 power users   batch system is used to submit jobs   hardware failures of single nodes (half of the disks replaced)   Myrinet failures (1 switch line-cards; 2 interface-cards)   uptime: server 362 days, nodes 172 days

Andreas Gellrich, DESY CHEP03, 25 March Operational Experiences (cont’d) Hamburg:   running since January 2002   18 user accounts; 4 power users   batch system is NOT used; manual scheduling   hardware failures of single nodes (all local disks were replaced)   Myrinet failures (3 switch line-cards; 2 interface-cards)   32 CPUs upgraded, incl. BIOS   some kernel hang-ups (eth0 driver got lost, …)   uptime: server 109 days, 63 days (since switch repair)

Andreas Gellrich, DESY CHEP03, 25 March Operational Experiences (cont’d) In summary:   several broken disks   1 broken motherboard   4 of 6 broken Myrinet switch line-cards   4 of 48 broken Myrinet interface-cards   some kernel hang-ups Possible Improvements:   server is single point of server   Linux installation procedure   exploit serial console for administration

Andreas Gellrich, DESY CHEP03, 25 March Conclusions   Commodity hardware based PC clusters for LQCD   Linux as operating system   MPI-based parallelism   Batch system optional   Clusters in Hamburg and Zeuthen in operation for > 1 year   Hardware problem occur, but repairs are easy Successful model for LQCD calculations!