Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Operated by the Southeastern Universities Research Association for the U.S. Department of Energy Thomas Jefferson National Accelerator Facility Jefferson.

Similar presentations


Presentation on theme: "1 Operated by the Southeastern Universities Research Association for the U.S. Department of Energy Thomas Jefferson National Accelerator Facility Jefferson."— Presentation transcript:

1 1 Operated by the Southeastern Universities Research Association for the U.S. Department of Energy Thomas Jefferson National Accelerator Facility Jefferson Lab LQCD Installation of the 2004 LQCD Compute Cluster Walt Akers

2 2 Operated by the Southeastern Universities Research Association for the U.S. Department of Energy Thomas Jefferson National Accelerator Facility Jefferson Lab’s Scientific Purpose

3 3 Operated by the Southeastern Universities Research Association for the U.S. Department of Energy Thomas Jefferson National Accelerator Facility LQCD – Lattice Quantum Chromodynamics Jefferson Lab and MIT lead a collaboration of 28 senior theorists from 14 institutions addressing the hadron physics portion of the U.S. National Lattice QCD Collaboration.  The National LQCD Collaboration has three major sites for hardware: BNL, FNAL and JLab. A goal of the collaboration is to have access to tens of Teraflops (sustained) in the very near future. Achieving this goal would make the U.S. a world leader in LQCD and put discovery potential in the hands of U.S. LQCD physicists.  2005 status: U.S. ~8 Teraflops World ~25 Teraflops

4 4 Operated by the Southeastern Universities Research Association for the U.S. Department of Energy Thomas Jefferson National Accelerator Facility Computing Resources at Jefferson Lab Jefferson Lab currently employs more than 1000 compute nodes for parallel and batch data processing. These are managed independently by the High Performance Computing Group and the Physics Computer Center.  High Performance Computing (HPC) Resources Resources used exclusively for parallel processing of lattice quantum chromodynamics. 384 Node Gig-E Mesh Cluster (DELL PowerEdge GHz) 256 Node Gig-E Mesh Cluster (Supermicro GHz) 128 Node Myrinet Cluster(Supermicro GHz ) 32 Node Test Cluster(Mixed Systems)  Physics Computer Center (CC) Resources Resources used for sequential processing of experimental data Node Batch Farm(Mixed Systems)

5 5 Operated by the Southeastern Universities Research Association for the U.S. Department of Energy Thomas Jefferson National Accelerator Facility The LQCD-04 Cluster - Dell PowerEdge 2850 The 384-node compute cluster provided by Dell is our most recent addition. Each node is equipped with:  Single 2.8 GHz processor  512 Mbytes RAM  38 GByte SCSI Drive  3 Dual-port Intel gigabit network interface cards  Intelligent Platform Management Interface (IPMI)

6 6 Operated by the Southeastern Universities Research Association for the U.S. Department of Energy Thomas Jefferson National Accelerator Facility The LQCD-04 Cluster - Interconnects Interconnects  Each compute node uses 6 gigabit ethernet links to perform nearest neighbor communications in three dimensions.  One onboard gigabit port is used to provide a connection to the service network.  TCP was replaced with the VIA protocol which provides less overhead, lower latency (18.75 usec), and higher throughput (500 MByte/second aggregate).  This cluster can currently be employed as either a single 384-node machine or three distinct 128-node machines. Why use a gigabit ethernet mesh? Price/Performance! Lattice Quantum Chromodynamics calculations deal almost exclusively with nearest neighbor communication. A mesh solution is optimal. Direct gigabit connections deliver 2/3 of the throughput at 1/3 of the current cost of an Infiniband solution.

7 7 Operated by the Southeastern Universities Research Association for the U.S. Department of Energy Thomas Jefferson National Accelerator Facility HPC Batch Management Software TORQUE: Tera-scale Open-source Resource and QUEue manager  TORQUE is an extension of Open PBS that includes revisions allowing it to scale to thousands of nodes.  TORQUE provides a queue-based infrastructure for batch submission and resource manager daemons that run on each node. UnderLord Scheduling System:  The UnderLord scheduler was developed at Jefferson Lab.  It provides a hierarchical algorithm that selects jobs for execution based on a collection of weighted parameters.  The UnderLord allows nodes to be associated with individual queues.

8 8 Operated by the Southeastern Universities Research Association for the U.S. Department of Energy Thomas Jefferson National Accelerator Facility Considerations in Selecting a Cluster Vendor 1)Price/Performance  Measured in sustained MFlops/dollar (Dell Cluster was $1.10 / MFlop) 2)Reliability  Quality of the individual components 3)Maintainability  Ease of replacement of failed components  Features for advanced detection of failures  Features for monitoring performance of overall system 4)Service  Does vendor provide a streamlined process for repair/replacement  What is the time between failure and repair/replacement

9 9 Operated by the Southeastern Universities Research Association for the U.S. Department of Energy Thomas Jefferson National Accelerator Facility Mflops / $ QCDSP Performance per Dollar for Typical LQCD Applications Commodity compute nodes (leverage marketplace & Moore’s law) Large enough to exploit cache boost Low latency, high bandwidth network to exploit full I/O capability (& keep up with cache performance) Vector Supercomputers, including the Japanese Earth Simulator JLab SciDAC Prototype Clusters QCDOC will deliver significant capabilities in early (2004) Future clusters will significantly extend scientific reach Japanese Earth Simulator Anticipated boost due to SciDAC funded software

10 10 Operated by the Southeastern Universities Research Association for the U.S. Department of Energy Thomas Jefferson National Accelerator Facility Cluster Reliability Infant Mortality  On average 7% of the machines in a large cluster acquisition will have some component failure upon delivery or during the first month of running.  The most common failures are: Shipping damage Hard drive failures (often as a result of mishandling during shipping) Improperly installed or fitted components resulting from accelerated production schedules. Manufacturing defects The cluster provided by Dell had fewer than 2 % early failures and 2/3 of these failures were related to third-party ethernet cards. We are very pleased with the early reliability of this cluster.

11 11 Operated by the Southeastern Universities Research Association for the U.S. Department of Energy Thomas Jefferson National Accelerator Facility Service Installation  The installation team sent by Dell and Ideal Systems were phenomenal.  The team adapted quickly to shipping and delivery problems that were outside of their control and delivered an operational configuration on schedule. Return/Replacement Protocol  We began exercising the return protocol during the first weeks of commissioning to replace several defective network cards.  It took very little effort to develop a return strategy that was straight forward enough to be handled by our part time, student assistant.

12 12 Operated by the Southeastern Universities Research Association for the U.S. Department of Energy Thomas Jefferson National Accelerator Facility Maintainability: Our Most Critical Requirement The Importance of Maintainability:  To remain competitive with other cluster technologies funded by the DOE, we must provide maximum system availability with minimal staff.  The 800 nodes (in 4 computing clusters) operated by Jefferson Lab’s High Performance Computing Group are managed, operated and maintained by three regular staff members and one student assistant.  Staff are also responsible for cluster software development.  Because our configuration is highly parallel, the failure of a single node within a computing cluster renders the entire cluster unusable.  Whenever possible, our compute nodes must be configured to detect hardware/software problems before they become critical and, when possible, take measures to correct themselves without operator intervention.

13 13 Operated by the Southeastern Universities Research Association for the U.S. Department of Energy Thomas Jefferson National Accelerator Facility Sensors and Intelligent Platform Management Constant Monitoring:  All systems are constantly monitored by a local daemon that collects hardware and software operating statistics.  These results are combined with the sensor values obtained through lm_sensors or through IPMI (where available).  Sensor data is consolidated on a centralized server where it can be monitored and used for our system management utilities. 

14 14 Operated by the Southeastern Universities Research Association for the U.S. Department of Energy Thomas Jefferson National Accelerator Facility Sensors and Intelligent Platform Management Sensor Summaries:  Our sensor summary pages display the values of all of the critical system parameters.  Actual values are presented in gauges that reflect their min and max, as well as low and high thresholds.  Data collected at the machine level is then used to produce a ‘rack summary’ and then further condensed into a ‘room overview’ that displays the most severe conditions throughout the room.  Administrators can ‘drill down’ from the room overview to find most problems.

15 15 Operated by the Southeastern Universities Research Association for the U.S. Department of Energy Thomas Jefferson National Accelerator Facility Using SNMP to Monitor Our Infrastructure Responding to Power Outages:  The Dell computing cluster is on an individual Uninteruptable Power Supply that is not generator backed.  When a power failure is detected by our monitoring software (and the remaining battery drops below 90%), IPMI is used to power down all compute nodes.  Once power has been restored (for at least 5 minutes) and the battery has recharged to 95%, IPMI is used to power-on all compute nodes.  Any previously running batch job is restarted, and the system continues to operate.

16 16 Operated by the Southeastern Universities Research Association for the U.S. Department of Energy Thomas Jefferson National Accelerator Facility What We Like Most About the Dell Cluster  Installation team provided by Dell and Ideal Systems was fast, knowledgeable and efficient.  Compute nodes are easily disassembled and reassembled for repair or maintenance.  Dell’s IPMI implementation provides a wealth of system health information for cluster monitoring.  Systems have demonstrated a high degree of stability and reliability so far.

17 17 Operated by the Southeastern Universities Research Association for the U.S. Department of Energy Thomas Jefferson National Accelerator Facility What We Would Do Differently Next Time  Start with more space, electricity and cooling. A new 10,000 square foot computer facility is currently under construction and should be online in late  Order our systems preinstalled in racks. This will minimize the shipping debris that we struggled with during the last installation and should greatly improve installation speed.  If feasible, consider a single, high-speed interconnect rather than a mesh topology. While the gig-e mesh provides adequate bandwidth at a very affordable price, it does represent a burden to install, troubleshoot and maintain. Because the price of Infiniband is falling, we anticipate that our next cluster will use a switched network to provide greater configuration flexibility and reduced wire management concerns.

18 18 Operated by the Southeastern Universities Research Association for the U.S. Department of Energy Thomas Jefferson National Accelerator Facility How Dell Can Help Us on the Next Cluster  Improve the DOS based BIOS configuration utilities Specifically, the bioscfg.exe utility had trouble writing changes to the boot order in BIOS. We had to modify all of those by hand.  Make sensor data available from the /proc filesystem in Linux. A Linux driver that provides local access to sensor data will provide us a lot of troubleshooting flexibility.  Provide a BMC console that allows administrators to remotely monitor the system boot process using IPMI. Since our Computer Center is currently located in a separate building from our offices, this would save everyone on our team a long walk through the cold.


Download ppt "1 Operated by the Southeastern Universities Research Association for the U.S. Department of Energy Thomas Jefferson National Accelerator Facility Jefferson."

Similar presentations


Ads by Google