LQCD benchmarks on cluster architectures M. Hasenbusch, D. Pop, P

Slides:



Advertisements
Similar presentations
New PC Architectures - Processors, Chipsets, Performance, Bandwidth 1. PC - Schematic overview 2. Chipset schema (Intel 860 example) 3. AMD Athlon, XEON-P4.
Advertisements

Premio Predator G2 Workstation Training
Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines P. Balaji ¥ W. Feng α Q. Gao ¥ R. Noronha ¥ W. Yu ¥ D. K. Panda ¥ ¥ Network.
Farms/Clusters of the future Large Clusters O(1000), any existing examples ? YesSupercomputing, PC clusters LLNL, Los Alamos, Google No long term experience.
♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.
Performance Analysis of Virtualization for High Performance Computing A Practical Evaluation of Hypervisor Overheads Matthew Cawood University of Cape.
Premio Server Product Training By Samuel Sanchez Desktop, Server, and Network Product Manager November 2000 (Click on the speaker icon on each slide to.
Linux Clustering A way to supercomputing. What is Cluster? A group of individual computers bundled together using hardware and software in order to make.
Presented by: Yash Gurung, ICFAI UNIVERSITY.Sikkim BUILDING of 3 R'sCLUSTER PARALLEL COMPUTER.
Designing Lattice QCD Clusters Supercomputing'04 November 6-12, 2004 Pittsburgh, PA.
Design Considerations Don Holmgren Lattice QCD Project Review May 24, Design Considerations Don Holmgren Lattice QCD Computing Project Review Cambridge,
A Commodity Cluster for Lattice QCD Calculations at DESY Andreas Gellrich *, Peter Wegner, Hartmut Wittig DESY CHEP03, 25 March 2003 Category 6: Lattice.
A Comparative Study of Network Protocols & Interconnect for Cluster Computing Performance Evaluation of Fast Ethernet, Gigabit Ethernet and Myrinet.
Computers Chapter 4 Inside the Computer © 2005 Prentice-Hall, Inc.Slide 2.
Real Parallel Computers. Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra, Meuer, Simon Parallel.
Peter Wegner, DESY CHEP03, 25 March LQCD benchmarks on cluster architectures M. Hasenbusch, D. Pop, P. Wegner (DESY Zeuthen), A. Gellrich, H.Wittig.
Intel® 64-bit Platforms Platform Features. Agenda Introduction and Positioning of Intel® 64-bit Platforms Intel® 64-Bit Xeon™ Platforms Intel® Itanium®
Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji  Hemal V. Shah ¥ D. K. Panda 
CPP Staff - 30 CPP Staff - 30 FCIPT Staff - 35 IPR Staff IPR Staff ITER-India Staff ITER-India Staff Research Areas: 1.Studies.
Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Processor Technology John Gordon, Peter Oliver e-Science Centre, RAL October 2002 All details correct at time of writing 09/10/02.
PC DESY Peter Wegner 1. Motivation, History 2. Myrinet-Communication 4. Cluster Hardware 5. Cluster Software 6. Future …
Sven Ubik, Petr Žejdl CESNET TNC2008, Brugges, 19 May 2008 Passive monitoring of 10 Gb/s lines with PC hardware.
Cluster computing facility for CMS simulation work at NPD-BARC Raman Sehgal.
Basic Computer Structure and Knowledge Project Work.
Serial vs.Parallel Computing Scalable Perf. vs. Availability
Information and Communication Technology Fundamentals Credits Hours: 2+1 Instructor: Ayesha Bint Saleem.
DELL PowerEdge 6800 performance for MR study Alexander Molodozhentsev KEK for RCS-MR group meeting November 29, 2005.
Chipset Introduction The chipset is commonly used to refer to a set of specialized chips on a computer's motherboard or.
1 CS503: Operating Systems Spring 2014 Dongyan Xu Department of Computer Science Purdue University.
A TCP/IP transport layer for the DAQ of the CMS Experiment Miklos Kozlovszky for the CMS TriDAS collaboration CERN European Organization for Nuclear Research.
Company LOGO High Performance Processors Miguel J. González Blanco Miguel A. Padilla Puig Felix Rivera Rivas.
The MPC Parallel Computer Hardware, Low-level Protocols and Performances University P. & M. Curie (PARIS) LIP6 laboratory Olivier Glück.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
LECC2003 AmsterdamMatthias Müller A RobIn Prototype for a PCI-Bus based Atlas Readout-System B. Gorini, M. Joos, J. Petersen (CERN, Geneva) A. Kugel, R.
Hardware Trends. Contents Memory Hard Disks Processors Network Accessories Future.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
SC2002, Baltimore ( From the Earth Simulator to PC Clusters Structure of SC2002 Top500 List Dinosaurs Department Earth.
4 Dec 2006 Testing the machine (X7DBE-X) with 6 D-RORCs 1 Evaluation of the LDC Computing Platform for Point 2 SuperMicro X7DBE-X Andrey Shevel CERN PH-AID.
Copyright © 2007 Heathkit Company, Inc. All Rights Reserved PC Fundamentals Presentation 30 – PC Architecture.
ITEP computing center and plans for supercomputing Plans for Tier 1 for FAIR (GSI) in ITEP  8000 cores in 3 years, in this year  Distributed.
Infiniband Bart Taylor. What it is InfiniBand™ Architecture defines a new interconnect technology for servers that changes the way data centers will be.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
CMS week, June 2002, CERN 1 First P2P Measurements on Infiniband Luciano Berti INFN Laboratori Nazionali di Legnaro.
Sep. 17, 2002BESIII Review Meeting BESIII DAQ System BESIII Review Meeting IHEP · Beijing · China Sep , 2002.
Motherboard A motherboard allows all the parts of your computer to receive power and communicate with one another.
LNL 1 SADIRC2000 Resoconto 2000 e Richieste LNL per il 2001 L. Berti 30% M. Biasotto 100% M. Gulmini 50% G. Maron 50% N. Toniolo 30% Le percentuali sono.
Advanced Operating Systems - Spring 2009 Lecture 18 – March 25, 2009 Dan C. Marinescu Office: HEC 439 B. Office hours:
 System Requirements are the prerequisites needed in order for a software or any other resources to execute efficiently.  Most software defines two.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Recent experience with PCI-X 2.0 and PCI-E network interfaces and emerging server systems Yang Xia Caltech US LHC Network Working Group October 23, 2006.
MOTHER BOARD PARTS BY BOGDAN LANGONE BACK PANEL CONNECTORS AND PORTS Back Panels= The back panel is the portion of the motherboard that allows.
Clusters of Multiprocessor Systems
LHCb and InfiniBand on FPGA
Chapter 4 Inside the Computer
Computer Hardware.
Supervisor: Andreas Gellrich
CS111 Computer Programming
Hot Processors Of Today
Comparing dual- and quad-core performance
Introduction to Microprocessors
Computers © 2005 Prentice-Hall, Inc. Slide 1.
What’s in the Box?.
Web Server Administration
Today’s agenda Hardware architecture and runtime system
I/O BUSES.
Chapter 2: Planning for Server Hardware
Designing a PC Farm to Simultaneously Process Separate Computations Through Different Network Topologies Patrick Dreher MIT.
Cluster Computers.
Presentation transcript:

LQCD benchmarks on cluster architectures M. Hasenbusch, D. Pop, P LQCD benchmarks on cluster architectures M. Hasenbusch, D. Pop, P. Wegner (DESY Zeuthen), A. Gellrich, H.Wittig (DESY Hamburg) CHEP03, 25 March 2003 Category 6: Lattice Gauge Computing Motivation PC Cluster @DESY Benchmark architectures DESY Cluster E7500 systems Infiniband blade servers Itanium2 Benchmark programs, Results Future Conclusions, Acknowledgements Peter Wegner, DESY CHEP03, 25 March 2003 1

PC Cluster Motivation LQCD, Stream Benchmark, Myrinet Bandwidth 32/64-bit Dirac Kernel, LQCD (Martin Lüscher, (DESY) CERN, 2000): P4, 1.4 GHz, 256 MB Rambus, using SSE1(2) instructions incl. cache pre-fetch Time per lattice point: 0.926 micro sec (1503 Mflops [32 bit arithmetic]) 1.709 micro sec (814 Mflops [64 bit arithmetic]) Stream Benchmark, Memory Bandwidth: P4(1.4 GHz, PC800 Rambus) : 1.4 … 2.0 GB/s PIII (800MHz, PC133 SDRAM) : 400 MB/s PIII(400 MHz, PC133 SDRAM) : 340 MB/s Myrinet, external Bandwidth: 2.0+2.0 Gb/s optical-connection, bidirectional, ~240 MB/s sustained Peter Wegner, DESY CHEP03, 25 March 2003 2

Benchmark Architectures - DESY Cluster Hardware Nodes Mainboard Supermicro P4DC6 2 x XEON P4, 1.7 (2.0) GHz, 256 (512) kByte Cache 1 Gbyte (4x 256 Mbyte) RDRAM IBM 18.3 GB DDYS-T18350 U160 3.5” SCSI disk Myrinet 2000 M3F-PCI64B-2 Interface Network Fast Ethernet Switch Gigaline 2024M, 48x100BaseTX ports + GIGAline2024 1000BaseSX-SC Myrinet Fast Interconnect M3-E32 5 slot chassis, 2xM3-SW16 Line cards Installation Zeuthen: 16 dual CPU nodes, Hamburg: 32 dual CPU nodes Peter Wegner, DESY CHEP03, 25 March 2003 3

Benchmark Architectures DESY Cluster i860 chipset problem Xeon Processor Xeon Processor 400MHz System Bus AGP4X Graphics Dual Channel RDRAM* 3.2 GB/s MRH >1GB/s + + P64H 800MB/s MCH 3.2GB/s MRH 800MB/s + + P64H Intel® Hub Architecture 266 MB/s 64 bit PCI ATA 100 MB/s (dual IDE Channels) 6 channel audio 64 bit PCI ICH2 LAN Connection Interface Up to 4 GB of RDRAM 133 MB/s PCI Slots (66 MHz, 64bit) PCI Slots (33 MHz, 32bit) 10/100 Ethernet bus_read (send) = 227 MBytes/s bus_write (recv) = 315 MBytes/s of max. 528 MBytes/s 4 USB ports External Myrinet bandwidth: 160 Mbytes/s 90 Mbytes/s bidirectional

Benchmark Architectures – Intel E7500 chipset Peter Wegner, DESY CHEP03, 25 March 2003 5

Benchmark Architectures - E7500 system Par-Tec (Wuppertal) 4 Nodes: Intel(R) Xeon(TM) CPU 2.60GHz 2 GB ECC PC1600 (DDR-200) SDRAM Super Micro P4DPE-G2 Intel E7500 chipset PCI 64/66 2 x Intel(R) PRO/1000 Network Connection Myrinet M3F-PCI64B-2 Peter Wegner, DESY CHEP03, 25 March 2003 6

Benchmark Architectures Leibniz-Rechenzentrum Munich (single cpu tests): Pentium IV 3,06GHz. with ECC Rambus Pentium IV 2,53GHz. with Rambus 1066 memory Xeon, 2.4GHz. with PC2100 DDR SDRAM memory (probably FSB400) Megware: 8 nodes dual XEON, 2.4GHz, E7500 2GB DDR ECC memory Myrinet2000 Supermicro P4DMS-6GM University of Erlangen: Itanium2, 900MHz, 1.5MB Cache, 10GB RAM zx1 chipset (HP) Peter Wegner, DESY CHEP03, 25 March 2003 7

Benchmark Architectures - Infiniband Megware: 10 Mellanox ServerBlades Single Xeon 2.2 GHz 2 GB DDR RAM ServerWorks GC-LE Chipsatz InfiniBand 4X HCA RedHat 7.3, Kernel 2.4.18-3 MPICH-1.2.2.2 und OSU-Patch für VIA/InfiniBand 0.6.5 Mellanox Firmware 1.14 Mellanox SDK (VAPI) 0.0.4 Compiler GCC 2.96 Peter Wegner, DESY CHEP03, 25 March 2003 8

Dirac Operator Benchmark (SSE) 16x163, single P4/XEON CPU MFLOPS Dirac operator Linear Algebra Peter Wegner, DESY CHEP03, 25 March 2003 9

Parallel (1-dim) Dirac Operator Benchmark (SSE), even-odd preconditioned, 2 x 163 , XEON CPUs, single CPU performance MFLOPS Myrinet2000 i860: 90 MB/s E7500: 190 MB/s Peter Wegner, DESY CHEP03, 25 March 2003 10

Performance comparisons (MFLOPS): Single node Dual node Parallel (1-dim) Dirac Operator Benchmark (SSE), even-odd preconditioned, 2 x 163 , XEON CPUs, single CPU performance, 2, 4 nodes Performance comparisons (MFLOPS): Single node Dual node SSE2 non-SSE 446 328 (74%) 330 283 (85%) Parastation3 software non-blocking I/O support (MFLOPS, non-SSE): blocking non-blocking I/O 308 367 (119%) Peter Wegner, DESY CHEP03, 25 March 2003 11

Maximal Efficiency of external I/O MFLOPs (without communication) MFLOPS (with communication) Maximal Bandwidth Efficiency Myrinet (i860), SSE 579 307 90 + 90 0.53 Myrinet/GM (E7500), SSE 631 432 190 + 190 0.68 Myrinet/ Parastation (E7500), SSE 675 446 181 + 181 0.66 Parastation (E7500), non-blocking, non-SSE 406 368 hidden 0.91 Gigabit, Ethernet, non-SSE 390 228 100 + 100 0.58 Infiniband 370 297 210 + 210 0.80 Peter Wegner, DESY CHEP03, 25 March 2003 12

4 single CPU nodes, Gbit Ethernet, non-blocking switch, full duplex Parallel (1-dim) Dirac Operator Benchmark (SSE), even-odd preconditioned, 2 x 163 , XEON/Itanium2 CPUs, single CPU performance, 4 nodes 4 single CPU nodes, Gbit Ethernet, non-blocking switch, full duplex P4 (2.4 GHz, 0.5 MB cache) SSE: 285 MFLOPS 88.92 + 88.92 MB/s non-SSE: 228 MFLOPS 75.87 + 75.87 MB/s Itanium2 (900 MHz, 1.5 MB cache) non-SSE: 197 MFLOPS 63.13 + 63.13 MB/s Peter Wegner, DESY CHEP03, 25 March 2003 13

Infiniband interconnect up to 10GB/s Bi-directional Sys Mem HCA Mem Cntlr Host Bus CPU Switch: Simple, low cost, multistage network Link: High Speed Serial 1x, 4x, and 12x    Sys Mem CPU Mem Cntlr Host Bus Link Switch HCA TCA I/O Cntlr Target Channel Adapter: Interface to I/O controller SCSI, FC-AL, GbE, ... TCA I/O Cntlr Host Channel Adapter: Protocol Engine Moves data via messages queued in memory Chips : IBM, Mellanox PCI-X cards: Fujitsu, Mellanox, JNI, IBM http://www.infinibandta.org Peter Wegner, DESY CHEP03, 25 March 2003 14

Infiniband interconnect    Peter Wegner, DESY CHEP03, 25 March 2003 15

Infiniband vs Myrinet performance, non-SSE (MFLOPS): Parallel (2-dim) Dirac Operator Benchmark (Ginsparg-Wilson-Fermions) , XEON CPUs, single CPU performance, 4 nodes Infiniband vs Myrinet performance, non-SSE (MFLOPS): XEON 1.7 GHz Myrinet, i860 chipset XEON 2.2 GHz Infiniband, E7500 chipset 32-Bit 64-Bit 8x83 lattice, 2x2 processor grid 370 281 697 477 16x163 lattice, 2x4 processor grid 338 299 609 480 Peter Wegner, DESY CHEP03, 25 March 2003 16

Future - Low Power Cluster Architectures ? Peter Wegner, DESY CHEP03, 25 March 2003 17

Future Cluster Architectures - Blade Servers ? NEXCOM – Low voltage blade server 200 low voltage Intel XEON CPUs (1.6 GHz – 30W) in a 42U Rack Integrated Gbit Ethernet network Mellanox – Infiniband blade server Single XEON Blades connected via a 10 Gbit (4X) Infiniband network MEGWARE, NCSA, Ohio State University Peter Wegner, DESY CHEP03, 25 March 2003 18

Conclusions PC CPUs have an extremely high sustained LQCD performance using SSE/SSE2 (SIMD+pre-fetch), assuming a sufficient large local lattice Bottlenecks are the memory throughput and the external I/O bandwidth, both components are improving (Chipsets: i860  E7500  E705  …, FSB: 400MHz  533 MHz  667 MHz  …, external I/O: Gbit-Ethernet  Myrinet2000  QSnet  Inifiniband  …) Non-blocking MPI communication can improve the performance by using adequate MPI implementations (e.g. ParaStation) 32-bit Architectures (e.g. IA32) have a much better price performance ratio than 64-bit architectures (Itanium, Opteron ?) Large low voltage dense blade clusters could play an important role in LQCD computing (low voltage XEON, CENTRINO ?, …) Peter Wegner, DESY CHEP03, 25 March 2003 19

Acknowledgements Acknowledgements We would like to thank Martin Lüscher (CERN) for the benchmark codes and the fruitful discussions about PCs for LQCD, and Isabel Campos Plasencia (Leibnitz-Rechenzentrum Munich), Gerhard Wellein (Uni Erlangen), Holger Müller (Megware), Norbert Eicker (Par-Tec), Chris Eddington (Mellanox) for the opportunity to run the benchmarks on their clusters. Peter Wegner, DESY CHEP03, 25 March 2003 20