HELMHOLTZ INSTITUT MAINZ Dalibor Djukanovic Helmholtz-Institut Mainz 12.12.2011 PANDA Collaboration Meeting GSI, Darmstadt.

Slides:



Advertisements
Similar presentations
Computing Infrastructure
Advertisements

The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
Scale-out Central Store. Conventional Storage Verses Scale Out Clustered Storage Conventional Storage Scale Out Clustered Storage Faster……………………………………………….
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
Novell Server Linux vs. windows server 2008 By: Gabe Miller.
HELICS Petteri Johansson & Ilkka Uuhiniemi. HELICS COW –AMD Athlon MP 1.4Ghz –512 (2 in same computing node) –35 at top500.org –Linpack Benchmark 825.
SUMS Storage Requirement 250 TB fixed disk cache 130 TB annual increment for permanently on- line data 100 TB work area (not controlled by SUMS) 2 PB near-line.
1 CS 501 Spring 2005 CS 501: Software Engineering Lecture 22 Performance of Computer Systems.
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.
CPP Staff - 30 CPP Staff - 30 FCIPT Staff - 35 IPR Staff IPR Staff ITER-India Staff ITER-India Staff Research Areas: 1.Studies.
Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,
Mass RHIC Computing Facility Razvan Popescu - Brookhaven National Laboratory.
Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.
Cluster computing facility for CMS simulation work at NPD-BARC Raman Sehgal.
© 2013 Mellanox Technologies 1 NoSQL DB Benchmarking with high performance Networking solutions WBDB, Xian, July 2013.
Data oriented job submission scheme for the PHENIX user analysis in CCJ Tomoaki Nakamura, Hideto En’yo, Takashi Ichihara, Yasushi Watanabe and Satoshi.
1 A Basic R&D for an Analysis Framework Distributed on Wide Area Network Hiroshi Sakamoto International Center for Elementary Particle Physics (ICEPP),
Computing/Tier 3 Status at Panjab S. Gautam, V. Bhatnagar India-CMS Meeting, Sept 27-28, 2007 Delhi University, Delhi Centre of Advanced Study in Physics,
High Performance Computing G Burton – ICG – Oct12 – v1.1 1.
File System Benchmarking
Online Systems Status Review of requirements System configuration Current acquisitions Next steps... Upgrade Meeting 4-Sep-1997 Stu Fuess.
HPCS Lab. High Throughput, Low latency and Reliable Remote File Access Hiroki Ohtsuji and Osamu Tatebe University of Tsukuba, Japan / JST CREST.
Kyle Spafford Jeremy S. Meredith Jeffrey S. Vetter
CDF data production models 1 Data production models for the CDF experiment S. Hou for the CDF data production team.
Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.
Presented by, MySQL AB® & O’Reilly Media, Inc. 0 to 60 in 3.1 Tyler Carlton Cory Sessions.
INTRODUCTION The GRID Data Center at INFN Pisa hosts a big Tier2 for the CMS experiment, together with local usage from other HEP related/not related activities.
Block1 Wrapping Your Nugget Around Distributed Processing.
INDIACMS-TIFR Tier 2 Grid Status Report I IndiaCMS Meeting, April 05-06, 2007.
HPC system for Meteorological research at HUS Meeting the challenges Nguyen Trung Kien Hanoi University of Science Melbourne, December 11 th, 2012 High.
Experience with the Thumper Wei Yang Stanford Linear Accelerator Center May 27-28, 2008 US ATLAS Tier 2/3 workshop University of Michigan, Ann Arbor.
Large Scale Parallel File System and Cluster Management ICT, CAS.
ITEP computing center and plans for supercomputing Plans for Tier 1 for FAIR (GSI) in ITEP  8000 cores in 3 years, in this year  Distributed.
1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.
Performance measurement with ZeroMQ and FairMQ
HEP Computing Status Sheffield University Matt Robinson Paul Hodgson Andrew Beresford.
Sep. 17, 2002BESIII Review Meeting BESIII DAQ System BESIII Review Meeting IHEP · Beijing · China Sep , 2002.
Status of the Bologna Computing Farm and GRID related activities Vincenzo M. Vagnoni Thursday, 7 March 2002.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
PROOF tests at BNL Sergey Panitkin, Robert Petkus, Ofer Rind BNL May 28, 2008 Ann Arbor, MI.
© 2011 IBM Corporation Product positioning Jana Jamsek, ATS Europe.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.
Background Computer System Architectures Computer System Software.
The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.
Remigius K Mommsen Fermilab CMS Run 2 Event Building.
New directions in storage | ISGC 2015, Taipei | Patrick Fuhrmann | 19 March 2015 | 1 Presenter: Patrick Fuhrmann dCache.org Patrick Fuhrmann, Paul Millar,
Configuring SQL Server for a successful SharePoint Server Deployment Haaron Gonzalez Solution Architect & Consultant Microsoft MVP SharePoint Server
1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.
G. Russo, D. Del Prete, S. Pardi Frascati, 2011 april 4th-7th The Naples' testbed for the SuperB computing model: first tests G. Russo, D. Del Prete, S.
A Web Based Job Submission System for a Physics Computing Cluster David Jones IOP Particle Physics 2004 Birmingham 1.
29/04/2008ALICE-FAIR Computing Meeting1 Resulting Figures of Performance Tests on I/O Intensive ALICE Analysis Jobs.
Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.
CFI 2004 UW A quick overview with lots of time for Q&A and exploration.
© 2007 Z RESEARCH Z RESEARCH Inc. Non-stop Storage GlusterFS Cluster File System.
E2800 Marco Deveronico All Flash or Hybrid system
Compute and Storage For the Farm at Jlab
Balazs Voneki CERN/EP/LHCb Online group
DIT314 ~ Client Operating System & Administration
Experience of Lustre at QMUL
Ingredients 24 x 1Gbit port switch with 2 x 10 Gbit uplinks  KCHF
Diskpool and cloud storage benchmarks used in IT-DSS
Experience of Lustre at a Tier-2 site
Stallo: First impressions
NGS Oracle Service.
Boost Linux Performance with Enhancements from Oracle
Oracle Storage Performance Studies
Design Unit 26 Design a small or home office network
Support for ”interactive batch”
Presentation transcript:

HELMHOLTZ INSTITUT MAINZ Dalibor Djukanovic Helmholtz-Institut Mainz PANDA Collaboration Meeting GSI, Darmstadt

HIMster - Intro

HIMster Cluster for experiment simulation = HIMster (HIM Cluster) Procurement started in late 2010 Installation early 2011 by Megware Delivery and initial setup = 2 days 3 Water cooled Racks of Hardware

Hardware 134 Compute Nodes, each node: 2 AMD GHz, 8 cores per CPU => 2144 cores 2 Gbyte RAM / core, 14 Nodes with 4 Gbyte / core => 4.7 Tbyte Infiniband Host Channel Adapter QDR 40 Gbit/s No HDD (Diskless Setup) 1 Frontend (1 Database) Server Same CPUs 4 Gbyte Ram / core Infiniband HCA, 10 Gbit-Nic Redundant Power Supplies / Systemfans Storage: 15 k RPM SAS Drives (Frontend only approx. 1 TByte): /home – systemwide NFS-mounted, quota 10 Gbyte /cluster – systemwide NFS-mounted software dir Central Parallel File System /data = 124 Tbyte disk space (7.2 k RPM) Read/Write Performance 700 Mbyte/s (single client) Read/Write Performance Mbyte/s (aggregated multiclient)

HIMster Node PSU CPU RAM IB HCA

HIMster Node - CPU 1 CPU (Package) = 2 Dies ( „Magny Cours“ ) GHz per Die 512 kb L2-cache, 5 MB shared L3-cache (per Die) Magny Cours = 4 NUMA (Non uniform memory access) zones „Scatter Jobs“ across Numa Nodes (avoid foreign memory) => Nodeallocation policy, Process Pinning Mbyte/s

HIMster - Network 1.Infiniband Network: Fat Tree Blocking Factor 2:1 40 Gbit/s = 4 x QDR 36-port managed switches (modular) 2.Gbit Network: Administration Interactive Usage Switches 10 Gbit uplink

HIMster - Network osu_benchmarks]$ mpirun -np 2 -hostfile./hostfile./osu_bw # OSU MPI Bandwidth Test v3.1.2 # Size Bandwidth (MB/s) osu_benchmarks]$ mpirun -np 2 -hostfile./hosts./osu_latency # OSU MPI Latency Test v3.1.2 # Size Latency (us) High Bandwidth Gbyte/s (unidirectional) Low Latency - approx. 1.4 microseconds

Storage- Topology What we wanted: 1.Roughly 1 Gbyte/sec throughput 2.Easy setup / maintenance 3.Stable System FraunhoferFS / Vendor promised: 1.Throughput in excess of 1.5 Gbyte/s ✔ 2.Easy RPM based Installation ✔ 3.High Availability / High Redundancy ? / ✔ Benefits of a parallel FS, while hiding comlpexity.

Storage - Basic Setup 2 Servers each running 1 instance of: Object Storage Server (4 RAID6 sets, XFS) Meta Data Server (RAID10, ext4) Server: Dual Socket Intel 2.53GHz (Quadcore) 12 Gbyte RAM (-> Upgrade to 24 Gbyte) Intel SSD 40 GByte Infiniband QDR (40 Gbit/s), Dual Port Fibre Channel (8 Gbit/s per port) Clients connect via Kernel module (no costly context switches) Native Infiniband Support (RDMA)

Storage - Performance (Single Client) Single Client Write Performance Single Client Read Performance ~]$ dd if=/dev/zero of=/data/work/kphth/dalibor/test.file bs=1M count= records in records out bytes (105 GB) copied, s, 685 MB/s ~]$ dd if=/data/work/kphth/dalibor/test.file of=/dev/null bs=1M records in records out bytes (105 GB) copied, s, 1.1 GB/s

Storage - Performance (Multi Client Write)./IOR -a POSIX -b (100 GByte) / (# of nodes) -o /data/tests/testScaleing –i 3 -w -r –t 1m -d 10 -F

Storage - Performance (Multi Client Read)./IOR -a POSIX -b (100 GByte) / (# of nodes) -o /data/tests/testScaleing –i 3 -w -r –t 1m -d 10 -F

I/O - Profile - Generic Benchmarks Application I/O Profile: Tracing I/O using Linux Standard strace-tool (e.g. ioapps) 200 Events using Dmitry‘s Macro Access on file quite irregular  Not easy to find „generic“ Benchmark I/O Profiling done with: strace -q -a1 -s0 -f -tttT -oOUT_FILE -e trace=file,desc,process,socket APPLICATION ARGUMENTS

I/O - Simple Scaling Estimate (Write) Simple I/O run: Prog writes 100 kbyte then sleeps for a second Run multiple Threads

Meta Data Distributed Meta Data helps Sometimes we see problems with ROOT Macros?./mdtest -n 10 -i 200 -u -t -d /data/work/kphth/dalibor/meta_test/ Process detached % time seconds usecs/call calls errors syscall stat wait total FhGFS update promises to improve stat performance... Lets see.

Policies / Environment User Management (separate clusterwide NIS) Default Resources: 10 Gbyte quota on /home 1 Tbyte scratch on /data (FhGFS) Faishare 20 percent (+ Queue based restrictions) So far account with the Inst. f. Kernphysik mandatory Compute nodes may only be used via batch system No access from outside (yet) PANDAGrid Access (work in progress) Approx cores with adjusted priority 5 Tbyte scratch

Batch System Combination of Torque/Maui Queue assignment depends on walltime of job (routing queues) Queue limits depend on max walltime Rules revisited regularly to match Cluster usage pattern Route Default Queue Long Walltime: 24 hours Max/User: 1700 Jobs Big Walltime: 2 hours Max/User: 1000 Jobs Midrange Walltime: 4 hours Max/User: 500 Jobs Medium Walltime: 12 hours Max/User: 250 Jobs Express Debug Queue Max: 16 Cores Mon-Fri: 08:00 – 22:00

Summary - Outlook Cluster is in stable production state Users seem to be happy so far No (too) big issues left Database server max queries / sec  Metadata might be bottleneck under heavy load?  Job output to stdout quite large (several Mbyte / job => several Gbytes) Integrate Cluster with Grid (on the way) Questions?