High-Performance Storage System for the LHCb Experiment

Slides:



Advertisements
Similar presentations
RAID A RRAYS Redundant Array of Inexpensive Discs.
Advertisements

Enhanced Availability With RAID CC5493/7493. RAID Redundant Array of Independent Disks RAID is implemented to improve: –IO throughput (speed) and –Availability.
R.A.I.D. Copyright © 2005 by James Hug Redundant Array of Independent (or Inexpensive) Disks.
Chapter 5: Server Hardware and Availability. Hardware Reliability and LAN The more reliable a component, the more expensive it is. Server hardware is.
REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.
1 Disk Based Disaster Recovery & Data Replication Solutions Gavin Cole Storage Consultant SEE.
SQL Server on VMware Jonathan Kehayias (MCTS, MCITP) SQL Database Administrator Tampa, FL.
Business Continuity and DR, A Practical Implementation Mich Talebzadeh, Consultant, Deutsche Bank
Moving to a Virtual World Don Mendonsa Lawrence Livermore National Laboratory This work performed under the auspices of the U.S. Department of Energy by.
1© Copyright 2011 EMC Corporation. All rights reserved. EMC RECOVERPOINT/ CLUSTER ENABLER FOR MICROSOFT FAILOVER CLUSTER.
National Manager Database Services
11 SERVER CLUSTERING Chapter 6. Chapter 6: SERVER CLUSTERING2 OVERVIEW  List the types of server clusters.  Determine which type of cluster to use for.
Microsoft Load Balancing and Clustering. Outline Introduction Load balancing Clustering.
By : Nabeel Ahmed Superior University Grw Campus.
RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing Kai Hwang, Hai Jin, and Roy Ho.
PETAL: DISTRIBUTED VIRTUAL DISKS E. K. Lee C. A. Thekkath DEC SRC.
Managing Storage Lesson 3.
1 Storage Refinement. Outline Disk failures To attack Intermittent failures To attack Media Decay and Write failure –Checksum To attack Disk crash –RAID.
JMS Compliance in NaradaBrokering Shrideep Pallickara, Geoffrey Fox Community Grid Computing Laboratory Indiana University.
INSTALLING MICROSOFT EXCHANGE SERVER 2003 CLUSTERS AND FRONT-END AND BACK ‑ END SERVERS Chapter 4.
Module 9: Configuring Storage
Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.
Appendix B Planning a Virtualization Strategy for Exchange Server 2010.
Module – 4 Intelligent storage system
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
Sensitivity of Cluster File System Access to I/O Server Selection A. Apon, P. Wolinski, and G. Amerson University of Arkansas.
Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.
4/5/2007Data handling and transfer in the LHCb experiment1 Data handling and transfer in the LHCb experiment RT NPSS Real Time 2007 FNAL - 4 th May 2007.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
"1"1 Introduction to Managing Data " Describe problems associated with managing large numbers of disks " List requirements for easily managing large amounts.
Virtualization for the LHCb Online system CHEP Taipei Dedicato a Zio Renato Enrico Bonaccorsi, (CERN)
Management of the LHCb DAQ Network Guoming Liu * †, Niko Neufeld * * CERN, Switzerland † University of Ferrara, Italy.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
LFC Replication Tests LCG 3D Workshop Barbara Martelli.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
VMware vSphere Configuration and Management v6
Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Implementation of a reliable and expandable on-line storage for compute clusters Jos van Wezel.
Sep. 17, 2002BESIII Review Meeting BESIII DAQ System BESIII Review Meeting IHEP · Beijing · China Sep , 2002.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
WINDOWS SERVER 2003 Genetic Computer School Lesson 12 Fault Tolerance.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
Management of the LHCb DAQ Network Guoming Liu *†, Niko Neufeld * * CERN, Switzerland † University of Ferrara, Italy.
Page 1 Mass Storage 성능 분석 강사 : 이 경근 대리 HPCS/SDO/MC.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Load Rebalancing for Distributed File Systems in Clouds.
Remigius K Mommsen Fermilab CMS Run 2 Event Building.
The Evaluation Tool for the LHCb Event Builder Network Upgrade Guoming Liu, Niko Neufeld CERN, Switzerland 18 th Real-Time Conference June 13, 2012.
Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.
© 2007 Z RESEARCH Z RESEARCH Inc. Non-stop Storage GlusterFS Cluster File System.
System Storage TM © 2007 IBM Corporation IBM System Storage™ DS3000 Series Jüri Joonsaar Tartu.
Managing Storage Module 3.
Balazs Voneki CERN/EP/LHCb Online group
Integrating Disk into Backup for Faster Restores
Lab A: Planning an Installation
WP18, High-speed data recording Krzysztof Wrona, European XFEL
Diskpool and cloud storage benchmarks used in IT-DSS
Fujitsu Training Documentation RAID Groups and Volumes
Enrico Bonaccorsi, (CERN) Loic Brarda, (CERN) Gary Moine, (CERN)
RT2003, Montreal Niko Neufeld, CERN-EP & Univ. de Lausanne
Experience of Lustre at a Tier-2 site
Unit OS10: Fault Tolerance
Storage Virtualization
The LHCb Event Building Strategy
Oracle Storage Performance Studies
Example of DAQ Trigger issues for the SoLID experiment
Cloud Computing Architecture
THE GOOGLE FILE SYSTEM.
by Mikael Bjerga & Arne Lange
CS 295: Modern Systems Organizing Storage Devices
Presentation transcript:

High-Performance Storage System for the LHCb Experiment Sai Suman Cherukuwada, CERN Niko Neufeld, CERN IEEE NPSS RealTime 2007, FNAL

LHCb Background Large Hadron Collider beauty One of 4 Major CERN experiments at the LHC Single-arm Forward Spectrometer b-Physics, CP Violation in the Interactions of b-hadrons

Storage at Experiment Site LHCb Data Acquisition Level 0 Trigger – 1 MHz High Level Trigger – 2-5 kHz Written to Disk at Experiment Site Staged out to Tape Centre L0 Electronics Readout Boards Storage at Experiment Site High-Level Trigger Farm

Storage Requirements Must sustain Write operations for 2-5 kHz of Event Data at ~30 kB/event, with peak loads going up to twice as much Must sustain matching read operations for staging out data to tape centre Must support reading of data for Analysis tasks Must be Fault tolerant Must easily scale to support higher storage capacities and/or throughput

Architecture : Choices Fully Partitioned Independent Servers Cluster File System

Architecture : Shared-Disk File System Cluster Can Scale Compute or Storage Components independently Locking Overhead restricted to very few servers Unified namespace is simpler to manage Storage fabric delivers high throughput Shared-Disk File System Cluster

Hardware : Components Dell PowerEdge 2950 Intel Xeon Quad Core Servers (1.6 GHz) with 4 GB FBD RAM QLogic QLE2462 4 Gbps Fibre Channel Adapters DataDirect Networks S2A 8500 Storage Controllers with 2 Gbps host-side ports 50 x Hitachi 500 GB 7200 rpm SATA Disks Brocade 200E 4 Gbps Fibre Channel Switch

Hardware : Storage Controller DirectRAID Combined Features of RAID3, RAID5, and RAID0 8 + 1p + 1s Very Low Impact on Disk Rebuild Large Sector Sizes (up to 8 kB) supported Eliminates host-side striping IOZone File System Benchmark with 8 threads writing 2 GB files each on one server Tested first in “Normal” mode with all disks in normal health, and then in “Rebuild”, with one disk in the process of being replaced by a global hot spare

Software

Software : Shared-Disk File System Runs on RAID volumes exported from Storage Arrays (called LUNs or Logical Units) Can be mounted by multiple servers simultaneously Lock Manager ensures consistency of operations Scales almost linearly up to 4 nodes (at least) IOZone Test with 8 threads, O_DIRECT I/O LUNs striped over 100+ disks 2 Gbps Fibre Channel Connections to Disk Array

Writer Service : Design Goals Enable a large number of HLT Farm Servers to write to disk Write Data to shared disk file system at close to maximum disk throughput Failover + Failback with no data loss Load Balancing between instances Write hundreds of concurrent files per server

Writer Service : Discovery Discovery and status updates performed through multicast Service Table maintains current status of all known hosts Service Table contents constantly updated to all connected Gaudi Writer Processes from the HLT Farm Writer Service 1 Multicast Messages Writer Service 2 Writer Service 3 Relay Service Table Information to Writers Gaudi Writer Process 1 Gaudi Writer Process 2

Writer Process : Writing Cache every event Send to Writer Service Wait for Acknowledgement Flush and free

Writer Service : Failover Writer Processes aware of all instances of Writer Service Each Data Chunk is entirely self-contained Writing of a Data Chunk is idempotent If Writer Service fails, Writer Process can reconnect and resend unacknowledged chunks Writer Service 1 Writer Service 2 Failed Connection Connect to Next Entry In Service Table, Update New Service Table Replay Unacknowledged Data Chunks Gaudi Writer Process 1

Writer Service : Throughput Cached concurrent write performance for large numbers of files is insufficient Large CPU and memory load (memory copy) O_DIRECT reduces CPU and memory usage Data need to be page-aligned for O_DIRECT Written event data are not aligned to anything Custom test writing 32 files per thread x 8 threads Write sizes varying from 32 bytes to 1 MB LUNs striped over 16 disks 2 Gbps Fibre Channel Connections to Disk Array

Writer Service : Throughput per Server Scales up with number of clients Write throughput within 3% of maximum achievable through GFS Custom test writing event sizes ranging from 1 byte to 2 MB LUNs striped over 16 disks 2 Gbps Fibre Channel Connections to Disk Array 2 x 1 Gbps Ethernet Connections to Server CPU Utilisation ~ 7-10%

Conclusions & Future Work Solution that can offer high read and write throughput with minimal overhead Can be scaled up easily with more hardware Failover with no performance hit More sophisticated “Trickle” load-balancing algorithm in the process of being prototyped Maybe worth implementing a full-POSIX FS version someday?

Thank You

Writer Service Linux-HA not suited for Load Balancing Linux Virtual Server not suited for Write Workloads NFS in sync mode too slow, async mode can lead to information loss on failure