Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS Computing for the LHC Alberto Pace

Slides:



Advertisements
Similar presentations
RAID Oh yes Whats RAID? Redundant Array (of) Independent Disks. A scheme involving multiple disks which replicates data across multiple drives. Methods.
Advertisements

Resources for the ATLAS Offline Computing Basis for the Estimates ATLAS Distributed Computing Model Cost Estimates Present Status Sharing of Resources.
Click to edit Master title style European AFS and Kerberos Conference Welcome to CERN CERN, Accelerating Science and Innovation CERN, Accelerating Science.
2P13 Week 11. A+ Guide to Managing and Maintaining your PC, 6e2 RAID Controllers Redundant Array of Independent (or Inexpensive) Disks Level 0 -- Striped.
Randall Sobie The ATLAS Experiment Randall Sobie Institute for Particle Physics University of Victoria Large Hadron Collider (LHC) at CERN Laboratory ATLAS.
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
CERN IT Department CH-1211 Genève 23 Switzerland Visit of Professor Jerzy Szwed Under Secretary of State Ministry of Science and Higher.
The LHC Computing Grid – February 2008 The Worldwide LHC Computing Grid Dr Ian Bird LCG Project Leader 15 th April 2009 Visit of Spanish Royal Academy.
High Performance Computing Course Notes High Performance Storage.
Peer-to-peer archival data trading Brian Cooper and Hector Garcia-Molina Stanford University.
I/O Systems and Storage Systems May 22, 2000 Instructor: Gary Kimura.
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
Les Les Robertson WLCG Project Leader WLCG – Worldwide LHC Computing Grid Where we are now & the Challenges of Real Data CHEP 2007 Victoria BC 3 September.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 6 – RAID ©Manuel Rodriguez.
LAN / WAN Business Proposal. What is a LAN or WAN? A LAN is a Local Area Network it usually connects all computers in one building or several building.
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Storage Systems.
Two or more disks Capacity is the same as the total capacity of the drives in the array No fault tolerance-risk of data loss is proportional to the number.
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS Introduction to data management.
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS CERN and Computing … … and Storage Alberto Pace Head, Data.
Redundant Array of Inexpensive Disks aka Redundant Array of Independent Disks (RAID) Modified from CCT slides.
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS Introduction to data management.
A BigData Tour – HDFS, Ceph and MapReduce These slides are possible thanks to these sources – Jonathan Drusi - SCInet Toronto – Hadoop Tutorial, Amir Payberah.
Frédéric Hemmer, CERN, IT DepartmentThe LHC Computing Grid – October 2006 LHC Computing and Grids Frédéric Hemmer IT Deputy Department Head October 10,
CERN IT Department CH-1211 Genève 23 Switzerland t Plans and Architectural Options for Physics Data Analysis at CERN D. Duellmann, A. Pace.
Frédéric Hemmer, CERN, IT Department The LHC Computing Grid – June 2006 The LHC Computing Grid Visit of the Comité d’avis pour les questions Scientifiques.
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS From data management to storage services to the next challenges.
The LHC Computing Grid – February 2008 The Worldwide LHC Computing Grid Dr Ian Bird LCG Project Leader 25 th April 2012.
Ian Bird LCG Deployment Manager EGEE Operations Manager LCG - The Worldwide LHC Computing Grid Building a Service for LHC Data Analysis 22 September 2006.
6/26/01High Throughput Linux Clustering at Fermilab--S. Timm 1 High Throughput Linux Clustering at Fermilab Steven C. Timm--Fermilab.
CERN IT Department CH-1211 Genève 23 Switzerland t Castor development status Alberto Pace LCG-LHCC Referees Meeting, May 5 th, 2008 DRAFT.
"1"1 Introduction to Managing Data " Describe problems associated with managing large numbers of disks " List requirements for easily managing large amounts.
EGEE is a project funded by the European Union under contract IST HEP Use Cases for Grid Computing J. A. Templon Undecided (NIKHEF) Grid Tutorial,
1 The LHC Computing Grid – February 2007 Frédéric Hemmer, CERN, IT Department LHC Computing and Grids Frédéric Hemmer Deputy IT Department Head January.
CERN IT Department CH-1211 Genève 23 Switzerland Visit of Professor Karel van der Toorn President University of Amsterdam Wednesday 10 th.
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS Virtual visit to the computer Centre Photo credit to Alexey.
Ian Bird LHC Computing Grid Project Leader LHC Grid Fest 3 rd October 2008 A worldwide collaboration.
The LHC Computing Grid – February 2008 The Challenges of LHC Computing Dr Ian Bird LCG Project Leader 6 th October 2009 Telecom 2009 Youth Forum.
Les Les Robertson LCG Project Leader High Energy Physics using a worldwide computing grid Torino December 2005.
CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.
CERN – IT Department CH-1211 Genève 23 Switzerland t Working with Large Data Sets Tim Smith CERN/IT Open Access and Research Data Session.
RAID Disk Arrays Hank Levy. 212/5/2015 Basic Problems Disks are improving, but much less fast than CPUs We can use multiple disks for improving performance.
Service, Operations and Support Infrastructures in HEP Processing the Data from the World’s Largest Scientific Machine Patricia Méndez Lorenzo (IT-GS/EIS),
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
1 The LHC Computing Grid – April 2007 Frédéric Hemmer, CERN, IT Department The LHC Computing Grid A World-Wide Computer Centre Frédéric Hemmer Deputy IT.
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS Introduction to data management Alberto Pace.
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS Data architecture challenges for CERN and the High Energy.
1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.
The Worldwide LHC Computing Grid Frédéric Hemmer IT Department Head Visit of INTEL ISEF CERN Special Award Winners 2012 Thursday, 21 st June 2012.
IT-DSS Alberto Pace2 ? Detecting particles (experiments) Accelerating particle beams Large-scale computing (Analysis) Discovery We are here The mission.
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
Top 5 Experiment Issues ExperimentALICEATLASCMSLHCb Issue #1xrootd- CASTOR2 functionality & performance Data Access from T1 MSS Issue.
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS Storage plans for CERN and for the Tier 0 Alberto Pace (and.
WLCG Tier-2 Asia Workshop TIFR, Mumbai 1-3 December 2006
Physics Data Management at CERN
The LHC Computing Grid Visit of Mtro. Enrique Agüera Ibañez
Managing Storage in a (large) Grid data center
The LHC Computing Challenge
The LHC Computing Grid Visit of Her Royal Highness
CSE 451: Operating Systems Spring 2006 Module 18 Redundant Arrays of Inexpensive Disks (RAID) John Zahorjan Allen Center.
RAID Disk Arrays Hank Levy 1.
LHC Data Analysis using a worldwide computing grid
RAID Disk Arrays Hank Levy 1.
CSE 451: Operating Systems Autumn 2010 Module 19 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570.
CSE 451: Operating Systems Winter 2009 Module 13 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura 1.
Mark Zbikowski and Gary Kimura
CSE 451: Operating Systems Winter 2012 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura 1.
CSE 451: Operating Systems Winter 2007 Module 18 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570.
RAID Disk Arrays Hank Levy 1.
Overview & Status Al-Ain, UAE November 2007.
The LHC Computing Grid Visit of Professor Andreas Demetriou
Presentation transcript:

Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS Computing for the LHC Alberto Pace

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 2 Experiment data streams Level 1 - Special Hardware Level 2 - Embedded Processors 40 MHz (1000 TB/sec) Level 3 – Farm of commodity CPUs 75 KHz (75 GB/sec) 5 KHz (5 GB/sec) 100 Hz (300 MB/sec) Data Recording & Offline Analysis A typical data stream for an LHC experiment...

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 3 LHC data flows The accelerator will run for years Experiments produce about 15 Million Gigabytes (PB) of data each year (about 2 million DVDs!) LHC data analysis requires a computing power equivalent to ~100,000 of today's fastest PC processors Requires many cooperating computer centres, as CERN can only provide ~20% of the capacity Concorde (15 Km) One year of LHC data (~ 20 Km) Mont Blanc (4.8 Km) Nearly 1 petabyte/week

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 4 CPU DiskTape Computing available at CERN High-throughput computing based on commodity technology More than cores in 6000 boxes (Linux) 14 Petabytes on 41’000 drives (NAS Disk storage) 34 Petabytes on 45’000 tape slots

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 5 Organized data replication Grid services are based on middleware which manages the access to computing and storage resources distributed across hundreds of computer centres The middleware and the experiments’ computing frameworks distribute the load everywhere available resources are found

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 6 LCG Service Hierarchy Tier-0: the accelerator centre Data acquisition & initial processing Long-term data preservation Distribution of data  Tier-1 centres Spain – PIC (Barcelona) Taiwan – Academia SInica (Taipei) UK – CLRC (Oxford) US – FermiLab (Illinois) – Brookhaven (NY) Tier-1: “online” to the data acquisition process  high availability Managed Mass Storage –  grid-enabled data service Data-heavy analysis National, regional support Canada – Triumf (Vancouver) France – IN2P3 (Lyon) Germany – Forschunszentrum Karlsruhe Italy – CNAF (Bologna) Netherlands – NIKHEF/SARA (Amsterdam) Nordic countries – distributed Tier-1 Tier-2: ~200 centres in ~35 countries Simulation End-user analysis – batch and interactive

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 7 Internet connections

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 8 LHC Grid Activity Distribution of work across Tier0 / Tier1 / Tier 2 illustrates the importance of the grid system –Tier 2 contribution is around 50%; > 85% is external to CERN Data distribution from CERN to Tier-1 sites –Latest run shows that the data rates required for LHC start-up have been reached and can be sustained over long periods

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 9 Organized data replication

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 10 CERN as the core point CERN has the highest requirements in terms of storage in the entire LHC computing grid –All “raw” data must be stored at CERN ( > 15 PB per year) –All data must be replicated several times to the Tier1s centres

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 11 The Computer centre revolution Today

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 12 Network data flows

Data & Storage Services Key figures Disk Servers Tape Servers and Tape Library Tape Servers and Tape Library CPU Servers O(20K) processors O(20K) disks, 7.5 PB tapes, 44 PB Data Base Servers Network routers 270 CPU and disk servers 140 TB (after RAID10) 270 CPU and disk servers 140 TB (after RAID10) 80 Gbits/s 2.9 MW Electricity and Cooling 2.9 MW Electricity and Cooling

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 14 Grid applications spinoffs More than 25 applications from an increasing number of domains –Astrophysics –Computational Chemistry –Earth Sciences –Financial Simulation –Fusion –Geophysics –High Energy Physics –Life Sciences –Satellite imagery –Material Sciences –….. Summary of applications report:

Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS Virtual visit to the computer Centre Photo credit to Alexey Tselishchev

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 23 And the antimatter...

Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS Areas of research in Data Management Reliability, Scalability, Security, Manageability

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 25 1/4 Decouple the “Service reliability” from “Hardware reliability” –Generate sufficient error correction scattered on independent storage which allows reconstruction of data from any foreseeable hardware failure –“Dispersed storage” –Use commodity storage (or whatever minimizes the cost) N error correction chunks Fixed size M data chunks checksum Arbitrary reliability

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 26 Arbitrary scalability Global namespace but avoid database bottleneck Options for metatada –Metadata stored with the data but federated asynchronously –Central metadata repository asynchronously replicated locally Metadata consistency “on demand” 2/4

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 27 Global access, and Secure Access as a “File system”, using internet protocols –Xroot, NFS 4.1 Strong authentication, with “Global identities” –(eg: X509 certificates instead of local accounts) Fine grained authorization –Supporting users aggregations (groups) Detailed accounting and journaling Quota 3/4

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 28 High manageability Data migrates automatically to new hardware, freeing storage marked for retirement In depth monitoring, with constant performance measurement, and alarm generation Operational procedures are limited to “no- brainer” box replacement 4/4

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 29

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 30 Storage Reliability Reliability is related to the probability to lose data –Def: “the probability that a storage device will perform an arbitrarily large number of I/O operations without data loss during a specified period of time”

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 31 Reminder: types of RAID RAID0 –Disk striping RAID1 –Disk mirroring RAID5 –Parity information is distributed across all disks RAID6 –Uses (Reed–Solomon) error correction, allowing the loss of 2 disks in the array without data loss

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 32 Reed–Solomon error correction ?.. is an error-correcting code that works by oversampling a polynomial constructed from the data Any k distinct points uniquely determine a polynomial of degree, at most, k − 1 The sender determines the polynomial (of degree k − 1), that represents the k data points. The polynomial is "encoded" by its evaluation at n (≥ k) points. If during transmission, the number of corrupted values is < n-k the receiver can recover the original polynomial. Note: only when n-k ≤ 2, we have efficient implementations –n-k = 1 is Raid 5 (parity) –n-k = 2 is Raid 6

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 33 Reed–Solomon (simplified) Example 4 Numbers to encode: { 1, -6, 4, 9 } (k=4) polynomial of degree 3 (k − 1): We encode the polynomial with n=7 points { -2, 9, 8, 1, -6, -7, 4 } y = x 3 - 6x 2 + 4x + 9

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 34 Reed–Solomon (simplified) Example To reconstruct the polynomial, any 4 points are enough: we can lose any 3 points. We can have an error on any 2 points that can be corrected: We need to identify the 5 points “aligned” on the only one polynomial of degree 3 possible

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 35 Raid 5 reliability –Disk are regrouped in sets of equal size. If c is the capacity of the disk and n is the number of disks, the sets will have a capacity of c (n-1) example: 6 disks of 1TB can be aggregated to a “reliable” set of 5TB –The set is immune to the loss of 1 disk in the set. The loss of 2 disks implies the loss of the entire set content.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 36 Raid 6 reliability –Disk are regrouped in sets of arbitrary size. If c is the capacity of the disk and n is the number of disks, the sets will have a capacity of c (n-2) example: 12 disks of 1TB can be aggregated to a “reliable” set of 10TB –The set is immune to the loss of 2 disks in the set. The loss of 3 disks implies the loss of the entire set content.

Data & Storage Services Some calculations for Raid 5 Disks MTBF is between 3 x 10 5 and 1.2 x 10 6 hours Replacement time of a failed disk is < 4 hours Probability of 1 disk to fail within the next 4 hours Probability to have a failing disk in the next 4 hours in a 15 PB computer centre (15’000 disks) Imagine a Raid set of 10 disks. Probability to have one of the remaining disk failing within 4 hours However, let’s increase this by two orders of magnitude as the failure could be due to common factors (over temperature, high noise, EMP, high voltage, faulty common controller,....) Probability to lose computer centre data in the next 4 hours Probability to lose data in the next 10 years p( A and B ) = p(A) * p(B/A) if A,B independent : p(A) * p(B)

Data & Storage Services Same calculations for Raid 6 Probability of 1 disk to fail within the next 4 hours Imagine a raid set of 10 disks. Probability to have one of the remaining 9 disks failing within 4 hours (increased by two orders of magnitudes) Probability to have another of the remaining 8 disks failing within 4 hours (also increased by two orders of magnitudes) Probability to lose data in the next 4 hours Probability to lose data in the next 10 years

Data & Storage Services s Arbitrary reliability RAID is “disks” based. This lacks of granularity For increased flexibility, an alternative would be to use files... but files do not have constant size File “chunks” is the solution –Split files in chunks of size “s” –Group them in sets of “m” chunks –For each group of “m” chunks, generate “n” additional chunks so that For any set of “m” chunks chosen among the “m+n” you can reconstruct the missing “n” chunks –Scatter the “m+n” chunks on independent storage n s m

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 40 Arbitrary reliability with the “chunk” based solution The reliability is independent form the size “s” which is arbitrary. –Note: both large and small “s” impact performance Whatever the reliability of the hardware is, the system is immune to the loss of “n” simultaneous failures from pools of “m+n” storage chunks –Both “m” and “n” are arbitrary. Therefore arbitrary reliability can be achieved The fraction of raw storage space loss is n / (n + m) Note that space loss can also be reduced arbitrarily by increasing m –At the cost of increasing the amount of data loss if this would ever happen

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 41 Analogy with the gambling world –We just demonstrated that you can achieve “arbitrary reliability” at the cost of an “arbitrary low” amount of disk space. By just increasing the amount of data you accept loosing when this happens. –In the gambling world there are several playing schemes that allows you to win an arbitrary amount of money with an arbitrary probability. –Example: you can easily win 100 Euros at > 99 % probability... By playing up to 7 times on the “Red” of a French Roulette and doubling the bet until you win. The probability of not having a “Red” for 7 times is (19/37)7 = ) You just need to take the risk of loosing 12’700 euros with a 0.94 % probability Amount Win Lost BetCumulatedProbabilityAmountProbabilityAmount % % % % % % % % % % % % % %12700

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 42 Practical comments n can be 1 or 2 –1 = Parity –2 = Parity + Reed-Solomon, double parity Although possible, n > 2 has a computational impact that affects performances m chunks of any (m + n) sets are enough to obtain the information. Must be saved on independent media –Performance can depend on m (and thus on s, the size of the chunks): The larger m is, the more the reading can be parallelized –Until the client bandwidth is reached n m

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 43 How to tune m and n 1.Start from the published MTBF of your hardware vendor 2.Decide what is the “arbitrary” replacement time of a faulty hardware 3.Chose m and n in order to expect the reliability that you want 4.While running the system you MUST constantly monitor and record –Hardware failures (so that you constantly re-estimate if the true MTBF) –Replacement time of faulty hardware With these new parameters you recalculate new m and n

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 44 Chunk transfers Among many protocols, Bittorrent is the most popular An SHA1 hash (160 bit digest) is created for each chunk All digests are assembled in a “torrent file” with all relevant metadata information Torrent files are published and registered with a tracker which maintains lists of the clients currently sharing the torrent’s chunks In particular, torrent files have: –an "announce" section, which specifies the URL of the tracker –an "info" section, containing (suggested) names for the files, their lengths, the list of SHA-1 digests Reminder: it is the client’s duty to reassemble the initial file and therefore it is the client that always verifies the integrity of the data received

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 45 Reassembling the chunks Data reassembled directly on the client (bittorrent client) Reassembly done by the data management infrastructure Middleware

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 46 Identify broken chunks The SHA1 hash (160 bit digest) guarantees chunks integrity. It tells you the corrupted chunks and allows you to correct n errors (instead of n-1 if you would not know which chunks are corrupted) n s m Sha1 hash

Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS Questions ?