First Experiences with Ceph on the WLCG Grid Rob Appleyard Shaun de Witt, James Adams, Brian Davies.

First Experiences with Ceph on the WLCG Grid Rob Appleyard Shaun de Witt, James Adams, Brian Davies

Contents Who am I and where am I coming from? What is Ceph? What is an object store? Why are we interested in it? Comparative performance What will it cost us?

Introduction Me! Rob Appleyard Sysadmin at the Rutherford Appleton Laboratory Working on LHC data storage at RAL for 2.5 years Where this talk is coming from… Discussion of Ceph and how it works Findings of an internal evaluation exercise

RAL – An LHC Tier 1 Computing Centre Our current situation CASTOR Disk/Tape ~17PB of RAID 6 storage nodes ~13PB of tape Our plans We need a better disk-only file system Don’t touch the tapes!

What is an Object Store? An object store manages each chunk of data that it goes into it as an object or objects. The structure is flat; the system just has a collection of objects with associated metadata. As opposed to a file system which uses a file hierarchy. Lower levels of storage are abstracted away, Capabilities Distributed/redundant metadata separated from data Scalable to multi-petabyte levels You can then impose a filesystem/namespace above the object store (Ceph: CephFS) Or do whatever. The object store doesn’t care.

Why do we want to use Ceph? Generic, FREE, non domain-specific solution. Incorporated into the Linux kernel CERN’s plan for CASTOR tape is to run Ceph as a file system. Cut out the middleman! Improved resilience Under CASTOR, the loss of one node will lose all files on that node. With Ceph, this is not a problem Distributed placement groups Flexibility Ceph also planned for Tier 1 and departmental cloud storage.

Performance -Early 2013 performance comparison exercise for disk-only Ceph looked… not great.

Performance (Ctd.) Why so slow? Ceph instance was set for one master and one replica It waits for both copies to be written before acknowledging. Performance testing on new test instance soon. New feature – ‘tiering’ – could help Manages a fast (SSD) cache sitting at the front Then passes data to back end.

Cost Modelling: The Test Examine whether Ceph is a viable replacement for CASTOR from a hardware cost perspective Budgets squeezed… We can’t exceed CASTOR’s hardware budget Use a vendor’s website to price up nodes for CASTOR and Ceph Different requirements CASTOR needs better drives, RAID controllers, etc. …and headnodes (not included) But Ceph needs more drives?

Cost Modelling: The Numbers Based on commodity nodes from Broadberry, 36*3TB SATA drives.Broadberry Prices from Dec 2013… CASTOR: RAID 6 at node level: $113/TB (we actually buy SAS drives, so this is an underestimate) Ceph: 1 master copy with 2 additional replicas: $313/TB 1 master copy with 1 additional replica: $208/TB 1 master copy with 2 erasure-coded copies per 16 disks: $119/TB 1 master copy with 1 erasure-coded copy per 16 disks: $111/TB

Cost Modelling: The Conclusion Ceph must fit into CASTOR’s budget Therefore we can’t use straight replication. Cost difference between 1 and 2 erasure coded copies is pretty small. 2 is much better than one! Not included: Power Cooling Staff effort …but a lot of this should be similar to CASTOR (we hope!)

The Future RAL: Large (1PB) test instance Performance – should be better than last time. 1 replica initially, then try erasure codes… Look to deploy into production as CASTOR replacement early 2015 One really big instance rather than one per experiment. Risks… Big change Erasure coding not stable Future development for WLCG? CERN are working on a plug-in bundle that is optimised for XRootD Also an optimised file system to replace CephFS.

Any Questions? Contact: Rob.Appleyard@stfc.ac.uk Shaun.de-witt@stfc.ac.uk

Spare Slides…

Why are we interested in Ceph? Up to now, we have not seen a reason to move away from CASTOR. We did a full survey of our options during 2012 and found nothing that was sufficiently superior to CASTOR to be worth the effort of migration. But things move on… CASTOR is proving problematic with new WLCG protocol (xroot) CERN seriously considering running Ceph under CASTOR for tape If we’ll be running it anyway, why not cut out the middleman? Some issues previously identified in Ceph are, or will soon be, addressed Erasure encoding, stability of CephFS

Why Ceph? CASTOR want drop current file system support If Ceph works as well as planned Gets us out of Particle Physics specific software Except CERN are contributing to the code base Improved resilience Currently loss of 3 disks on server will (probably) mean loss of all files Under Ceph, 3 disk loss will lose less (not quantified) Assuming suitable erasure encoding/duplication 2 erasure encoded disks per 16 physical Improved support Ceph also planned for Tier 1 and SCD cloud storage More cross-team knowledge

Plans and Times Currently developing quattor component Plan is to deploy ‘small’ test instance for all VOs 1Pb nominal capacity, less overhead Initially using CephFS and dual copy Move to erasure encoding as soon as possible NO SRM Anticipate deployment late April/early May Integration of XRootD RADOS plugin as soon as available After some months of VO testing (Oct. 2014?) Start migrating data from CASTOR to CEPH Need to work with the VO to minimise pain Fewest possible files migrated Start moving capacity from CASTOR to Ceph

Costs Ceph with dual copy to expensive long term Need erasure encoding Could deploy with current set-up 1 copy, RAID6, 1 hot spare … but not recommended Lose advantage of disk loss With erasure encoding… Single erasure encoded copy (w/o hotspare, 1 erasure disk per 17 dara disks) is cheaper than current setup But less resilient Dual erasure encoded copy (w/o hotspare, 2 erasure disks per 16 data disks) is about the same price And better resilience

Proposed Deployment Ideally ‘single’ instance with quotas… Single meaning 1 disk instance and 1 tape instance (still under CASTOR) Using Ceph pools Simpler to manage rather than 4 instances currently set-up Easier to shift space around according to demand Problem may be ALICE security model May force us to run with 2 instances Work with ALICE to see if this can be mitigated

Risks Lots of them... This is just a sample RiskLikelihodImpactMitigation Erasure encoding not readyLowHighNone CephFS not performantMedium Switch of http access CephFS not stableLowHighSwitch of http access XRootD/RADOS plugin not ready MediumHighUse POSIX CephFS Difficulty in migrating dataHigh Minimise data to be migrated Difficult to administerMedium Use testing time to learn about the system Ceph moves to support modelLowmediumBuy support from Inktank (or other vendor)

First Experiences with Ceph on the WLCG Grid Rob Appleyard Shaun de Witt, James Adams, Brian Davies.

Similar presentations

Presentation on theme: "First Experiences with Ceph on the WLCG Grid Rob Appleyard Shaun de Witt, James Adams, Brian Davies."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

First Experiences with Ceph on the WLCG Grid Rob Appleyard Shaun de Witt, James Adams, Brian Davies.

Similar presentations

Presentation on theme: "First Experiences with Ceph on the WLCG Grid Rob Appleyard Shaun de Witt, James Adams, Brian Davies."— Presentation transcript:

Similar presentations

About project

Feedback