Cloud computing and federated storage Doug Benjamin Duke University.

Slides:



Advertisements
Similar presentations
Wei Lu 1, Kate Keahey 2, Tim Freeman 2, Frank Siebenlist 2 1 Indiana University, 2 Argonne National Lab
Advertisements

Using EC2 with HTCondor Todd L Miller 1. › Introduction › Submitting an EC2 job (user tutorial) › New features and other improvements › John Hover talking.
Xrootd and clouds Doug Benjamin Duke University. Introduction Cloud computing is here to stay – likely more than just Hype (Gartner Research Hype Cycle.
Resources for the ATLAS Offline Computing Basis for the Estimates ATLAS Distributed Computing Model Cost Estimates Present Status Sharing of Resources.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
An Approach to Secure Cloud Computing Architectures By Y. Serge Joseph FAU security Group February 24th, 2011.
Duke Atlas Tier 3 Site Doug Benjamin (Duke University)
Duke and ANL ASC Tier 3 (stand alone Tier 3’s) Doug Benjamin Duke University.
MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.
Tier-1 experience with provisioning virtualised worker nodes on demand Andrew Lahiff, Ian Collier STFC Rutherford Appleton Laboratory, Harwell Oxford,
Testing as a Service with HammerCloud Ramón Medrano Llamas CERN, IT-SDC
Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.
Analysis support issues, Virtual analysis facilities (ie Tier 3s in the cloud) Doug Benjamin (Duke University) on behalf of Sergey Panitkin, Val Hendrix,
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Ian Fisk and Maria Girone Improvements in the CMS Computing System from Run2 CHEP 2015 Ian Fisk and Maria Girone For CMS Collaboration.
US ATLAS Western Tier 2 Status and Plan Wei Yang ATLAS Physics Analysis Retreat SLAC March 5, 2007.
Ceph Storage in OpenStack Part 2 openstack-ch,
Ian Alderman A Little History…
Integrating HPC into the ATLAS Distributed Computing environment Doug Benjamin Duke University.
Data Import Data Export Mass Storage & Disk Servers Database Servers Tapes Network from CERN Network from Tier 2 and simulation centers Physics Software.
Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.
High Performance Computing on Virtualized Environments Ganesh Thiagarajan Fall 2014 Instructor: Yuzhe(Richard) Tang Syracuse University.
BESIII Production with Distributed Computing Xiaomei Zhang, Tian Yan, Xianghu Zhao Institute of High Energy Physics, Chinese Academy of Sciences, Beijing.
6/26/01High Throughput Linux Clustering at Fermilab--S. Timm 1 High Throughput Linux Clustering at Fermilab Steven C. Timm--Fermilab.
Grid Lab About the need of 3 Tier storage 5/22/121CHEP 2012, The need of 3 Tier storage Dmitri Ozerov Patrick Fuhrmann CHEP 2012, NYC, May 22, 2012 Grid.
Support in setting up a non-grid Atlas Tier 3 Doug Benjamin Duke University.
Introduction Advantages/ disadvantages Code examples Speed Summary Running on the AOD Analysis Platforms 1/11/2007 Andrew Mehta.
1 CloudVS: Enabling Version Control for Virtual Machines in an Open- Source Cloud under Commodity Settings Chung-Pan Tang, Tsz-Yeung Wong, Patrick P. C.
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
Tier 3 Computing Doug Benjamin Duke University. Tier 3’s live here Atlas plans for us to do our analysis work here Much of the work gets done here.
Atlas Tier 3 Virtualization Project Doug Benjamin Duke University.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Exploiting Virtualization & Cloud Computing in ATLAS 1 Fernando H. Barreiro.
ROOT and Federated Data Stores What Features We Would Like Fons Rademakers CERN CC-IN2P3, Nov, 2011, Lyon, France.
ATLAS Data Challenges US ATLAS Physics & Computing ANL October 30th 2001 Gilbert Poulard CERN EP-ATC.
PanDA Update Kaushik De Univ. of Texas at Arlington XRootD Workshop, UCSD January 27, 2015.
The LHCb Italian Tier-2 Domenico Galli, Bologna INFN CSN1 Roma,
CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.
NA61/NA49 virtualisation: status and plans Dag Toppe Larsen CERN
Claudio Grandi INFN Bologna CMS Computing Model Evolution Claudio Grandi INFN Bologna On behalf of the CMS Collaboration.
Network awareness and network as a resource (and its integration with WMS) Artem Petrosyan (University of Texas at Arlington) BigPanDA Workshop, CERN,
SLACFederated Storage Workshop Summary For pre-GDB (Data Access) Meeting 5/13/14 Andrew Hanushevsky SLAC National Accelerator Laboratory.
Integration of the ATLAS Tag Database with Data Management and Analysis Components Caitriana Nicholson University of Glasgow 3 rd September 2007 CHEP,
Tier-2 storage A hardware view. HEP Storage dCache –needs feed and care although setup is now easier. DPM –easier to deploy xrootd (as system) is also.
Doug Benjamin Duke University. 2 ESD/AOD, D 1 PD, D 2 PD - POOL based D 3 PD - flat ntuple Contents defined by physics group(s) - made in official production.
Andrea Manzi CERN On behalf of the DPM team HEPiX Fall 2014 Workshop DPM performance tuning hints for HTTP/WebDAV and Xrootd 1 16/10/2014.
PROOF tests at BNL Sergey Panitkin, Robert Petkus, Ofer Rind BNL May 28, 2008 Ann Arbor, MI.
Computing Issues for the ATLAS SWT2. What is SWT2? SWT2 is the U.S. ATLAS Southwestern Tier 2 Consortium UTA is lead institution, along with University.
Atlas Software Structure Complicated system maintained at CERN – Framework for Monte Carlo and real data (Athena) MC data generation, simulation and reconstruction.
ATLAS Distributed Computing perspectives for Run-2 Simone Campana CERN-IT/SDC on behalf of ADC.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.
CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.
Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.
Improving Performance using the LINUX IO Scheduler Shaun de Witt STFC ISGC2016.
ATLAS TIER3 in Valencia Santiago González de la Hoz IFIC – Instituto de Física Corpuscular (Valencia)
October 19, 2010 David Lawrence JLab Oct. 19, 20101RootSpy -- CHEP10, Taipei -- David Lawrence, JLab Parallel Session 18: Software Engineering, Data Stores,
Building on virtualization capabilities for ExTENCI Carol Song and Preston Smith Rosen Center for Advanced Computing Purdue University ExTENCI Kickoff.
Computing infrastructures for the LHC: current status and challenges of the High Luminosity LHC future Worldwide LHC Computing Grid (WLCG): Distributed.
Atlas IO improvements and Future prospects
Virtualisation for NA49/NA61
Dag Toppe Larsen UiB/CERN CERN,
SuperB and its computing requirements
Dag Toppe Larsen UiB/CERN CERN,
Virtualisation for NA49/NA61
Dagmar Adamova (NPI AS CR Prague/Rez) and Maarten Litmaath (CERN)
Simulation use cases for T2 in ALICE
Red Hat User Group June 2014 Marco Berube, Cloud Solutions Architect
Haiyan Meng and Douglas Thain
Zhen Xiao, Qi Chen, and Haipeng Luo May 2013
Presentation transcript:

Cloud computing and federated storage Doug Benjamin Duke University

Currency of the Realm How does a physics experiment determine it is successful? o Referred papers published and cited by peers (impact factor) o Students trained Success is not measured in bytes delivered or events simulated High Energy physics computing by and large is a means to an end and not the end it self. Very much like building and operating a detector subsystem Both Cloud computing and federated storage when done properly help to enhance the physics output.

Wouter Verkerke, NIKHEF 3 Physics analysis workflow – Iterations required NB: Typical life span of analysis = 6-9 months How often are analysis steps typically iterated and why? –Simulation sample production: 1-3 times Samples can be increased in size, New generators become available, New version of simulation/reconstruction software required by collaboration etc… –Data sample production: 1-2 times New data becomes available over time. Changes in good-run-list. Data is reprocessed with new version of reconstruction –Data analysis chain 1 ‘ntuple making’: 3-4 times New input samples available Changes/improvement to overlap removal algorithms (which must be executed before preselection) –Data analysis chain 2 ‘ntuple analysis’: times Core of the physics analysis: need to test out many ideas on event selection algorithms, derived quantity construction –Statistical analysis of data: O( ) times Every time ‘chain 2’ is newly executed plus more times to test new ideas on fitting data modeling etc…

Why Cloud computing? Cloud’s provide a mechanism to give the user a “private cluster” on shared resources To the user – he/she sees a cluster at their disposal To the organization – users running on shared resources and hopefully level out demand o Not always true – peaks ahead of conferences Don’t the resources already exist? o Yes but …. o In the US stimulus funded clusters (more than 40 institutions got funding in ) will be 5 years old in 2015 o Need a solution for the future.

Some issues for Cloud analysis cluster User work flow is such that they spin over their ntuples and derivatives many times. Need persistent storage that can be used for intermediate output (think Drop Box) Intermediate output will be input to further processing Need to be able to share files amongst members of the same analysis group – could be as small as two people or much larger Has to be as fast and as reliable as what people have now

EC2 Testing/configuration March tested small xrootd storage cluster on EC2 1 redirector and 3 data servers Used spot instances and normal instances with both ephemeral storage or EBS storage to setup xrootd data servers Measure rates of copying into EC2 appliance Work done in Eastern Zone (Virginia) Input data – US ATLAS federated system Work done over month (remember this number)

Write rates to EC2 ephemeral storage Ephemeral storage (two partitions), join the EC2 storage with LVM, OS/xrootd sees - one ext4 partition Avg 16 MB/s (one data server) 3 data servers used Total Avg ~ 45 MB/s write

EBS write rates Extreme copy 12 MB/s Many sites transfer Standard copy - Site to site transfer Notice the peak in the first bin < 0.5 MB/s Single lvm ext4 partition Can use xrootd federation to load files into Amazon EC2 storage More work needs to be done to understand the lock up with frm_xfrd daemon Write rates – 16 MB/s Avg per storage node with two ephemeral drives Need to further study the reliability of Xrootd extreme copy mode Worth the effort due to performance gains and potential reliability gains (multi site sources)

Conclusions from EC2 work EC2 gave reasonable write performance – though variable At time xrdcp in extreme mode not functional due to bug – subsequently fixed Costs were very surprising ~ $600 for several 100’s GB of storage on the servers, instance running time etc. Not surprising – o EC2 Rates (today) $0.10 GB/month normal EBS, $0.125 GB/month IO enhanced EBS, EBS snapshot – $0.125 GB/month o 1 TB -> $100/month to $125/month o Local group disk at over 10 sites in US > 50 TB/ site, most > 100 TB

Future grid Analysis cluster project -

Analysis cluster project

Future grid setup 3 VM’s (2 cores each) Centos 5.7 Xrootd Root compiled locally 10 GB disk each machine (total space) 1 xrootd redirector 1 xrootd data server 1 xrootd client machine All at University of Florida – nimbus cloud site

Data transfer in cloud Xrdcp command run on data server reference redirector machine and Redirector at BNL Avg – 11.2 MB/s RMS – 2.1 MB/s

Data processing rate Test program – o ATLAS D3PD analysis – cut flow job – Standard Module D3PD from o TTreeCache w/ learning phase o Activate only Branches that one uses in Analysis o Root Processing Rate on desktop at ANL with data over LAN - ~ 1200 Hz

Future activities Continue work – using OpenStack instance on Future grid – measure performance Extend activity to OpenStack instance at BNL Setup OpenStack cloud at ANL Tier 3 serving data over LAN. o Measure processing performance for analysis in VM o Analysis on physical hardware at hypervisor layer o Gives apples to apples comparison Using data from Federation – repeat measurements at ANL In October – use virtualized resources at Midwest Tier 2

Some Issues Cloud middleware an issue – many different varieties, EC2, Openstack, Nimbus, Cloudstack, Google compute engine… Wild Wild west.. Contextualization a challenge – need to understand the life cycle management of the VM’s VM startup time will affect user’s throughput Most of the effort in my experiment (ATLAS) focused on MC production in the cloud. Need to strengthen the focus on chaotic analysis

Conclusion Cloud analysis activities have moved from Vaporware last year to concrete activities this year o CMS processing on EC2 o ATLAS tests on ECS – Future Grid – soon private clouds at BNL and ANL Federated storage with WAN data access key to all of these activites Much more work to be done – with a variety of work flows and input files ATLAS has to migrate users to using optimized code for this type of activity – a big Challenge