ATLAS Scalability Tests of Tier-1 Database Replicas WLCG Collaboration Workshop (Tier0/Tier1/Tier2) Victoria, British Columbia, Canada September 1-2, 2007.

Slides:

Advertisements

Similar presentations

Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.

Advertisements

Database Services for Physics at CERN with Oracle 10g RAC HEPiX - April 4th 2006, Rome Luca Canali, CERN.

CERN - IT Department CH-1211 Genève 23 Switzerland t Tier0 database extensions and multi-core/64 bit studies Maria Girone, CERN IT-PSS LCG.

D0 SAM – status and needs Plagarized from: D0 Experiment SAM Project Fermilab Computing Division.

ATLAS Metrics for CCRC’08 Database Milestones WLCG CCRC'08 Post-Mortem Workshop CERN, Geneva, Switzerland June 12-13, 2008 Alexandre Vaniachine.

LHC: ATLAS Experiment meeting “Conditions” data challenge Elizabeth Gallas - Oxford - August 29, 2009 XLDB3.

ATLAS Database Operations Invited talk at the XXI International Symposium on Nuclear Electronics & Computing Varna, Bulgaria, September 2007 Alexandre.

2nd September Richard Hawkings / Paul Laycock Conditions data handling in FDR2c  Tag hierarchies set up (largely by Paul) and communicated in advance.

Databases E. Leonardi, P. Valente. Conditions DB Conditions=Dynamic parameters non-event time-varying Conditions database (CondDB) General definition:

Status of the LHCb MC production system Andrei Tsaregorodtsev, CPPM, Marseille DataGRID France workshop, Marseille, 24 September 2002.

And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR

CERN Physics Database Services and Plans Maria Girone, CERN-IT

1 GCA Application in STAR GCA Collaboration Grand Challenge Architecture and its Interface to STAR Sasha Vaniachine presenting for the Grand Challenge.

LFC Replication Tests LCG 3D Workshop Barbara Martelli.

1 Database mini workshop: reconstressing athena RECONSTRESSing: stress testing COOL reading of athena reconstruction clients Database mini workshop, CERN.

CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle Real Application Clusters (RAC) Techniques for implementing & running robust.

CERN - IT Department CH-1211 Genève 23 Switzerland t COOL Conditions Database for the LHC Experiments Development and Deployment Status Andrea.

08-Nov Database TEG workshop, Nov 2011 ATLAS Oracle database applications and plans for use of the Oracle 11g enhancements Gancho Dimitrov.

CERN-IT Oracle Database Physics Services Maria Girone, IT-DB 13 December 2004.

CERN Database Services for the LHC Computing Grid Maria Girone, CERN.

Integration of the ATLAS Tag Database with Data Management and Analysis Components Caitriana Nicholson University of Glasgow 3 rd September 2007 CHEP,

CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,

Database Competence Centre openlab Major Review Meeting nd February 2012 Maaike Limper Zbigniew Baranowski Luigi Gallerani Mariusz Piorkowski Anton.

Oracle for Physics Services and Support Levels Maria Girone, IT-ADC 24 January 2005.

Development, Deployment and Operations of ATLAS Databases XVI International Conference on Computing in High Energy and Nuclear Physics Victoria, British.

PROOF Benchmark on Different Hardware Configurations 1 11/29/2007 Neng Xu, University of Wisconsin-Madison Mengmeng Chen, Annabelle Leung, Bruce Mellado,

MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.

LHCb report to LHCC and C-RSG Philippe Charpentier CERN on behalf of LHCb.

Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.

11th November Richard Hawkings Richard Hawkings (CERN) ATLAS reconstruction jobs & conditions DB access  Conditions database basic concepts  Types.

ATLAS Distributed Computing perspectives for Run-2 Simone Campana CERN-IT/SDC on behalf of ADC.

CERN IT Department CH-1211 Genève 23 Switzerland t COOL Performance Tests ATLAS Conditions Database example Romain Basset, IT-DM October.

Victoria, Sept WLCG Collaboration Workshop1 ATLAS Dress Rehersals Kors Bos NIKHEF, Amsterdam.

January 20, 2000K. Sliwa/ Tufts University DOE/NSF ATLAS Review 1 SIMULATION OF DAILY ACTIVITITIES AT REGIONAL CENTERS MONARC Collaboration Alexander Nazarenko.

FroNtier Stress Tests at Tier-0 Status report Luis Ramos LCG3D Workshop – September 13, 2006.

Database CNAF Barbara Martelli Rome, April 4 st 2006.

ATLAS FroNTier cache consistency stress testing David Front Weizmann Institute 1September 2009 ATLASFroNTier chache consistency stress testing.

CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

Database Access Patterns in ATLAS Computing Model G. Gieraltowski, J. Cranshaw, K. Karr, D. Malon, A. Vaniachine ANL P, Nevski, Yu. Smirnov, T. Wenaus.

WLCG Status Report Ian Bird Austrian Tier 2 Workshop 22 nd June, 2010.

Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.

ATLAS The ConditionDB is accessed by the offline reconstruction framework (ATHENA). COOLCOnditions Objects for LHC The interface is provided by COOL (COnditions.

LHCb Computing activities Philippe Charpentier CERN – LHCb On behalf of the LHCb Computing Group.

LHCb 2009-Q4 report Q4 report LHCb 2009-Q4 report, PhC2 Activities in 2009-Q4 m Core Software o Stable versions of Gaudi and LCG-AA m Applications.

A quick summary and some ideas for the 2005 work plan Dirk Düllmann, CERN IT More details at

Database Requirements Updates from LHC Experiments WLCG Grid Deployment Board Meeting CERN, Geneva, Switzerland February 7, 2007 Alexandre Vaniachine (Argonne)

Dario Barberis: ATLAS DB S&C Week – 3 December Oracle/Frontier and CondDB Consolidation Dario Barberis Genoa University/INFN.

Oracle for Physics Services and Support Levels Maria Girone, IT-ADC 6 April 2005.

Grid technologies for large-scale projects N. S. Astakhov, A. S. Baginyan, S. D. Belov, A. G. Dolbilov, A. O. Golunov, I. N. Gorbunov, N. I. Gromova, I.

LHCb Computing 2015 Q3 Report Stefan Roiser LHCC Referees Meeting 1 December 2015.

Ian Bird WLCG Workshop San Francisco, 8th October 2016

Database Replication and Monitoring

CMS High Level Trigger Configuration Management

Database Services at CERN Status Update

3D Application Tests Application test proposals

Elizabeth Gallas - Oxford ADC Weekly September 13, 2011

David Front Weizmann institute May 2007

Philippe Charpentier CERN – LHCb On behalf of the LHCb Computing Group

WLCG Service Interventions

Conditions DB Data Distribution

Conditions Data access using FroNTier Squid cache Server

Ákos Frohner EGEE'08 September 2008

ALICE Computing Upgrade Predrag Buncic

LHCb Conditions Database TEG Workshop 7 November 2011 Marco Clemencic

Oracle Storage Performance Studies

3D Project Status Report

Data Lifecycle Review and Outlook

The LHCb Computing Data Challenge DC06

Presentation transcript:

ATLAS Scalability Tests of Tier-1 Database Replicas WLCG Collaboration Workshop (Tier0/Tier1/Tier2) Victoria, British Columbia, Canada September 1-2, 2007 Richard Hawkings (CERN) Alexandre Vaniachine (Argonne)

2 Richard Hawkings Alexandre Vaniachine WLCG Workshop, Victoria, British Columbia, Canada, September 1-2, 2007 All Ten ATLAS Tier-1 Sites in Production Operation Using 3D infrastructure ATLAS is running a ‘mini Calibration Data Challenge’: –Regular conditions data updates on Online RAC, testing propagation to Offline RAC and further to ten Tier-1s –Since April about 2500 runs, 110 GB of COOL data replicated to Tier-1s Thanks to the 3D Project, ATLAS Conditions DB worldwide replication is now in production with real data (from detector commissioning) and data from MC simulations: –Snapshot of real-time monitoring of 3D operations on EGEE Dashboard:

3 Richard Hawkings Alexandre Vaniachine WLCG Workshop, Victoria, British Columbia, Canada, September 1-2, 2007 DB Replicas at Tier-1s: Critical for Reprocessing ATLAS replications workload is using multiple COOL schemas with mixed amount and types of data: Schema#folders#chanChan payloadN/runTotal GB INDET char CALO char1 1.8 MDT CLOB: 3kB+4.5kB GLOBAL1503 x float TDAQ/DCS x float TRIGGER x float Why we are replicating all these data? ATLAS Computing Model provides following requirements at Tier-1 with respect to Conditions DB: –Running reconstruction re-processing: O(100) jobs in parallel –Catering for other ‘live’ Conditions DB usage at the Tier-1 (Calibration and Analysis), and perhaps for the associated Tier-2/3s

4 Richard Hawkings Alexandre Vaniachine WLCG Workshop, Victoria, British Columbia, Canada, September 1-2, 2007 Purpose of Scalability Tests To provide input to future hardware purchases for Tier-1s –How many servers required? Balance between CPU, memory and disk we have to do Oracle scalability tests with the following considerations: Although reconstruction jobs last for hours, most conditions data is read at initialization –Thus, we do not have to initialize O(100) jobs at once Tier-0 uses file-based Conditions DB slice on afs, at Tier-1 DB access differs –Because of rate considerations we may have to stage and process all files grouped by physical tapes, rather than datasets –Will the database data caching be of value for the Tier-1 access mode? Our scalability tests should not rely on data caching: – We should test random data access pattern Find out where are the bottlenecks: –Hardware (database CPU, storage system (I/O per second)) –COOL read-back queries (to learn if we have some problems) Find out how many clients a Tier-1 database server can support at once –Clients using Athena and realistic conditions data workload

5 Richard Hawkings Alexandre Vaniachine WLCG Workshop, Victoria, British Columbia, Canada, September 1-2, 2007 ATLAS Conditions DB Tests of 3D Oracle Streams Replicated data provide read-back data for scalability tests : Schema#folders#chanChan payloadN/runTotal GB INDET char CALO char1 1.8 MDT CLOB: 3kB+4.5kB GLOBAL1503 x float TDAQ/DCS x float TRIGGER x float ‘best guess’ The realistic conditions data workload: –a ‘best guess’ for ATLAS Conditions DB load in reconstruction dominated by DCS data Three workload combinations were used in the tests: –“no DCS”, “with DCS” and “10xDCS” (explained on slide 11)

6 Richard Hawkings Alexandre Vaniachine WLCG Workshop, Victoria, British Columbia, Canada, September 1-2, 2007 How Scalability Test Works First ATLAS scalability tests started at the French Tier-1 site at Lyon Lyon has a 3-node 64-bit Solaris RAC cluster which is shared with another LHC experiment (LHCb) In scalability tests our goal is to overload the database cluster by launching many jobs at parallel Initially, the more concurrent jobs is running (horizontal axis) – the more processing throughput we will get (vertical axis), until the server became overloaded, when it takes more time to retrieve the data, which limits the throughput In that particular plot shown the overload was caused by lack of optimization in the COOL 2.1 version that was used in the very first test –But it was nice to see that our approach worked

7 Richard Hawkings Alexandre Vaniachine WLCG Workshop, Victoria, British Columbia, Canada, September 1-2, 2007 First Scalability Tests at Lyon CC IN2P3 Tier-1 Test jobs read “no DCS” Conditions DB data workload at random The calculated throughput (jobs/hour) is for the RAC tested (3-node SPARC) No DCS 80% CPU load indicates that the limit is close Even with that old COOL 2.1 version we got useful “no DCS” results by engaging “manual” optimization via actions on the server side

8 Richard Hawkings Alexandre Vaniachine WLCG Workshop, Victoria, British Columbia, Canada, September 1-2, 2007 First Scalability Tests at Bologna CNAF Tier-1 Test jobs read “no DCS” Conditions DB data workload at random The calculated throughput (jobs/hour) is for the RAC tested: –2-node dual-core Linux Oracle RAC (dedicated to ATLAS ) “no DCS” In a similar way results were obtained at CNAF with the old COOL 2.1

9 Richard Hawkings Alexandre Vaniachine WLCG Workshop, Victoria, British Columbia, Canada, September 1-2, 2007 Scalability tests: detailed monitoring of what is going on ATLAS testing framework keeps many things in check and under control: IN3P3 test CNAF test

10 Richard Hawkings Alexandre Vaniachine WLCG Workshop, Victoria, British Columbia, Canada, September 1-2, 2007 Latest Tests with COOL 2.2 New COOL 2.2 version enabled more realistic workload testing –Now “with DCS” Three different workloads have been used in these tests: “no DCS” - Data from 19 folders of 32 channels each, POOL reference (string) payload, plus 2 large folders each with 1174 channels, one with 3k string per channel, one with 4.5k string per channel, which gets treated by Oracle as a CLOB, plus 1 folder with 50 channels simulating detector status information. This is meant to represent a reconstruction job running reading calibration but no DCS data. The data is read once per run. “with DCS” - As above, but an additional 10 folders with 200 channels each containing 25 floats, and 5 folders with 1000 channels of 25 floats, representing some DCS data, again read once per run “10xDCS” - As above, but processing 10 events spaced in time so that all the DCS data is read again for each event. This represents a situation where the DCS data varies over the course of a run, so each job has to read in 10 separate sets of DCS data.

11 Richard Hawkings Alexandre Vaniachine WLCG Workshop, Victoria, British Columbia, Canada, September 1-2, 2007 Latest Scalability Tests at Bologna CNAF Tier-1 The top query for “no DCS”/“with DCS” cases is the COOL 'listChannels' call for Multi-Version folders - this query is not optimized in COOL 2.2 –optimization is expected to result in further increase in performance 1 Oracle CPU = 1 Intel CPU (dual-core)

12 Richard Hawkings Alexandre Vaniachine WLCG Workshop, Victoria, British Columbia, Canada, September 1-2, 2007 Latest Scalability Tests at Lyon CC IN2P3 Tier-1 It is too early to draw many conclusions from the comparison of these two sites at present, except that it shows the importance of doing tests at several sites Preliminary 1 Oracle CPU = 1 SPARC CPU (single-core)

13 Richard Hawkings Alexandre Vaniachine WLCG Workshop, Victoria, British Columbia, Canada, September 1-2, 2007 Importance of Tests at Different Sites Among the reasons for performance differences between the sites is that we have seen the Oracle optimizer is choosing different query plans at Lyon and CNAF –For some yet unknown reason, at Lyon, optimizer is choosing not to use an index, and thus getting worse performance We don't yet know if this is because of the differences in hardware, different server parameters, different tuning or what Also, at Lyon different throughputs were observed at different times –Thus, the performance is not yet fully understood Is it because of the shared server? The calculated throughput (jobs/hor) is for the RAC tested (not per CPU) –Sites had very different hardware configurations: at CNAF, Bologna: 4 Intel cores = 2 “Oracle CPUs” at CC IN2P3, Lyon: 3 SPARC cores = 3 “Oracle CPUs” –There were more actual CPUs at CNAF than at Lyon, which accounts for some difference in RAC performances observed

14 Richard Hawkings Alexandre Vaniachine WLCG Workshop, Victoria, British Columbia, Canada, September 1-2, 2007 In the Ballpark We estimate that ATLAS daily reconstruction and/or analysis jobs rates will be in the range from 100,000 to 1,000,000 jobs/day –Current ATLAS production finishes up to 55,000 jobs/day For each of ten Tier-1 centers that corresponds to the rates of 400 to 4,000 jobs/hour For many Tier-1s pledging ~5% capacities (vs. 1/10 th of the capacities) that corresponds to the rates of 200 to 2,000 jobs/hour –With most of these will be analysis or simulation jobs which do not need so much Oracle Conditions DB access Thus, our results from the initial scalability tests are promising –We got initial confirmation that ATLAS capacities request to WLCG (3-node clusters at all Tier-1s) is close to what will be needed for reprocessing in the first year of ATLAS operations

15 Richard Hawkings Alexandre Vaniachine WLCG Workshop, Victoria, British Columbia, Canada, September 1-2, 2007 Conclusions, Plans and Credits A useful framework for Oracle scalability tests has been developed Initial Oracle scalability tests indicate that WLCG 3D capacities in deployment for ATLAS are in the ballpark of what ATLAS requested We plan to continue scalability tests with new COOL releases –Initial results indicate better COOL performance Future scalability tests will allow more precise determination of the actual ATLAS requirements for distributed database capacities Because of large allocation of dedicated resources, scalability tests require careful planning and coordination with Tier-1 sites, which volunteered to participate in these tests. –Lyon test involved a collaborative effort beyond ATLAS DB (R. Hawkings, S.Stonjek, G. Dimitrov and F. Viegas) – many thanks to CC IN2P3 Tier-1 people: G. Rahal, JR Rouet, PE Macchi, and to E. Lancon and RD Schaffer for coordination –Bologna test involved a collaborative effort beyond ATLAS DB (R. Hawkings, G.Dimitrov and F. Viegas) – many thanks to CNAF Tier-1 people: B.Martelli, A. Italiano, L. dell'Agnello, and to L. Perini and D. Barberis for coordination