Presentation is loading. Please wait.

Presentation is loading. Please wait.

DØ Computing & Analysis Model

Similar presentations


Presentation on theme: "DØ Computing & Analysis Model"— Presentation transcript:

1 DØ Computing & Analysis Model
Tibor Kurča IPN Lyon Introduction DØ Computing Model SAM Analysis Farms - resources, capacity Data Model Evolution - where you can go wrong Summary Mar 15, 2007, Clermont-Ferrand

2 Computing Enables Physics
D A T H N L I G HEP Computing Online : data taking Offline : Data Reconstruction MC- data production Analysis  physics results final goal of the experiment Mar 15, 2007, Clermont-Ferrand Tibor Kurca, LCG France

3 Data Flow Analysis Real Data Monte Carlo Data Beam collisions
Event generation: software modelling beam particles interactions  production of new particles from those collisions Particles traverse detector Simulation: particles transport in the detectors Readout: Electronic detector signals written to tapes  raw data Digitization: Transformation of the particle drift times, energy deposits into the signals readout by electronics  the same format as real raw data Reconstruction: physics objects, i.e. particles produced in the beams collisions -- electrons, muons, jets… Physics Analysis Mar 15, 2007, Clermont-Ferrand Tibor Kurca, LCG France

4 DØ Computing Model 1997 – planning for RunII was formalized
- critical look at RunI production and analysis use cases - datacentric view – metadata (data about data) - scalability with RunII data rates and anticipated budgets Data volumes – inteligent file delivery  caching, buffering - extensive bookkeeping about usage in a central DB Access to the data - consistent interface to the data for anticipated global analysis  transport mechanisms and data stores transparent to the users  replication and location services  security, authentication and authorization The centralization, in turn, required client-server model for scalability and uptime and affordability  client-server model applied to serving calibration data to remote sites Resulting project: Sequential Access via Metadata (SAM) Mar 15, 2007, Clermont-Ferrand Tibor Kurca, LCG France

5 SAM - Data Management System
distributed Data Handling System for Run II DØ, CDF experiments - set of servers (stations) communicating via CORBA - central DB FNAL) - designed for PETABYTE sized datasets ! SAM functionalities - file storage from online and processing systems  MSS - FNAL Enstore, CCIN2P3 HPSS… disk caches around the world - routed file delivery - user doesn’t care about file locations - file metadata cataloging  datasets creation based on file metadata - analysis bookkeeping  which files processed succesfuly by which application when and where - user interfaces via command line, web and python API - user authentication - registration as SAM user - local and remote monitoring capabilities Mar 15, 2007, Clermont-Ferrand Tibor Kurca, LCG France

6 Computing Model I DØ computing model built on SAM
- first reconstruction done on FNAL farms - all MC produced remotely - all data centralized at FNAL (Enstore)  even MC - no automatic replication - Remote Regional Analysis Centers (RAC) CCIN2P3, GridKa usually prestaging data of interest data routed via central-analysisRACsmaller sites DØ native computing grid – SAMGrid SAMGrid/LCG, SAMGrid/OSG interoperability Mar 15, 2007, Clermont-Ferrand Tibor Kurca, LCG France

7 Computing Model II SAM 1st reconstruction MC-production Reprocessing
Fixing … SAM ENSTORE Analysis , Individual production … Mar 15, 2007, Clermont-Ferrand Tibor Kurca, LCG France

8 Analysis Farm 2002 Central Analysis facility: D0mino
SGI Origin MHz processors and 30 TB50 TB fibre channel disk - RAID disk for system needs and user home areas - centralized, interactive and batch services for on & off-site users - provided also data movement into a cluster of Linux compute nodes 500 GHz CAB (Central Analysis Backend) SAM enables “remote” analysis - user can run analysis jobs on remote sites with SAM services - 2 analysis farm stations were pulling the majority of their files from tape  large load user data access at FNAL was a bottleneck Mar 15, 2007, Clermont-Ferrand Tibor Kurca, LCG France

9 Central Analysis Farms 2003+
SGI Origin …. starting to be phased out D0mino0x : 2004  new Linux based interactive pool Clued0 : cluster of Institutional desktops + rack-mounted nodes as large disk servers 1 Gb Ethernet connection with batch system SAM access (station), local project disk appears as a single integrated cluster to the user managed by the users used for development of analysis tools, small sample tests CAB (Central Analysis Backend): Linux filservers and worker nodes (pioneered by CDF with FNAL/CD) full sample analysis jobs, common analysis samples production Mar 15, 2007, Clermont-Ferrand Tibor Kurca, LCG France

10 Central Analysis Farms - 2007
Home areas on NETAPP (Network Appliance) CAB: - Linux nodes - 3 THz of CPU - 400 TB SAM Cache Clued0 - desktop cluster + disk servers - 1+ THz - SAM Cache - 70 TB (nodes) + 160 TB (servers) Before adding 100 TB of Cache,2/3 of transfers could be from tape Enstore Practically all tape transfers occur within 5 min Intra-Station: 60% of cached files are delivered within 20 S 20 sec 5 min Mar 15, 2007, Clermont-Ferrand Tibor Kurca, LCG France

11 Data Model in Retrospective
Initial data model: - STA : raw data +all reconstructed objects (too big…) - DST : reconstructed objects plus enough info to redo reconstruction - TMB: compact format of selected reconstructed objects - all catalogued and accessible via SAM - formats supported by a standard C++ framework …… physics groups would produce and maintain their specific tuples Reality: - STA never implemented - TMB wasn’t ready when data started to come - DST was ready, but initially people wanted extra info in raw data - Root tuple output intended for debugging was available many started to use it for analysis - threshold for using the standard framework and SAM was high (complex and inadequate documentation) Mar 15, 2007, Clermont-Ferrand Tibor Kurca, LCG France

12 Data Model in Retrospective 2
TMB …. Finalized too late (8 months after data taking began)  data disk resident, duplication of algoritms developments …. Slow for analysis (unpacking times large, changes required slow relinks) Divergence between those using standard framework vs root tuples  incompabilities and complications, notably in standard object IDs  need for common format was finally recognized (difficult process) TMBTree effort was made to introduce new common analysis format - still compatibility issues and inertia prevented most root tuple users to to use it - didn’t have a clear support model  never caught on TMB++ - added calorimeter cells information & tracker hits Mar 15, 2007, Clermont-Ferrand Tibor Kurca, LCG France

13 CAF - Common Analysis Format
2004 “CAF” project begins – Common Analysis Format: common root tree format based on existing TMB  central production & storing in SAM  effeciency gains: easier sharing of data and analysis algorithms between physics groups reducing the development and maintenance effort required by the groups  faster turn-around between data taking and publication café CAF-environment has been developped: - single user-friendly, root-based analysis system forming the basis for common tools development – standard an alysis procedures such as trigger selection, object-ID selection, efficiency calculation  benefits for all physics groups Mar 15, 2007, Clermont-Ferrand Tibor Kurca, LCG France

14 Mar 15, 2007, Clermont-Ferrand Tibor Kurca, LCG France

15 CAF Use Taking off 2004 “CAF” begins CAF commissioned in 2006
Working to understanding use cases, Next focus is analysis Red is TMB access Blue is CAF Black is Physics group samples 10B Events consumed/month Mar 15, 2007, Clermont-Ferrand Tibor Kurca, LCG France

16 CPU Usage - Efficiency Cabsrv2: SAM_lo CPU time/wall time April ‘06
70% Sept ‘05 20% Historical average is around 70% CPU/Wall time. Currently I/O dominated Working to understand—multiple “problems” or limitations seems likely  ROOT bug Vitally important to understand analysis use cases/patterns in discussion with Physics groups Mar 15, 2007, Clermont-Ferrand Tibor Kurca, LCG France

17 Root Bug Many jobs only getting 20% CPU on CAB
Reported to experts (Paul Russo, Philippe Canal) and problem found. Slow lookup of TRef’s in Root. Fixed and a new patch of Root v4.4.2b and p release has new root patch. 12% file opened, TStreamerInfo read 6% read the input tree from the file 7% clone the input tree by Café 10% Do processing 32% unzip tree data 26% move tree data from Root I/O butter to user buffer 7% miscellaneous Use new fixed code and measure CPU performance to see if we continue to see any issues with CPU. Mar 15, 2007, Clermont-Ferrand Tibor Kurca, LCG France

18 Analysis over Time 2006 1 PB cabsrv1 cabsrv1 2002 clued0
Events consumed by stations since “the beginning of SAM time” Integrates to 450B events consumed 2006 1 PB cabsrv1 cabsrv1 2002 clued0 Mar 15, 2007, Clermont-Ferrand Tibor Kurca, LCG France

19 SAM Data Consumption/Month 2007
Feb 2006 – Mar 2007 ~800TB/month Mar 15, 2007, Clermont-Ferrand Tibor Kurca, LCG France

20 SAM Cumulated Data Consumption 2007
Mar Mar 2007 Feb 2006 – Mar 2007 > 10 PB/year ~250 B events/year Mar 15, 2007, Clermont-Ferrand Tibor Kurca, LCG France

21 Summary - Conclusions Analysis – final step in the whole computing chain of physics experiment - most unpredictable usage of computing resources - from their nature I/O oriented jobs - 2 phases in the analysis procedure: 1. developping analysis tools, testing on small samples 2. large scale analysis production User friendly environment, suitable tools - short learning curve - missing user interfaces, painful environment  users resistance Lessons: it’s not only about hardware resources & architecture…. Common data tiers (formats) are very important - need a format that meets needs of all users and all agree on from day one - simplicity of usage - documentation must be ready to use - - use cases, surprises ? “Most basic user’s needs in areas where they interact directly with computing system should be an extremely high priority” Mar 15, 2007, Clermont-Ferrand Tibor Kurca, LCG France


Download ppt "DØ Computing & Analysis Model"

Similar presentations


Ads by Google