Analyzing ever growing datasets in PHENIX Chris Pinkenburg for the PHENIX collaboration.

Analyzing ever growing datasets in PHENIX Chris Pinkenburg for the PHENIX collaboration

The PHENIX Detector hh h±h± 00 Many Subsystems for different Physics High speed daq (>5kHz), selective Lvl1 triggers in pp, MinBias in AuAu Max rate ~800MB/s Emc clusters Muon candidates Charged central arm tracks Stored in reconstructed output:

PHENIX Raw Data Volume PB sized raw data sets will be the norm for PHENIX Heavy ion runs produce more data than pp runs pp runs use triggers with high rejection factors, heavy ion mainly minbias Easy to remember Run to year matching @RHIC: Run2 ended in 2002 Run3 ended in 2003 …

Reconstructed data (DST) size to do Total size: 700TB (1PB including Run10) Passing over all data sets the scale for necessary weekly I/O : 700TB/week=1.2GB/sec) Copying data to local Disk and passing multiple times over it keeps network I/O within acceptable limits and makes jobs immune to network problems while processing Reduction ~30% over raw data average processing: about 500TB/week

Reconstructed data (DST) size to do Size does not scale with number of Events: Run7 additional cuts on saved clusters/tracks applied Run10: full information but using Half floats and improved output structure Run7: 4.2*10 9 Events Size: 200TB Run10: 10*10 9 Events Size: 300TB Run4: 1*10 9 Events Size: 70TB 10 9 Events

Number Of files Run4 came as a “surprise” showing that 1 raw data file ->1 DST is just not a good strategy Aggregating output files and increasing their size to now 9GB keeps the number of files at a manageable level Staging 100,000 files out of tape storage is a real challenge

PHENIX Output Files Separate Output according to triggers Data split according to content –Central arm tracks –emc clusters –muon candidates –detector specific info Reading files in parallel possible - special synchronization makes sure we do not mix events Recalibrators bring data “up to date”

From the Analysis Train… The initial idea of an “analysis train” evolved from mid ‘04 to early ‘05 into the following plan –Reserve a set of the RCF farm (fastest nodes, largest disks) –Stage as much of the data set onto the nodes’ local disks; run all (previously tested on ~10% data sample: “the stripe”) analysis modules –Delete used data, stage remaining files, run, repeat One cycle took ~ 3 weeks –Very difficult to organize, maintain data –Getting ~200k files from tape was very inefficient –Even using more machines with enough space to keep data disk resident was not feasible (machines down, disk crashes, forcing condor into submission,…) –Users unhappy with delays

… to the Analysis Taxi Since ~ autumn ‘05 –Add all existing distributed disk space into dCache pools –Stage and pin files that are in use (once during setup) –Close dCache to general use, only reconstruction and taxi driver have access: performance when open to all users was disastrous - too many HPSS requests, frequent door failures, … –Users can “hop in” every Thursday, requirements are: code tests (valgrind), limits to memory and CPU time consumption, approval from WG for output disk space – Typical time to run over one large data set: 1-2 days

Rhic Computing Facility PHENIX portion ~ 600 compute nodes ~ 4600 condor slots ~2PB distributed storage on compute nodes in chunks of 1.5TB-8TB managed by dCache backed by HPSS BlueArc nfs server ~100 TB

User interfaces Signup for nightly rebuild, gets retired after 3 months, button click re-signup Signup for a pass, Code test required with valgrind Module status page on the web Taxi summary page on the web Module can be removed from current pass The Basic Idea: User hands us a macro and tells us the dataset and the output directory The rest is our problem (job submission, removal of bad runs, error handling, rerunning failed jobs)

Job Submission submit perl script Creates Module Output Directory Tree log data core Condor dir (1 per fileset) Condor Job file run script File lists macros DB Dst type 2 All relevant information is kept in DB modules filesetlist Module Statistics cvstags Dst type 1 … Dst type 2

Job Execution Module Output Directory Tree log data core run script DB Dst type 2 modules Filesetlist mod status Module Statistics cvstags Dst type 1 … Dst type 2 Copies data from dCache to local disk and does md5 checksum Runs independent root job for each module

20 30 40 10 50 Weekly Taxi Usage We run between 20 and 30 modules/week QM 2009 Crunch time before conferences followed by low activity afterwards Run10 data became available before Run10 ended!

Jobs often get resubmitted during the week to pick up stragglers Condor Usage Statistics Jobs are typically started Fridays and are done before the weekend is over (yes we got a few more cpus after this plot was made, it’s now 4600 condor slots) 1.5 GB/sec Observed peak rate >5GB/sec in and out

dCache Throughput Jan 2009: Start of statistics Feb 2009: Use of fstat instead of fstat64 in filecatalog disabled detection of large files (>2GB) on local disk and forced direct read from dCache Between 1PB - 2PB/month, usage will increase when Run10 data becomes available

Local disk I/O http://root.cern.ch/drupal/content/spin-little-disk-spin TTrees are optimized for efficient reading of subsets of the data, lots of head movement when reading multiple baskets When always reading complete events moving to a generic format would likely improve disk I/O and reduce filesize by removing the TFile overhead. The number of cores keeps increasing and we will reach a limit when we won’t be able to satisfy the required I/O to utilize all of them One solution is to trade off cpu versus I/O by calculating variables instead of storing them (with Run10 we redo a major part of our emc clustering during readback) If precision is not important, using half precision floats is a space saving alternative

Train: Issues Disks crash, tapes break – reproducing old data is an ongoing task. Can we create files which have identical content compared to a production which was run 6 years ago? If not, how much of a difference is acceptable? It is easy to overwhelm the output disks (which are always full, the run script won’t start a job if its output filesystem has <200GB space) Live and learn (and improve) a farm is an error multiplier

Summary Since 2005 this tool enables a weekly pass over any PHENIX data set (since Run3) We push 1PB to 2PB per month through the system Analysis code is tagged, results are reproducible Automatic rerunning of failed jobs allows for 100% efficiency Given ever growing local disks, we have enough headroom for years to come Local I/O will become an issue at some point

BACKUP

Analyzing ever growing datasets in PHENIX Chris Pinkenburg for the PHENIX collaboration.

Similar presentations

Presentation on theme: "Analyzing ever growing datasets in PHENIX Chris Pinkenburg for the PHENIX collaboration."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Analyzing ever growing datasets in PHENIX Chris Pinkenburg for the PHENIX collaboration.

Similar presentations

Presentation on theme: "Analyzing ever growing datasets in PHENIX Chris Pinkenburg for the PHENIX collaboration."— Presentation transcript:

Similar presentations

About project

Feedback