Analyzing ever growing datasets in PHENIX Chris Pinkenburg for the PHENIX collaboration.

Slides:



Advertisements
Similar presentations
Status of BESIII Distributed Computing BESIII Workshop, Mar 2015 Xianghu Zhao On Behalf of the BESIII Distributed Computing Group.
Advertisements

The Zebra Striped Network File System Presentation by Joseph Thompson.
Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.
23/04/2008VLVnT08, Toulon, FR, April 2008, M. Stavrianakou, NESTOR-NOA 1 First thoughts for KM3Net on-shore data storage and distribution Facilities VLV.
STAR Software Walk-Through. Doing analysis in a large collaboration: Overview The experiment: – Collider runs for many weeks every year. – A lot of data.
K.Harrison CERN, 23rd October 2002 HOW TO COMMISSION A NEW CENTRE FOR LHCb PRODUCTION - Overview of LHCb distributed production system - Configuration.
Trains status&tests M. Gheata. Train types run centrally FILTERING – Default trains for p-p and Pb-Pb, data and MC (4) Special configuration need to be.
Mass RHIC Computing Facility Razvan Popescu - Brookhaven National Laboratory.
Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.
Data oriented job submission scheme for the PHENIX user analysis in CCJ Tomoaki Nakamura, Hideto En’yo, Takashi Ichihara, Yasushi Watanabe and Satoshi.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Ian Fisk and Maria Girone Improvements in the CMS Computing System from Run2 CHEP 2015 Ian Fisk and Maria Girone For CMS Collaboration.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
Alexandre A. P. Suaide VI DOSAR workshop, São Paulo, 2005 STAR grid activities and São Paulo experience.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
Central Reconstruction System on the RHIC Linux Farm in Brookhaven Laboratory HEPIX - BNL October 19, 2004 Tomasz Wlodek - BNL.
CDF data production models 1 Data production models for the CDF experiment S. Hou for the CDF data production team.
D0 Farms 1 D0 Run II Farms M. Diesburg, B.Alcorn, J.Bakken, T.Dawson, D.Fagan, J.Fromm, K.Genser, L.Giacchetti, D.Holmgren, T.Jones, T.Levshina, L.Lueking,
Mark A. Magumba Storage Management. What is storage An electronic place where computer may store data and instructions for retrieval The objective of.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 3: Operating-System Structures System Components Operating System Services.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
08/30/05GDM Project Presentation Lower Storage Summary of activity on 8/30/2005.
9 February 2000CHEP2000 Paper 3681 CDF Data Handling: Resource Management and Tests E.Buckley-Geer, S.Lammel, F.Ratnikov, T.Watts Hardware and Resources.
Grand Challenge and PHENIX Report post-MDC2 studies of GC software –feasibility for day-1 expectations of data model –simple robustness tests –Comparisons.
Chapter 10 Chapter 10: Managing the Distributed File System, Disk Quotas, and Software Installation.
STAR Software Walk-Through. Doing analysis in a large collaboration: Overview The experiment: – Collider runs for many weeks every year. – A lot of data.
1 GCA Application in STAR GCA Collaboration Grand Challenge Architecture and its Interface to STAR Sasha Vaniachine presenting for the Grand Challenge.
Computer Systems Week 14: Memory Management Amanda Oddie.
1 The PHENIX Experiment in the RHIC Run 7 Martin L. Purschke, Brookhaven National Laboratory for the PHENIX Collaboration RHIC from space Long Island,
David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.
PHENIX and the data grid >400 collaborators 3 continents + Israel +Brazil 100’s of TB of data per year Complex data with multiple disparate physics goals.
PROOF and ALICE Analysis Facilities Arsen Hayrapetyan Yerevan Physics Institute, CERN.
1 L.Didenko Joint ALICE/STAR meeting HPSS and Production Management 9 April, 2000.
Outline: Status: Report after one month of Plans for the future (Preparing Summer -Fall 2003) (CNAF): Update A. Sidoti, INFN Pisa and.
RCF Status - Introduction PHENIX and STAR Counting Houses are connected to RCF at a Network Bandwidth of 20 Gbits/sec each –Redundant (Bandwidth-wise and.
Issues on the operational cluster 1 Up to 4.4x times variation of the execution time on 169 cores Using -O2 optimization flag Using IBM MPI without efficient.
Software Constructs in PHENIX TK Hemmick Not because we should do this…but we can learn from it.
PROOF tests at BNL Sergey Panitkin, Robert Petkus, Ofer Rind BNL May 28, 2008 Ann Arbor, MI.
Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.
ATLAS Distributed Computing perspectives for Run-2 Simone Campana CERN-IT/SDC on behalf of ADC.
Predrag Buncic CERN ALICE Status Report LHCC Referee Meeting 01/12/2015.
Data processing Offline review Feb 2, Productions, tools and results Three basic types of processing RAW MC Trains/AODs I will go through these.
D0 Farms 1 D0 Run II Farms M. Diesburg, B.Alcorn, J.Bakken, R. Brock,T.Dawson, D.Fagan, J.Fromm, K.Genser, L.Giacchetti, D.Holmgren, T.Jones, T.Levshina,
Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
A Data Handling System for Modern and Future Fermilab Experiments Robert Illingworth Fermilab Scientific Computing Division.
StoRM + Lustre Proposal YAN Tian On behalf of Distributed Computing Group
Analysis Model Zhengyun You University of California Irvine Mu2e Computing Review March 5-6, 2015 Mu2e-doc-5227.
System Components Operating System Services System Calls.
Status Report on Data Reconstruction May 2002 C.Bloise Results of the study of the reconstructed events in year 2001 Data Reprocessing in Y2002 DST production.
29/04/2008ALICE-FAIR Computing Meeting1 Resulting Figures of Performance Tests on I/O Intensive ALICE Analysis Jobs.
ANALYSIS TRAIN ON THE GRID Mihaela Gheata. AOD production train ◦ AOD production will be organized in a ‘train’ of tasks ◦ To maximize efficiency of full.
Compute and Storage For the Farm at Jlab
Getting the Most out of Scientific Computing Resources
Data Formats and Impact on Federated Access
Getting the Most out of Scientific Computing Resources
WP18, High-speed data recording Krzysztof Wrona, European XFEL
OpenPBS – Distributed Workload Management System
Quick Look on dCache Monitoring at FNAL
ALICE Computing Upgrade Predrag Buncic
Computing Infrastructure for DAQ, DM and SC
湖南大学-信息科学与工程学院-计算机与科学系
Near Real Time Reconstruction of PHENIX Run7 Minimum Bias Data From RHIC Project Goals Reconstruct 10% of PHENIX min bias data from the RHIC Run7 (Spring.
Chapter 2: Operating-System Structures
Chapter 2: Operating-System Structures
The LHCb Computing Data Challenge DC06
Presentation transcript:

Analyzing ever growing datasets in PHENIX Chris Pinkenburg for the PHENIX collaboration

The PHENIX Detector hh h±h± 00 Many Subsystems for different Physics High speed daq (>5kHz), selective Lvl1 triggers in pp, MinBias in AuAu Max rate ~800MB/s Emc clusters Muon candidates Charged central arm tracks Stored in reconstructed output:

PHENIX Raw Data Volume PB sized raw data sets will be the norm for PHENIX Heavy ion runs produce more data than pp runs pp runs use triggers with high rejection factors, heavy ion mainly minbias Easy to remember Run to year Run2 ended in 2002 Run3 ended in 2003 …

Reconstructed data (DST) size to do Total size: 700TB (1PB including Run10) Passing over all data sets the scale for necessary weekly I/O : 700TB/week=1.2GB/sec) Copying data to local Disk and passing multiple times over it keeps network I/O within acceptable limits and makes jobs immune to network problems while processing Reduction ~30% over raw data average processing: about 500TB/week

Reconstructed data (DST) size to do Size does not scale with number of Events: Run7 additional cuts on saved clusters/tracks applied Run10: full information but using Half floats and improved output structure Run7: 4.2*10 9 Events Size: 200TB Run10: 10*10 9 Events Size: 300TB Run4: 1*10 9 Events Size: 70TB 10 9 Events

Number Of files Run4 came as a “surprise” showing that 1 raw data file ->1 DST is just not a good strategy Aggregating output files and increasing their size to now 9GB keeps the number of files at a manageable level Staging 100,000 files out of tape storage is a real challenge

PHENIX Output Files Separate Output according to triggers Data split according to content –Central arm tracks –emc clusters –muon candidates –detector specific info Reading files in parallel possible - special synchronization makes sure we do not mix events Recalibrators bring data “up to date”

From the Analysis Train… The initial idea of an “analysis train” evolved from mid ‘04 to early ‘05 into the following plan –Reserve a set of the RCF farm (fastest nodes, largest disks) –Stage as much of the data set onto the nodes’ local disks; run all (previously tested on ~10% data sample: “the stripe”) analysis modules –Delete used data, stage remaining files, run, repeat One cycle took ~ 3 weeks –Very difficult to organize, maintain data –Getting ~200k files from tape was very inefficient –Even using more machines with enough space to keep data disk resident was not feasible (machines down, disk crashes, forcing condor into submission,…) –Users unhappy with delays

… to the Analysis Taxi Since ~ autumn ‘05 –Add all existing distributed disk space into dCache pools –Stage and pin files that are in use (once during setup) –Close dCache to general use, only reconstruction and taxi driver have access: performance when open to all users was disastrous - too many HPSS requests, frequent door failures, … –Users can “hop in” every Thursday, requirements are: code tests (valgrind), limits to memory and CPU time consumption, approval from WG for output disk space – Typical time to run over one large data set: 1-2 days

Rhic Computing Facility PHENIX portion ~ 600 compute nodes ~ 4600 condor slots ~2PB distributed storage on compute nodes in chunks of 1.5TB-8TB managed by dCache backed by HPSS BlueArc nfs server ~100 TB

User interfaces Signup for nightly rebuild, gets retired after 3 months, button click re-signup Signup for a pass, Code test required with valgrind Module status page on the web Taxi summary page on the web Module can be removed from current pass The Basic Idea: User hands us a macro and tells us the dataset and the output directory The rest is our problem (job submission, removal of bad runs, error handling, rerunning failed jobs)

Job Submission submit perl script Creates Module Output Directory Tree log data core Condor dir (1 per fileset) Condor Job file run script File lists macros DB Dst type 2 All relevant information is kept in DB modules filesetlist Module Statistics cvstags Dst type 1 … Dst type 2

Job Execution Module Output Directory Tree log data core run script DB Dst type 2 modules Filesetlist mod status Module Statistics cvstags Dst type 1 … Dst type 2 Copies data from dCache to local disk and does md5 checksum Runs independent root job for each module

Weekly Taxi Usage We run between 20 and 30 modules/week QM 2009 Crunch time before conferences followed by low activity afterwards Run10 data became available before Run10 ended!

Jobs often get resubmitted during the week to pick up stragglers Condor Usage Statistics Jobs are typically started Fridays and are done before the weekend is over (yes we got a few more cpus after this plot was made, it’s now 4600 condor slots) 1.5 GB/sec Observed peak rate >5GB/sec in and out

dCache Throughput Jan 2009: Start of statistics Feb 2009: Use of fstat instead of fstat64 in filecatalog disabled detection of large files (>2GB) on local disk and forced direct read from dCache Between 1PB - 2PB/month, usage will increase when Run10 data becomes available

Local disk I/O TTrees are optimized for efficient reading of subsets of the data, lots of head movement when reading multiple baskets When always reading complete events moving to a generic format would likely improve disk I/O and reduce filesize by removing the TFile overhead. The number of cores keeps increasing and we will reach a limit when we won’t be able to satisfy the required I/O to utilize all of them One solution is to trade off cpu versus I/O by calculating variables instead of storing them (with Run10 we redo a major part of our emc clustering during readback) If precision is not important, using half precision floats is a space saving alternative

Train: Issues Disks crash, tapes break – reproducing old data is an ongoing task. Can we create files which have identical content compared to a production which was run 6 years ago? If not, how much of a difference is acceptable? It is easy to overwhelm the output disks (which are always full, the run script won’t start a job if its output filesystem has <200GB space) Live and learn (and improve) a farm is an error multiplier

Summary Since 2005 this tool enables a weekly pass over any PHENIX data set (since Run3) We push 1PB to 2PB per month through the system Analysis code is tagged, results are reproducible Automatic rerunning of failed jobs allows for 100% efficiency Given ever growing local disks, we have enough headroom for years to come Local I/O will become an issue at some point

BACKUP