Status report from the Deferred Trigger Study Group John Baines, Giovanna Lehmann Miotto, Wainer Vandelli, Werner Wiedenmann, Eric Torrence, Armin Nairz.

Status report from the Deferred Trigger Study Group John Baines, Giovanna Lehmann Miotto, Wainer Vandelli, Werner Wiedenmann, Eric Torrence, Armin Nairz

Use cases Deferred Triggers: Subset of events stored in DAQ system & processed later in run - separate stream potentially useful when CPU is saturated at start of fill Broadly two different classes of use case: – Deferred HLT processing: Deferred stream based on L1. Build event, cache, then run HLT later caching@~5-10kHz, deferred processing ~ 1s/event, rejection ~50-100 High cache rate => need high replay rate => shorter per-event processing time EB rate for deferred + prompt must fit in budget (~20kHz for 2 nd generation ROS). e.g. cache all L1 multi-jet events (3-5kHz for 4J20 as 2-3x10 34 ) & run topoclustering e.g. cache L1 multi-jet and/or high p T dilepton triggers & run HLT tracking for displaced vertex trigger – post-HLT processing: Deferred stream based on HLT result. Very similar to L4 case caching at ~0.5-1Hz, deferred processing ~10s/event, rejection ~5-10 Lower cache rate => lower replay rate => longer per-event processing time could be used to increase effic. for same T0 rate: apply looser selection in HLT then deferred trigger runs slower offline selection & applies tighter cuts e.g. deferred stream for triggers requiring full-event EF tracking e.g. MET, b-jet, Tau

Deferred Trigger Processing Options Processing options: Inter-fill processing: only processed deferred stream between physics fills – different processes from normal prompt triggers (file based like offline debug stream recovery) – Started between fills when farm relatively idle. Stopped when new fill starts. => Baseline option in-fill plus inter-fill processing: attempt to make use also of spare CPU capacity later in run – little gain if LHC luminosity levelling – Could be in competition with end-of-fill triggers – Dynamic partitioning of HLT farm: Different processes for normal prompt processing & processing cached events. Need to dynamically vary partitioning as CPU usage changes for prompt processing Delay to reconfigure partition and start/stop process Significant difficulties for DAQ => Disfavoured option – Variable deferral fraction: Still have only inter-fill processing of cached events but add ability to process some fraction of deferred-triggers promptly in normal trigger processes. mechanism similar to pre-scales use to update fraction of events cached during run Relatively small change for online but events from same LB split between prompt and deferred files => Significant additional complexity for Tier0 => Disfavoured option

Storage options Distributed storage: local disks of HLT nodes + Potentially large ~1600TB, but not RAID disk => not secure + Distributed => play-back not limited by data rates from disk - Book-keeping & operations difficulties, Can’t balance load for playback => Disfavoured option Central storage: expand existing SFO + Secure storage; much higher fault tolerance + Can balance load across farm during play-back + straight forward book-keeping + minimizes changes needed to current system -Playback limited to data rates ~5GB/s (2.5 kHz event rate) => Baseline option Clustered storage: per-rack SFO-like disk-server Lower number of disks than distributed scheme => Retains some of the advantage of the central scheme More distributed than central scheme  higher playback rates  Solution for higher-rates

Order of Magnitude Cost Estimate Baseline: Inter-fill processing, Central Storage 1kHz caching rate, 2.5 kHz playback 8s/event processing time 210 TB Disk Cache Hardware Cost: ~100 kCHF Possible Use case: EF fullscan for MET/Tau/b-jet High Rate System: (Baseline x 10) Inter-fill processing, Clustered Storage 10kHz caching rate, 25 kHz playback 0.8s/event processing time 2100 TB Disk Cache Hardware Cost ~1MCHF Possible use case: Multi-jets: Topocluster Displaced vertex trigger: L2 ID fullscan for multijets or high p T mu Processing power equivalent to 40% of current farm capacity Wall-time to process: <30 hours based on 2012 fill data: could be longer for more efficient LHC Effort Needed: 3.5 SY for online sw infrastructure changes + 0.25 SY for Tier-0 sw infrastructure changes excludes effort to develop, configure & install h/w & operational effort Time-scale: 1 year for sw development + commissioning during extended break, e.g. winter shutdown Processing time factor 10 less

Summary Deferred stream could have significant benefits for a CPU limited farm BUT: – Deferred stream processing only suitable for specific use cases (low rate, high processing time) – much less flexible than normal prompt processing  preferable to address need for added CPU by upgrading nodes or adding racks. – significant cost: both hardware & effort Preferred scheme is inter-fill processing: – In-fill processing unattractive due to added complexity: Online for dynamic partitioning or Offline for variable deferral fraction Central or clustered storage preferred A base-line infrastructure could provide: – up to 2.5 kHz deferred stream rate – 8s/event for processing – processing completed within 48 hrs (under 2012 operating conditions) In the case of more efficient LHC, would need to lower deferred stream rate

Additional Material

Introduction Deferred Triggers: Subset of events stored in DAQ system & processed later in run Two processing options considered: Inter-fill processing: only processed deferred stream between physics fills Dynamic processing: process both in-fill and inter-fill – attempt to make use also of spare CPU capacity later in run Potential competition with end of fill triggers ~50% decrease after 4 hours

Assumptions Events built before being cached – may contain intermediate HLT result in case HLT run before caching Deferred stream consists of a specific subset of triggers: – must not include triggers needed by calibration stream to produce constants for bulk processing Deferred triggers output to a separate stream Deferred stream needs: – Different constants – possible from a different run – Separate monitoring – relates to past not current condition – Independent of state of on-going run  Need separate processes for deferred stream processing.  File-based processing is the most straight-forward  Need to partition farm between prompt and deferred processing and dynamically balance resources – Relatively straight forward in inter-fill scheme – Difficult in dynamic scheme => Inter-fill scheme is the baseline

Disk size & Total Processing time Inter-fill scheme: Includes delays due to pausing of reprocessing during subsequent physics fills Disk Usage by Deferred Stream (TB) Wall-time to process deferred stream (hours) Result’s of Eric’s model based on 2012 fill information

Cache: 0.5kHz playback: 2.5kHz Cache: 1 kHz playback: 2.5kHz Cache: 2.5kHz playback: 2.5kHz Time to process Inter-fill scheme: Includes delays due to pausing of reprocessing during subsequent physics fills

Cache: 0.5kHz playback: 2.5kHz Cache: 1 kHz playback: 2.5kHz Cache: 2.5kHz playback: 2.5kHz Disk Usage Inter-fill scheme: Includes effect of delays due to pausing of reprocessing during subsequent physics fills

Requirements - Some examples: Inter-fill processing Event (Data) Rate Max. wall- time to process [h] Max. Disk Usage [TB] Average HLT Processing Time [s/event] Effective inc. in farm proc. capacity [cores] Caching [kHz (GB/s)] Playback [kHz (GB/s)] 0.5 (1)2.5 (5)2385820% 1 (2)2.5 (5)29210840% 2.5 (5) 496608100% 10 (20)25(50)29 21000.840% 10 (20) 4926402100% From Model 20k cores/ playback rate HLT proc. Time * caching rate/20k = caching rate/ Playback rate Current SFO : 6x21 TB + 3x10 TB disks => 156TB Write: 1.6 GB/s; Read: 2GB/s Input Clustered storage Baseline High-Rate

In-fill & Inter-fill processing Dynamic Partitioning of farm has to dynamically take into account changes in CPU requirement Each change imposes delays to configure & start/abort processes  hard! Relatively small potential gains (except in special case): Event (Data) Rate Max. wall- time to process [h] Max. Disk Usage [TB] Caching [kHz (GB/s)] Playback [kHz (GB/s)] 0.5 (1)2.5 (5)0.8 c.f. 2314 c.f. 85 1 (2)2.5 (5)25 c.f. 29113 c.f. 210 1.5 (3)2.5 (5)31253 Special case: in-fill processing rate = caching rate Would it be possible to use a mechanism similar to end of fill triggers? Define a variable deferral fraction Set to 1 at start of run Set to e.g. 0.8 during run => 80% of deferred triggers cached, 20% processed promptly Big disadvantage: events from same lumi block in o/p files produces up to 48 hrs apart Assume 20% of farm used for prompt processing after 4 hours

DAQ & HLT Activation of deferred stream processing should be automatic – But can be stopped/aborted by expert Error handling should not normally require operator intervention – But alert expert if system cannot restart correctly Must be possible to rapidly stop partition when needed – And re-start again from this point when CPU becomes available Need to define action in case disks become full – Stop deferred stream, – Exceptionally transfer events unprocessed to Tier0? (if rate ~500Hz) Extensive book-keeping framework needed: – To drive play-back – to account for data possible loses

Tier0 While technically possible to deal with delays > 48 hours, anything that deviates from standard work-flow is significant extra work => should keep within 48 hours except in very rare exceptions Important that output files are LB-aware i.e. closed at LB boundaries In the case of the clustered or distributed options would need to make a significant addition to T0 to merge files: – Multi-step RAW file merging needed (more complicated than current 1-step process) – Currently ~10 files per LB, could be ~200 smaller files for clustered storage (even more for distributed storage) Completeness of dataset is an issue: rely on completeness in many places – e. g. RAW merging job only defined for complete data – Would need to adapt T0 workflow to enable processing of prompt stream with only partially complete LBs Extra infrastructure needed if, in exceptional circumstances, unprocessed events are streamed to Tier-0: – Complete HLT processing & re-streaming needed offline - similar to debug reproc. but much bigger scale (~10M events c.f. few hundred) – Retro-active insertion of prcessed data into handshake DB – Merge many small files produced – Need to add to files from truncated online processing before bulk reconstruction

DQ Online monitoring should be separate Offline: should be possible to treat deferred steam in same way as other streams => Deferred triggers adequately represented in express stream Deferred stream available for bulk processing within 48 hours of run-end Need stream-dependant good run list

Status report from the Deferred Trigger Study Group John Baines, Giovanna Lehmann Miotto, Wainer Vandelli, Werner Wiedenmann, Eric Torrence, Armin Nairz.

Similar presentations

Presentation on theme: "Status report from the Deferred Trigger Study Group John Baines, Giovanna Lehmann Miotto, Wainer Vandelli, Werner Wiedenmann, Eric Torrence, Armin Nairz."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Status report from the Deferred Trigger Study Group John Baines, Giovanna Lehmann Miotto, Wainer Vandelli, Werner Wiedenmann, Eric Torrence, Armin Nairz.

Similar presentations

Presentation on theme: "Status report from the Deferred Trigger Study Group John Baines, Giovanna Lehmann Miotto, Wainer Vandelli, Werner Wiedenmann, Eric Torrence, Armin Nairz."— Presentation transcript:

Similar presentations

About project

Feedback