Download presentation
Presentation is loading. Please wait.
Published byJayson Powers Modified over 8 years ago
1
Virtual Organization Approach for Running HEP Applications in Grid Environment Łukasz Skitał 1, Łukasz Dutka 1, Renata Słota 2, Krzysztof Korcyl 3, Maciej Janusz 2, Jacek Kitowski 1,2 1 ACC CYFRONET AGH, Cracow 2 Institute of Computer Science AGH-UST, Cracow 3 Institute of Nuclear Physics PAN, Cracow Institute of Computer Science AGH-UST
2
Introduction Application Requirements Architecture Virtual Organization Requirements Certification Monitoring Dynamic Processing Tasks Pool Summary/Conclusions Outline
3
Introduction Part of int.eu.grid project State of Art. – current interactive applications in GRID Crossgrid Reality Grid GridPP Why grid? Gain more resources Use resources for different purposes when ATLAS is off Why VO? Provides stable, but dynamic environment for an application HEP application is to process all events without any loss of data
4
SFI PF Local Event Processing Farms Back End Network SFOs mass storage CERN CERN Computing Center Copenhagen Edmonton Kraków Manchester PF Remote Processing Farms PF Packet Switched WAN: GEANT Dispatcher lightpath PF HEP Application in Brief
5
Three filtering levels Hardware level (lvl 1) Small local farm (lvl 2) Complex events filtering (lvl 3) SFI - SubFarm Interface SFO - SubFarm Output PF – Processing Farm HEP Application in Brief Sensors Lvl1 filters 2.5 s 120 GB/s ~ 10 ms Lvl2 filters Buffers LVL3 Event Filter EFP Event Filter Processors ~ sec SFI SFO EFN ~4 GB/s ~ 300 MB/s Event Filter N/work
6
HEP Requirements Real-time application High throughput (estimated) 3500 event per second 1.5MB per event Average 1 second to compute one event on typical processor Infrastructure monitoring (for load balancing) Efficient way to distribute events to worker nodes Grid job submission mechanism is not sufficient Simple job submission takes minutes Have seconds... Failure recovery Malfunctions of single nodes are acceptable, but have to be detected Application monitoring Infrastructure monitoring (for availability checks)
7
Job submission for each event is too slow Job submission for bunch of event is still too slow We need interactive communication Pilot job idea One job to allocate a node and start PT (Processing Task) Dedicated queue in LRMS for HEP pilot jobs (HEP VO) One PT processes many events Direct communication between PT and ATLAS experiment Faster than job submission ATLAS experiment provides event (1.5MB/event) PT responds with events analysis results (1Kb/event) Asynchronous communication with events buffering Limited lifetime of PT to allow dynamic resource allocation Lifetime set by queue or PT configuration HEP application integration with GRID
8
Proposed HEP Architecture SFI EFD Buffer PT Local PT Farm EFD Buffer Dispatcher UIWMS CE HEP VO Infrastructure monitoring Application Monitoring proxyPT Events HEP VO Database PT CE Remote PTs WNs PT WNs CERN GRID
9
Components EFD (Event Filter Dataflow) Takes event from SFI and place them in local buffer Events are distributed to PT (local or remote) Depending on PT's answer event is stored or flushed Processing Task (PT) Runs on worker nodes (WNs) Process events and answers with short analysis data ProxyPT Interface to remote PT Dispatcher Coordinates task distribution from EFD to PTs Infrastructure Monitoring Network load/status, WNs status Application Monitoring PT application state
10
HEP Virtual Organization Purpose Provides runtime environment for HEP application Fulfills application’s requirements Realizes site certification process Architecture High level (static) - sites Certification Agreements Configuration guidelines Functional tests Low level (dynamic) - resources Runtime environment Dynamic resource allocation Monitoring and failover Load balancing
11
Site certification - Requirements Long-term ability to provide services and resources Legal issues/agreements LRMS configuration Dedicated high priority (but short) queues on computing elements for jobs from HEP VO Ability to safely communicate between site's WNs and CERN HEP nodes: Opened specified port on CERN side Opened specified port on site side Trusted proxy to setup two way communication Channel encryption
12
HEP VO site operation process Certification phase Long term tests for reliability performance and updates (application, databases) Sites tested using artificial/calibration data Communication between site's WNs and CERN HEP nodes Operation phase with runtime environment monitoring Operates on production data Checks during PTs startup Proper environment, up-to-date application, databases, etc. Infrastructure and application monitoring Dynamic resource allocation Excluding nodes/sites which are frequently unavailable Temporary excluded sites/nodes can not process real data, but they can still receive test jobs
13
High level VO Low level VO Site certification Operation Certification Agreements Guidelines Functional testsDynamic resource allocation Monitoring Management issues Communication Site availability statistics HEP Virtual Organization
14
Monitoring for HEP VO Takes advantage of monitoring Monitoring using external tools Application Monitoring (with tool like J-OCM): deployed on every worker node running HEP PT provides information about current execution status monitors computation time JIMS for Infrastructure Monitoring availability of worker node load of worker node free memory network throughput between CERN and remote computing farm Failover
15
Dynamic resource allocation Dynamic Processing Tasks pool Malfunctioning PT excluded from runtime environment (low level VO) PT lifetime limited by queue length (walltime) each ‘normal’ job has it’s own lifetime specified before execution ‘interactive’ type of job pool has to be refreshed periodically Fair sharing of resources
16
Summary, conclusions High/Low-level VO Site certification and software validation for HEP application HEP oriented site functionality tests On-line validation of site configuration Statistical analysis of HEP processing Dynamic Processing Task Pool
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.