Presentation is loading. Please wait.

Presentation is loading. Please wait.

Virtual Organization Approach for Running HEP Applications in Grid Environment Łukasz Skitał 1, Łukasz Dutka 1, Renata Słota 2, Krzysztof Korcyl 3, Maciej.

Similar presentations


Presentation on theme: "Virtual Organization Approach for Running HEP Applications in Grid Environment Łukasz Skitał 1, Łukasz Dutka 1, Renata Słota 2, Krzysztof Korcyl 3, Maciej."— Presentation transcript:

1 Virtual Organization Approach for Running HEP Applications in Grid Environment Łukasz Skitał 1, Łukasz Dutka 1, Renata Słota 2, Krzysztof Korcyl 3, Maciej Janusz 2, Jacek Kitowski 1,2 1 ACC CYFRONET AGH, Cracow 2 Institute of Computer Science AGH-UST, Cracow 3 Institute of Nuclear Physics PAN, Cracow Institute of Computer Science AGH-UST

2 Introduction Application  Requirements  Architecture Virtual Organization  Requirements  Certification  Monitoring  Dynamic Processing Tasks Pool Summary/Conclusions Outline

3 Introduction Part of int.eu.grid project State of Art. – current interactive applications in GRID  Crossgrid  Reality Grid  GridPP Why grid?  Gain more resources  Use resources for different purposes when ATLAS is off Why VO?  Provides stable, but dynamic environment for an application HEP application is to process all events without any loss of data

4 SFI PF Local Event Processing Farms Back End Network SFOs mass storage CERN CERN Computing Center Copenhagen Edmonton Kraków Manchester PF Remote Processing Farms PF Packet Switched WAN: GEANT Dispatcher lightpath PF HEP Application in Brief

5 Three filtering levels  Hardware level (lvl 1)  Small local farm (lvl 2)  Complex events filtering (lvl 3) SFI - SubFarm Interface SFO - SubFarm Output PF – Processing Farm HEP Application in Brief Sensors Lvl1 filters 2.5  s 120 GB/s ~ 10 ms Lvl2 filters Buffers LVL3 Event Filter EFP Event Filter Processors ~ sec SFI SFO EFN ~4 GB/s ~ 300 MB/s Event Filter N/work

6 HEP Requirements Real-time application High throughput (estimated)  3500 event per second  1.5MB per event  Average 1 second to compute one event on typical processor  Infrastructure monitoring (for load balancing) Efficient way to distribute events to worker nodes  Grid job submission mechanism is not sufficient Simple job submission takes minutes Have seconds... Failure recovery  Malfunctions of single nodes are acceptable, but have to be detected  Application monitoring  Infrastructure monitoring (for availability checks)

7 Job submission for each event is too slow Job submission for bunch of event is still too slow We need interactive communication Pilot job idea  One job to allocate a node and start PT (Processing Task)  Dedicated queue in LRMS for HEP pilot jobs (HEP VO)  One PT processes many events  Direct communication between PT and ATLAS experiment Faster than job submission ATLAS experiment provides event (1.5MB/event) PT responds with events analysis results (1Kb/event) Asynchronous communication with events buffering  Limited lifetime of PT to allow dynamic resource allocation Lifetime set by queue or PT configuration HEP application integration with GRID

8 Proposed HEP Architecture SFI EFD Buffer PT Local PT Farm EFD Buffer Dispatcher UIWMS CE HEP VO Infrastructure monitoring Application Monitoring proxyPT Events HEP VO Database PT CE Remote PTs WNs PT WNs CERN GRID

9 Components EFD (Event Filter Dataflow)  Takes event from SFI and place them in local buffer  Events are distributed to PT (local or remote)  Depending on PT's answer event is stored or flushed Processing Task (PT)  Runs on worker nodes (WNs)  Process events and answers with short analysis data ProxyPT  Interface to remote PT Dispatcher  Coordinates task distribution from EFD to PTs Infrastructure Monitoring  Network load/status, WNs status Application Monitoring  PT application state

10 HEP Virtual Organization Purpose  Provides runtime environment for HEP application  Fulfills application’s requirements  Realizes site certification process Architecture  High level (static) - sites Certification Agreements Configuration guidelines Functional tests  Low level (dynamic) - resources Runtime environment Dynamic resource allocation Monitoring and failover Load balancing

11 Site certification - Requirements Long-term ability to provide services and resources Legal issues/agreements LRMS configuration  Dedicated high priority (but short) queues on computing elements for jobs from HEP VO Ability to safely communicate between site's WNs and CERN HEP nodes:  Opened specified port on CERN side  Opened specified port on site side  Trusted proxy to setup two way communication  Channel encryption

12 HEP VO site operation process Certification phase  Long term tests for reliability performance and updates (application, databases)  Sites tested using artificial/calibration data  Communication between site's WNs and CERN HEP nodes Operation phase with runtime environment monitoring  Operates on production data  Checks during PTs startup Proper environment, up-to-date application, databases, etc.  Infrastructure and application monitoring  Dynamic resource allocation  Excluding nodes/sites which are frequently unavailable Temporary excluded sites/nodes can not process real data, but they can still receive test jobs

13 High level VO Low level VO Site certification Operation Certification Agreements Guidelines Functional testsDynamic resource allocation Monitoring Management issues Communication Site availability statistics HEP Virtual Organization

14 Monitoring for HEP VO Takes advantage of monitoring  Monitoring using external tools  Application Monitoring (with tool like J-OCM): deployed on every worker node running HEP PT provides information about current execution status monitors computation time  JIMS for Infrastructure Monitoring availability of worker node load of worker node free memory network throughput between CERN and remote computing farm Failover

15 Dynamic resource allocation Dynamic Processing Tasks pool Malfunctioning PT excluded from runtime environment (low level VO) PT lifetime limited by queue length (walltime)  each ‘normal’ job has it’s own lifetime specified before execution  ‘interactive’ type of job  pool has to be refreshed periodically Fair sharing of resources

16 Summary, conclusions High/Low-level VO Site certification and software validation for HEP application HEP oriented site functionality tests On-line validation of site configuration Statistical analysis of HEP processing Dynamic Processing Task Pool


Download ppt "Virtual Organization Approach for Running HEP Applications in Grid Environment Łukasz Skitał 1, Łukasz Dutka 1, Renata Słota 2, Krzysztof Korcyl 3, Maciej."

Similar presentations


Ads by Google