Presentation is loading. Please wait.

Presentation is loading. Please wait.

And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR 17.05.2012.

Similar presentations


Presentation on theme: "And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR 17.05.2012."— Presentation transcript:

1 and Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR 17.05.2012

2 Overview WLCG structure Tier3 T3mon concept Monitoring tools T3mon implementation

3 WLCG structure Goals of WLCG Hierarchic approach Production and analysis Argument for Tier3

4 Goals of WLCG Turn detector signal into physics Atlas for example: 1 Pb/s => The trigger => 200 Mb/s 15 petabytes of raw data per year To do: store processes analyze

5 Hierarchic approach Tier 0 the CERN computer centre safe-keeping the first copy of raw data first pass reconstruction Tier 1 11 centers all around the world safe-keeping shares of raw, reconstructed, reprocessed and simulated data reprocessing Tier 2 about 140 sites production and reconstruction of simulated events analysis

6 Production and analysis Data selection algorithms improve Calibration data change Re-processing several times a year of all data gathered since LHC start-up

7 Argument for Tier3 Analysis on Tier2 is inconvenient Institutions have local computing resources Local access and resources dedicated to analysis improve user response time dramatically

8 Tier3 What is Tier3? Types of Tier3 Compare and contrast: Tier2 Current status Need for monitoring

9 What is Tier 3 non-pledged computing resources institutional computing no formal commitment to WLCG for analysis not really another level of the model

10 Types of Tier3 sites Tier3 with Tier2 functionality Collocated with Tier2 National analysis facilities Non-grid Tier3’s

11 Compare and contrast: Tier2Tier3 Different jobs All VO users Strict requirements on the quality of service Pledged resources A set of mandatory grid services on site Processed data can go back in the grid VO central control Analysis only Local/regional users No QOS information gathered globally (yet!) Non-pledged resources Often no grid services, may be a full or partial set No data allowed back in the grid Local control

12 Current status More than 300 Tier3 sites right now A survey by Atlas in 2010 Tier 3 come many different sizes Storage methods vary Different LRMSs used Different ways to access WLCG Many don’t have monitoring Limited manpower

13 Need for monitoring any system needs monitoring some information is required on the global level dataset popularity usage statistics grid services may not be present existing grid monitoring systems can’t be used Tier 3 sites have limited manpower need an easy way to set up monitoring

14 T3mon concept Users and requirements What to monitor Structure Local monitoring Global monitoring

15 Users and requirements local administrators detailed fabric monitoring resources management systems (LRMS) mass storage systems (MSS) VO managers general usage statistics and quality of service global services dataset popularity

16 What to monitor local resources management systems Proof PBS Condor Oracle Grid Engine mass storage systems (MSS): XRootD Lustre

17 Components Local monitoring detailed fabric monitoring gather data used by the global system present detailed data to local administrators Global monitoring aggregate metrics from local monitoring give necessary information to central services present data via Dashboard

18 Local monitoring system Condor Lustre OGE PBS XRootD Proof Local DB MSG Publishing agent

19 Global monitoring system Local monitoring system MSG Local monitoring system Local monitoring system Local monitoring system Dashboard PPPP C Data management C

20 Tools Ganglia data flow plug-in system Dashboard MSG ActiveMQ

21 Ganglia 21 distributed monitoring system for clusters and Grids Condor Lustre OGE PBS XRootD Proof Local DB MSG Publishing agent

22 Why Ganglia? easy to set up fabric monitoring popular choice among Tier 3 sites extension modules for LRMS and MSS monitoring

23 Ganglia data flow 23 gmond gmetad web frontend rrdtool string metrics numeric metrics xml by request XDR via UDP head node node

24 Ganglia web interface

25 gmond Adding custom metrics 25 module monitored subsystem callback monitored subsystem custom monitoring daemon gmetric ganglia

26 Dashboard “The Experiment Dashboard's main goal is to collect and expose to users relevant information about the services and applications running on the grid environment“ Other Applications Feeders Web Application Collectors Data Access Layer (DAO) Messaging system Tier3 software

27 MSG WLCG Messaging System for Grids “Aims to help the integration and consolidation of the various grid monitoring systems used in WLCG” Based on ActiveMQ open-source message broker

28 T3Mon implementation Project structure Subsystem modules Proof monitoring module PBS monitoring module Condor monitoring module Lustre monitoring module XRootD monitoring module Testing infrastructure

29 Project structure Python SVN provided by CERN RPM repository with a separate package for each monitoring module Each module handles one software system to be monitored on Tier3 One configuration file for all modules

30 Proof 30 gmond database MSG Ganglia Proof Proof plug-in

31 PBS 31 gmond log files MSG Ganglia PBS PBS plug-in

32 … … Condor 32 gmond database MSG Ganglia Condor plug-in Condor condor_quill condor_master condor_startd

33 Lustre 33 gmond Ganglia Lustre Lustre plug-in /proc/fs/lustre

34 cmsd xrootd XRootD 34 mpxstats summary_to_ganglia.py gmetric gmond cmsd xrootd xrdsummond xrootd.py xrddetmond database MSG Ganglia

35 Testing infrastructure Goals Document installing Ganglia on a cluster Document configuring Tier3 subsystems for monitoring Test modules in a minimal cluster environment Clusters: PBS: 3 nodes (1 head node, 2 worker nodes) Proof: 3 nodes (1 hn, 2 wns) Condor: 3 nodes (1 hn, 1 wn, 1 client) OGE: 3 nodes (1 hn, 2 wn) Lustre: 3 nodes (1 MDS, 1 OSS, 1 client) Xrootd: 3 nodes (1 manager, 2 servers) Xrootd II: 3 nodes (1 manager, 2 servers) Development machine Installation testing machine

36 Virtual testing infrastructure 23 nodes total only 2 physical servers running virtualization software (OpenVZ and Xen) fast deployment and reconfiguring of nodes as required performance is not a deciding factor

37 Results and plans The project is nearing completion Most modules are done Proof and XRootD modules already testing on real clusters Next steps: Message consumers OGE Testing and support Data transfer monitoring project

38 Thank you!


Download ppt "And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR 17.05.2012."

Similar presentations


Ads by Google