Presentation is loading. Please wait.

Presentation is loading. Please wait.

WLCG infrastructure monitoring proposal Pablo Saiz IT/SDC/MI 16 th August 2013.

Similar presentations


Presentation on theme: "WLCG infrastructure monitoring proposal Pablo Saiz IT/SDC/MI 16 th August 2013."— Presentation transcript:

1 WLCG infrastructure monitoring proposal Pablo Saiz IT/SDC/MI 16 th August 2013

2 16 August 2013 Infrastructure monitoring P. Saiz 2 Table of contents I.Summary of the progress II.Desired structure of applications III.Proposal for infrastructure monitoring

3 I.Summary 16 August 2013 Infrastructure monitoring P. Saiz 3

4 16 August 2013 Infrastructure monitoring P. Saiz 4 Motivation  Reduction on number of people  Redefining scope of applications  Combining expertise  Step out and evaluate other alternatives  Goal:  Offer (at least) same QoS with less resources

5 16 August 2013 Infrastructure monitoring P. Saiz 5 Status so far  WLCG monitoring consolidation group created  Applications supported by the section Applications supported by the section  Applications used Applications used  … so now we know what to provide

6 16 August 2013 Infrastructure monitoring P. Saiz 6 How to provide it  Visualization  Documentation  Deployment  Recurrent tasks  Input from our experience  Input from other groups  What is available out there  Split in different areas of work  Source of Information  Transport  Storage  Aggregation  Review of the areas Review of the areas

7 II.Structure of applications 16 August 2013 Infrastructure monitoring P. Saiz 7

8 16 August 2013 Infrastructure monitoring P. Saiz 8 Different layers of applications Collect information Transport Storage Visualize Aggregate Recurrent Tasks Documentation Deployment

9 Collect information Transport Storage Visualize Aggregate Recurrent Tasks Documentation Deployment 16 August 2013 Infrastructure monitoring P. Saiz 9 Deployment  Using openstack, puppet, hiera, foreman  Quota of 100 nodes, 240 cores  Multiple templates already created  Development machine (7 nodes)  Web servers ( SSB, SUM, WLCG transfers, Job: 16 nodes )  Elastic Search (6 nodes), Hadoop (4 nodes)  Currently working on nagios installation  Migrating machines from quattor to AI  Koji and Bamboo for build system and continuous integration Deployment

10 Collect information Transport Storage Visualize Aggregate Recurrent Tasks Documentation Deployment 16 August 2013 Infrastructure monitoring P. Saiz 10 Source of information  Gather info from external, internal sources.  Publish it in the transport layer Collect information Nagios GOCDB REBUS OIM Savannah Other app

11 Collect information Transport Storage Visualize Aggregate Recurrent Tasks Documentation Deployment 16 August 2013 Infrastructure monitoring P. Saiz 11 Transport  Message Broker  Local files  HTTP PUT/GET  UDP  (table in DB)? Transport

12 Collect information Transport Storage Visualize Aggregate Recurrent Tasks Documentation Deployment 16 August 2013 Infrastructure monitoring P. Saiz 12 Storage Archival Current Metrics Meta data Meta data Accepts any data #jobs, status of a service, downtime, pledges, channel status Metric, Instance, Time Range, Value Archival Long term data (Same format as Metric Storage)? Current Metrics Most common views Metadata Profiles Topology

13 Collect information Transport Storage Visualize Aggregate Recurrent Tasks Documentation Deployment 16 August 2013 Infrastructure monitoring P. Saiz 13 Aggregation  Treated as another metric  Might collect input from previous metrics  Current schema of ‘CMS Site readiness’ Summary Site readiness Availability Aggregate

14 Collect information Transport Storage Visualize Aggregate Recurrent Tasks Documentation Deployment 16 August 2013 Infrastructure monitoring P. Saiz 14 Visualize Visualization Server: HTML skeleton REST API with JSON data Cache: memcache, varnish Client Common library + plugin jQuery Common MVC No obvious choice… Plots (Interactive, Exportable, Embeddable) Highcharts

15 III.Infrastructure monitoring 16 August 2013 Infrastructure monitoring P. Saiz 15

16 16 August 2013 Infrastructure monitoring P. Saiz 16 Current situation  Big system, difficult to maintain/evolve  Many internal dependencies  Multiple schemas, aggregations:  SSB, MRS, ACE  Scope much bigger than what we need  Limit to WLCG  Usage of probes Usage of probes  Does not test what the experiments are doing!  Non-trivial deployment of new tests  Based on technologies available at the time of the design  New requests from experiments:  Test whatever they want  Availability vs Usability  Combine Dashboard/SAM apps

17 Infrastructure monitoring 16 August 2013 Infrastructure monitoring P. Saiz 17 Collect information Transport Storage Visualize Aggregate Recurrent Tasks Documentation Deployment NagiosPledge DownPilot HC VO feed MyWLCG SSB SUM Trend Report ACE POEM Archival Metrics

18 And for the prototype… 16 August 2013 Infrastructure monitoring P. Saiz 18 Collect information Transport Storage Visualize Aggregate Recurrent Tasks Documentation Deployment NagiosPledge DownDirect HC VO feed MyWLCG SSB SUM Trend Report ACE POEM Archival Metrics SSB Storage  Records status changes  Same procedure as any other metric New Data Processed Data consume2db SSB format Simplified MRS  Accepts any data  No foreign keys!  No status calculation  300K messages per day All the data in storage have the same format:  Instance, Metric, Time range, Value  Source could be nagios, pilot framework, VO-defined metrics, availabilities

19 16 August 2013 Infrastructure monitoring P. Saiz 19 And now we can see metrics… 14 August 2013 Infrastructure monitoring P. Saiz 19

20 16 August 2013 Infrastructure monitoring P. Saiz 20 Aggregation  Combination of ACE +SSB Virtual Columns  Two types:  Horizontal: Ins 1 (M 1 …M n )  Ins 1 (M p )  Vertical: M 1 (Ins 1 …Ins n )  Ins p (M 2 )  Initial options for “and”, “or” of current status  Later on, might be extended to ‘sliding window’  Full description Full description

21 16 August 2013 Infrastructure monitoring P. Saiz 21 Examples of aggregation ATLAS_CRITICAL WN Site (expand this column) ATLAS_CRITICAL WN Site (expand this column)

22 Summary 16 August 2013 Infrastructure monitoring P. Saiz 22  Lots of progress towards unified schema  Data can be published from different sources  Nagios, VO-defined metrics, ACE, (HC, Job Pilots)  Single schema for storage  Components talk to each other through API  Getting close to a “proof of concept”  Aggregation needs some work  Visualization might need adjusting  Other tasks can go in parallel  NoSQL evaluation  Nagios configuration  Only active metrics


Download ppt "WLCG infrastructure monitoring proposal Pablo Saiz IT/SDC/MI 16 th August 2013."

Similar presentations


Ads by Google