Presentation is loading. Please wait.

Presentation is loading. Please wait.

Monitoring and Accounting in EGEE/LCG Jeremy Coles (for Dave Kant) ARM-6 Barcelona Based on GridPP15 talk.

Similar presentations


Presentation on theme: "Monitoring and Accounting in EGEE/LCG Jeremy Coles (for Dave Kant) ARM-6 Barcelona Based on GridPP15 talk."— Presentation transcript:

1 Monitoring and Accounting in EGEE/LCG Jeremy Coles (for Dave Kant) ARM-6 Barcelona Based on GridPP15 talk

2 ARM-6 Barcelona, Jan 2006 - 2 Overview Monitoring  Service Availability Monitoring Service Availability Monitoring Environment The Sensors Schema Accounting  Status of Batch Support in APEL Condor and SGE  LCG-RUS

3 ARM-6 Barcelona, Jan 2006 - 3 Service Availability Monitoring Grid Operations Activity (CERN lead) … with contributions from anyone who wants to participate Work started at 4 th EGEE conference Pisa (October) Implementation of sensors, metrics and alarms for services in EGEE/LCG infrastructure to ensure smooth grid operations  Good sensors  Meaningful metrics  Controllable Alarms How to contribute http://goc.grid.sinica.edu.tw/gocwiki/Service_Availability_Sensors

4 ARM-6 Barcelona, Jan 2006 - 4 Contributions to Sensors Substantial Metrics document in circulation which defines 50+ metrics Home Page not yet available. Section 6 concerns the services BDIIAsaiPacificMin Tsai CatalogueCERNJames Casey CECERNPiotr Nyczyk FTSCERNGavin McCance MyProxySEE-Greece / CERN? / Maarten Litmaath RGMAUKI and CERNLaurence Field, Antony Wilson RBUKI and ItalyDave Kant, Sergio Andreozzi SRMUKIDave Kant, Jens Jensen, Greg Cowan VOMSItalyValerio Venturi

5 ARM-6 Barcelona, Jan 2006 - 5 Architecture All sensors publish into RGMA using a common schema Publish frequency depends on the sensor: SFT every 2 hours; RB every 30 mins; SRM once-a-day Alarms generated according to thresholds e.g. RB alarm if match make time exceeds 90 seconds

6 ARM-6 Barcelona, Jan 2006 - 6 TimeLine Preliminary Releases Expected Sensors: Feb 2006 Summary Generator  Indian Team ? Metric Generator:  Re-use Lemon Components? Displays: Feb 2006  Based on SFT Alarm System: March 2006  Sure/Lemon Piotr  RSS Dave  Integration with CIC portal (Lyon?) Work in progress Community working together

7 ARM-6 Barcelona, Jan 2006 - 7 Sensors for Service Monitoring RB Active Monitoring  Track a test job through the Grid; from UI to Worker Node  Functional Test: Can RBs Match Jobs to Resources requested  Frequent Job submission: Sample functionality every 30 minutes  Tools on the UI (edg-get-job-info)  RGMA Publishers on RB and WN  Sceen Shots: Job Summaries, RB Summaries, Metrics

8 ARM-6 Barcelona, Jan 2006 - 8 Example: RB Service Monitoring Our Experience Maps not practical for day-to- day operational activities?

9 ARM-6 Barcelona, Jan 2006 - 9 Example: RB Service Monitoring Shows Results of the latest round of jobs sent to RBs View details of individual tests

10 ARM-6 Barcelona, Jan 2006 - 10 Track a test job through the Grid; from UI to Worker Node UI edg-job-output RB L&B Info RB Publisher WN Publisher

11 ARM-6 Barcelona, Jan 2006 - 11 Recent History for a RB Derive Metric Data  Capture time to matchmake the job  Capture availability in a 24 hr period Number of Jobs to reach DONE ------------------------------------------ Total number of jobs submitted

12 ARM-6 Barcelona, Jan 2006 - 12 Passive Monitoring Passive Monitoring (Italy: Sergio Andreozzi) Processing of log files http://www.cnaf.infn.it/~andreozzi/wiki/Work/WMShttp://www.cnaf.infn.it/~andreozzi/wiki/Work/WMS Workload Manager Component  WaitingRequests  InputFileListSize Job Controller  WaitingRequests  InputFileListSizee Network Server  submissionRate (requests/600s) WM Proxy  ServerPoolSize Whole System  InJobs in last 10 mins  OutJobs in last 10 mins Hosting Environment  Load (1,5,15), memory (used, free, total, real, virtual)

13 ARM-6 Barcelona, Jan 2006 - 13 General Issues Will R-GMA/MySQL be able to cope with volume of data ?  GSTAT (GIIS monitor) alone generates 5GB data per Month  CERN are considering moving to Oracle (RGMA supported or migrating data from the MySQL archiver)  What plans are there for Oracle support in R-GMA?

14 ARM-6 Barcelona, Jan 2006 - 14 Types of Accounting Job Accounting AFTER the event (APEL Domain) Concept of a “Job” as a unit of resource consumption Determination of value after job execution Job usage record as a complete description of resource consumption Suitable for post paid services. Real Time Accounting (DGAS, SGAS Domain) Incremental determination of resource value while job being executed Incremental decrement of account balance Can enforce user quotas Suitable for pre-paid services

15 ARM-6 Barcelona, Jan 2006 - 15 APEL, Job Accounting Flow Diagram [1] Build Job Accounting Records at site. [2] Send Job Records to a central repository [3] Data Aggregation

16 ARM-6 Barcelona, Jan 2006 - 16 Accounting for Grid Jobs Build Job Records at Site APEL mapping grid users to the resource usage on local farms

17 Job Records In via RGMA RGMA MON SQL QUERY TO Accounting Server 1 Query / Hour On-Demand Accounting Pages based on SQL queries to summary data 1 Record per Grid Job (Millions of records expected) Summary data refreshed every hour (Max records about 100K per year) Home Page User queries Graphs GOC Consolidation of Data

18 ARM-6 Barcelona, Jan 2006 - 18 APEL Status APEL has been in production for 1 year 156 Sites, 5.4 Million Job Records 100K Job records per week -> Linear rise (c.f exponential) continues despite growth in CE. -> More site doesn’t mean more Jobs or more users.

19 ARM-6 Barcelona, Jan 2006 - 19 Demos of Accounting Aggregation Global views of resource consumption. LHC View http://goc.grid-support.ac.uk/gridsite/accounting/tree/treeview.php  Data Aggregation across Countries EGEE View http://www2.egee.cesga.es/gridsite/accounting/CESGA/tree_egee.php  Data Aggregation across EGEE ROC Based on LHC View and Data Mining Displays Official EGEE VOs (12) and Regional VOs Tables to show which GOCDB sites haven’t published recently … which ones publish but are not listed in GOCDB GridPP View http://goc.grid-support.ac.uk/gridsite/accounting/tree/gridppview.php  Specific view for GridPP accounting summaries for Tier-2s  Comments from GridPP users -> Prototype -> EGEE view changes

20 Aggregation of Data for GridPP

21 Aggregation of Data for Tier2

22 Data Aggregation at Site Level Breakdown of data per Vo per month showing Njobs, CPUt, WCT, record history Total CPU Usage per VO Gantt Chart NB:Gaps across all VOs consistent with scheduled downdowns in GocDB

23 ARM-6 Barcelona, Jan 2006 - 23 Batch Support in APEL Currently Available in LCG 2.6 OpenPBS, Torque, PBSPro and Vanilla PBS  ~80% Sites in LCG/EGEE Load Share Facility (Versions 5 and 6)  CERN, Italy In Development Condor (http://goc.grid-support.ac.uk/gridsite/accounting/condor.html )http://goc.grid-support.ac.uk/gridsite/accounting/condor.html  Requested by Canada, UK  Was due for release in Nov/December but delayed  Deal with multiple parses of large batch files: Condor does not self-manage its logs, so they grow to > 2GB in size, multiple parses via APEL in-efficient. Sun Grid Engine (http://goc.grid-support.ac.uk/gridsite/accounting/sge.html )http://goc.grid-support.ac.uk/gridsite/accounting/sge.html  Requested by UK (Imperial College)  Format of Log records unclear to us: Missing information in message logs  LCG-SGE job manager format is not LCG Compliant (PBS, LSF and Condor all are!). Substantial changes to APEL required unless this is addressed more carefully.

24 ARM-6 Barcelona, Jan 2006 - 24 APEL/RGMA Issues Publishing Missing Records  Options available to users are limited all  Republish mean republish everything: exceeds internal memory limits in Java causing APEL to crash. RGMA Archiver is growing in size It takes longer to traverse the database About 2 minutes to run the summary generator Benefit to move to Oracle Batch Support is still limited (!) Condor and SGE should be seen by the community to be important extensions to the application. APEL and gLite (!) Will Apel work in this environment. Nb. The web summary views are independent of APEL Data Privacy and Security Sites don’t publish User DN … its private data Restrict access to private data via RGMA client Data needs to be shifted from produces to consumers in a secure way Restricted to Fixed Schema in RGMA? Cannot easily add new fields to the database Unable to capture information about Jobs in batch logs e.g. exit status, time in queue, etc (STEVE FISHER COMMENTS: NEW FIELDS CAN BE ADDED)

25 ARM-6 Barcelona, Jan 2006 - 25 What Lies Ahead? Challenges Ahead World Wide Accounting Service for LCG

26 ARM-6 Barcelona, Jan 2006 - 26 More Wider Issues How important is accounting?  Compute resource viewed as a grid currency  Need a guarantee that the data has not been tampered with in an un fair way  How does normalisation fit into this? The concept of a raw usage records has no meaning if internal scaling is applied to Heterogeneous farms. Recognise that accounting isn’t just about “job usage” its about Resource usage which encompasses many things:-  CPU Usage  Also Storage & Network Usage  Treated Differently ? CPU is consumed; Storage is Occupied and can be recycled Getting Data from All Participants  Hasn’t been easy to get all sites in EGEE to send data to us.  Many reasons: some technical, some political  How do we account for usage in wider communities which span grid projects e.g. LHC?

27 ARM-6 Barcelona, Jan 2006 - 27 Challenges Ahead Data Collection  Many implementations for collecting accounting data in LCG World; APEL/DGAS in EGEE SGAS in SweGrid Sites that implement their own systems (FermiILab: multiple grid job managers from different grids feed a single condor pool) Also OSG who are interested in deploying APEL with their own transport mechanism.  Switching one for another doesn’t resolve the problem of data sharing across the project. No mechanism in place to share this data in a consistent way in place.  GGF Working on a Resource Usage Service  What would the model for data sharing look like? Low level or high level?  Low Level: sensors publishing data via a web service?  High level: Data collected within the infrastructure, aggregated in a meaningful way, reviewed and approve data before it can be passed on (FermiLab)  Some Tier-1 centres have concerns about data association “LCG not EGEE” “Will the service be separate?”

28 ARM-6 Barcelona, Jan 2006 - 28 Challenges Ahead Usage Reporting at what Level?  Anonymous level: How much resource has been provided to each VO  Aggregation across: VOs, Countries, Regions, Grids, Organisations  Granularity: summed over units of Hours, Days, Weeks, Months? User Level Reporting?  If 10,000 CPU hours were consumed by Atlas VO, who are the users that submitted the work?  Data privacy laws  A Grid “DN” is personal information which could be used to target an individual.  Who has access to this data and how do you get it?  Can CA policies change to support anonymous DNs and reverse DN mappings?  What are the consequences? Are there any lawyers in the audience?

29 ARM-6 Barcelona, Jan 2006 - 29 World Wide Accounting Service for LCG Project involves combining results from all three peer infrastructures and presenting an aggregated view of resource usage for LHC VOs to the RRB  Peer Infrastructures in LCG Open Science Grid + Others (Ruth Pordes, Philippe Canal, Matteo Melani) Nordugrid (Per Oster, Thomas Sandholm) LCG/EGEE (Kors Bos, Dave Kant) GRID-ACCOUNTING@LISTSERV.RL.AC.UK

30 ARM-6 Barcelona, Jan 2006 - 30 Resource Usage Service Based on emerging GGF standards and Web Services  GGF UR, OGSI An implementation exists in “Market for Computational Science” – UK e- Science project. What does DGAS provide? Use case might be:  A user invokes the query service through a web browser, using SSL for client authentication, to ensure that usage information at user level belongs to the user. Servlet sends query to RUS web service and gets user data. Service Interface RUS WS Application ACL DB Web Service Container Work started with Akram Khan and Xiaoyu Chen at Brunel

31 ARM-6 Barcelona, Jan 2006 - 31 Conclusions Very Busy Year Ahead SGE and Condor support need to be completed Improve some features of APEL that cause difficulties Investigate LCG-RUS Service Metrics Activity – very important - beginning to consume effort.


Download ppt "Monitoring and Accounting in EGEE/LCG Jeremy Coles (for Dave Kant) ARM-6 Barcelona Based on GridPP15 talk."

Similar presentations


Ads by Google