D0 Grid Data Production 1 D0 Grid Data Production Initiative: Coordination Mtg Version 1.0 (meeting edition) 21 May 2009 Rob Kennedy and Adam Lyon Attending:

Slides:

Advertisements

Similar presentations

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.

Advertisements

CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.

D0 Grid Data Production 1 D0 Grid Data Production Initiative: Coordination Mtg Version 1.0 (meeting edition) 08 January 2009 Rob Kennedy and Adam Lyon.

1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.

Software Delivery. Software Delivery Management  Managing Requirements and Changes  Managing Resources  Managing Configuration  Managing Defects 

GRID Workload Management System Massimo Sgaravatto INFN Padova.

From Entrepreneurial to Enterprise IT Grows Up Nate Baxley – ATLAS Rami Dass – ATLAS

Maintaining and Updating Windows Server 2008

Change Management Chris Colomb Trish Fullmer Jordan Bloodworth Veronica Beichner.

System Design/Implementation and Support for Build 2 PDS Management Council Face-to-Face Mountain View, CA Nov 30 - Dec 1, 2011 Sean Hardman.

The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.

Scalability By Alex Huang. Current Status 10k resources managed per management server node Scales out horizontally (must disable stats collector) Real.

Success status, page 1 Collaborative learning for security and repair in application communities MIT & Determina AC PI meeting July 10, 2007 Milestones.

SCD FIFE Workshop - GlideinWMS Overview GlideinWMS Overview FIFE Workshop (June 04, 2013) - Parag Mhashilkar Why GlideinWMS? GlideinWMS Architecture Summary.

Ian Fisk and Maria Girone Improvements in the CMS Computing System from Run2 CHEP 2015 Ian Fisk and Maria Girone For CMS Collaboration.

RUP Fundamentals - Instructor Notes

PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

OSG Operations and Interoperations Rob Quick Open Science Grid Operations Center - Indiana University EGEE Operations Meeting Stockholm, Sweden - 14 June.

Integration and Sites Rob Gardner Area Coordinators Meeting 12/4/08.

May 8, 20071/15 VO Services Project – Status Report Gabriele Garzoglio VO Services Project – Status Report Overview and Plans May 8, 2007 Computing Division,

CHEP'07 September D0 data reprocessing on OSG Authors Andrew Baranovski (Fermilab) for B. Abbot, M. Diesburg, G. Garzoglio, T. Kurca, P. Mhashilkar.

Apr 30, 20081/11 VO Services Project – Stakeholders’ Meeting Gabriele Garzoglio VO Services Project Stakeholders’ Meeting Apr 30, 2008 Gabriele Garzoglio.

Sep 21, 20101/14 LSST Simulations on OSG Sep 21, 2010 Gabriele Garzoglio for the OSG Task Force on LSST Computing Division, Fermilab Overview OSG Engagement.

Scheduling policies for real- time embedded systems.

GridPP18 Glasgow Mar 07 DØ – SAMGrid Where’ve we come from, and where are we going? Evolution of a ‘long’ established plan Gavin Davies Imperial College.

08/30/05GDM Project Presentation Lower Storage Summary of activity on 8/30/2005.

D0 Grid Data Production 1 D0 Grid Data Production Initiative: Coordination Mtg Version 1.0 (meeting edition) 29 January 2009 Rob Kennedy and Adam Lyon.

Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”

PanDA Monitor Development ATLAS S&C Workshop by V.Fine (BNL)

D0 Grid Data Production 1 D0 Grid Data Production Initiative: Coordination Mtg Version 1.0 (meeting edition) 19 February 2009 Rob Kennedy and Adam Lyon.

D0 Grid Data Production 1 D0 Grid Data Production Initiative: Close-out Mtg Version 1.1 (post-meeting edition w/notes) 04 June 2009 Rob Kennedy and Adam.

D0 Grid Data Production D0 Grid Data Production: Evaluation Version September 2008 Rob Kennedy and Adam Lyon.

EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.

D0 Grid Data Production 1 D0 Grid Data Production Initiative: Coordination Mtg Version 1.0 (meeting edition) 22 January 2009 Rob Kennedy and Adam Lyon.

D0 Grid Data Production 1 D0 Grid Data Production Initiative: Coordination Mtg Version 1.0 (meeting edition) 14 May 2009 Rob Kennedy and Adam Lyon Attending:

Analysis trains – Status & experience from operation Mihaela Gheata.

D0 Grid Data Production 1 D0 Grid Data Production Initiative: Coordination Mtg Version 1.0 (meeting edition) 23 April 2009 Rob Kennedy and Adam Lyon Attending:

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

MultiJob pilot on Titan. ATLAS workloads on Titan Danila Oleynik (UTA), Sergey Panitkin (BNL) US ATLAS HPC. Technical meeting 18 September 2015.

PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.

INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.

D0 Grid Data Production 1 D0 Grid Data Production Initiative: Coordination Mtg Version 1.0 (meeting edition) 05 February 2009 Rob Kennedy and Adam Lyon.

D0 Grid Data Production Initiative D0 Grid Data Production Initiative: Phase 1 Status Version December 2008 Rob Kennedy and Adam Lyon.

Tool Integration with Data and Computation Grid “Grid Wizard 2”

Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.

OPERATIONS REPORT JUNE – SEPTEMBER 2015 Stefan Roiser CERN.

Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.

Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.

FermiGrid Keith Chadwick. Overall Deployment Summary 5 Racks in FCC:  3 Dell Racks on FCC1 –Can be relocated to FCC2 in FY2009. –Would prefer a location.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operations: Evolution of the Role of.

D0 Grid Data Production Initiative 1 D0 Grid Data Production Initiative: Phase 1 to Phase 2 Transition Version 1.3 (v1.0 presented to D0 Spokes, CD Mgmt:

Monitoring the Readiness and Utilization of the Distributed CMS Computing Facilities XVIII International Conference on Computing in High Energy and Nuclear.

HPHC - PERFORMANCE TESTING Dec 15, 2015 Natarajan Mahalingam.

DGAS Distributed Grid Accounting System INFN Workshop /05/1009, Palau Giuseppe Patania Andrea Guarise 6/18/20161.

Joe Foster 1 Two questions about datasets: –How do you find datasets with the processes, cuts, conditions you need for your analysis? –How do.

Enabling Grids for E-sciencE Claudio Cherubino INFN DGAS (Distributed Grid Accounting System)

SQL Database Management

How to Contribute to System Testing and Extract Results

Xiaomei Zhang CMS IHEP Group Meeting December

Workload Management System

Process Description and Control

Building Grids with Condor

Upgrading to Microsoft SQL Server 2014

D0 Grid Data Production Initiative: Coordination Mtg 11

Data Challenge 1 Closeout Lessons Learned Already

TSDS - Texas Student Data System PEIMS

SharePoint 2013 Best Practices

Presentation transcript:

D0 Grid Data Production 1 D0 Grid Data Production Initiative: Coordination Mtg Version 1.0 (meeting edition) 21 May 2009 Rob Kennedy and Adam Lyon Attending: RDK, …

D0 Grid Data Production 2 Overview News and SummaryNews and Summary –Close-out Prep Meetings D0 CAB: 5/22. Anticipate this to be brief, All-CAB2 focus –Coordination Meetings Re-located in WH9SE Libra Remaining meetings: 5/21, 5/24, 6/04 AgendaAgenda –News –Monitoring –AOB

D0 Grid Data Production 3 Topics Remaining to Cover 5/14: CAB Configuration5/14: CAB Configuration –Optimize use of CAB resources, beyond just d0farm and CAB2. (providers request) –Retain Turn-around/Response Time for Analysis (customers/users request) –Simplify Production Coordination, Improve Processing Flexibility (all request) –How to Proceed from Here? 5/21: Monitoring  THIS WEEK5/21: Monitoring  THIS WEEK –Assess what we all have now, where our gaps are, what would be most cost-effective to address –See Gabriele’s white paper on D0 grid job tracking (includes monitoring, focus on OSG). (CD DocDB 3129) –May also reference Ruth’s look into Monitoring which produced an inventory (CD DocDB 3106) –How to Proceed from Here? 5/28: Condor Releases, SAMGrid Upgrade5/28: Condor Releases, SAMGrid Upgrade –Leftover To Do from Initiative: Release new SAMGrid with added state feature –Upgrade production release of Condor with fixes – modify Condor/VDT upgrade procedure? –How to Proceed from Here? 6/04: Close-Out6/04: Close-Out –Transition of samgrid.fnal.gov support from FGS to FEF (or appropriate group) –Lessons Learned, Close-out Festivities Plan

D0 Grid Data Production 4 Monitoring – Overview Assess what do we have now, what do we lack, and what would be cost-effective to work onAssess what do we have now, what do we lack, and what would be cost-effective to work on –Flesh out list of monitoring either available or desirable Review some old slides for ideas already presented Ask questions, visit online plots - and BE SURE what was used 3 months ago still works. –Review formal work in this area by Gabriele (3129) Work documented in 3106 is an inventory, can consult offline –Prioritize suggestions by cost-benefit estimate as time permits Management View – monitoring to present to D0 Spokes, CD HeadManagement View – monitoring to present to D0 Spokes, CD Head –CPU Utilization, % Max Capacity Used trend, Production Output, CPU/event = f(L) Capacity management, Financial management –Effort required to coordinate Data Production: auxiliary plots to show resubmissions, error investigations? Technical View – monitoring used by service itself to judge technical performance and healthTechnical View – monitoring used by service itself to judge technical performance and health –Job State transition timing –(Mine the XML DB – could deliver much on error categories of jobs) Operations View – monitoring used by coordinator and users to assess operationsOperations View – monitoring used by coordinator and users to assess operations –Job de-queuing time (analysis jobs) – Addresses a stated D0 qualitative requirement (a plot yet?) –Job id mapping (grid system id  batch system id) –Job status that follows a higher-level job workflow, rather than the implementation workflows.

D0 Grid Data Production 5 Metric: Job Slot Utilization Before: Job Slot InefficiencyBefore: Job Slot Inefficiency –Top: dips in green trace –10% unused job slots in Sep ‘08... Part of the big problem. –Some of this due to job queue going empty (blue trace hits bottom) After: Steady so far!After: Steady so far! –Bottom: negligible dips in green trace. –One instance of empty queue. Treat next round. Source LinkSource LinkSource LinkSource Link –See plot near bottom of page. MONITORING NOW: Still available, along with plots from FermiGrid – Good State.

D0 Grid Data Production 6 Metric: CPU Utilization Before: Occupancy & Load DipsBefore: Occupancy & Load Dips –Metric: convolutes Job Slot & CPU utilization. Looking for CPU inefficiency when load dips but job slot use does not –Top: dips only in load (black trace) –Example: recent file transfer daemon failure (fixed) –Side effect of more stable system: Easier to see low-level issues AND debug larger issues. Less entanglement. After: Occupancy & Load SteadyAfter: Occupancy & Load Steady –Bottom: Stead Load (4am bump OK) –Supports Job Slot Utilization Plot –Now starting to identify Source Link, Supporting Metric: CPU/Wall timeSource Link, Supporting Metric: CPU/Wall timeSource LinkSupporting MetricSource LinkSupporting Metric –CPU time / Wall clock used by “d0farm” as reported by CAB accounting. Fell from over 90% to 84% in Nov. Since deployment: 95%! MONITORING NOW: Great plots from FermiGrid. PBS Plots changed w/upgrade…. Do we have CPU time per job for d0farm? CPU/Wall clock ratio for d0farm info over time. Now have instantaneous value only? This Week fcpd fails, jobs wait for data deployment

D0 Grid Data Production 7 Metric: Events Processed/Day Before: Wild swings, low averageBefore: Wild swings, low average –Top-level Metric, so exposed to many different potential issues –Top: running average over 1 week –Services load swings due to shared services with MC Production –Little “downtimes”, esp. on weekends –>10% “known” utilization inefficiency from job slot utilization After: Too early to be sure… higher?After: Too early to be sure… higher? –9 days shown: 6.5 MEvts/day –Need more statistics to be sure, but this is what was expected in Phase 1. –Eventual goal = 8 MEvts/day with existing system (node count, etc) will mean addressing more subtle issues. –Now: address low-level issues now becoming clearer with stable system. Source Link (unmerged values)Source Link (unmerged values)Source LinkSource Link MONITORING NOW: Plots/data from Mike which provided data for these plots. May require a little processing for presentation. Lack “initial success” plot – would help characterize effort expended on “rework”. Sep-Nov ‘08 Average 5.2 Mevts/day Early Dec ‘08 Average 6.5 Mevts/day Sep-Nov ‘08 Average 5.2 Mevts/day

D0 Grid Data Production 8 Execution Time = f(L) (“initial luminosity” at begin of run, not at begin of store) Past Past: ~ 30 sec/evt Now Now: ~ 60 sec/evt Eventually: ~120 sec/evt ??? (watch, have plan in place)

D0 Grid Data Production 9 Average Initial Luminosity (“initial luminosity” at begin of run, not at begin of store) Now: ~ 60 sec/evt Long-term Bracket? ?

D0 Grid Data Production 10 Job Tracking on the Open Science Grid for the DZero Virtual Organization – GG (DocDB 3126) D0 Grid InfrastructureD0 Grid Infrastructure Different Categories of Monitoring InformationDifferent Categories of Monitoring Information –Job Status Monitoring –Resource Characteristics –Job Internal Status Problems and Desired Properties of Monitoring InfrastructureProblems and Desired Properties of Monitoring Infrastructure –Reliability –Completeness –Presentation Existing Mitigating Solutions – Input for D0Existing Mitigating Solutions – Input for D0 –GlideinWMS Possible Infrastructural Improvements – Input for OSGPossible Infrastructural Improvements – Input for OSG Experience from Other VOsExperience from Other VOs

D0 Grid Data Production 11 Backup Slides Not expected to present at meeting

D0 Grid Data Production Operations-Driven Projects Monitoring2.3.1 Monitoring –Workshop to assess what we all have now, where our gaps are, what would be most cost-effective to address Can we “see” enough in real-time? Collect what we all have, define requirements (w/i resources available), and execute. Can we trace jobs “up” as well as down? Enhance existing script to automate batch job to grid job “drill-up”? Input includes Gabriele’s paper covering D0 Monitoring Issues2.3.2 Issues –Address “stuck state” issue affecting both Data and MC Production – PM has some examples. Update? –Large job issue from Mike? (RDK to research what this was. Was it 2+ GB memory use? If so, add memory to few machines to create a small “BIG MEMORY” pool?) Features2.3.3 Features –Add state at queuing node (from Phase 1 work). Waiting on Condor dev. PM following up on this. GG to try to push this. –FWD Load Balancing: Distribute jobs “evenly” across FWD… however easiest to do or approximate Processes2.3.4 Processes –Enable REX/Ops to deploy new Condor. In Phase 2, but Lower Priority. Condor deployments for bug fixes coming up. –Revisit capacity/config 2/year? Continuous Service Improvement Plan – RDK towards end of Phase Phase 1 Follow-up2.3.5 Phase 1 Follow-up –Enable auto-update of gridmap files on Queuing nodes. Enable monitoring on Queuing nodes. AL Partly done. –Lessons Learned from recent ops experience: (RDK revive list, reconsider towards end of Phase 2)

D0 Grid Data Production 13 Phase 2 Work List Outline 2.1 Capacity Management: Data Prod is not keeping up with data logging.2.1 Capacity Management: Data Prod is not keeping up with data logging. –Capacity Planning: Model nEvents per Day – forecast CPU needed –Capacity Deployment: Procure, acquire, borrow CPU. We believe infrastructure is capable. –Resource Utilization: Use what we have as much as possible. Maintain improvements. 2.2 Availability & Continuity Management: Expanded system needs higher reliability2.2 Availability & Continuity Management: Expanded system needs higher reliability –Decoupling: deferred. Phase 1 work has proven sufficient for near-term. –Stability, Reduced Effort: Deeper queues. Goal is fewer manual submissions per week. –Resilience: Add/improve redundancy at infrastructure service and CAB level. –Configuration Recovery: Capture configuration and artefacts in CVS consistently. 2.3 Operations-Driven Projects2.3 Operations-Driven Projects –Monitoring: Execute a workshop to share what we have, identify gaps and cost/benefits. –Issues: Address “stuck state” issue affecting both Data and MC Production –Features: Add state at queuing node (from Phase 1). Distribute jobs “evenly” across FWD. –Processes: Enable REX/Ops to deploy new Condor… new bug fixes coming soon. –Phase 1 Follow-up: Few minor tasks remain from rush to deploy… dot-i’s and cross-t’s. Deferred Work List: maintain with reasons for deferring work.Deferred Work List: maintain with reasons for deferring work.