Presentation is loading. Please wait.

Presentation is loading. Please wait.

D0 Grid Data Production 1 D0 Grid Data Production Initiative: Coordination Mtg Version 1.0 (meeting edition) 21 May 2009 Rob Kennedy and Adam Lyon Attending:

Similar presentations


Presentation on theme: "D0 Grid Data Production 1 D0 Grid Data Production Initiative: Coordination Mtg Version 1.0 (meeting edition) 21 May 2009 Rob Kennedy and Adam Lyon Attending:"— Presentation transcript:

1 D0 Grid Data Production 1 D0 Grid Data Production Initiative: Coordination Mtg Version 1.0 (meeting edition) 21 May 2009 Rob Kennedy and Adam Lyon Attending: RDK, …

2 D0 Grid Data Production 2 Overview News and SummaryNews and Summary –Close-out Prep Meetings D0 CAB: 5/22. Anticipate this to be brief, All-CAB2 focus –Coordination Meetings Re-located in WH9SE Libra Remaining meetings: 5/21, 5/24, 6/04 AgendaAgenda –News –Monitoring –AOB

3 D0 Grid Data Production 3 Topics Remaining to Cover 5/14: CAB Configuration5/14: CAB Configuration –Optimize use of CAB resources, beyond just d0farm and CAB2. (providers request) –Retain Turn-around/Response Time for Analysis (customers/users request) –Simplify Production Coordination, Improve Processing Flexibility (all request) –How to Proceed from Here? 5/21: Monitoring  THIS WEEK5/21: Monitoring  THIS WEEK –Assess what we all have now, where our gaps are, what would be most cost-effective to address –See Gabriele’s white paper on D0 grid job tracking (includes monitoring, focus on OSG). (CD DocDB 3129) –May also reference Ruth’s look into Monitoring which produced an inventory (CD DocDB 3106) –How to Proceed from Here? 5/28: Condor Releases, SAMGrid Upgrade5/28: Condor Releases, SAMGrid Upgrade –Leftover To Do from Initiative: Release new SAMGrid with added state feature –Upgrade production release of Condor with fixes – modify Condor/VDT upgrade procedure? –How to Proceed from Here? 6/04: Close-Out6/04: Close-Out –Transition of samgrid.fnal.gov support from FGS to FEF (or appropriate group) –Lessons Learned, Close-out Festivities Plan

4 D0 Grid Data Production 4 Monitoring – Overview Assess what do we have now, what do we lack, and what would be cost-effective to work onAssess what do we have now, what do we lack, and what would be cost-effective to work on –Flesh out list of monitoring either available or desirable Review some old slides for ideas already presented Ask questions, visit online plots - and BE SURE what was used 3 months ago still works. –Review formal work in this area by Gabriele (3129) Work documented in 3106 is an inventory, can consult offline –Prioritize suggestions by cost-benefit estimate as time permits Management View – monitoring to present to D0 Spokes, CD HeadManagement View – monitoring to present to D0 Spokes, CD Head –CPU Utilization, % Max Capacity Used trend, Production Output, CPU/event = f(L) Capacity management, Financial management –Effort required to coordinate Data Production: auxiliary plots to show resubmissions, error investigations? Technical View – monitoring used by service itself to judge technical performance and healthTechnical View – monitoring used by service itself to judge technical performance and health –Job State transition timing –(Mine the XML DB – could deliver much on error categories of jobs) Operations View – monitoring used by coordinator and users to assess operationsOperations View – monitoring used by coordinator and users to assess operations –Job de-queuing time (analysis jobs) – Addresses a stated D0 qualitative requirement (a plot yet?) –Job id mapping (grid system id  batch system id) –Job status that follows a higher-level job workflow, rather than the implementation workflows.

5 D0 Grid Data Production 5 Metric: Job Slot Utilization Before: Job Slot InefficiencyBefore: Job Slot Inefficiency –Top: dips in green trace –10% unused job slots in Sep ‘08... Part of the big problem. –Some of this due to job queue going empty (blue trace hits bottom) After: Steady so far!After: Steady so far! –Bottom: negligible dips in green trace. –One instance of empty queue. Treat next round. Source LinkSource LinkSource LinkSource Link –See plot near bottom of page. MONITORING NOW: Still available, along with plots from FermiGrid – Good State.

6 D0 Grid Data Production 6 Metric: CPU Utilization Before: Occupancy & Load DipsBefore: Occupancy & Load Dips –Metric: convolutes Job Slot & CPU utilization. Looking for CPU inefficiency when load dips but job slot use does not –Top: dips only in load (black trace) –Example: recent file transfer daemon failure (fixed) –Side effect of more stable system: Easier to see low-level issues AND debug larger issues. Less entanglement. After: Occupancy & Load SteadyAfter: Occupancy & Load Steady –Bottom: Stead Load (4am bump OK) –Supports Job Slot Utilization Plot –Now starting to identify Source Link, Supporting Metric: CPU/Wall timeSource Link, Supporting Metric: CPU/Wall timeSource LinkSupporting MetricSource LinkSupporting Metric –CPU time / Wall clock used by “d0farm” as reported by CAB accounting. Fell from over 90% to 84% in Nov. Since deployment: 95%! MONITORING NOW: Great plots from FermiGrid. PBS Plots changed w/upgrade…. Do we have CPU time per job for d0farm? CPU/Wall clock ratio for d0farm info over time. Now have instantaneous value only? This Week fcpd fails, jobs wait for data deployment

7 D0 Grid Data Production 7 Metric: Events Processed/Day Before: Wild swings, low averageBefore: Wild swings, low average –Top-level Metric, so exposed to many different potential issues –Top: running average over 1 week –Services load swings due to shared services with MC Production –Little “downtimes”, esp. on weekends –>10% “known” utilization inefficiency from job slot utilization After: Too early to be sure… higher?After: Too early to be sure… higher? –9 days shown: 6.5 MEvts/day –Need more statistics to be sure, but this is what was expected in Phase 1. –Eventual goal = 8 MEvts/day with existing system (node count, etc) will mean addressing more subtle issues. –Now: address low-level issues now becoming clearer with stable system. Source Link (unmerged values)Source Link (unmerged values)Source LinkSource Link MONITORING NOW: Plots/data from Mike which provided data for these plots. May require a little processing for presentation. Lack “initial success” plot – would help characterize effort expended on “rework”. Sep-Nov ‘08 Average 5.2 Mevts/day Early Dec ‘08 Average 6.5 Mevts/day Sep-Nov ‘08 Average 5.2 Mevts/day

8 D0 Grid Data Production 8 Execution Time = f(L) (“initial luminosity” at begin of run, not at begin of store) Past Past: ~ 30 sec/evt Now Now: ~ 60 sec/evt Eventually: ~120 sec/evt ??? (watch, have plan in place)

9 D0 Grid Data Production 9 Average Initial Luminosity (“initial luminosity” at begin of run, not at begin of store) Now: ~ 60 sec/evt Long-term Bracket? ?

10 D0 Grid Data Production 10 Job Tracking on the Open Science Grid for the DZero Virtual Organization – GG (DocDB 3126) D0 Grid InfrastructureD0 Grid Infrastructure Different Categories of Monitoring InformationDifferent Categories of Monitoring Information –Job Status Monitoring –Resource Characteristics –Job Internal Status Problems and Desired Properties of Monitoring InfrastructureProblems and Desired Properties of Monitoring Infrastructure –Reliability –Completeness –Presentation Existing Mitigating Solutions – Input for D0Existing Mitigating Solutions – Input for D0 –GlideinWMS Possible Infrastructural Improvements – Input for OSGPossible Infrastructural Improvements – Input for OSG Experience from Other VOsExperience from Other VOs

11 D0 Grid Data Production 11 Backup Slides Not expected to present at meeting

12 D0 Grid Data Production 12 2.3 Operations-Driven Projects 2.3.1 Monitoring2.3.1 Monitoring –Workshop to assess what we all have now, where our gaps are, what would be most cost-effective to address Can we “see” enough in real-time? Collect what we all have, define requirements (w/i resources available), and execute. Can we trace jobs “up” as well as down? Enhance existing script to automate batch job to grid job “drill-up”? Input includes Gabriele’s paper covering D0 Monitoring. 2.3.2 Issues2.3.2 Issues –Address “stuck state” issue affecting both Data and MC Production – PM has some examples. Update? –Large job issue from Mike? (RDK to research what this was. Was it 2+ GB memory use? If so, add memory to few machines to create a small “BIG MEMORY” pool?) 2.3.3 Features2.3.3 Features –Add state at queuing node (from Phase 1 work). Waiting on Condor dev. PM following up on this. GG to try to push this. –FWD Load Balancing: Distribute jobs “evenly” across FWD… however easiest to do or approximate. 2.3.4 Processes2.3.4 Processes –Enable REX/Ops to deploy new Condor. In Phase 2, but Lower Priority. Condor deployments for bug fixes coming up. –Revisit capacity/config 2/year? Continuous Service Improvement Plan – RDK towards end of Phase 2. 2.3.5 Phase 1 Follow-up2.3.5 Phase 1 Follow-up –Enable auto-update of gridmap files on Queuing nodes. Enable monitoring on Queuing nodes. AL Partly done. –Lessons Learned from recent ops experience: (RDK revive list, reconsider towards end of Phase 2)

13 D0 Grid Data Production 13 Phase 2 Work List Outline 2.1 Capacity Management: Data Prod is not keeping up with data logging.2.1 Capacity Management: Data Prod is not keeping up with data logging. –Capacity Planning: Model nEvents per Day – forecast CPU needed –Capacity Deployment: Procure, acquire, borrow CPU. We believe infrastructure is capable. –Resource Utilization: Use what we have as much as possible. Maintain improvements. 2.2 Availability & Continuity Management: Expanded system needs higher reliability2.2 Availability & Continuity Management: Expanded system needs higher reliability –Decoupling: deferred. Phase 1 work has proven sufficient for near-term. –Stability, Reduced Effort: Deeper queues. Goal is fewer manual submissions per week. –Resilience: Add/improve redundancy at infrastructure service and CAB level. –Configuration Recovery: Capture configuration and artefacts in CVS consistently. 2.3 Operations-Driven Projects2.3 Operations-Driven Projects –Monitoring: Execute a workshop to share what we have, identify gaps and cost/benefits. –Issues: Address “stuck state” issue affecting both Data and MC Production –Features: Add state at queuing node (from Phase 1). Distribute jobs “evenly” across FWD. –Processes: Enable REX/Ops to deploy new Condor… new bug fixes coming soon. –Phase 1 Follow-up: Few minor tasks remain from rush to deploy… dot-i’s and cross-t’s. Deferred Work List: maintain with reasons for deferring work.Deferred Work List: maintain with reasons for deferring work.


Download ppt "D0 Grid Data Production 1 D0 Grid Data Production Initiative: Coordination Mtg Version 1.0 (meeting edition) 21 May 2009 Rob Kennedy and Adam Lyon Attending:"

Similar presentations


Ads by Google