Presentation is loading. Please wait.

Presentation is loading. Please wait.

D0 Grid Data Production Initiative D0 Grid Data Production Initiative: Phase 1 Status Version 1.0 12 December 2008 Rob Kennedy and Adam Lyon.

Similar presentations


Presentation on theme: "D0 Grid Data Production Initiative D0 Grid Data Production Initiative: Phase 1 Status Version 1.0 12 December 2008 Rob Kennedy and Adam Lyon."— Presentation transcript:

1 D0 Grid Data Production Initiative D0 Grid Data Production Initiative: Phase 1 Status Version 1.0 12 December 2008 Rob Kennedy and Adam Lyon

2 D0 Grid Data Production Initiative 2 Outline Initiative OverviewInitiative Overview Major IssuesMajor Issues Roadmap and Phase 1 WBSRoadmap and Phase 1 WBS Before and After MetricsBefore and After Metrics Next Steps, ConclusionNext Steps, Conclusion

3 D0 Grid Data Production Initiative 3 Initiative Overview Initiative is an Umbrella Project to achieve a broad set of goalsInitiative is an Umbrella Project to achieve a broad set of goals Scope: D0 Grid Data Production (taking MC Production into consideration)Scope: D0 Grid Data Production (taking MC Production into consideration) ChargeCharge –Evaluate D0 Grid Data Production, especially Resource Utilization by end Sep ’08 – DONE –Create and execute a Work Plan to achieve goal… – Phase 1 DONE –Goal: Stable Grid Data Production operations that efficiently utilizes the resources available. –Constraints: Achieve improvements ASARP. No explicit end date or staff level limits set. Initiative Team (Execution Phase): October 2008  presentInitiative Team (Execution Phase): October 2008  present –Project Manager: Rob Kennedy – CD OPMQA –Project Co-Manager: Adam Lyon – D0 Collab and CD SCF/REX/Ops Group Leader –Communication with broad set of stakeholders: Weekly meeting Thursdays at 9am D0 Production Coordinators – Mike Diesburg, Joel Snow; D0 Collaborators – Chip Brock, Qizhong Li CD: FermiGrid Svcs (Steve Timm, Keith Chadwick), SAM-Grid Dev (Gabriele Garzoglio, Parag Mhashilkar, Andrew Baranovski), REX Ops (Robert Illingworth, Joe Boyd), SCF Mgmt (Margaret Votava, Eileen Berman), Fermi Exp’t Facilities (Jason Allen, Glenn Cooper) OSG: Abhishek Singh Rana –Documentation Home: http://d0db-prd.fnal.gov/rexipedia / common/SAMGridD0/GDPEval http://d0db-prd.fnal.gov/rexipedia / common/SAMGridD0/GDPEval

4 D0 Grid Data Production Initiative 4 Major Issues Resource Utilization is lower than expected for a production system (the motivating concern).Resource Utilization is lower than expected for a production system (the motivating concern). –CPUs allotted to Data Production use are not kept busy, but there are jobs & data are available be run. –Causes: Shallow queues must be refilled often. Grid System latencies lead to slow filling of open CPU slots. D0 Grid System Uptime, First-Time Success Rates are lower than expected for production.D0 Grid System Uptime, First-Time Success Rates are lower than expected for production. –Leads to re-running of jobs and/or manual checking of job records to determine success/failure –Causes: Grid Batch System bugs (some known to be fixed in Condor 7), Context Event Server failures, … D0 Grid System requires too much effort for customer (D0 Production Coordinator) to use.D0 Grid System requires too much effort for customer (D0 Production Coordinator) to use. –Hours per day looking at failed jobs or if jobs failed. 1-2 touches per day to keep queues full (w/scripts). Sum of the above significantly reduces the MEvents/day that D0 actually reconstructs.Sum of the above significantly reduces the MEvents/day that D0 actually reconstructs. –Mike Diesburg estimates, confirmed by historical record, BEFORE the Initiative: Max capacity of current system = 10 MEvents/day (million events per day). Realistic sustained level = 8-9 MEvents/day. We expect about 10% endemic inefficiency due to issues “not worth our fixing” like internal latencies, facility power outages, hardware failure recovery. 60-65% of expected value.Observed sustained level = 5.2 MEvents/day…. 60-65% of expected value. Absolute numbers are not the focus as yet, rather, the ratio by all is agreed to be unacceptably low.

5 D0 Grid Data Production Initiative 5 Roadmap September 2008: Planning – DONESeptember 2008: Planning – DONE –Rob Kennedy, working with Adam Lyon, charged by Vicky White to lead effort to pursue this. –First stage is to list, understand, and prioritize the problems and the work in progress. –Next, develop a broad coarse-grained plan to address issues and improve the efficiency. October 2008 – December 2008: Phase 1 Of the Initiative – DONEOctober 2008 – December 2008: Phase 1 Of the Initiative – DONE –1.1. Server Expansion and Decoupling Data/MC Production at Services –1.2. Condor 7.0 Upgrade and Support –1.3. Small Quick Wins –1.4. Metrics –Follow-up on “newly exposed” issues as revealed: eg. Installer products, Fcpd upgrade, restart script fix January 2009: Formal Re-Assessment … with a long-term mindsetJanuary 2009: Formal Re-Assessment … with a long-term mindset –Re-assess against metrics, downtime cause categorization, D0+CD staff-time in ops. Re-prioritize issues. –Plan new work for the next “layer” of issues revealed. Ready to tackle MC Production-specific issues as well? February 2009 – April 2009: Phase 2 … Finish long lead-time work + treat next “layer”.February 2009 – April 2009: Phase 2 … Finish long lead-time work + treat next “layer”. –Some work for Data Production is constrained to execute in 2009, eg. Applying virtualization.

6 D0 Grid Data Production Initiative 6 Phase 1 WBS (Added since proposal. In progress, but delayed) 1.1. Add Servers, Decoupling of Data/MC Prod at Services1.1. Add Servers, Decoupling of Data/MC Prod at Services –Add 4 th and 5 th Forwarding Node, 2 nd Queuing Node. –Add new SAM Station and Context Server host. –Thoroughly document and “productize” installation procedures. –Configured to decouple Data & MC at Fwd, Que Services. 1.2. Condor 7 Upgrade (Grid Batch System layer)1.2. Condor 7 Upgrade (Grid Batch System layer) –Major improvement to this Grid batch system. –More predictable behavior and latencies –By itself, this fixes some issues raised by Prod Coord.s –Now up-to-date: Ability to Leverage Condor Dev to Diagnose Problems (U.Wisc has offered) 1.3. Small Quick Wins (overlap with above tasks)1.3. Small Quick Wins (overlap with above tasks) –SAM-Grid Job Status Information (development task) Add new “state” to reduce effort by Coordinators required to prevent data problems and superfluous jobs Feature ready, but deployment delayed by Condor 7 bug, fix actively being pursued. Getting schedule estimate now. 1.4. Metrics (used monitoring as available)1.4. Metrics (used monitoring as available) –Resource Utilization and nEvents/Day (see later slides) Scheduled/tracked in MS Project (1.1, 1.2 shown)Scheduled/tracked in MS Project (1.1, 1.2 shown) –Successful pre-Thanksgiving deployment.

7 D0 Grid Data Production Initiative 7 Before/After Plots on slides Metrics Relationships Job Slot Utilization CPU Utilization Events Produced/Day (for given N job slots) Effort to Coordinate Production/Day (for given level of production) Top-most Customer (D0) View: Grid/Batch Level: Compute Level: Job Processing Stability Timely Input Data Delivery Infrastructure Level:

8 D0 Grid Data Production Initiative 8 Metric: Job Slot Utilization Before: Job Slot InefficiencyBefore: Job Slot Inefficiency –Top: dips in green trace –10% unused job slots in Sep ‘08... Part of the big problem. –Some of this due to job queue going empty (blue trace hits bottom) After: Steady so far!After: Steady so far! –Bottom: negligible dips in green trace. –One instance of empty queue. Treat next round. Source LinkSource LinkSource LinkSource Link –See plot near bottom of page.

9 D0 Grid Data Production Initiative 9 Metric: CPU Utilization Before: Occupancy & Load DipsBefore: Occupancy & Load Dips –Metric: convolutes Job Slot & CPU utilization. Looking for CPU inefficiency when load dips but job slot use does not –Top: dips only in load (black trace) –Example: recent file transfer daemon failure (fixed) –Side effect of more stable system: Easier to see low-level issues AND debug larger issues. Less entanglement. After: Occupancy & Load SteadyAfter: Occupancy & Load Steady –Bottom: Stead Load (4am bump OK) –Supports Job Slot Utilization Plot –Now starting to identify Source LinkSource LinkSource LinkSource Link Supporting Metric: Wall/CPU time ratioSupporting Metric: Wall/CPU time ratioSupporting MetricSupporting Metric –Wall clock / CPU time used by “d0farm” as reported by CAB accounting. Fell from over 90% to 84% in Nov. –Since deployment: 95%! This Week fcpd fails, jobs wait for data deployment

10 D0 Grid Data Production Initiative 10 Metric: Events Processed/Day Before: Wild swings, low averageBefore: Wild swings, low average –Top-level Metric, so exposed to many different potential issues –Top: running average over 1 week –Services load swings due to shared services with MC Production –Little “downtimes”, esp. on weekends –>10% “known” utilization inefficiency from job slot utilization After: Too early to be sure… higher?After: Too early to be sure… higher? –Post-fcpd fix… to get clearer picture. –9 days shown: 6.5 MEvts/day –Need more statistics to be sure, but this is what was expected in Phase 1. –Eventual goal = 8 MEvts/day with existing system (node count, etc) will mean addressing more subtle issues. –Now: address low-level issues now becoming clearer with stable system. Source LinkSource LinkSource LinkSource Link –“Unmerged” values used Sep-Nov ‘08 Average 5.2 Mevts/day Early Dec ‘08 Average 6.5 Mevts/day Sep-Nov ‘08 Average 5.2 Mevts/day

11 D0 Grid Data Production Initiative 11 Next Steps, Conclusion MC Production: D0 sets OSG resource efficiency record (Abhishek Singh Rana, 12/4/2008)MC Production: D0 sets OSG resource efficiency record (Abhishek Singh Rana, 12/4/2008) –Joel Snow reports this is at least in part due to collateral improvements in MC Production from Initiative –Separate work underway to get LCG side of MC Production up to capability level. Next StepsNext Steps –Observe system over the December holidays. Identify next layer of work, associated cost/benefits. –Tweak the configuration for deeper job queues after we have set this baseline. –Work on metrics: more complete and/or direct measure of system capability, health, and demand –Summarize existing support/estimates in a short-term SLA … to bootstrap Service Management Conclusion: Phase 1 has succeeded, on time before holidays, but more work to be doneConclusion: Phase 1 has succeeded, on time before holidays, but more work to be done –The D0 Grid Data Production is certainly more stable than before. –Issues (fcpd) are easier to address once and for all when they do arise. –Improvement in top-level metrics too early to be sure of improvement. More to do to reach goals anyway. Further Steps towards Maturing the D0 Grid Production System as a ServiceFurther Steps towards Maturing the D0 Grid Production System as a Service –More Robust, Capable, and Manageable System requiring less effort to use. –Enable Service Management Functions: Capacity Planning, Managed Growth.

12 D0 Grid Data Production Initiative 12 D0 Grid Production (simplified to show tools in stack) Experiment Front-End ToolsExperiment Front-End Tools –D0Repro-tools, automc –Challenge: Error Management SamGrid = SAM + GridSamGrid = SAM + Grid –Jim_Client interface (jobs) –Grid job queue interacting with brokers with site info –Forwarding Node: Job held and then sent to execution site –OSG Job Queuing System interacting with OSG broker –Then, job to local batch system –Durable Location for file space D0RunjobD0Runjob –Runs on Worker node to setup environment for Application –Manages workflow, esp. where multiple Apps run in chain D0 ApplicationD0 Application –Must be runnable with RTE –Inputs must be delivered via parameters passed down stack User Front-End Tools: D0Repro-tools.or. automc SamGrid D0 Runjob D0 Application(s) Worker Node Job File(s) SAM jim_client Grid Job Queue Forwarding Node OSG Job Queue Local Batch System OSG Site Uses Condor


Download ppt "D0 Grid Data Production Initiative D0 Grid Data Production Initiative: Phase 1 Status Version 1.0 12 December 2008 Rob Kennedy and Adam Lyon."

Similar presentations


Ads by Google