Presentation is loading. Please wait.

Presentation is loading. Please wait.

D0 Grid Data Production Initiative 1 D0 Grid Data Production Initiative: Phase 1 to Phase 2 Transition Version 1.3 (v1.0 presented to D0 Spokes, CD Mgmt:

Similar presentations


Presentation on theme: "D0 Grid Data Production Initiative 1 D0 Grid Data Production Initiative: Phase 1 to Phase 2 Transition Version 1.3 (v1.0 presented to D0 Spokes, CD Mgmt:"— Presentation transcript:

1 D0 Grid Data Production Initiative 1 D0 Grid Data Production Initiative: Phase 1 to Phase 2 Transition Version 1.3 (v1.0 presented to D0 Spokes, CD Mgmt: 06 February 2009) Presented to D0 CPB: 27 February 2009 Rob Kennedy and Adam Lyon

2 D0 Grid Data Production Initiative 2 Outline Background (historical reference)Background (historical reference) –Overview, Major Issues, Roadmap Phase 1 SummaryPhase 1 Summary –Work Done and Outcome (as seen with more experience) AssessmentAssessment –Capacity Model, cpu/event = f(L) Phase 2 Work PlanPhase 2 Work Plan –Work List Outline –Capacity Timeline… skeleton draft

3 D0 Grid Data Production Initiative 3 Initiative Overview (Sep 2008 presentation with updates in Green) Initiative is an Umbrella Project to achieve a broad set of goalsInitiative is an Umbrella Project to achieve a broad set of goals Scope: D0 Grid Data Production (taking MC Production into consideration)Scope: D0 Grid Data Production (taking MC Production into consideration) ChargeCharge –Evaluate D0 Grid Data Production, especially Resource Utilization by end Sep ’08 – DONE –Create and execute a Work Plan to achieve goal… – Phase 1 DONE –Goal: Stable Grid Data Production operations that efficiently utilizes the resources available. – DONE for conditions at beginning of Initiative. Phase 2 to address evolving conditions. –Constraints: Achieve improvements ASARP. No explicit end date or staff level limits set. Initiative Team (Execution Phase): October 2008  presentInitiative Team (Execution Phase): October 2008  present –Project Manager: Rob Kennedy – CD OPMQA –Project Co-Manager: Adam Lyon – D0 Collab and CD SCF/REX/PS Group Leader –Communication with broad set of stakeholders: Weekly meeting Thursdays at 9am D0 Production Coordinators – Mike Diesburg, Joel Snow; D0 Collaborators – Chip Brock, Qizhong Li CD: FermiGrid Svcs (Steve Timm, Keith Chadwick), SAM-Grid Dev (Gabriele Garzoglio, Parag Mhashilkar, Andrew Baranovski), REX (Robert Illingworth, Joe Boyd), SCF Mgmt (Margaret Votava, Eileen Berman), Fermi Exp’t Facilities (Jason Allen, Glenn Cooper) OSG: Abhishek Singh Rana –Documentation Home: http://d0db-prd.fnal.gov/rexipedia / common/SAMGridD0/GDPEval http://d0db-prd.fnal.gov/rexipedia / common/SAMGridD0/GDPEval

4 D0 Grid Data Production Initiative 4 Major Issues (Sep 2008 presentation) Resource Utilization is lower than expected for a production system (the motivating concern).Resource Utilization is lower than expected for a production system (the motivating concern). –CPUs allotted to Data Production use are not kept busy, but there are jobs & data are available be run. –Causes: Shallow queues must be refilled often. Something is leading to slow filling of open CPU slots. D0 Grid System Uptime, First-Time Success Rates are lower than expected for production.D0 Grid System Uptime, First-Time Success Rates are lower than expected for production. –Leads to re-running of jobs and/or manual checking of job records to determine success/failure –Causes: Grid Batch System bugs (some known to be fixed in Condor 7), Context Event Server failures, … D0 Grid System requires too much effort for customer (D0 Production Coordinator) to use.D0 Grid System requires too much effort for customer (D0 Production Coordinator) to use. –Hours per day looking at failed jobs or if jobs failed. 1-2 touches per day to keep queues full (w/scripts). Sum of the above significantly reduces the MEvents/day that D0 actually reconstructs.Sum of the above significantly reduces the MEvents/day that D0 actually reconstructs. –Mike Diesburg estimates (Sep 2008), confirmed by historical record, BEFORE the Initiative: Max capacity of current system = 10 MEvents/day (million events per day). Realistic sustained level = 8-9 MEvents/day. We expect about 10% endemic inefficiency due to issues “not worth our fixing” like internal latencies, facility power outages, hardware failure recovery. 60-65% of expected value.Observed sustained level = 5.2 MEvents/day…. 60-65% of expected value. Absolute numbers are not the focus as yet, rather, the ratio by all is agreed to be unacceptably low.

5 D0 Grid Data Production Initiative 5 Roadmap (Sep 2008 presentation with updates in Green) September 2008: Planning – DONESeptember 2008: Planning – DONE –Rob Kennedy, working with Adam Lyon, charged by Vicky White to lead effort to pursue this. –First stage is to list, understand, and prioritize the problems and the work in progress. –Next, develop a broad coarse-grained plan to address issues and improve the efficiency. October 2008 – December 2008: Phase 1 Of the Initiative – DONEOctober 2008 – December 2008: Phase 1 Of the Initiative – DONE –1.1. Server Expansion and Decoupling Data/MC Production at Services –1.2. Condor 7.0 Upgrade and Support –1.3. Small Quick Wins –1.4. Metrics –Follow-up on “newly exposed” issues as revealed: eg. Installer products, Fcpd upgrade, restart script fix January 2009: Formal Re-Assessment … with a long-term mindset – DONEJanuary 2009: Formal Re-Assessment … with a long-term mindset – DONE –Re-assess against metrics, downtime cause categorization, D0+CD staff-time in ops. Re-prioritize issues. Capacity Management determined to be the primary theme for Phase 2 work. –Plan new work for the next “layer” of issues revealed. Ready to tackle MC Production-specific issues as well? February 2009 – April 2009: Phase 2 … Finish long lead-time work + treat next “layer”.February 2009 – April 2009: Phase 2 … Finish long lead-time work + treat next “layer”. –Some work for Data Production is constrained to execute in 2009, eg. Applying virtualization.

6 D0 Grid Data Production Initiative 6 Phase 1 Summary Work Done: Add Servers, Decoupling of Data/MC Prod at Services, Condor 7 Upgrade (Grid Batch System layer)Work Done: Add Servers, Decoupling of Data/MC Prod at Services, Condor 7 Upgrade (Grid Batch System layer) –Add 4 th and 5 th Forwarding Node, 2 nd Queuing Node. Add new SAM Station and Context Server host. Document, “productize” installation procedures. Configured to decouple Data & MC at Fwd, Que Services. –Condor 7 is major improvement! Several major issues fixed. More predictable behavior and latencies. Outcome: Successful Pre-Thanksgiving DeploymentOutcome: Successful Pre-Thanksgiving Deployment –Mike D.: Dec/Jan Holidays was one of the least eventful periods ever. –Smooth enough now: have begun testing hand-off of day-to-day coordination, with Mike D. oversight. Numerous Operations issues resolved. Resource Utilization improved, reached goalNumerous Operations issues resolved. Resource Utilization improved, reached goal –Periodic Expressions “1/day hang” cured. No more “Death Spirals” leading to downtimes. –Job Slot Utilization and CPU-time/Wall-time > 95% (in smooth operation) Confirmed over time: Success! –January 2009: some “next layer” issues seen. “Events Processed per Day” not really improved“Events Processed per Day” not really improved –Increase in Tevatron Luminosity suspected... Confirmed. –CPU-time per Job increasing rapidly... Confirmed. –We have seen ~2X increase Oct ‘08 to Dec’08/Jan ‘09! –8E6 events/day goal was appropriate for lower luminosity. –Note: ~1 month delay from data logging to production.

7 D0 Grid Data Production Initiative 7 Assessment (January 2009) Main focus: Understanding “Events Produced per Day”Main focus: Understanding “Events Produced per Day” –Calculate the expected production rate from existing system Cpu/event with current Reco version = f(L) Cpu “power” in Data Production queue Luminosity increase in Tevatron is major driver of reduced output of production –Consider the environment as well Recent shutdown led to detector fixes. More good data per event = more CPU/event (small effect) Modest increase in CPU/event in new Reco version at higher luminosity (small effect) Check CPU overheads (setting up, starting Reco) vs. Reco CPU consumption (small-ish effect) –Observe and compare system performance during smooth multi-day periods Develop a Phase 2 Work PlanDevelop a Phase 2 Work Plan –Observation: Data Production is falling behind Data Logging now. This is our top priority to address: understand what CAN be done and report to D0 for their planning. –Capacity increase options being explored, as well as impact on infrastructure, configuration –Model development continues to insure no hidden inefficiencies at 10% level. –Consensus: last effort to reduce cpu consumption by D0 Reco  “no room” for improvement

8 D0 Grid Data Production Initiative 8 Plots: Efficiency, cpu/evt = f(L) THIS IS TEXT for next three plot slides…THIS IS TEXT for next three plot slides… Are there hidden inefficiencies? PBS Job Efficiency (CPU Use) – From Mike D. AVAILABLE HERE.Are there hidden inefficiencies? PBS Job Efficiency (CPU Use) – From Mike D. AVAILABLE HERE.AVAILABLE HEREAVAILABLE HERE –Time base is date that data processed, not date that data was recorded. –Job Efficiency = Run-time / (Run-time+Overheads) –After Phase 1 Deployment, metric is at: ~95%... Very good! –Does not take into account the following: Jobs that started, had data, but failed (~1% effect) … Nodes which are down (~1% effect) Merge jobs included in this (~2% effect) … Jobs that do not really start due to data delivery failure (~1% effect) Overall Duty Cycle (~95%) to account for planned/unplanned downtimes –For long-term planning: Use 85-90% CPU efficiency (CPU cycles available that are used on Reco)… still, very good. Execution Time = f(L) – From Mike D. AVAILABLE HERE.Execution Time = f(L) – From Mike D. AVAILABLE HERE.AVAILABLE HEREAVAILABLE HERE –This is for the current version of Reco (previously was for old version). Some increase in CPU used perhaps at higher L. –Also, detector improvements after shutdown  more good data/event, more combinatorics  more CPU/event. –GOOD FOR PHYSICS! … but a challenge for Reconstruction Farm. Average Initial Luminosity – From Mike D. AVAILABLE HERE.Average Initial Luminosity – From Mike D. AVAILABLE HERE.AVAILABLE HEREAVAILABLE HERE –We appear to be around L = 165 E30 nowadays. Combining this with Execution time: About 60 cpu-sec/event, which gives… –6 MEvents/day theoretically, and at same time period, 5.1 MEvents/day observed under the same conditions. –Given width and uncertainty in measurements above, we cannot say these two numbers are different. –We can use a linear fit to Luminosity plot as input to capacity planning for near future… with max initial store L at 300 E6. –Consider +20% capacity as contingency for the theory-observed difference, allow headroom for special processing, etc.

9 D0 Grid Data Production Initiative 9 PBS Job Efficiency (CPU Use) Smooth Operation: Today >95% Ops Issue No major downtimes after Phase 1 Deploy

10 D0 Grid Data Production Initiative 10 Execution Time = f(L) (“initial luminosity” at begin of run, not at begin of store) Past Past: ~ 30 sec/evt Now Now: ~ 60 sec/evt Eventually: ~120 sec/evt ??? (watch, have plan in place)

11 D0 Grid Data Production Initiative 11 Average Initial Luminosity (“initial luminosity” at begin of run, not at begin of store) Now: ~ 60 sec/evt Long-term Bracket? ?

12 D0 Grid Data Production Initiative 12 Phase 2 Work List Outline 2.1 Capacity Management: Data Prod is not keeping up with data logging.2.1 Capacity Management: Data Prod is not keeping up with data logging. –Capacity Planning: Model nEvents per Day – forecast CPU needed –Capacity Deployment: Procure, acquire, borrow CPU. We believe infrastructure is capable. –Resource Utilization: Use what we have as much as possible. Maintain improvements. 2.2 Availability & Continuity Management: Expanded system needs higher reliability2.2 Availability & Continuity Management: Expanded system needs higher reliability –Decoupling: deferred. Phase 1 work has proven sufficient for near-term. –Stability, Reduced Effort: Deeper queues. Goal is fewer manual submissions per week. –Resilience: Add/improve redundancy at infrastructure service and CAB level. –Configuration Recovery: Capture configuration and artefacts in CVS consistently. 2.3 Operations-Driven Projects2.3 Operations-Driven Projects –Monitoring: Execute a workshop to share what we have, identify gaps and cost/benefits. –Issues: Address “stuck state” issue affecting both Data and MC Production –Features: Add state at queuing node (from Phase 1). Distribute jobs “evenly” across FWD. –Processes: Enable REX/Ops to deploy new Condor… new bug fixes coming soon. –Phase 1 Follow-up: Few minor tasks remain from rush to deploy… dot-i’s and cross-t’s. Deferred Work List: maintain with reasons for deferring work.Deferred Work List: maintain with reasons for deferring work.

13 D0 Grid Data Production Initiative 13 Data Flow for Data Production Enstore LTO4-G Enstore LTO4-F SAM Cache d0srv071 d0srv072 Durable Store, Stager Space d0srv063 d0srv065 Worker Nodes Scratch space Raw Data Merged TMB Tarballs Unmerged TMB Shared w/Analysis Users No automated failover between ‘63, ‘65 Durable Storage and Stager Space are on separate partitions In2p3 remote uploads Also on cache nodes: 0-bias skim, LCG cache Initiated by Merge Job, Via gridftp 1 0 2 3 4 5 6 7 Initiated by Reco Job Other data destined for Tape Storage

14 D0 Grid Data Production Initiative 14 Capacity Timeline Working Draft March  April 2009: Keep-Up Level + Work-through-Backlog LevelMarch  April 2009: Keep-Up Level + Work-through-Backlog Level –Added 115 old, slow, retired CDF worker nodes. D0Farm from ~1600 to 1814 slots (as of 26 Feb 2008) –Upgrade PBS head nodes (FEF) during March 10 downtime. Last infrastructure improvement needed. –All CAB2 analysis nodes for use by Data Production: March 10  May 1 (or any other end condition met) Work through 178 MEvt backlog (less 1 week). A backlog has been there, BUT NOW we can REALLY do something.A backlog has been there Scale-up in steps quickly to be sure infrastructure can handle load, avoid waste of graciously allocated resources –Exploit more opportunistic use of “Other VO” cpu during this same time period –Purchase Req out in late March for more CPUs… will be in service towards end of summer. –(End April = End Initiative. Task mgmt passes to existing CD groups. Close-out process in May 2009.) May  July 2009: not Keep-Up Level. GAP TO BE FILLED.May  July 2009: not Keep-Up Level. GAP TO BE FILLED. –Downsize system as analysis CPU returned, less opportunistic CPU available. –May develop a backlog again, but too late anyway to fully process for summer conferences. –New CPU may arrive in July, but will have to be burned in, infrastructure tested, etc –Purchase Req out in summer for more infrastructure servers (if need proven) August  December 2009: Keep-Up Level (+ headroom?)August  December 2009: Keep-Up Level (+ headroom?) –Add CPU and infrastructure (from procurement) to support a long-term “keep-up” system. –Make up backlog from May through June 2009 for winter conferences.

15 D0 Grid Data Production Initiative 15 CAB2 Temp Expanded Use Early-March  April 2009: Keep-Up Level + Work-through-Backlog LevelEarly-March  April 2009: Keep-Up Level + Work-through-Backlog Level –Temp Expanded CAB2 Use by Data Production: 2/20/2009 via Email: Regarding temporarily using the whole CAB2 for the production, D0 management has made a decision that from March 10, we will temporarily expand the d0 farm queue to be the whole CAB2. The purpose is to catch up the backlog in data production for the summer conference. This configuration is temporary. We will change it back to the current configuration when one of the following condition happens: - when the backlog has been reduced to be less than one week of data; or - May 1, 2009, or - when there is an analysis need for more CPUs than CAB1 can provide. Although the configuration change will be done by FEF (thanks to FEF!), the SamGrid team may need to plan to adjust related parameters to handle a much larger production farm. The current d0 farm queue has 1800 job slots. The new d0 farm queue will have 1800+1400 job slots, temporarily. Thank you, Qizhong

16 D0 Grid Data Production Initiative 16 Next Steps, Conclusion Conclusion: Phase 1 succeeded. Accommodate Tevatron success in Phase 2Conclusion: Phase 1 succeeded. Accommodate Tevatron success in Phase 2 –The D0 Grid Data Production is certainly more stable than before. Improvement in resource utilization metrics appears genuine. Next layer of operations issues are addressable… can improve even further. Next StepsNext Steps –Phase 2: Develop, implement a viable short-term and draft long-term Capacity Plan And do so without losing the gains in stability and resource utilization achieved so far. –Work through event backlog with loaned CAB2 slots. –Continue work on stability, resilience, optimal decoupled configuration, monitoring –Take Care though: Service “scale-ups” like this have revealed new weaknesses, behaviors. Further Steps towards Maturing the D0 Grid Production System as a ServiceFurther Steps towards Maturing the D0 Grid Production System as a Service –More Robust, Capable, and Manageable System requiring less effort to use. –Enable Service Management Functions: Capacity Planning, Managed Growth. –Capacity Management can sensibly lead to a more formal statement of service levels.


Download ppt "D0 Grid Data Production Initiative 1 D0 Grid Data Production Initiative: Phase 1 to Phase 2 Transition Version 1.3 (v1.0 presented to D0 Spokes, CD Mgmt:"

Similar presentations


Ads by Google