D0 Grid Data Production Initiative D0 Grid Data Production Initiative: Phase 1 Status Version 1.0 12 December 2008 Rob Kennedy and Adam Lyon.

Slides:

Advertisements

Similar presentations

How We Manage SaaS Infrastructure Knowledge Track

Advertisements

CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.

D0 Grid Data Production 1 D0 Grid Data Production Initiative: Coordination Mtg Version 1.0 (meeting edition) 08 January 2009 Rob Kennedy and Adam Lyon.

1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.

Project What is a project A temporary endeavor undertaken to create a unique product, service or result.

Integrating Network and Transfer Metrics to Optimize Transfer Efficiency and Experiment Workflows Shawn McKee, Marian Babik for the WLCG Network and Transfer.

A Grid Resource Broker Supporting Advance Reservations and Benchmark- Based Resource Selection Erik Elmroth and Johan Tordsson Reporter ： S.Y.Chen.

Chapter 1 and 2 Computer System and Operating System Overview

Maintaining and Updating Windows Server 2008

CS364 CH08 Operating System Support TECH Computer Science Operating System Overview Scheduling Memory Management Pentium II and PowerPC Memory Management.

JIM Deployment for the CDF Experiment M. Burgon-Lyon 1, A. Baranowski 2, V. Bartsch 3,S. Belforte 4, G. Garzoglio 2, R. Herber 2, R. Illingworth 2, R.

Release & Deployment ITIL Version 3

The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.

F Run II Experiments and the Grid Amber Boehnlein Fermilab September 16, 2005.

Server Virtualization: Navy Network Operations Centers

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

Project Tracking. Questions... Why should we track a project that is underway? What aspects of a project need tracking?

Grid Job and Information Management (JIM) for D0 and CDF Gabriele Garzoglio for the JIM Team.

03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.

BINP/GCF Status Report BINP LCG Site Registration Oct 2009

D0 Farms 1 D0 Run II Farms M. Diesburg, B.Alcorn, J.Bakken, T.Dawson, D.Fagan, J.Fromm, K.Genser, L.Giacchetti, D.Holmgren, T.Jones, T.Levshina, L.Lueking,

Service Transition & Planning Service Validation & Testing

Jean-Yves Nief CC-IN2P3, Lyon HEPiX-HEPNT, Fermilab October 22nd – 25th, 2002.

CHEP'07 September D0 data reprocessing on OSG Authors Andrew Baranovski (Fermilab) for B. Abbot, M. Diesburg, G. Garzoglio, T. Kurca, P. Mhashilkar.

Apr 30, 20081/11 VO Services Project – Stakeholders’ Meeting Gabriele Garzoglio VO Services Project Stakeholders’ Meeting Apr 30, 2008 Gabriele Garzoglio.

Virtualization within FermiGrid Keith Chadwick Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.

1 st December 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.

Sep 21, 20101/14 LSST Simulations on OSG Sep 21, 2010 Gabriele Garzoglio for the OSG Task Force on LSST Computing Division, Fermilab Overview OSG Engagement.

GridPP18 Glasgow Mar 07 DØ – SAMGrid Where’ve we come from, and where are we going? Evolution of a ‘long’ established plan Gavin Davies Imperial College.

Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.

D0 Grid Data Production D0 Grid Data Production: Evaluation Version September 2008 Rob Kennedy and Adam Lyon.

Atlas CAP Closeout Thanks to all the presenters for excellent and frank presentations Thanks to all the presenters for excellent and frank presentations.

EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.

Data reprocessing for DZero on the SAM-Grid Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division.

BNL Tier 1 Service Planning & Monitoring Bruce G. Gibbard GDB 5-6 August 2006.

DDM Monitoring David Cameron Pedro Salgado Ricardo Rocha.

D0 Grid Data Production 1 D0 Grid Data Production Initiative: Coordination Mtg Version 1.0 (meeting edition) 21 May 2009 Rob Kennedy and Adam Lyon Attending:

LCG CCRC’08 Status WLCG Management Board November 27 th 2007

Project Management Project planning individual assignment: Project CLOCK

PERFORMANCE AND ANALYSIS WORKFLOW ISSUES US ATLAS Distributed Facility Workshop November 2012, Santa Cruz.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

1 Object-Oriented Analysis and Design with the Unified Process Figure 13-1 Implementation discipline activities.

State of Georgia Release Management Training

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Ops Portal New Requirements.

Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

December 07, 2006Parag Mhashilkar, Fermilab1 Samgrid – OSG Interoperability Parag Mhashilkar, Fermi National Accelerator Laboratory.

Global ADC Job Monitoring Laura Sargsyan (YerPhI).

OPERATIONS REPORT JUNE – SEPTEMBER 2015 Stefan Roiser CERN.

CERN IT Department CH-1211 Genève 23 Switzerland t SL(C) 5 Migration at CERN CHEP 2009, Prague Ulrich SCHWICKERATH Ricardo SILVA CERN, IT-FIO-FS.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operations: Evolution of the Role of.

D0 Grid Data Production Initiative 1 D0 Grid Data Production Initiative: Phase 1 to Phase 2 Transition Version 1.3 (v1.0 presented to D0 Spokes, CD Mgmt:

Monitoring the Readiness and Utilization of the Distributed CMS Computing Facilities XVIII International Conference on Computing in High Energy and Nuclear.

July 26, 2007Parag Mhashilkar, Fermilab1 DZero On OSG: Site And Application Validation Parag Mhashilkar, Fermi National Accelerator Laboratory.

A Case for Application-Aware Grid Services Gabriele Garzoglio, Andrew Baranovski, Parag Mhashilkar, Anoop Rajendra*, Ljubomir Perković** Computing Division,

Development Project Management Jim Kowalkowski. Outline Planning and managing software development – Definitions – Organizing schedule and work (overall.

Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.

Parag Mhashilkar (Fermi National Accelerator Laboratory)

VO Experiences with Open Science Grid Storage OSG Storage Forum | Wednesday September 22, 2010 (10:30am)

Enabling Grids for E-sciencE Claudio Cherubino INFN DGAS (Distributed Grid Accounting System)

DØ Grid Computing Gavin Davies, Frédéric Villeneuve-Séguier Imperial College London On behalf of the DØ Collaboration and the SAMGrid team The 2007 Europhysics.

5/12/06T.Kurca - D0 Meeting FNAL1 p20 Reprocessing Introduction Computing Resources Architecture Operational Model Technical Issues Operational Issues.

Grid as a Service. Agenda Targets Overview and awareness of the obtained material which determines the needs for defining Grid as a service and suggest.

Accounting Review Summary and action list from the (pre)GDB Julia Andreeva CERN-IT WLCG MB 19th April

Computing Infrastructure – Minos 2009/12 ● Downtime schedule – 3 rd Thur monthly ● Dcache upgrades ● Condor / Grid status ● Bluearc performance – cpn lock.

MICE Computing and Software

Progress on NA61/NA49 software virtualisation Dag Toppe Larsen Wrocław

Summary on PPS-pilot activity on CREAM CE

William Stallings Computer Organization and Architecture

D0 Grid Data Production Initiative: Coordination Mtg 11

Operating Systems.

Presentation transcript:

D0 Grid Data Production Initiative D0 Grid Data Production Initiative: Phase 1 Status Version December 2008 Rob Kennedy and Adam Lyon

D0 Grid Data Production Initiative 2 Outline Initiative OverviewInitiative Overview Major IssuesMajor Issues Roadmap and Phase 1 WBSRoadmap and Phase 1 WBS Before and After MetricsBefore and After Metrics Next Steps, ConclusionNext Steps, Conclusion

D0 Grid Data Production Initiative 3 Initiative Overview Initiative is an Umbrella Project to achieve a broad set of goalsInitiative is an Umbrella Project to achieve a broad set of goals Scope: D0 Grid Data Production (taking MC Production into consideration)Scope: D0 Grid Data Production (taking MC Production into consideration) ChargeCharge –Evaluate D0 Grid Data Production, especially Resource Utilization by end Sep ’08 – DONE –Create and execute a Work Plan to achieve goal… – Phase 1 DONE –Goal: Stable Grid Data Production operations that efficiently utilizes the resources available. –Constraints: Achieve improvements ASARP. No explicit end date or staff level limits set. Initiative Team (Execution Phase): October 2008  presentInitiative Team (Execution Phase): October 2008  present –Project Manager: Rob Kennedy – CD OPMQA –Project Co-Manager: Adam Lyon – D0 Collab and CD SCF/REX/Ops Group Leader –Communication with broad set of stakeholders: Weekly meeting Thursdays at 9am D0 Production Coordinators – Mike Diesburg, Joel Snow; D0 Collaborators – Chip Brock, Qizhong Li CD: FermiGrid Svcs (Steve Timm, Keith Chadwick), SAM-Grid Dev (Gabriele Garzoglio, Parag Mhashilkar, Andrew Baranovski), REX Ops (Robert Illingworth, Joe Boyd), SCF Mgmt (Margaret Votava, Eileen Berman), Fermi Exp’t Facilities (Jason Allen, Glenn Cooper) OSG: Abhishek Singh Rana –Documentation Home: / common/SAMGridD0/GDPEval / common/SAMGridD0/GDPEval

D0 Grid Data Production Initiative 4 Major Issues Resource Utilization is lower than expected for a production system (the motivating concern).Resource Utilization is lower than expected for a production system (the motivating concern). –CPUs allotted to Data Production use are not kept busy, but there are jobs & data are available be run. –Causes: Shallow queues must be refilled often. Grid System latencies lead to slow filling of open CPU slots. D0 Grid System Uptime, First-Time Success Rates are lower than expected for production.D0 Grid System Uptime, First-Time Success Rates are lower than expected for production. –Leads to re-running of jobs and/or manual checking of job records to determine success/failure –Causes: Grid Batch System bugs (some known to be fixed in Condor 7), Context Event Server failures, … D0 Grid System requires too much effort for customer (D0 Production Coordinator) to use.D0 Grid System requires too much effort for customer (D0 Production Coordinator) to use. –Hours per day looking at failed jobs or if jobs failed. 1-2 touches per day to keep queues full (w/scripts). Sum of the above significantly reduces the MEvents/day that D0 actually reconstructs.Sum of the above significantly reduces the MEvents/day that D0 actually reconstructs. –Mike Diesburg estimates, confirmed by historical record, BEFORE the Initiative: Max capacity of current system = 10 MEvents/day (million events per day). Realistic sustained level = 8-9 MEvents/day. We expect about 10% endemic inefficiency due to issues “not worth our fixing” like internal latencies, facility power outages, hardware failure recovery % of expected value.Observed sustained level = 5.2 MEvents/day… % of expected value. Absolute numbers are not the focus as yet, rather, the ratio by all is agreed to be unacceptably low.

D0 Grid Data Production Initiative 5 Roadmap September 2008: Planning – DONESeptember 2008: Planning – DONE –Rob Kennedy, working with Adam Lyon, charged by Vicky White to lead effort to pursue this. –First stage is to list, understand, and prioritize the problems and the work in progress. –Next, develop a broad coarse-grained plan to address issues and improve the efficiency. October 2008 – December 2008: Phase 1 Of the Initiative – DONEOctober 2008 – December 2008: Phase 1 Of the Initiative – DONE –1.1. Server Expansion and Decoupling Data/MC Production at Services –1.2. Condor 7.0 Upgrade and Support –1.3. Small Quick Wins –1.4. Metrics –Follow-up on “newly exposed” issues as revealed: eg. Installer products, Fcpd upgrade, restart script fix January 2009: Formal Re-Assessment … with a long-term mindsetJanuary 2009: Formal Re-Assessment … with a long-term mindset –Re-assess against metrics, downtime cause categorization, D0+CD staff-time in ops. Re-prioritize issues. –Plan new work for the next “layer” of issues revealed. Ready to tackle MC Production-specific issues as well? February 2009 – April 2009: Phase 2 … Finish long lead-time work + treat next “layer”.February 2009 – April 2009: Phase 2 … Finish long lead-time work + treat next “layer”. –Some work for Data Production is constrained to execute in 2009, eg. Applying virtualization.

D0 Grid Data Production Initiative 6 Phase 1 WBS (Added since proposal. In progress, but delayed) 1.1. Add Servers, Decoupling of Data/MC Prod at Services1.1. Add Servers, Decoupling of Data/MC Prod at Services –Add 4 th and 5 th Forwarding Node, 2 nd Queuing Node. –Add new SAM Station and Context Server host. –Thoroughly document and “productize” installation procedures. –Configured to decouple Data & MC at Fwd, Que Services Condor 7 Upgrade (Grid Batch System layer)1.2. Condor 7 Upgrade (Grid Batch System layer) –Major improvement to this Grid batch system. –More predictable behavior and latencies –By itself, this fixes some issues raised by Prod Coord.s –Now up-to-date: Ability to Leverage Condor Dev to Diagnose Problems (U.Wisc has offered) 1.3. Small Quick Wins (overlap with above tasks)1.3. Small Quick Wins (overlap with above tasks) –SAM-Grid Job Status Information (development task) Add new “state” to reduce effort by Coordinators required to prevent data problems and superfluous jobs Feature ready, but deployment delayed by Condor 7 bug, fix actively being pursued. Getting schedule estimate now Metrics (used monitoring as available)1.4. Metrics (used monitoring as available) –Resource Utilization and nEvents/Day (see later slides) Scheduled/tracked in MS Project (1.1, 1.2 shown)Scheduled/tracked in MS Project (1.1, 1.2 shown) –Successful pre-Thanksgiving deployment.

D0 Grid Data Production Initiative 7 Before/After Plots on slides Metrics Relationships Job Slot Utilization CPU Utilization Events Produced/Day (for given N job slots) Effort to Coordinate Production/Day (for given level of production) Top-most Customer (D0) View: Grid/Batch Level: Compute Level: Job Processing Stability Timely Input Data Delivery Infrastructure Level:

D0 Grid Data Production Initiative 8 Metric: Job Slot Utilization Before: Job Slot InefficiencyBefore: Job Slot Inefficiency –Top: dips in green trace –10% unused job slots in Sep ‘08... Part of the big problem. –Some of this due to job queue going empty (blue trace hits bottom) After: Steady so far!After: Steady so far! –Bottom: negligible dips in green trace. –One instance of empty queue. Treat next round. Source LinkSource LinkSource LinkSource Link –See plot near bottom of page.

D0 Grid Data Production Initiative 9 Metric: CPU Utilization Before: Occupancy & Load DipsBefore: Occupancy & Load Dips –Metric: convolutes Job Slot & CPU utilization. Looking for CPU inefficiency when load dips but job slot use does not –Top: dips only in load (black trace) –Example: recent file transfer daemon failure (fixed) –Side effect of more stable system: Easier to see low-level issues AND debug larger issues. Less entanglement. After: Occupancy & Load SteadyAfter: Occupancy & Load Steady –Bottom: Stead Load (4am bump OK) –Supports Job Slot Utilization Plot –Now starting to identify Source LinkSource LinkSource LinkSource Link Supporting Metric: Wall/CPU time ratioSupporting Metric: Wall/CPU time ratioSupporting MetricSupporting Metric –Wall clock / CPU time used by “d0farm” as reported by CAB accounting. Fell from over 90% to 84% in Nov. –Since deployment: 95%! This Week fcpd fails, jobs wait for data deployment

D0 Grid Data Production Initiative 10 Metric: Events Processed/Day Before: Wild swings, low averageBefore: Wild swings, low average –Top-level Metric, so exposed to many different potential issues –Top: running average over 1 week –Services load swings due to shared services with MC Production –Little “downtimes”, esp. on weekends –>10% “known” utilization inefficiency from job slot utilization After: Too early to be sure… higher?After: Too early to be sure… higher? –Post-fcpd fix… to get clearer picture. –9 days shown: 6.5 MEvts/day –Need more statistics to be sure, but this is what was expected in Phase 1. –Eventual goal = 8 MEvts/day with existing system (node count, etc) will mean addressing more subtle issues. –Now: address low-level issues now becoming clearer with stable system. Source LinkSource LinkSource LinkSource Link –“Unmerged” values used Sep-Nov ‘08 Average 5.2 Mevts/day Early Dec ‘08 Average 6.5 Mevts/day Sep-Nov ‘08 Average 5.2 Mevts/day

D0 Grid Data Production Initiative 11 Next Steps, Conclusion MC Production: D0 sets OSG resource efficiency record (Abhishek Singh Rana, 12/4/2008)MC Production: D0 sets OSG resource efficiency record (Abhishek Singh Rana, 12/4/2008) –Joel Snow reports this is at least in part due to collateral improvements in MC Production from Initiative –Separate work underway to get LCG side of MC Production up to capability level. Next StepsNext Steps –Observe system over the December holidays. Identify next layer of work, associated cost/benefits. –Tweak the configuration for deeper job queues after we have set this baseline. –Work on metrics: more complete and/or direct measure of system capability, health, and demand –Summarize existing support/estimates in a short-term SLA … to bootstrap Service Management Conclusion: Phase 1 has succeeded, on time before holidays, but more work to be doneConclusion: Phase 1 has succeeded, on time before holidays, but more work to be done –The D0 Grid Data Production is certainly more stable than before. –Issues (fcpd) are easier to address once and for all when they do arise. –Improvement in top-level metrics too early to be sure of improvement. More to do to reach goals anyway. Further Steps towards Maturing the D0 Grid Production System as a ServiceFurther Steps towards Maturing the D0 Grid Production System as a Service –More Robust, Capable, and Manageable System requiring less effort to use. –Enable Service Management Functions: Capacity Planning, Managed Growth.

D0 Grid Data Production Initiative 12 D0 Grid Production (simplified to show tools in stack) Experiment Front-End ToolsExperiment Front-End Tools –D0Repro-tools, automc –Challenge: Error Management SamGrid = SAM + GridSamGrid = SAM + Grid –Jim_Client interface (jobs) –Grid job queue interacting with brokers with site info –Forwarding Node: Job held and then sent to execution site –OSG Job Queuing System interacting with OSG broker –Then, job to local batch system –Durable Location for file space D0RunjobD0Runjob –Runs on Worker node to setup environment for Application –Manages workflow, esp. where multiple Apps run in chain D0 ApplicationD0 Application –Must be runnable with RTE –Inputs must be delivered via parameters passed down stack User Front-End Tools: D0Repro-tools.or. automc SamGrid D0 Runjob D0 Application(s) Worker Node Job File(s) SAM jim_client Grid Job Queue Forwarding Node OSG Job Queue Local Batch System OSG Site Uses Condor