Presentation is loading. Please wait.

Presentation is loading. Please wait.

D0 Grid Data Production Initiative: Coordination Mtg 11

Similar presentations


Presentation on theme: "D0 Grid Data Production Initiative: Coordination Mtg 11"— Presentation transcript:

1 D0 Grid Data Production Initiative: Coordination Mtg 11
Version 1.0 (meeting edition) 20 November 2008 Rob Kennedy and Adam Lyon Attending: … D0 Grid Data Production

2 Outline Summary and News Deployment “Feature List”
Details filling in on December Deployment Task Status (4 slides) Focus on individual task status, what is needed Deployment 1 Plan Focus on overall schedule, task order D0 Grid Data Production

3 Summary and News Summary News and Notes:
Initiative Deployment 1 Planning Mtg 2 held Monday New Station move completed successfully FWD1-3 Upgrade (with FWD4-5 in prd) Wed done, but full service not yet restored – (details to come) QUE1 Upgrade planned for Thu (today) if all agreed Focus Today: Resolve FWD1-3, Proceed or not with QUE1 News and Notes: ITSM all-day Workshops this Tue-Fri Running ahead, so Rob K. here today afterall D0 Grid Data Production

4 Current Deployment “Feature” Lists
Deployment 1: Split Data/MC Production Services (NO CHANGE) Time frame: November 13-17, with 1 week+ observation before holidays 1. Config: Basic Splitting of Fwd,Que Services between Data and MC Production with 2 Fwd nodes assigned to each, plus 1 Fwd dedicated to all Merging 2. Fwd4 deployed (w/o virtualization) 3. Fwd5 deployed 4. Que2 deployed, with client software to enable parallel use of 2 QUE nodes 5. New SAM Station (moved off of FWD1) 6. Condor 7 via “new” m official release from UWisc 7. FileMax increase on all Fwd nodes to handle large nJob actions 8. D0Runjob Upgrade for Data Production: Prerequisite for deploying new SAM-Grid release Deployment 2: Optimize Data and MC Production Configurations Time frame: December 8-10, with 1 week+ observation before holidays 1. Config: Optimize Configurations separately for Data and MC Production, especially to increase Data Production “queue” length 2. New SAM-Grid Release with support for new Job status value at Queuing node 3. Uniform OS’s: Upgrade FWD1-3 and QUE1 to latest SLF 4.0, same as FWD4-5 4. Formalize transfer of support for QUE1 (samgrid.fnal.gov) to FEF from FGS D0 Grid Data Production

5 Deployment 1 Schedule B Mon 17 Nov 2008 Tues 18 Nov 2008
Mon 17 Nov: Depl Plan Mtg 2, 9am Test FWD4,5, QUE2 w/ new OpenFileMax (MD,…) Backup samgrid products area on FWD1-3. Also /etc/grid-security, globus-gatekeeper (AL) Request OpenFileMax change on FWD1-3 (AL) Plan QUE1 Upgrade in detail (AL,PM) Tues 18 Nov 2008 FWD2 certs expire. Test FWD4,5, QUE2 w/ new OpenFileMax Backup samgrid products area on QUE1. Also /etc/grid-security, globus-gatekeeper; job_queue, job_history (AL,PM) Automated administration/monitoring on QUE1,2: put into a product (AL) Wed 20 Nov 2008 FWD1-3 wipe/re-install via umbrella package (JB); Increase OpenFileMax (FEF/JB) MC Prod uses FWD4 while this happens Data Prod uses FWD5 while this happens Reboot FWD1-3 to pickup OpenFileMax change Any order of FWD work is OK: All at once or seq. If all goes well, then stop, announce, observe. QUE1 work starts next day. Fall-back: restore samgrid products from backup SAM Station: Move context server (RI) to new sam station host and observe. Thu 20 Nov 2008 Coordination Mtg 9am led by Adam Sign off on FWD work, proceed with QUE1 work QUE1 upgrade install via umbrella package (JB) QUE1 has brokering, web page not on QUE2 AL: Be careful NOT to wipe state of old jobs… Brokering, Web page should not be touched. We have not fully tested the new deployment of these. Production can use QUE2 while this happens. This has modest complication of using this queuing node for recovery jobs (impacts Data Prod). If all goes well, then stop, announce. Observe. Fall-back: restore samgrid products from backup Validate the overall configuration matches plan Check all monitoring, automated tasks. Fri 21 – Mon 24 Nov 2008 Observe system in production Tues 25 Nov 2008 Sign-off on D0 Grid Production System Clean-up: Deferred to December Deployment SRM client cert with correct host address OS upgrades (old nodes on SLF 4.5 to SLF 4.7) D0 Grid Data Production

6 Deployment 1 Configuration (adapted from Oct 6 proposal, tweaked in meeting)
Reco FWD1: 1250 (now 750) FWD5: 1250 MC, MC Merge FWD2: 1250 (now 750) FWD4: 1250 Reco Merge FWD3: 750/300 grid each QUE1: Reco, Reco Merge – keep here to maintain history QUE2: MC, MC Merge SAM Station: All Jim Client: can submit to QUE1 or QUE2 depending on qualifier D0 Grid Data Production

7 Task Status (1 of 4) (Red = critical tasks, Green = done, Blue = in progress, Yellow = added notes)
1.1.1 Forwarding Node 4 (Fwd4) <Snip some completed tasks> Fwd4: Pre-Deployment OpenFileMax=16k Large-Scale Test AL "JS,MD,JB" Fri 11/14/08 Tue 11/18/08 3d "Fwd4: Setup Automated Maintenance, Monitoring" AL JB Wed 11/12/08Fri 11/14/08 3d Milestone: Fwd4 Ready to Deploy AL Tue 11/18/08 Tue 11/18/08 0d 1.1.2 Forwarding Node 5 (Fwd5) Fwd5: Pre-Deployment OpenFileMax=16k Large-Scale Test AL "JS,MD,JB“ Fri 11/14/08 Tue 11/18/08 3d "Fwd5: Setup Automated Maintenance, Monitoring" AL JB Wed 11/12/08Fri 11/14/08 3d Milestone: Fwd5 Ready to Deploy AL Tue 11/18/08 Tue 11/18/08 0d 1.1.3 Queuing Node 2 (Que2) "Que2: Setup Automated Maintenance, Monitoring" AL REX Thu 11/13/08 Fri 11/14/08 2d Que2: Integration Test w/2-QUE Client AL JB Thu 11/13/08 Fri 11/14/08 2d Milestone: Que2 Ready to Deploy AL Fri 11/14/08 Fri 11/14/08 0d 1.1.5 New Distinct Sam Station Milestone: SAM Station Ready to Deploy AL Fri 11/14/08 Fri 11/14/08 0d JIRA “Figure out what to do with SRMs” contains “Request and Install SRM-related certs” D0 Grid Data Production

8 Deployment 1 Tasks (2 of 4) (Red = critical tasks, Green = done, Blue = in progress, Yellow = added notes) 1.1.6 Deployment Stage 1 <Snip some completed tasks> Deployment 1: Execute AL REX Fri 11/14/08 Thu 11/20/08 5d "SAM Station: Deactivate old station, Activate new station"AL RI Fri 11/14/08 Fri 11/14/08 1d "FWD1 Upgrade (App,Config,OpenFileMax)" AL JB Wed 11/19/08 Thu 11/20/08 2d "FWD2 Upgrade (App,Config,OpenFileMax)" AL JB Wed 11/19/08 Thu 11/20/08 2d "FWD3 Upgrade (App,Config,OpenFileMax)" AL JB Wed 11/19/08 Thu 11/20/08 2d "QUE1 Upgrade (App,Config)" AL JB Wed 11/19/08 Thu 11/20/08 2d Establish Grid Production Configuration AL REX Thu 11/20/08 Thu 11/20/08 1d SAM Station: Setup Context Server AL RI Wed 11/19/08 Thu 11/19/08 2d Milestone: Deployment 1 Execution done AL Thu 11/20/08 Thu 11/20/08 0d Deployment 1: Monitor AL REX Fri 11/21/08 Mon 11/24/08 2d Deployment 1: Sign-off AL REX Tue 11/25/08Tue 11/25/08 1d MILE 1: Deployment 1 Completed AL Tue 11/25/08Tue 11/25/08 0d Deployment 1 Review AL Mon 12/1/08 1d Not all starts/durations above are sync’d to the latest Monday plan Meeting on Monday 17 November produced the authoritative schedule (Sched B) We cannot deploy later than 20 Nov. (Thursday)… no deploy on Friday or holiday week. New Condor is in this deployment too, all FWD,QUE nodes. THIS is a major risk. D0 Grid Data Production

9 Task Status (3 of 4) (Red = critical tasks, Green = done, Blue = in progress, Yellow = added notes)
1.1.8 FWD and QUE Packaging with Version-Based Umbrella Product New FWD Install Proc/Doc hand-off to REX/Ops AL JB Mon 11/17/08 Mon 11/17/08 0d Umbrella Product: Update FWD Installation Procedure AL JB Mon 11/24/08 Tue 11/25/08 2d Add OpenFileMax setting to FWD Installation ProcedureAL REX Wed 11/19/08 Wed 11/19/08 1d New QUE Install Proc/Doc hand-off to REX/Ops AL JB Mon 11/17/08 Mon 11/17/08 0d Umbrella Product: Update QUE Installation Procedure AL JB Mon 11/24/08 Tue 11/25/08 2d Umbrella Product: FWD and QUE Install. Proc. archivedAL REX Wed 11/26/08 Wed 11/26/08 1d Milestone: FWD and QUE Packaging with Version-Based Umbrella Product done "GG,AL" Wed 11/26/08 Notes: D0 Grid Data Production

10 Task Status (4 of 4) (Red = critical tasks, Green = done, Blue = in progress, Yellow = added notes)
1.3.1 SAM-Grid Job Status Info New Job Status Value at QUE Node: Later Work GG PM Tue 11/18/08 Mon 11/24/08 5d Use "Same" Proxy for Gridftps GG PM Thu 11/20/08 Mon 11/24/08 3d SAM-Grid Release with Job Status Info feature GG PM Tue 11/25/08 Wed 11/26/08 2d Pre-deployment test of new SAM-Grid Release AL REX Mon 12/1/08 Fri 12/5/08 5d Upgrade D0Runjob version used by Data Production AL "MD,AL"Thu 10/30/08 Fri 10/31/08 2d Milestone: SAM-Grid Release Deployable for Data Production AL REX Fri 12/5/08 Fri 12/5/08 0d 1.3.2 Slow Fwd-CAB Job Transition Note: FileMax change requires a schedd restart (ST). Work into deployment plans. 1.3.3 Improved H/w Uptime 1.4 Metrics nSubmissions plot for Sep ’08 Mike? Post-Deployment topics and tasks covered in the “Deployment 1 Review” Archiving Installation Instructions with all note-worthy comments in JIRA integrated Lists of new-machine certs and new-operator authorization: location, process, what uses it, manual or auto updated Cost-benefit: push FWD, QUE nodes to be appliances spec’d from OS (including OpenFileMax) to applications to grid system configuration rapid wipe and re-install Past Notes: At mercy of off-site gridmap updates… need to use the existing automated system to keep all in sync Also: no remote site has new VDT (which has new VOMS) No installation instructions for durable locations server. Considering for Phase 2 of Initiative. D0 Grid Data Production


Download ppt "D0 Grid Data Production Initiative: Coordination Mtg 11"

Similar presentations


Ads by Google