TEVATRON DATA CURATION UPDATE Gene Oleynik, Fermilab Department Head, Data Movement and Storage 1.

Slides:



Advertisements
Similar presentations
Preserving for the Future Mike King Systems Manager UK Data Archive (University of Essex)
Advertisements

XenData SXL-5000 LTO Archive System Turnkey video archive system with near-line LTO capacities scaling from 210 TB to 1.18 PB, designed for the demanding.
XenData SX-520 LTO Archive Servers A series of archive servers based on IT standards, designed for the demanding requirements of the media and entertainment.
Chapter 16: Recovery System
A new standard in Enterprise File Backup. Contents 1.Comparison with current backup methods 2.Introducing Snapshot EFB 3.Snapshot EFB features 4.Organization.
MUNIS Platform Migration Project WELCOME. Agenda Introductions Tyler Cloud Overview Munis New Features Questions.
GCC Tape Room and Library Status of 19-Mar-2006 Gene Oleynik
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Disk and Tape Storage Cost Models Richard Moore & David Minor San Diego Supercomputer.
Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
23/04/2008VLVnT08, Toulon, FR, April 2008, M. Stavrianakou, NESTOR-NOA 1 First thoughts for KM3Net on-shore data storage and distribution Facilities VLV.
Preservasi Informasi Digital.  It will never happen here!  Common Causes of Loss of Data  Accidental Erasure (delete, power, backup)  Viruses and.
CS27510 Commercial Database Applications. Maintenance Maintenance Disaster Recovery Disaster Recovery.
F Run II Experiments and the Grid Amber Boehnlein Fermilab September 16, 2005.
Hall D Online Data Acquisition CEBAF provides us with a tremendous scientific opportunity for understanding one of the fundamental forces of nature. 75.
Ian Fisk and Maria Girone Improvements in the CMS Computing System from Run2 CHEP 2015 Ian Fisk and Maria Girone For CMS Collaboration.
CC - IN2P3 Site Report Hepix Fall meeting 2009 – Berkeley
Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.
Module 7. Data Backups  Definitions: Protection vs. Backups vs. Archiving  Why plan for and execute data backups?  Considerations  Issues/Concerns.
CD FY09 Tactical Plan Status FY09 Tactical Plan Status Report for Site Networking Anna Jordan April 28, 2009.
CDF data production models 1 Data production models for the CDF experiment S. Hou for the CDF data production team.
November 7, 2001Dutch Datagrid SARA 1 DØ Monte Carlo Challenge A HEP Application.
Building a distributed software environment for CDF within the ESLEA framework V. Bartsch, M. Lancaster University College London.
1 Maintain System Integrity Maintain Equipment and Consumables ICAS2017B_ICAU2007B Using Computer Operating system ICAU2231B Caring for Technology Backup.
©2006 Merge eMed. All Rights Reserved. Energize Your Workflow 2006 User Group Meeting May 7-9, 2006 Disaster Recovery Michael Leonard.
Preventing Common Causes of loss. Common Causes of Loss of Data Accidental Erasure – close a file and don’t save it, – write over the original file when.
Jan. 17, 2002DØRAM Proposal DØRACE Meeting, Jae Yu 1 Proposal for a DØ Remote Analysis Model (DØRAM) IntroductionIntroduction Remote Analysis Station ArchitectureRemote.
Mark A. Magumba Storage Management. What is storage An electronic place where computer may store data and instructions for retrieval The objective of.
Technology Overview: Mass Storage D. Petravick Fermilab March 12, 2002.
Grid Lab About the need of 3 Tier storage 5/22/121CHEP 2012, The need of 3 Tier storage Dmitri Ozerov Patrick Fuhrmann CHEP 2012, NYC, May 22, 2012 Grid.
08/30/05GDM Project Presentation Lower Storage Summary of activity on 8/30/2005.
Fermi National Accelerator Laboratory SC2006 Fermilab Data Movement & Storage Multi-Petabyte tertiary automated tape store for world- wide HEP and other.
E.Soundararajan R.Baskaran & M.Sai Baba Indira Gandhi Centre for Atomic Research, Kalpakkam.
16 September GridPP 5 th Collaboration Meeting D0&CDF SAM and The Grid Act I: Grid, Sam and Run II Rick St. Denis – Glasgow University Act II: Sam4CDF.
Site-Wide Backup Briefing Ray Pasetes Core Support Services April 16, 2004.
Component 8/Unit 9bHealth IT Workforce Curriculum Version 1.0 Fall Installation and Maintenance of Health IT Systems Unit 9b Creating Fault Tolerant.
JLAB Computing Facilities Development Ian Bird Jefferson Lab 2 November 2001.
Archival Workshop on Ingest, Identification, and Certification Standards Certification (Best Practices) Checklist Does the archive have a written plan.
Week 7 : Chapter 7 Agenda SQL 710 Maintenance Plan:
Elizabeth Gallas August 9, 2005 CD Support for D0 Database Projects 1 Elizabeth Gallas Fermilab Computing Division Fermilab CD Grid and Data Management.
BNL Tier 1 Service Planning & Monitoring Bruce G. Gibbard GDB 5-6 August 2006.
CERN – IT Department CH-1211 Genève 23 Switzerland t Working with Large Data Sets Tim Smith CERN/IT Open Access and Research Data Session.
TiBS Fermilab – HEPiX-HEPNT Ray Pasetes October 22, 2003.
CD FY09 Tactical Plan Status FY09 Tactical Plan Status Report for Neutrino Program (MINOS, MINERvA, General) Margaret Votava April 21, 2009 Tactical plan.
Run II Review Closeout 15 Sept., 2004 FNAL. Thanks! …all the hard work from the reviewees –And all the speakers …hospitality of our hosts Good progress.
01. December 2004Bernd Panzer-Steindel, CERN/IT1 Tape Storage Issues Bernd Panzer-Steindel LCG Fabric Area Manager CERN/IT.
CERN IT Department CH-1211 Genève 23 Switzerland t The Tape Service at CERN Vladimír Bahyl IT-FIO-TSI June 2009.
1 LPC Data Organization Robert M. Harris Fermilab With contributions from Jon Bakken, Lothar Bauerdick, Ian Fisk, Dan Green, Dave Mason and Chris Tully.
GDB, 07/06/06 CMS Centre Roles à CMS data hierarchy: n RAW (1.5/2MB) -> RECO (0.2/0.4MB) -> AOD (50kB)-> TAG à Tier-0 role: n First-pass.
1 5/4/05 Fermilab Mass Storage Enstore, dCache and SRM Michael Zalokar Fermilab.
A Data Handling System for Modern and Future Fermilab Experiments Robert Illingworth Fermilab Scientific Computing Division.
CDP Technology Comparison CONFIDENTIAL DO NOT REDISTRIBUTE.
BEHIND THE SCENES LOOK AT INTEGRITY IN A PERMANENT STORAGE SYSTEM Gene Oleynik, Jon Bakken, Wayne Baisley, David Berg, Eileen Berman, Chih-Hao Huang, Terry.
Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.
Margaret Votava / Scientific Computing Division FIFE Workshop 20 June 2016 State of the Facilities.
A Solution for Maintaining File Integrity within an Online Data Archive Dan Scholes PDS Geosciences Node Washington University 1.
Compute and Storage For the Farm at Jlab
XenData SX-10 LTO Archive Appliance
Gene Oleynik, Head of Data Storage and Caching,
Planning for Application Recovery
Backup, Archive & Recovery
Experiences and Outlook Data Preservation and Long Term Analysis
Maximum Availability Architecture Enterprise Technology Centre.
Luca dell’Agnello INFN-CNAF
LQCD Computing Operations
Understanding Back-End Systems
Computer communications
Lee Lueking D0RACE January 17, 2002
Proposal for a DØ Remote Analysis Model (DØRAM)
Presentation transcript:

TEVATRON DATA CURATION UPDATE Gene Oleynik, Fermilab Department Head, Data Movement and Storage 1

Current Tevatron Storage Tevatron Data located in two Data Centers on site 21 PB of tape storage about equally split between D0 & CDF Combined front-end disk cache on the order of 3 PB GRID Computing Center: RAW and Online Database backups: Tevatron uses 1/ slot libraries there Feynman Computing Center: Reconstructed and other data: 3 10,000 slot SL8500 shared with other experiments cache disks located here 22 Run II T10000C drives Copy of CDF RAW + CNAF in progress 2

Current Tevatron Tape Transfers TB to CNAF

Distribution of Active Data 4

Technology Migration LTO2 > LTO4 (7380) 9940A > LTO4 (1330) 9940B > LTO4 (18300) LTO1 > LTO4 (1075) LTO3 > LTO4,T2 (4000) LTO4 > T2 (18963/25028) In progress 5 Parenthetical numbers are number of tapes migrated. Migration Activities – Obviously a continual process 76% complete overall: 94% done with CDF, 58% D0 RAW data is complete, need spokes signoff to remove LTO4 By the end of FY14, we will have migrated over 57,000 Tevatron media with a final count of less than 4000!

Media distribution at GCC GS Library 6

Media distribution at FCC Libraries 7

Technology Migration Cyclically move to newer denser tape technology. Run II is T10000C (5 TB/cart). These same media can be used in T10000D (8 TB/cart) and T10000D can read T10000C written media. => This technology will last the preservation period. Continue to use in-house developed and HEP collaborative storage software: Enstore SAM, SAM cache dCache (we are about to deploy a D0 dCache for preservation) 8

Moving Forward Minimize the differences in technologies by the experiments and support sustainable ones. Plan to stay on T2 media for some time. Complete migration to T2 by the end of FY14. D0 investigating moving to SAM+dCache rather than SAM Cache Reduce the amount of equipment to support CDF plans to reduce their cache disk from the 2011 level down to about 6% of that by D0 will likely do similar T10000C tape drive count has been reduced from LTO4 Completion of Fermi >> CNAF copies 9

Conclusion We are near completion of migration to T2 media. We are on track on our commitment to maintain Tevatron data to Questions? 10

Backup slides 11

Technology Migration Over the course of the Tevatron (not even counting 8mm): > 90x increase in tape capacity > 24x increase in transfer rate Decommissioned 9310 & ADIC AML-2 tape libraries. Migrated off 9940A, 9940B, LTO1, LTO2, LTO3 to LTO4 Currently migrating LTO4 to T2 (5.4TB/cartridge media) Care taken to insure all migrated data is copied and correct: Read back and verify checksum for every migrated file Validate metadata is correct Verify no file left behind when disposing of older media (new extra paranoid step) Ramping up migration took to a lot of effort and time. Use up to 8 “Migration Stations” in parallel. 12

Environment Decreasing track pitch yields higher demands on the environment Dimensional stability: Temperature and humidity results in creep and can cause read problems Dust and other debris. Fine dust is of most concern. 13

Environment We try to stay within the recommended operating range for Humidity and temperature: For Dust, the recommendation is Class 8 cleanroom. We are about class 9 (better). We still see dust built up in the libraries over the years, though it has not caused problems yet. 14

Integrity, Monitoring and Disaster Recovery Data integrity: End-to-end checksumming; Spot sampling files’ checksums Experiment accesses (very good coverage while running) Write-protect filled tapes Extensive proactive health monitoring (soft errors, rates, etc.). Environment monitoring: Wireless temperature/humidity recorders at libraries Portable industrial dust detector. Sampled around the rooms DR: Data is mostly single copy, but CDF RAW is basically included in RECO, and The RAW and RECO are in separate data centers Online database backups at different data center than databases Second copy efforts started for CDF (FNAL to CNAF) 15

Integrity Issues We Have Encountered Fine debris buildup in LTO4 drives (bad tape batch?) resulting in slow transfers (like an hour to successfully! read a GB file), No data loss, required close monitoring to proactively replace drives.. Slitting Debris in a batch of T2 media. No data loss. Several instances of insects on tape at Grid Center. Some data loss (CMS T2). Mangled tape (very rare, though we just recently had such an incident with a CMS T2 media 15,170/15,707 recovered). Other firmware bugs – potential data losses (had copies). A number of unreadable files: 13/15M for Tevatron. We have never encountered a checksum error from tape, just sense media errors. Successful reading of files may be sufficient. 16

Integrity and Monitoring open questions We currently sample randomly selected tapes and files and tapes and verify checksums. Is this the right thing to do in data preservation mode? Do we risk mangling or inserting debris or…? How do we measure Data loss? The real impact is lost statistical significance, and that varies (a calibration file vs. a RAW data file). Easiest to do is lost files or potentially lost (is it lost if it exists elsewhere?). A work in progress. 17

So what does all this cost? Amortized costs (M&S costs over the appropriate lifetimes) Tape and disk hardware Infrastructure equipment, servers, network switches. Migration (media amortization ~ 6 year), tape trade-in, decrease in tape cost over time Yearly costs: Salaries: 5.5 system admin, 4.5 developers Facilities (electric, building) Maintenance Lab overhead costs for staff and M&S Duty factor: assume 50% of library occupied by customers Estimate ~ $25-35/TB/yr for tape 5-10x this for disk 18

What keeps me awake Large unplanned effort and costs: Is the software technology sustainable. In-house and collaborative expertise sustainable? This can be expensive to move from: new interfaces, formats, data migration and etc. may all be needed Reliance on proprietary vendor technology We use widely adopted hardware, but it could be costly and require a costly migration to different technology if there is a vendor issue Dust Cleanup? 19

Tape track density trend 20