SAM: Past, Present, and Future Lee Lueking All Dzero Meeting November 2, 2001.

Slides:



Advertisements
Similar presentations
Amber Boehnlein, FNAL D0 Computing Model and Plans Amber Boehnlein D0 Financial Committee November 18, 2002.
Advertisements

SUMS Storage Requirement 250 TB fixed disk cache 130 TB annual increment for permanently on- line data 100 TB work area (not controlled by SUMS) 2 PB near-line.
Mass RHIC Computing Facility Razvan Popescu - Brookhaven National Laboratory.
Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.
Enterprise Storage Our Journey Thus Far John D. Halamka MD CIO, Harvard Medical School and Beth Israel Deaconess Medical Center.
Chapter 3 Operating Systems Introduction to CS 1 st Semester, 2015 Sanghyun Park.
The Mass Storage System at JLAB - Today and Tomorrow Andy Kowalski.
The D0 Monte Carlo Challenge Gregory E. Graham University of Maryland (for the D0 Collaboration) February 8, 2000 CHEP 2000.
Status Report on Tier-1 in Korea Gungwon Kang, Sang-Un Ahn and Hangjin Jang (KISTI GSDC) April 28, 2014 at 15th CERN-Korea Committee, Geneva Korea Institute.
9/16/2000Ian Bird/JLAB1 Planning for JLAB Computational Resources Ian Bird.
The SAMGrid Data Handling System Outline:  What Is SAMGrid?  Use Cases for SAMGrid in Run II Experiments  Current Operational Load  Stress Testing.
Remote Production and Regional Analysis Centers Iain Bertram 24 May 2002 Draft 1 Lancaster University.
CDF data production models 1 Data production models for the CDF experiment S. Hou for the CDF data production team.
November 7, 2001Dutch Datagrid SARA 1 DØ Monte Carlo Challenge A HEP Application.
Building a distributed software environment for CDF within the ESLEA framework V. Bartsch, M. Lancaster University College London.
D0 Farms 1 D0 Run II Farms M. Diesburg, B.Alcorn, J.Bakken, T.Dawson, D.Fagan, J.Fromm, K.Genser, L.Giacchetti, D.Holmgren, T.Jones, T.Levshina, L.Lueking,
D0 SAM – status and needs Plagarized from: D0 Experiment SAM Project Fermilab Computing Division.
SAM Job Submission What is SAM? sam submit …… Data Management Details. Conclusions. Rod Walker, 10 th May, Gridpp, Manchester.
3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.
Snapshot of the D0 Computing and Operations Planning Process Amber Boehnlein For the D0 Computing Planning Board.
Introduction to U.S. ATLAS Facilities Rich Baker Brookhaven National Lab.
September 4,2001Lee Lueking, FNAL1 SAM Resource Management Lee Lueking CHEP 2001 September 3-8, 2001 Beijing China.
A Design for KCAF for CDF Experiment Kihyeon Cho (CHEP, Kyungpook National University) and Jysoo Lee (KISTI, Supercomputing Center) The International Workshop.
Jan. 17, 2002DØRAM Proposal DØRACE Meeting, Jae Yu 1 Proposal for a DØ Remote Analysis Model (DØRAM) IntroductionIntroduction Remote Analysis Station ArchitectureRemote.
SAM and D0 Grid Computing Igor Terekhov, FNAL/CD.
Batch Scheduling at LeSC with Sun Grid Engine David McBride Systems Programmer London e-Science Centre Department of Computing, Imperial College.
6/26/01High Throughput Linux Clustering at Fermilab--S. Timm 1 High Throughput Linux Clustering at Fermilab Steven C. Timm--Fermilab.
ORBMeeting July 11, Outline SAM Overview and Station description Resource Management Station Cache Station Prioritized Fair Share Job Control File.
Integrating JASMine and Auger Sandy Philpott Thomas Jefferson National Accelerator Facility Jefferson Ave. Newport News, Virginia USA 23606
Status of the LHCb MC production system Andrei Tsaregorodtsev, CPPM, Marseille DataGRID France workshop, Marseille, 24 September 2002.
D0RACE: Testbed Session Lee Lueking D0 Remote Analysis Workshop February 12, 2002.
SAM Installation Lauri Loebel Carpenter and the SAM Team February
JLAB Computing Facilities Development Ian Bird Jefferson Lab 2 November 2001.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Data reprocessing for DZero on the SAM-Grid Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division.
Sep 02 IPP Canada Remote Computing Plans Pekka K. Sinervo Department of Physics University of Toronto 4 Sep IPP Overview 2 Local Computing 3 Network.
Lee Lueking 1 The Sequential Access Model for Run II Data Management and Delivery Lee Lueking, Frank Nagy, Heidi Schellman, Igor Terekhov, Julie Trumbo,
Outline: Tasks and Goals The analysis (physics) Resources Needed (Tier1) A. Sidoti INFN Pisa.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
UTA MC Production Farm & Grid Computing Activities Jae Yu UT Arlington DØRACE Workshop Feb. 12, 2002 UTA DØMC Farm MCFARM Job control and packaging software.
Outline: Status: Report after one month of Plans for the future (Preparing Summer -Fall 2003) (CNAF): Update A. Sidoti, INFN Pisa and.
Feb. 14, 2002DØRAM Proposal DØ IB Meeting, Jae Yu 1 Proposal for a DØ Remote Analysis Model (DØRAM) Introduction Partial Workshop Results DØRAM Architecture.
Randy MelenApril 14, Stanford Linear Accelerator Center Site Report April 1999 Randy Melen SLAC Computing Services/Systems HPC Team Leader.
RHIC/US ATLAS Tier 1 Computing Facility Site Report Christopher Hollowell Physics Department Brookhaven National Laboratory HEPiX Upton,
Data Management with SAM at DØ The 2 nd International Workshop on HEP Data Grid Kyunpook National University Daegu, Korea August 22-23, 2003 Lee Lueking.
Batch Software at JLAB Ian Bird Jefferson Lab CHEP February, 2000.
01. December 2004Bernd Panzer-Steindel, CERN/IT1 Tape Storage Issues Bernd Panzer-Steindel LCG Fabric Area Manager CERN/IT.
D0 Farms 1 D0 Run II Farms M. Diesburg, B.Alcorn, J.Bakken, R. Brock,T.Dawson, D.Fagan, J.Fromm, K.Genser, L.Giacchetti, D.Holmgren, T.Jones, T.Levshina,
Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.
D0 File Replication PPDG SLAC File replication workshop 9/20/00 Vicky White.
1 5/4/05 Fermilab Mass Storage Enstore, dCache and SRM Michael Zalokar Fermilab.
Oct. 6, 1999PHENIX Comp. Mtg.1 CC-J: Progress, Prospects and PBS Shin’ya Sawada (KEK) For CCJ-WG.
Jianming Qian, UM/DØ Software & Computing Where we are now Where we want to go Overview Director’s Review, June 5, 2002.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
CDF SAM Deployment Status Doug Benjamin Duke University (for the CDF Data Handling Group)
Patrick Gartung 1 CMS 101 Mar 2007 Introduction to the User Analysis Facility (UAF) Patrick Gartung - Fermilab.
Apr. 25, 2002Why DØRAC? DØRAC FTFM, Jae Yu 1 What do we want DØ Regional Analysis Centers (DØRAC) do? Why do we need a DØRAC? What do we want a DØRAC do?
1 P. Murat, Mini-review of the CDF Computing Plan 2006, 2005/10/18 An Update to the CDF Offline Plan and FY2006 Budget ● Outline: – CDF computing model.
Status report NIKHEF Willem van Leeuwen February 11, 2002 DØRACE.
Belle II Physics Analysis Center at TIFR
PC Farms & Central Data Recording
Distributed Data Access and Resource Management in the D0 SAM System
SAM at CCIN2P3 configuration issues
Chapter 1: Introduction
Dzero Data Handling and Databases
Support for ”interactive batch”
DØ MC and Data Processing on the Grid
Status report NIKHEF Willem van Leeuwen February 11, 2002 DØRACE.
Lee Lueking D0RACE January 17, 2002
Proposal for a DØ Remote Analysis Model (DØRAM)
Presentation transcript:

SAM: Past, Present, and Future Lee Lueking All Dzero Meeting November 2, 2001

Lee Lueking - All Dzero Meeting2 SAM: Past, Present, and Future Part I: Past and Present 1.Stats: users,groups,datasets,projects,files. How is the system being utilized? 2.Cache and job management: How do the caching and fair share mechanisms work? 3.Central analysis groups and queues. 4.Tape access: What are encp stats for last month? Tapes, good, bad and recoverable. 5.Remote sites: data forwarding from remote MC processing centers Part II: Future (post shutdown) 1.New tape facilities 2.SAM on Farm and ClueD0 3.Storing user/group data into sam 4.Delivering data to remote sites 5.Problems and concerns

Part I: Past and Present

November 2, 2001Lee Lueking - All Dzero Meeting4 SAM Usage Statistics 428 registered SAM users in production 428 registered SAM users in production  283 of them have at some time run at least one SAM project  267 of them have run a SAM project at some time in the past year  181 of them have run a SAM project in the past 2 months 222 registered nodes 222 registered nodes 150,847 cached files on disk somewhere 150,847 cached files on disk somewhere  146,908 of them on d0mino  1299 on d0lxac1  2301 on a clued0 node  337 on imperial college test machine in the UK  503 on linux build machine 281,066 data files known to SAM 281,066 data files known to SAM  43,534 raw files (all stored on tape)  78,463 reconstructed files (76,305 of them actually stored)  19,700 root-tuple files

November 2, 2001Lee Lueking - All Dzero Meeting5 Active Stations Station Name Description protofarm Heidi’s protofarm Imperial Test Imperial College LancsLancaster Ccin2p3-analysisLyon Central-analysis FNAL D0 Analysis hoeve Nikhef farm Fnal-farm Real FNAL farm Clued0 Distributed analysis luhep Langston Oklahoma datalogger Online logger msu East Lansing Uta-hepArlington linux-analysis-cluster-1 Linux analysis cluster d0nevisColumbia D0small-01 Small linux station Prague-test-stationPrague

November 2, 2001Lee Lueking - All Dzero Meeting6 Central-analysis Cache All groups currently use Least Recently Used replacement algorithm All groups currently use Least Recently Used replacement algorithm Files can migrate from one group’s cache to another if used frequently by other group. Files can migrate from one group’s cache to another if used frequently by other group. Currently, caches are large and there is little turn over. Currently, caches are large and there is little turn over. Group Cache Allocation algo 100 GB cal 78 GB dzero 8 TB emid 10 GB thumbnail 50 GB trigsim 20 GB ttk1 2 TB

November 2, 2001Lee Lueking - All Dzero Meeting7 Central-analysis Cache Turn over

November 2, 2001Lee Lueking - All Dzero Meeting8 Resource Management Approaches Fair Sharing (policies) Fair Sharing (policies)  Allocation of resources and scheduling of jobs  The goal is to ensure that, in a busy environment, each group gets a fixed share of “resources” or gets a fixed share of “work” done Co-allocation and reservation (optimization) Co-allocation and reservation (optimization)

November 2, 2001Lee Lueking - All Dzero Meeting9 Fair Share and Computational Economy Jobs, when executed, incur costs (through resource utilization) and realize benefits (through getting work done) Jobs, when executed, incur costs (through resource utilization) and realize benefits (through getting work done) Maintain a tuple (vector) of cumulative costs/benefits for each group and compare them to its allocated fair share to set priority higher/lower Maintain a tuple (vector) of cumulative costs/benefits for each group and compare them to its allocated fair share to set priority higher/lower Incorporate all known resource types and benefit metrics, totally flexible. Examples:tape mounts, tape reads, network, cache, CPU, and memory. Incorporate all known resource types and benefit metrics, totally flexible. Examples:tape mounts, tape reads, network, cache, CPU, and memory.

November 2, 2001Lee Lueking - All Dzero Meeting10 Job Control: Station Integration with the Abstract Batch System Client Local RM (Station Master) Batch System Process Manager (SAM wrapper script) User Task Job Manager (Project Master) 2.submit to SM 4.submit To BS 6.dispatch8.invoke 5.Sam condition satisfied 10.resubmit 9.setJobCount/stop 3.invoke jobEnd Fair Share Job Scheduling Resource Co-allocation 1.user sam submit 7.Started

November 2, 2001Lee Lueking - All Dzero Meeting11 Replica Site WAN Data flow Station Mass Storage System User (producer) Forwarding + Caching = Global Replication NIKHEF (Amsterdam) 155 Mbps Sara Fermilab D0robot

November 2, 2001Lee Lueking - All Dzero Meeting12 Enstore Statistics: Delivery Start Date: "10/22/01 00:00:00" End Date: "10/29/01 00:00:00" Start Date: "10/22/01 00:00:00" End Date: "10/29/01 00:00:00" Delivered Files: 938 Total Delivered Files: 938 Total Delivered Bytes: GB Delivered Bytes: GB Average File Size: / MB Average File Size: / MB Average Delivery Time: / s Average Delivery Time: / s Average Queue Wait Time: / s Average Queue Wait Time: / s Average Mount Time: / s Average Mount Time: / s Average Seek Time: / s Average Seek Time: / s Average Transfer Time: / s Average Transfer Time: / s Average Transfer Rate: / MB/s Average Transfer Rate: / MB/s File Delivery Error Statistics Total Errors: 856 File Delivery Error Statistics Total Errors: 856 "USERERROR" Errors: 72 (8.41% of Total Errors) "USERERROR" Errors: 72 (8.41% of Total Errors) "NOACCESS" Errors: 675 (78.86% of Total Errors) "NOACCESS" Errors: 675 (78.86% of Total Errors) "NOTALLOWED" Errors: 109 (12.73% of Total Errors) "NOTALLOWED" Errors: 109 (12.73% of Total Errors)

November 2, 2001Lee Lueking - All Dzero Meeting13 Enstore Statistics: Store Start Date: "10/22/01 00:00:00" End Date: "10/29/01 00:00:00" Start Date: "10/22/01 00:00:00" End Date: "10/29/01 00:00:00" File Store Success Statistics Stored Files: 1622 File Store Success Statistics Stored Files: 1622 Total Stored Bytes: GB Total Stored Bytes: GB Average File Size: / MB Average File Size: / MB Average Delivery Time: / s Average Delivery Time: / s Average Queue Wait Time: / s Average Queue Wait Time: / s Average Mount Time: / s Average Mount Time: / s Average Seek Time: / s Average Seek Time: / s Average Transfer Time: / s Average Transfer Time: / s Average Transfer Rate: / MB/s Average Transfer Rate: / MB/s File Store Error Statistics Total Errors: 4 File Store Error Statistics Total Errors: 4 "USERERROR" Errors: 3 (75.00% of Total Errors) "USERERROR" Errors: 3 (75.00% of Total Errors) "EEXIST" Errors: 1 (25.00% of Total Errors) "EEXIST" Errors: 1 (25.00% of Total Errors)

November 2, 2001Lee Lueking - All Dzero Meeting14 Current Tape Storage Summary 45 TB on tape 45 TB on tape Total of 1362 volumes altogether Total of 1362 volumes altogether Currently there are 18 noaccess volumes Currently there are 18 noaccess volumes 80 notallowed 80 notallowed

Part II: The Future (post shutdown)

November 2, 2001Lee Lueking - All Dzero Meeting16 New Tape Facilities STK 9940 Drives STK 9940 Drives  Very reliable (no problems in 30 TB)  60 GB cartridge Share STK PowderHorn silo with other lab customers Share STK PowderHorn silo with other lab customers  have 6-7 x 9940 drives for us.  1000 tape slots In ~March, Move to our own PowderHorn In ~March, Move to our own PowderHorn  Space in FCC now being prepared  Robot already here  Deploy and test starting Jan- Feb. Dzero STK PowderHorn silo Dzero STK PowderHorn silo  have 9 x 9940 drives now, up to 20 drives.  5500 tape slots total.

November 2, 2001Lee Lueking - All Dzero Meeting17 Use Existing AML/2 for MC Replacing M2 drives with LTO. Replacing M2 drives with LTO. 100 GB cartridge 100 GB cartridge Have 6 drives, expand to 10 later. Have 6 drives, expand to 10 later. Very Reliable in tests so far (1 problem in 30 TB) Very Reliable in tests so far (1 problem in 30 TB) Plan to use for all MC and some Group data Plan to use for all MC and some Group data

November 2, 2001Lee Lueking - All Dzero Meeting18 SAM Distributed Cache Fnal-farm Fnal-farm ClueD0 ClueD0

November 2, 2001Lee Lueking - All Dzero Meeting19 Enstore Mass Storage Case Study:Distributed Reconstruction Farm Worker 1 Worker 2 Worker 3 Worker N D0bbin Farm Server No disks are cross mounted. Worker nodes get files directly from MSS via encp. Data is moved by SAM using rcp from where it is cached to where it is needed. 90 dual processor Linux nodes (growing) 30 GB disk each 100 Mbit ethernet NICs on workers D0bbin is 4 processor SGI O2000, Gigabit NIC LAN

November 2, 2001Lee Lueking - All Dzero Meeting20 Case Study:Distributed Analysis Cluster ClueD0 Desktop 1 Desktop 2 Desktop 3 Desktop 100+ Mass Storage Clued0-ripon (file server node) ClueD0-ripon node has 640 GB SAM cache disk 100+ linux desktop nodes have 4-5TB distributed SAM cache 5 nodes in SAM mode now All (tape) data enters the ClueD0 station through the main file server node ClueD0-ripon. The station migrates data as needed and manages the cache distributed among the many desktop constituents.

November 2, 2001Lee Lueking - All Dzero Meeting21 Storing Group Data in SAM Each group will have tapes allocated for specific tiers of data: gen, d0gstar, d0sim, reconstructed, root-tuples, others. Each group will have tapes allocated for specific tiers of data: gen, d0gstar, d0sim, reconstructed, root-tuples, others. Each group will have a tape allocation limit Each group will have a tape allocation limit Group data will be added with special tier designation “-bygroup” to distinguish it form farm and other production data. Group data will be added with special tier designation “-bygroup” to distinguish it form farm and other production data. Document describing details available under sam documentation “Storing Group Data into SAM”. Document describing details available under sam documentation “Storing Group Data into SAM”. Groups set up so far include top, higgs, and tauid. Groups set up so far include top, higgs, and tauid.

November 2, 2001Lee Lueking - All Dzero Meeting22 Replica Site WAN Data flow Station Mass Storage System User (producer) Routing + Caching = Global Replication

November 2, 2001Lee Lueking - All Dzero Meeting23 Issues Tape problems should be under control Tape problems should be under control CORBA naming server has caused problems in past. We are testing a new naming service with persistency that should resolve this. Plan to deploy this month. CORBA naming server has caused problems in past. We are testing a new naming service with persistency that should resolve this. Plan to deploy this month. Some queries have caused the system to jam. We have split user db server away from the dbserver for the stations. Looking into how to deal with long (usually event picking) queries. Some queries have caused the system to jam. We have split user db server away from the dbserver for the stations. Looking into how to deal with long (usually event picking) queries. User support is sometimes slower than people like: User support is sometimes slower than people like:  We are training many Dzero volunteers to help  Lauri is available at Dzero every Wednesday on DAB5 (my office). She has not been overwhelmed by walk-ins.

November 2, 2001Lee Lueking - All Dzero Meeting24 Conclusion Sam is heavily used by D0 Sam is heavily used by D0 The Cache management and Fair share resource allocations are designed to help control the use of resources in the system. The Cache management and Fair share resource allocations are designed to help control the use of resources in the system. SAM provides easy storage of data for on-site and off-site production customers. SAM provides easy storage of data for on-site and off-site production customers. In spite of many tape problems, the Ensore system has been storing and serving lots of data. In spite of many tape problems, the Ensore system has been storing and serving lots of data.

November 2, 2001Lee Lueking - All Dzero Meeting25 Conclusion (2) The new Tape and Robot technologies will make the tape- based data storage and access extremely reliable. The new Tape and Robot technologies will make the tape- based data storage and access extremely reliable. SAM provides a framework within which to operate distributed processing and analysis clusters. These will be very important in the future. SAM provides a framework within which to operate distributed processing and analysis clusters. These will be very important in the future. We are ready to store group data into the system on a regular basis. We are ready to store group data into the system on a regular basis. Delivery of data to remote stations from robot stores is coming. Delivery of data to remote stations from robot stores is coming. We have addressed, and continue to address many issues to make the system serve Dzero better than ever. We have addressed, and continue to address many issues to make the system serve Dzero better than ever.