LQCD Computing Operations

Slides:



Advertisements
Similar presentations
Distributed Data Processing
Advertisements

1a. Outline how the main memory of a computer can be partitioned b. What are the benefits of partitioning the main memory? It allows more than 1 program.
ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.
RE Adapter for Encompass (v1.0)‏ Encompass and The Raiser's Edge® Integrated Data Solution.
1 Storage Today Victor Hatridge – CIO Nashville Electric Service (615)
LANs and WANs Network size, vary from –simple office system (few PCs) to –complex global system(thousands PCs) Distinguish by the distances that the network.
EU-GRID Work Program Massimo Sgaravatto – INFN Padova Cristina Vistoli – INFN Cnaf as INFN members of the EU-GRID technical team.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by.
Anthony Atkins Digital Library and Archives VirginiaTech ETD Technology for Implementers Presented March 22, 2001 at the 4th International.
October 24, 2000Milestones, Funding of USCMS S&C Matthias Kasemann1 US CMS Software and Computing Milestones and Funding Profiles Matthias Kasemann Fermilab.
Cloud computing is the use of computing resources (hardware and software) that are delivered as a service over the Internet. Cloud is the metaphor for.
SCRAM Software Configuration, Release And Management Background SCRAM has been developed to enable large, geographically dispersed and autonomous groups.
Current Job Components Information Technology Department Network Systems Administration Telecommunications Database Design and Administration.
Effective User Services for High Performance Computing A White Paper by the TeraGrid Science Advisory Board May 2009.
PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.
2 Systems Architecture, Fifth Edition Chapter Goals Describe the activities of information systems professionals Describe the technical knowledge of computer.
Module 9 Planning a Disaster Recovery Solution. Module Overview Planning for Disaster Mitigation Planning Exchange Server Backup Planning Exchange Server.
Nov 1, 2000Site report DESY1 DESY Site Report Wolfgang Friebel DESY Nov 1, 2000 HEPiX Fall
Tier 1 Facility Status and Current Activities Rich Baker Brookhaven National Laboratory NSF/DOE Review of ATLAS Computing June 20, 2002.
The Future of the iPlant Cyberinfrastructure: Coming Attractions.
ILDG Middleware Status Chip Watson ILDG-6 Workshop May 12, 2005.
Module 13 Implementing Business Continuity. Module Overview Protecting and Recovering Content Working with Backup and Restore for Disaster Recovery Implementing.
Scott Butson District Technology Manager. Provide professional to all district staff Professional development has been provided on a regular basis to.
Virtual Batch Queues A Service Oriented View of “The Fabric” Rich Baker Brookhaven National Laboratory April 4, 2002.
Owen SyngeTitle of TalkSlide 1 Storage Management Owen Synge – Developer, Packager, and first line support to System Administrators. Talks Scope –GridPP.
BNL Tier 1 Service Planning & Monitoring Bruce G. Gibbard GDB 5-6 August 2006.
Scalable Systems Software for Terascale Computer Centers Coordinator: Al Geist Participating Organizations ORNL ANL LBNL.
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
1 e-Science AHM st Aug – 3 rd Sept 2004 Nottingham Distributed Storage management using SRB on UK National Grid Service Manandhar A, Haines K,
Storage and Data Movement at FNAL D. Petravick CHEP 2003.
U.S. ATLAS Computing Facilities Overview Bruce G. Gibbard Brookhaven National Laboratory U.S. LHC Software and Computing Review Brookhaven National Laboratory.
U.S. ATLAS Computing Facilities U.S. ATLAS Physics & Computing Review Bruce G. Gibbard, BNL January 2000.
CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.
CAD CAM. 2 and 3 Dimensional CAD: Using 2-dimensional CAD software, designers can create accurate, scaled drawings of parts and assemblies for designs.
Active Directory Domain Services (AD DS). Identity and Access (IDA) – An IDA infrastructure should: Store information about users, groups, computers and.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
Compute and Storage For the Farm at Jlab
SPS Spotlight Series October 2014
LQCD Computing Project Overview
SP Business Suite Deployment Kick-off
Chapter 1 Computer Technology: Your Need to Know
Chapter 6: Securing the Cloud
 2001 Prentice Hall, Inc. All rights reserved.
CT1503 Network Operating System
Apache Ignite Data Grid Research Corey Pentasuglia.
Server Upgrade HA/DR Integration
Working Group 4 Facilities and Technologies
Chapter 2: System Structures
Introduction to Data Management in EGI
EIN 6133 Enterprise Engineering
Chapter 18 MobileApp Design
SAN and NAS.
Simulation use cases for T2 in ALICE
Physical Architecture Layer Design
Unit 8 NT1330 Client-Server Networking II Date: 8/2/2016
Leigh Grundhoefer Indiana University
A Web-Based Data Grid Chip Watson, Ian Bird, Jie Chen,
Course: Module: Lesson # & Name Instructional Material 1 of 32 Lesson Delivery Mode: Lesson Duration: Document Name: 1. Professional Diploma in ERP Systems.
Chapter 7 –Implementation Issues
Cost Estimation Van Vliet, chapter 7 Glenn D. Blank.
Wide Area Workload Management Work Package DATAGRID project
DEPLOYING SECURITY CONFIGURATION
Technology Department Annual Update
Data Management Components for a Research Data Archive
International Lattice Data Grid
Grid Computing Software Interface
An Introduction to Operating Systems
Presentation transcript:

LQCD Computing Operations Chip Watson LQCD Computing Review Boston, May 24-25, 2005

Talk Outline Operations Scope (what needs to be done) Staffing (project funded + contributions) Other expenses

Operations Scope High Level View: Provide basic services at each site (typical single site computer center operations) Provide data persistence across 3 sites Support meta-facility operations

Operations Scope (1) Provide basic services at each site: Interactive services software development: edit / (cross) compile / debug interactive running: quick tests, interactive analysis standard tools: make utility, scripting (perl, ...), others Batch services allocations & priorities to conform to local allocations future: auto staging of required input files (even striping) File services temporary (high performance) and long term storage Note: these services, and the several environments implied, are in the process of being standardized so that users can easily move their work from one site to another – portable code, batch scripts, file naming, etc.

Operations Scope (2) Basic services also includes: repair of failed components; routine maintenance backup of user directories system administration cyber-security (specific needs of the LQCD systems) migration to each release of the user environment (profile) installation of new releases of SciDAC software

Operations Scope (3) Provide data persistence: Silos at FNAL, JLab will provide long term storage High value data will be multiply stored, perhaps with one copy at NERSC or NSF centers In year 1, the primary flows will be the migration of configurations from BNL (QCDOC) to FNAL and JLab Operate as a Meta-facility: Increasing capabilities: Basic file transfer (manual management) Catalogs (especially a meta-data catalog; leverage ILDG work) Data grid; policy driven file migration; leverage SRM work (perhaps) computational grid (low priority)

Uniformity Goal: Make the 3 sites appear uniform to the user. Common library interface (API) Common run-time environment disk layout, batch scripting support, etc. Web interfaces inter-linked documentation (common, site specific) inter-linked status pages Trouble tickets integrate with host institution’s systems forward (most) software problems to SciDAC developers forward multi-site problems to all effected sites

Leverage from other projects Software & software support (SciDAC project) Software is developed by base and SciDAC funding Algorithm optimizations yield more science sooner Developers help with deployment and with trouble shooting General computer center services (host laboratories) Wide-area networking & cyber-security (first line of defense) Account management Silo systems at FNAL, JLab Depth of expertise in power, A/C, disk servers, ... The LQCD computing project imparts only small impacts on the host labs, but receives big benefits in return.

Funding Profile

Operations: Staffing Operations staffing: project funded + contributions Staffing is lean, as an incremental cost onto an operational computer center, and as an increment onto SciDAC prototyping Operations support is 5 shifts/week, with some basic support possible for additional hours by leveraging site computer center staff (details to be worked out at each site) FTE’s come from this project and from base + SciDAC funding (MOU’s still to be finalized): Software development & support, and prototype R&D (SciDAC) are not shown here Project + Base/SciDAC FTE sysadmin / technician software & user support site management BNL 0.75 0.5 0.25 FNAL 1.0 + 0.75 0.25 + 0.25 JLAB 0.4 + 0.25

Operations: Expenses Other Expenses: expansion of disk space: this is part of the “deployment” WBS element; disk capacity at each site will need to grow as compute capacity is added tape (media) at FNAL and JLab (annual expense) tape drive(s): as bandwidth requirements grow, the project may need to procure additional tape drives at FNAL and JLab to ensure adequate access speed space, power, air conditioning are contributed by host laboratories These other expenses are approximately 3% of total costs (may vary year to year).

Current Status Basic services are currently running at a low level current system is an outgrowth of the SciDAC & QCDOC projects batch system not yet fully operational for the QCDOC (?) new project funding will enable higher quality of operations, user support allocations & accounting are initially single site X 3, and at BNL are implemented by pre-defined partitioning LQCD API is maturing rapidly (details to follow) Common run-time environment substantially defined implementation is resource constrained; expect to have an initial version operational by project start Meta-facilities will be “basic only” at start Trouble tickets may be operational by project start (perhaps single site X 3 only)

Final Comments We intend to carefully balance hardware procurements and staffing so as to optimize science. We will always wish for more staff to make the system nicer, more user friendly. In this lean mode, a fair amount of user training and support will be distributed into the user community (where there is considerable expertice). LQCD has a rather small set of projects for a facility of this size (typically a few active users per site doing the production running for a particular large collaboration), and all active users will become proficient in dealing with minor inconveniences. These two facts will help to yield a good return on this investment for DOE.