David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.

Slides:

Advertisements

Similar presentations

GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7 th GridPP.

Advertisements

CERN – BT – 01/07/ Cern Fabric Management -Hardware and State Bill Tomlin GridPP 7 th Collaboration Meeting June/July 2003.

High Performance Computing Course Notes Grid Computing.

DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance

A conceptual model of grid resources and services Authors: Sergio Andreozzi Massimo Sgaravatto Cristina Vistoli Presenter: Sergio Andreozzi INFN-CNAF Bologna.

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

ASIS et le projet EU DataGrid (EDG) Germán Cancio IT/FIO.

6/2/2015Bernd Panzer-Steindel, CERN, IT1 Computing Fabric (CERN), Status and Plans.

Distributed components

Network Management Overview IACT 918 July 2004 Gene Awyzio SITACS University of Wollongong.

Business Continuity and DR, A Practical Implementation Mich Talebzadeh, Consultant, Deutsche Bank

12. March 2003Bernd Panzer-Steindel, CERN/IT1 LCG Fabric status

Milos Kobliha Alejandro Cimadevilla Luis de Alba Parallel Computing Seminar GROUP 12.

The Open Grid Service Architecture (OGSA) Standard for Grid Computing Prepared by: Haoliang Robin Yu.

11 SERVER CLUSTERING Chapter 6. Chapter 6: SERVER CLUSTERING2 OVERVIEW  List the types of server clusters.  Determine which type of cluster to use for.

Adaptive Server Farms for the Data Center Contact: Ron Sheen Fujitsu Siemens Computers, Inc Sever Blade Summit, Getting the.

LCG Milestones for Deployment, Fabric, & Grid Technology Ian Bird LCG Deployment Area Manager PEB 3-Dec-2002.

Microsoft ® System Center Operations Manager 2007 Infrastructure Planning and Design Published: June 2008 Updated: July 2010.

Version 4.0. Objectives Describe how networks impact our daily lives. Describe the role of data networking in the human network. Identify the key components.

7/2/2003Supervision & Monitoring section1 Supervision & Monitoring Organization and work plan Olof Bärring.

1 Linux in the Computer Center at CERN Zeuthen Thorsten Kleinwort CERN-IT.

Architecting Web Services Unit – II – PART - III.

EMI INFSO-RI SA2 - Quality Assurance Alberto Aimar (CERN) SA2 Leader EMI First EC Review 22 June 2011, Brussels.

INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.

LCG and HEPiX Ian Bird LCG Project - CERN HEPiX - FNAL 25-Oct-2002.

Partner Logo DataGRID WP4 - Fabric Management Status HEPiX 2002, Catania / IT, , Jan Iven Role and.

DOSAR Workshop, Sao Paulo, Brazil, September 16-17, 2005 LCG Tier 2 and DOSAR Pat Skubic OU.

The Grid System Design Liu Xiangrui Beijing Institute of Technology.

A monitoring tool for a GRID operation center Sergio Andreozzi (INFN CNAF), Sergio Fantinel (INFN Padova), David Rebatto (INFN Milano), Gennaro Tortone.

1 The new Fabric Management Tools in Production at CERN Thorsten Kleinwort for CERN IT/FIO HEPiX Autumn 2003 Triumf Vancouver Monday, October 20, 2003.

 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.

SONIC-3: Creating Large Scale Installations & Deployments Andrew S. Neumann Principal Engineer, Progress Sonic.

Chapter 5 McGraw-Hill/Irwin Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved.

Virtual Batch Queues A Service Oriented View of “The Fabric” Rich Baker Brookhaven National Laboratory April 4, 2002.

Deployment work at CERN: installation and configuration tasks WP4 workshop Barcelona project conference 5/03 German Cancio CERN IT/FIO.

20-May-2003HEPiX Amsterdam EDG Fabric Management on Solaris G. Cancio Melia, L. Cons, Ph. Defert, I. Reguero, J. Pelegrin, P. Poznanski, C. Ungil Presented.

9 Systems Analysis and Design in a Changing World, Fourth Edition.

Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, May 2005.

Owen SyngeTitle of TalkSlide 1 Storage Management Owen Synge – Developer, Packager, and first line support to System Administrators. Talks Scope –GridPP.

Installing, running, and maintaining large Linux Clusters at CERN Thorsten Kleinwort CERN-IT/FIO CHEP

May http://cern.ch/hep-proj-grid-fabric1 EU DataGrid WP4 Large-Scale Cluster Computing Workshop FNAL, May Olof Bärring, CERN.

Grid User Interface for ATLAS & LHCb A more recent UK mini production used input data stored on RAL’s tape server, the requirements in JDL and the IC Resource.

11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.

Managing the CERN LHC Tier0/Tier1 centre Status and Plans March 27 th 2003 CERN.ch.

Introduction to Grids By: Fetahi Z. Wuhib [CSD2004-Team19]

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 1 Automate your way to.

Rutherford Appleton Lab, UK VOBox Considerations from GridPP. GridPP DTeam Meeting. Wed Sep 13 th 2005.

Maite Barroso - 10/05/01 - n° 1 WP4 PM9 Deliverable Presentation: Interim Installation System Configuration Management Prototype

GRID ANATOMY Advanced Computing Concepts – Dr. Emmanuel Pilli.

LCG CERN David Foster LCG WP4 Meeting 20 th June 2002 LCG Project Status WP4 Meeting Presentation David Foster IT/LCG 20 June 2002.

15-Feb-02Steve Traylen, RAL WP6 Test Bed Report1 RAL/UK WP6 Test Bed Report Steve Traylen, WP6 PPGRID/RAL, UK

Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.

Next Generation of Apache Hadoop MapReduce Owen

INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.

Breaking the frontiers of the Grid R. Graciani EGI TF 2012.

EGEE is a project funded by the European Union under contract IST Issues from current Experience SA1 Feedback to JRA1 A. Pacheco PIC Barcelona.

Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.

Managing Large Linux Farms at CERN OpenLab: Fabric Management Workshop Tim Smith CERN/IT.

Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.

EGEE Middleware Activities Overview

The Open Grid Service Architecture (OGSA) Standard for Grid Computing

Status and plans of central CERN Linux facilities

Dagmar Adamova (NPI AS CR Prague/Rez) and Maarten Litmaath (CERN)

EDT-WP4 monitoring group status report

LCG middleware and LHC experiments ARDA project

GGF15 – Grids and Network Virtualization

A conceptual model of grid resources and services

Ian Bird LCG Project - CERN HEPiX - FNAL 25-Oct-2002

The Problem ~6,000 PCs Another ~1,000 boxes But! Affected by:

Cloud-Enabling Technology

Presentation transcript:

David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002

David Foster LCG Project 12-March-02 How do we deal with LHC reality ? Several Thousand Machines – Software and Hardware maintenance Geographically Distributed System – Multiple system administrators and policies – Wide area networking Tens of PB of data – Random access patterns Thousands of Physicists – Many potential failure modes – No clear solutions for QOS, Fault tolerance etc. The Challenge is “Creative Simplification”

David Foster LCG Project 12-March-02 Distributed Model Run jobs at CERN – LXBatch Run jobs across European Grid – EDG Testbeds Run jobs across World-Wide Grid – Start with US-EU Interconnect – GLUE, DataTag initiatives The GRID is currently the technology framework

David Foster LCG Project 12-March-02 GRID Framework Application Environment – RedHat, Grid services (e.g GAT from GridLab) etc. High level middleware services – Resource Broker, Information Server, Storage Element etc. Distributed components Common protocols – TCP/IP, HTTP, SOAP, XML Document Exchange etc While everything works, no problem …

David Foster LCG Project 12-March-02 LHC GRID System Requirements Predictable Behaviour – Understand failure modes (Disk, Server, Network) Change Management – The dreaded “Certification” process Scalability – Understand how scalable the systems really are. – Experience needed Cost Effectiveness

David Foster LCG Project 12-March-02 Architectural Considerations What does it take to make a predictable system ? – Hardware, Fabric, Middleware, Applications – Maximise automated corrective actions Must be part of the architectural design Must be part of the software implementation Must be part of the hardware selection The cost of failures must be understood so must not: – Cause panic – Require expert assistance in every case – “Bring down” the system

David Foster LCG Project 12-March-02

Cost Effectiveness Computer center organisation is important – Managing 20’000 machines Physical Logistics “Locatability” Upgrade Strategies – Increased Automated Tasks Even one manual intervention not tolerable Causal Analysis – Human tasks must be kept Manageable

David Foster LCG Project 12-March-02 Overall Architecture Needs A “Systems level” Approach – How do we manage state in the grid ? Workflow Recovery Strategies – What functions are distributed/replicated where, how, and location strategies ? – Data Storage/Management Strategies Pb disk farms ? SAN/NAS Interconnects ? Data Organisation ? – Node Management Strategies Node Downtime Impact

David Foster LCG Project 12-March-02 Fabric Management Overview Fabric management – Manage the nodes Configuration, Installation, Maintenance monitoring, scheduling, fault tolerance – Interface the Grid Grid to local policy mapping – Monitoring the resource Information on activity, performance etc to other grid functions. Error reporting

David Foster LCG Project 12-March-02 Current Activities EDG-WP4 – Working on producing an automated testbed by end Many computer centers have their own solutions Some commercial products exist – Solutions target specific problems and environments CERN computing services – Running production services (~1000 servers) – Should integrate WP4 functions as they become available. GGF – No apparent focus on FM although parts are in other activities, e.g. Grid-Local policy mapping, various monitoring activities.

David Foster LCG Project 12-March-02 LCG Strategy Move towards increasingly automated production environments as soon as possible Work with existing initiatives – Accelerate some developments – Work within the constraints of existing production environments. Inject resources into “productisation” – Address reliability, manageability and cost- effectiveness

David Foster LCG Project 12-March-02 Implementation Strategy Work from the “Bottom up” – Review periodically to understand the hardware technology strategy in the 3 year timeframe. Theory and Practice – Productise the best in configuration management. – Productise the best in installation management – Productise the best in monitoring – Work with a common middleware software environment Work from the “Top down” – Define a common application environment – Design overall workflow and resource strategies – Define farm management strategies

David Foster LCG Project 12-March-02 High Level FM Objectives 2002 Complete a first technology review. Configuration system for managing software installations and updates in production. – Work with EDG-WP4 technologies Monitoring system for data collection and provide alerts in production. – Gain initial experience using scalable technologies. – Combine work of WP4 and IT/FIO Group (SCADA) Production Hierarchy Testbed->LXProto->LXShare->LXBatch (LXPlus)

David Foster LCG Project 12-March-02 Configuration Management Actions Target September 2002 Complete work to put in place a configuration management EDG-LCFG server – Test the HLD components in practice Create EDG-LCFG client scripts to replace SUE and BIS capabilities Create infrastructure to enable automated RPM installation – Scalable access to RPM files – RPM management

David Foster LCG Project 12-March-02 Monitoring Actions (SCADA) Prototype 0 (March) – Re-implementation of PEM with PVSS – Monitor major farms (LXPlus, LXShare, LXBatch) – Capable of 100 parameters/machine Prototype 1 (Mid-year) – Monitor software components (e.g. castor) – Automate simple actions – Connect to configuration management system

David Foster LCG Project 12-March-02 Conclusions There is a consensus amongst everyone that production environments are important now. – Must evolve our understanding of what makes a predictable environment. – Architectural, Development and Technology issues. – We will start taking developments into production in – Much is to be done and much work is still underway. Simplification is the key. Must track developments – Web services and architectural evolution – Commercial interest in providing solutions