What is SAM-Grid? Job Handling Data Handling Monitoring and Information.

Slides:



Advertisements
Similar presentations
GridPP July 2003Stefan StonjekSlide 1 SAM middleware components Stefan Stonjek University of Oxford 7 th GridPP Meeting 02 nd July 2003 Oxford.
Advertisements

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Distributed Systems basics
Rod Walker IC 13th March 2002 SAM-Grid Middleware  SAM.  JIM.  RunJob.  Conclusions. - Rod Walker,ICL.
Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun.
A Computation Management Agent for Multi-Institutional Grids
Resource Management of Grid Computing
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
Office of Science U.S. Department of Energy Grids and Portals at NERSC Presented by Steve Chan.
DataGrid Kimmo Soikkeli Ilkka Sormunen. What is DataGrid? DataGrid is a project that aims to enable access to geographically distributed computing power.
Milos Kobliha Alejandro Cimadevilla Luis de Alba Parallel Computing Seminar GROUP 12.
Grids and Grid Technologies for Wide-Area Distributed Computing Mark Baker, Rajkumar Buyya and Domenico Laforenza.
The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT 2002, Moscow, Russia June 26, 2002.
Workload Management Massimo Sgaravatto INFN Padova.
1 Bridging Clouds with CernVM: ATLAS/PanDA example Wenjing Wu
Microsoft Load Balancing and Clustering. Outline Introduction Load balancing Clustering.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
Grid Job, Information and Data Management for the Run II Experiments at FNAL Igor Terekhov et al (see next slide) FNAL/CD/CCF, D0, CDF, Condor team, UTA,
SAMGrid – A fully functional computing grid based on standard technologies Igor Terekhov for the JIM team FNAL/CD/CCF.
Cloud Computing for the Enterprise November 18th, This work is licensed under a Creative Commons.
Chapter 9 Elements of Systems Design
Grid Job and Information Management (JIM) for D0 and CDF Gabriele Garzoglio for the JIM Team.
1 School of Computer, National University of Defense Technology A Profile on the Grid Data Engine (GridDaEn) Xiao Nong
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.
Deploying and Operating the SAM-Grid: lesson learned Gabriele Garzoglio for the SAM-Grid Team Sep 28, 2004.
3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
CHEP 2003Stefan Stonjek1 Physics with SAM-Grid Stefan Stonjek University of Oxford CHEP th March 2003 San Diego.
Grid Technologies  Slide text. What is Grid?  The World Wide Web provides seamless access to information that is stored in many millions of different.
1 st December 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.
SAMGrid as a Stakeholder of FermiGrid Valeria Bartsch Computing Division Fermilab.
Grid Workload Management Massimo Sgaravatto INFN Padova.
Evaluation of Agent Teamwork High Performance Distributed Computing Middleware. Solomon Lane Agent Teamwork Research Assistant October 2006 – March 2007.
Instrumentation of the SAM-Grid Gabriele Garzoglio CSC 426 Research Proposal.
1 4/23/2007 Introduction to Grid computing Sunil Avutu Graduate Student Dept.of Computer Science.
The SAM-Grid and the use of Condor-G as a grid job management middleware Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division.
22 nd September 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.
Cracow Grid Workshop October 2009 Dipl.-Ing. (M.Sc.) Marcus Hilbrich Center for Information Services and High Performance.
The SAM-Grid / LCG Interoperability Test Bed Gabriele Garzoglio ( ) Speaker: Pierre Girard (
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
9 Systems Analysis and Design in a Changing World, Fourth Edition.
Data reprocessing for DZero on the SAM-Grid Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division.
T3 analysis Facility V. Bucard, F.Furano, A.Maier, R.Santana, R. Santinelli T3 Analysis Facility The LHCb Computing Model divides collaboration affiliated.
Evolution of a High Performance Computing and Monitoring system onto the GRID for High Energy Experiments T.L. Hsieh, S. Hou, P.K. Teng Academia Sinica,
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
VO Privilege Activity. The VO Privilege Project develops and implements fine-grained authorization to grid- enabled resources and services Started Spring.
AliEn AliEn at OSC The ALICE distributed computing environment by Bjørn S. Nilsen The Ohio State University.
7. Grid Computing Systems and Resource Management
International Symposium on Grid Computing (ISGC-07), Taipei - March 26-29, 2007 Of 16 1 A Novel Grid Resource Broker Cum Meta Scheduler - Asvija B System.
Aneka Cloud ApplicationPlatform. Introduction Aneka consists of a scalable cloud middleware that can be deployed on top of heterogeneous computing resources.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI How to integrate portals with the EGI monitoring system Dusan Vudragovic.
Grid Job, Information and Data Management for the Run II Experiments at FNAL Igor Terekhov et al FNAL/CD/CCF, D0, CDF, Condor team.
April 25, 2006Parag Mhashilkar, Fermilab1 Resource Selection in OSG & SAM-On-The-Fly Parag Mhashilkar Fermi National Accelerator Laboratory Condor Week.
Developing GRID Applications GRACE Project
Tutorial on Science Gateways, Roma, Catania Science Gateway Framework Motivations, architecture, features Riccardo Rotondo.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
The SAM-Grid / LCG interoperability system: a bridge between two Grids Gabriele Garzoglio, Andrew Baranovski, Parag Mhashilkar Anoop Rajendra*, Sudhamsh.
Grid Services for Digital Archive Tao-Sheng Chen Academia Sinica Computing Centre
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
Chapter 1 Characterization of Distributed Systems
Distributed Data Access and Resource Management in the D0 SAM System
Distributed System 電機四 陳伯翰 b
GGF15 – Grids and Network Virtualization
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

What is SAM-Grid? Job Handling Data Handling Monitoring and Information

Problems To Solve How can a large, geographically distributed, dynamic, physics collaboration work together? How can this collaboration make use of available distributed computing resources? How can it handle the huge amount of data (PBs) generated by the experiment?

Answers – The GRID & SAM-Grid GRID  A network of middleware services that tie together distributed resources (Fabric – processors, storage). SAM-Grid  Integrate the standard middleware to achieve a complete Job, Data, and Information management infrastructure thereby enabling fully distributed computing.

SAM-Grid Architecture

Job Management Grid-level (global) job scheduling (selection of a cluster to run) distinguished from local scheduling (distribution of the job within the cluster) We distinguish structured jobs from unstructured.  Structured jobs have their details known to Grid middleware  Unstructured jobs are mapped as a whole onto a cluster Scheduler is interfaced with the data handling system.  For data-intensive jobs, sites are ranked by the amount of data cached at the site

Job Handling JOB Computing Element User Interface Submission Service User Interface Resource Selection Match Making Service Information Collector Exec Site #1 Match Making Service Computing Element Grid Sensors Execution Site #n Submission Service Grid Sensors Computing Element Generic Service Generic Service Informatio n Collector Grid Sensor s Computin g Element Generic Service Generic Service external algorithm external algorithm Grid/Fabri c Interface

Data Handling - SAM MSS1 Local Station 1 Cache1 Local Station 1 Cache2 Local Station 2 Cache1 Remote Station Cache1 SAM is a distributed data movement and management service SAM stations are resources pooled together to enable data management Data replication is achieved by the use of disk caches during file routing. SAM is a fully functional meta- data catalog. A station can access a remote resource via the services offered by other connected stations MSS2 Remote Station Cache2 MSS – Mass Storage System Control Flow Data Flow

Data Handling Database Server(s) (Central Database) Station 1 Servers Station 2 Servers Station 3 Servers Station n Servers Mass Storage System(s) Shared Globally Local To Site Shared Locally Name Server Global Resource Manager(s) Log server services Arrows indicate Control and data flow

Monitoring and Information This includes:  configuration framework  resource description for job brokering  infrastructure for monitoring Main features  Sites (resources), services and jobs monitoring  Distributed knowledge about jobs etc.  Incremental knowledge building  Grid Monitoring Architecture for current state inquiries, Logging for recent history studies  All Web based

Monitoring and Information Web Browser Web Server Site 1 Information System IP Web Browser Web Server 1 Site 2 Information System IP Web Server N Site N Information System

Challenges with Grid/Fabric Interface The Globus toolkit Grid/Fabric interfaces are not sufficiently…  …flexible: they expect a “standard” batch system configuration.  …scalable: a process per grid job is started up at the gateway machine. We want/need aggregation.  …comprehensive: they interface to the batch system only. How about data handling, local monitoring, databases, etc.  …robust: if the batch system forgets about the jobs, they cannot react.

Flexibility Addressing the peculiarity of the configuration of each batch system requires modification to the Globus toolkit job-manager We address the problem by writing job- managers that use a level of abstraction on top of the batch systems. Each batch system adapter can be locally configured to conform to the local batch system interface

Scalability The Globus gatekeeper starts up a process at the gateway node for every job entering the site This limits the number of grid jobs at a site to around 300, for the typical commodity computer We split single grid jobs into multiple batch processes in the SAM-Grid job-managers. Not only does this increase scalability, but it also increases the manageability of the job

Comprehensiveness The standard job-managers interface only to the local batch system We notify other fabric services when a job enters a site  Data handling: for data pre-staging  Monitoring: to monitor a non-running job  Database: to aggregate queries

Robustness The standard job-managers cannot react to temporary failures of the local batch systems In our experience, PBS, Condor and BQS have failed to report the status of a job We write wrappers around the batch systems. These wrappers implement extra robustness. We call them “idealizers”