Download presentation
Presentation is loading. Please wait.
Published byPhebe McCarthy Modified over 6 years ago
1
Distributed Data Access and Resource Management in the D0 SAM System
I.Terekhov, Fermi National Accelerator Laboratory, for the SAM project: L.Carpenter, L.Lueking, C.Moore, J.Trumbo, S.Veseli, M.Vranicar, S.White, V.White
2
Plan of Attack The domain SAM as a Data Grid
D0 overview and applications SAM as a Data Grid Metadata File replication Initial resource management SAM and generic Grid technologies Comprehensive resource management
3
D0: A Virtual Organization
High Energy Physics (HEP) collider experiment, multi-institutional Collaboration of 500+ scientists, 72+ institutions, 18+ countries Physicists generate, analyze data Coordinated resource sharing (networks, MSS, etc) for common problem (physics analysis) solving
4
Applications and Data Intensity
Real data taking from the detector Monte-Carlo data simulation Reconstruction Analysis The gist of experimental HEP Extremely I/O intensive Recurrent processing of datasets: caching highly beneficial
5
Data Handling as the Core of D0 Meta-Computing
HEP Applications are data-intensive Computational Economy is extremely data-centric b/c costs are driven by DH resources SAM: primarily and historically a DH system: a working Data Grid prototype Job control inclusion is in the Grid context (the D0-PPDG project)
6
SAM as a Data Grid High Level Services Replica Selection
Replication Cost Estimation Replica Selection Data Replication Comprehensive Resource Management Generic Grid Services Core Services Mass Storage Systems Metadata Resource Management (External to SAM) Based on: A.Chervenak, I.Foster, C. Kesselman, C.Salisbury, S.Tuecke, The Data Grid: Towards an Architecture for the Distributed Management And Analysis of Large Scientific Datasets, To appear in Journal of Network and Computer Applications
7
Data Replica Management Levels
Processing Station is a (locally distributed, semi-autonomous) collection of HW resources (disk, CPU, etc). A SW component Local data replication for parallel processing in a single batch system - within a Station Global data replication - worldwide data exchange among Stations and MSS’s
8
Local Data Replication
consider a cluster, physically distributed disk cache logical partitioning by research groups (controlled, coordinated sharing) each group executes independent cache replacement algorithm (FIFO, LRU, many flavors) Replica catalog is updated in the course of the cache replacement Access history of each local replica is maintained persistently in the MD
9
Local Data Replication, cont’d
While Resource Managers strive to have jobs and their data being in proximity (see below), the Batch System does not always dispatch jobs wherever the data lies Station executes intra-cluster data replication on demand, fully user-transparently
10
Forwarding + Caching = Global Replication
Mass Storage System Station Site Replica User (producer) WAN Data flow
11
Goals of Resource Management
Implement experiment policies on prioritization and fair sharing in resource usage, by user categories (access modes, research group etc) Maximize throughput in terms of real work done (i.e. user jobs and not system internal jobs such as data transfers)
12
RM approaches Fair Sharing (policies)
Allocation of resources and scheduling of jobs The goal is to ensure that, in a busy environment, each abstract user gets a fixed share of “resources” or gets a fixed share of “work” done Co-allocation and reservation (optimization)
13
FS and Computational Economy
Jobs, when executed, incur costs (through resource utilization) and realize benefits (through getting work done) Maintain a tuple (vector) of cumulative costs/benefits for each abstract user and compare them to his allocated fair share to set priority higher/lower Incorporated all known resource types and benefit metrics, totally flexible
14
The Hierarchy of Resource Managers
Sites Connected by WAN Global RM Experiment Policies, Fair Share Allocations, Cost Metrics Stations And MSS’s Connected By LANs Site RM Batch queues and disks Station – Local RM
15
Job Control: Station Integration with the Abstract Batch System
Sam submit Local RM (Station Master) Job Manager (Project Master) invoke Client jobEnd submit setJobCount/stop Sam condition satisfied Batch System Process Manager (SAM wrapper script) dispatch invoke User Task resubmit Fair Share Job Scheduling Resource Co-allocation
16
SAM as a Data Grid Cached data, File transfer queues,
Site RM weather conditions High Level Services Replication Cost Estimation DH-Batch system integration, Fair Share Allocation, MSS access control Network access control Preferred locations Replica Selection Caching, Forwarding, Pinning Data Replication Comprehensive Resource Management Core Services Mass Storage Systems Metadata Resource Management Replica catalog, System configuration, Cost/Benefit metrics (External to SAM) Batch System internal RM MSS internal RM (External to SAM)
17
SAM Grid Work (D0-PPDG) Enhance the system by adding Grid services (Grid authentication, replica selection, etc) Adapt the system to generic Grid services Replace proprietary tools and internal protocols with those standard to the Grid Collaborate with Computer Scientists to develop new Grid technologies, use SAM as a testbed for testing/validating them
18
Initial PPDG Work: Condor/D0 Job Scheduling, Preliminary Architecture
Condor MMS DAGMan Job Management: Grid Meta-Scheduler Condor Condor/SAM-Grid adapter CondorG Standard Grid Protocols Schedule Jobs Costs of job placements? SAM/Condor-Grid adapter D0 Meta-computing System Data Management: The D0 Data Grid Sam submit Data and DH Resources SAM Abstract Batch System
19
Conclusions D0 SAM is not only a production meta-computing system, but a functioning Data Grid prototype, with data replication and resource management being in advanced/mature stage Work continues to fully Grid-enable the system Some of our components/services will hopefully be of interest to the Grid community
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.