Presentation is loading. Please wait.

Presentation is loading. Please wait.

Virtual Data Management for CMS Simulation Production A GriPhyN Prototype.

Similar presentations


Presentation on theme: "Virtual Data Management for CMS Simulation Production A GriPhyN Prototype."— Presentation transcript:

1 Virtual Data Management for CMS Simulation Production A GriPhyN Prototype

2 10/20012 Goals Explore –virtual data dependency tracking –data derivability –integrate virtual data catalog functionality –use of DAGs in virtual data production Identify –architectural issues: planners, catalogs, interfaces –hard issues in executing real production physics applications Create prototypes –tools that can go into the VDT Test virtual data concepts on something “real”

3 10/20013 Which Part of GriPhyN Application Planner Executor Catalog Services Info Services Policy/Security Monitoring Repl. Mgmt. Reliable Transfer Service Compute ResourceStorage Resource cDAG aDAG DAGMAN, Kangaroo GRAMGridFTP; GRAM; SRM GSI, CAS MDS MCAT; GriPhyN catalogs GDMP MDS Globus = initial solution is operational

4 10/20014 What Was Done Created: –A virtual data catalog A catalog scheme for a RDBMS –A “virtual data language”  VDL –A VDL command interpreter –Simple DAGs for the CMS pipeline –Complex DAGs for a canonical test application Kanonical executable for GriPhyN  keg These DAGs actually execute on a Condor- Globus Grid

5 10/20015 The CMS Challenge Remember Rick’s slides and the complexity! Types of executables (4) Parameters, inputs, and outputs Templates of parameter lists Sensitivities of binaries –Dynamic libraries –Environment variables –Condor-related environment issues – less obvious

6 10/20016 Assumptions Grid activity takes place as sub-jobs Some subset of subjects create tracked, durable data products – these are tracked in the virtual data catalog Job execution mechanisms execute VDL functions to describe their virtual data manipulations – dependencies and derivations The product of sub-jobs are physical instances of logical files (i.e., physical files) Planners decide where physical files should be created Physical copies of logical files are tracked in replica catalog

7 10/20017 The VDL begin v /bin/cat arg –n file i filename1 file i filename2 stdout filename3 env key=value end setenv … /bin/cat -n filename1 filename2 filename3

8 10/20018 Dependent Programs begin v /bin/phys1 arg –n file i f1 file i f2 stdout f3 env key=value end begin v /bin/phys2 arg –m file i f1 file i f3 file o f4 env key=value end …note that dependencies can be complex graphs

9 10/20019 The Interpreter How program invocations are formed –Environment variables –Regular Parameters –Input files –Output file How DAGs are formed –Recursive determination of dependencies –Parallel execution How scripts are formed –Recursive determination of dependencies –Serial execution (now); parallel is possible

10 10/200110 Virtual Data Catalog Relational Database Structure: As Implemented

11 10/200111 Virtual Data Catalog Conceptual Data Structure TRANSFORMATION /bin/physapp1 version 1.2.3b(2) created on 12 Oct 1998 owned by physbld.orca DERIVATION ^ paramlist ^ transformation FILE LFN=filename1 PFN1=/store1/1234987 PFN2=/store9/2437218 PFN3=/store4/8373636 ^derivation FILE LFN=filename2 PFN1=/store1/1234987 PFN2=/store9/2437218 ^derivation PARAMETER LIST PARAMETER i filename1 PARAMETER O filename2 PARAMETER E PTYPE=muon PARAMETER p -g

12 10/200112 DAGs & Data Structures DAGMan Example –TOP generates even random number –LEFT and RIGHT divide number by 2 –BOTTOM sums Diamond DAG random half sum f.a f.bf.c f.d

13 10/200113 DAGs & Data Structures II begin v random stdout f.a end begin v half stdin f.a stdout f.b end begin v half stdin f.a stdout f.c end begin v sum file i f.b file i f.c stdout f.d end rc f.a out.a rc f.b out.b rc f.c out.c rc f.d out.d random half sum f.a f.bf.c

14 10/200114 DAGs & Data Structures III XFORMPARAMDERIVEDRC xidcuprgpidvaluexidpidddidposflgpidpfn 1vrnd1f.a111O1out.a 2vhalf2f.b212I2out.b 3vsum3f.c222O3out.c 4f.d213I4out.d 233O 3240i 3341i 344 O

15 10/200115 Abstract and Concrete DAGs Abstract DAGs –Resource locations unspecified –File names are logical –Data destinations unspecified Concrete DAGs –Resource locations determined –Physical file names specified –Data delivered to and returned from physical locations Translation is the job of the “planner”

16 10/200116 What We Tested DAG structures –Diamond DAG –Canonical “keg” app in complex DAGs –The CMS pipeline Execution environments –Local execution –Grid execution via DAGMan

17 10/200117 Generality simple fabric  very powerful DAGs DAGs of this pattern with >260 nodes were run.

18 10/200118 What We Have Learned UNIX program execution semantics is messy but manageable –Command line execution is manageable –File accesses can be trapped and tracked Dynamic loading makes reproducibility more difficult – should be avoided if possible Object handling *obviously* needs concentrated research effort

19 10/200119 Future Work Working with OO Databases Handling navigational access Refining notion of signatures Dealing with fuzzy dependencies and equivalence Cost tracking and calculations (w/ planner) Automating the cataloging process –Integration with portals –Uniform execution language –Analysis of scripts (shell, Perl, Python, Tcl) Refinement of data staging paradigms Handling shell details –Pipes, 3>&1 (fd #s)

20 10/200120 Future Work II Design of Staging Semantics –What files need to be moved where to start a computation –How do you know (exactly) where the computation will run, and how to get the file “there” (NFS, local etc) –How/when to get the results back –How/when to trust the catalog –Double-check file’s existence/ safe arrival when you get there to use it –DB marking of files existence – schema, timing –Mechanisms to audit and correct consistency of catalog vs. reality


Download ppt "Virtual Data Management for CMS Simulation Production A GriPhyN Prototype."

Similar presentations


Ads by Google