Presentation is loading. Please wait.

Presentation is loading. Please wait.

GriPhyN & iVDGL Architectural Issues GGF5 BOF Data Intensive Applications Common Architectural Issues and Drivers Edinburgh, 23 July 2002 Mike Wilde Argonne.

Similar presentations


Presentation on theme: "GriPhyN & iVDGL Architectural Issues GGF5 BOF Data Intensive Applications Common Architectural Issues and Drivers Edinburgh, 23 July 2002 Mike Wilde Argonne."— Presentation transcript:

1 GriPhyN & iVDGL Architectural Issues GGF5 BOF Data Intensive Applications Common Architectural Issues and Drivers Edinburgh, 23 July 2002 Mike Wilde Argonne National Laboratory Grid Physics Network International Virtual Data Grid Laboratory

2 Project Summary l Principle requirements –IT Research: virtual data and transparent execution –Grid building: deploy international grid lab at scale l Components developed/used –Virtual Data Toolkit; Linux deployment platform –Virtual Data Catalog, Request planner and executor, DAGman, NeST l Scale of current testbeds –ATLAS Test Grid – 8 sites –CMS Test Grid – 5 sites –Compute nodes: ~900 @ UW, UofC, UWM, UTB, ANL –>50 researchers and grid-builders working on IT research challenge problems and demos l Future directions (2002 & 2003) –Extensive work on virtual data, planning, and catalog architecture, and fault tolerance

3 Chimera Overview l Concept: Tools to support management of transformations and derivations as community resources l Technology: Chimera virtual data system including virtual data catalog and virtual data language; use of GriPhyN virtual data toolkit for automated data derivation l Results: Successful early applications to CMS and SDSS data generation/analysis l Future: Public release of prototype, new apps, knowledge representation, planning

4 “Chimera” Virtual Data Model l Transformation designers create programmatic abstractions –Simple or compound; augment with metadata l Production managers create bulk derivations –Can materialize data products or leave virtual l Users track their work through derivations –Augment (replace?) the scientist’s log book l Definitions can be augmented with metadata –The key to intelligent data retrieval –Issues relating to metadata propagation

5 pythia_input pythia.exe cmsim_input cmsim.exe writeHits writeDigis begin v /usr/local/demo/scripts/cmkin_input.csh file i ntpl_file_path file i template_file file i num_events stdout cmkin_param_file end begin v /usr/local/demo/binaries/kine_make_ntpl_pyt_cms121.exe pre cms_env_var stdin cmkin_param_file stdout cmkin_log file o ntpl_file end begin v /usr/local/demo/scripts/cmsim_input.csh file i ntpl_file file i fz_file_path file i hbook_file_path file i num_trigs stdout cmsim_param_file end begin v /usr/local/demo/binaries/cms121.exe condor copy_to_spool=false condor getenv=true stdin cmsim_param_file stdout cmsim_log file o fz_file file o hbook_file end begin v /usr/local/demo/binaries/writeHits.sh condor getenv=true pre orca_hits file i fz_file file i detinput file i condor_writeHits_log file i oo_fd_boot file i datasetname stdout writeHits_log file o hits_db end begin v /usr/local/demo/binaries/writeDigis.sh pre orca_digis file i hits_db file i oo_fd_boot file i carf_input_dataset_name file i carf_output_dataset_name file i carf_input_owner file i carf_output_owner file i condor_writeDigis_log stdout writeDigis_log file o digis_db end CMS Pipeline in VDL-0

6 Data Dependencies – VDL-1 TR tr1( out a2, in a1 ) { profile hints.exec-pfn = "/usr/bin/app1"; argument stdin = ${a1}; argument stdout = ${a2}; } TR tr2( out a2, in a1 ) { profile hints.exec-pfn = "/usr/bin/app2"; argument stdin = ${a1}; argument stdout = ${a2}; } DV x1->tr1( a2=@{out:file2}, a1=@{in:file1}); DV x2->tr2( a2=@{out:file3}, a1=@{in:file2}); file1 file2 file3 x1 x2

7 Executor Example: Condor DAGMan l Directed Acyclic Graph Manager l Specify the dependencies between Condor jobs using DAG data structure l Manage dependencies automatically –(e.g., “Don’t run job “B” until job “A” has completed successfully.”) l Each job is a “node” in DAG l Any number of parent or children nodes l No loops Job A Job BJob C Job D Slide courtesy Miron Livny, U. Wisconsin

8 Joint work with Jim Annis, Steve Kent, FNAL Size distribution of galaxy clusters? Galaxy cluster size distribution Chimera Virtual Data System + GriPhyN Virtual Data Toolkit + iVDGL Data Grid (many CPUs) Chimera Application: Sloan Digital Sky Survey Analysis

9 catalog cluster 5 4 core brg field tsObj 3 2 1 brg field tsObj 2 1 brg field tsObj 2 1 brg field tsObj 2 1 core 3 Cluster-finding Data Pipeline

10 Small SDSS Cluster-Finding DAG

11 And Even Bigger: 744 Files, 387 Nodes 108 168 60 50

12 Vision: Distributed Virtual Data Service apps Tier 1 centers Regional Centers Local sites VDC Distributed virtual data service

13 Knowledge Management - Strawman Architecture l Knowledge based requests are formulated in terms of science data –Eg, Give me a specific transform of channels c,p,&t over time range t0-t1 l Finder finds the data files –Translates range “t0-t1” into a set of files l Coder creates an execution plan and defines derivations from known transformations –Can deal with missing files (e.g, file c in LIGO example) l Knowledge request is answered in terms of datasets l Coder translates datasets into logical files (or objects, queries, tables,…) l Planner translates logical entities into physical entities

14 GriPhyN/PPDG Data Grid Architecture Application Planner Executor Catalog Services Info Services Policy/Security Monitoring Repl. Mgmt. Reliable Transfer Service Compute ResourceStorage Resource DAG (concrete) DAG (abstract) DAGMAN, Kangaroo GRAMGridFTP; GRAM; SRM GSI, CAS MDS MCAT; GriPhyN catalogs GDMP MDS Globus

15 Common Problem #1 (evolving) View of Data Grid Stack Data Transport (GridFTP) Storage Element Local Repl Catalog (Flat or Hierarchical) Reliable File Transfer Replica Location Service Publish-Subscribe Service (GDMP) Storage Element Manager Reliable Replication

16 Architectural Complexities

17 Common Problem #2: Request Planning l Map of grid resources l Incoming work to plan –Queue? With lookahead? l Status of grid resources –State (up/down) –Load (current, queued, and anticipated) –Reservations l Policy –Allocation (commitment of resource to VO or group based on policy) l Ability to change decisions dynamically

18 Policy l Focus is on resource allocation (not with security) l Allocation examples: –“CMS should get 80% of the resources at Caltech” (averaged monthly) –“Higgs group has high prio at BNL till 8/1” l Need to apply fair share scheduling to grid l Need to understand the allocation models dictated by funders and data centers

19 Grids as overlays on shared resources

20 Grid Scheduling Problem l Given an abstract DAG representing logical work: –Where should each compute job be executed? >What does site and VO policy say? >What does grid “weather” dictate? –Where is the required data now? –Where should data results be sent? l Stop and re-schedule computations? l Suspend or de-prioritize work in progress to let higher prio work go through? l Degree of policy control? l Is a “grid” an entity? - “aggregator” of resources? l How is data placement coordinated with planning? l Use of a Execution profiler in the planner arch: –Characterize resource needs of an app over time –Parameterize resource reqs of app by its parameters l What happens when things go wrong?

21 Policy and the Planner l Planner considers: –Policy (fairly static, from CAS/SAS) –Grid status –Job (user/group) resource consumptn history –Job profiles (resources over time) from Prophesy

22 Open Issues – Planner (1) l Does the planner have a queue? If so, how does a planner manage its queue? l How many planners are there? Is it a service? l How is responsibility between planner and the executor (cluster scheduler) partitioned? l How many other entities need to be coordinated? –RFT, DAPman, SRM, NeST, …? –How to wait on reliable file transfers? l How does planner estimate times if it only has partial responsibility for when/where things run? l How is data placement planning coordinated with request planning?

23 Open Issues – Planner (2) l Clearly need incremental planning (eg for analysis) l Stop and re-schedule computations? l Suspend or de-prioritize work in progress to let higher prio work go through? l Degree of policy control? l Is the “grid” an entity? l Use of a Execution profiler in the planner arch: –Characterize the resource requirements of an app over time –Parameterize the res reqs of an app w.r.t its (salient) parameters l What happens when things go wrong?

24 Issue Summary l Consolidate the data grid stack –Reliable file transfer –Reliable replication –Replica catalog and virtual data catalog scaled for global use l Define interfaces and locations of planners l Unify job workflow representation around DAGs l Define how to state and manage policy l Strategies for fault tolerance – similar to replanning for weather and policy changes? l Evolution of services to OGSA


Download ppt "GriPhyN & iVDGL Architectural Issues GGF5 BOF Data Intensive Applications Common Architectural Issues and Drivers Edinburgh, 23 July 2002 Mike Wilde Argonne."

Similar presentations


Ads by Google