San Diego Supercomputer Center, University of California at San Diego Grid Physics Network (GriPhyN) University of Florida Data Grid and Gridflow Management.

Slides:



Advertisements
Similar presentations
3 September 2004NVO Coordination Meeting1 Grid-Technologies NVO and the Grid Reagan W. Moore George Kremenek Leesa Brieger Ewa Deelman Roy Williams John.
Advertisements

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.
GFS OGF-22 Global Resource Naming Developers: Reagan Moore Arcot Mike.
The Storage Resource Broker and.
The Storage Resource Broker and.
Overview of the SDSC Storage Resource Broker Wayne Schroeder (and other SRB team members) May, 2004 San Diego Supercomputer Center, University of California.
Data Grid: Storage Resource Broker Mike Smorul. SRB Overview Developed at San Diego Supercomputing Center. Provides the abstraction mechanisms needed.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids Reagan W. Moore San Diego Supercomputer Center.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids, Digital Libraries and Persistent Archives Reagan.
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE SAN DIEGO SUPERCOMPUTER CENTER Particle Physics Data Grid PPDG Data Handling System Reagan.
San Diego Supercomputer Center, University of California at San Diego Grid Physics Network (GriPhyN) University of Florida A Data Storage Language for.
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Integration of Data Grids, Digital Libraries, and Persistent.
San Diego Supercomputer Center NARA Research Prototype Persistent Archive Building Preservation Environments with Data Grid Technology (NARA Research Prototype.
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure1 Grid Based Solutions for Distributed Data Management Reagan.
A Very Brief Introduction to iRODS
Applying Data Grids to Support Distributed Data Management Storage Resource Broker Reagan W. Moore Ian Fisk Bing Zhu University of California, San Diego.
Jean-Yves Nief, CC-IN2P3 Wilko Kroeger, SCCS/SLAC Adil Hasan, CCLRC/RAL HEPiX, SLAC October 11th – 13th, 2005 BaBar data distribution using the Storage.
Robust Tools for Archiving and Preserving Digital Data Joseph JaJa, Mike Smorul, and Mike McGann Institute for Advanced Computer Studies Department of.
Modern Data Management Overview Storage Resource Broker Reagan W. Moore
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Architecture of Grid File System (GFS) - Based on the outline draft - Arun swaran Jagatheesan San Diego Supercomputer Center Global Grid Forum 11 Honolulu,
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center National Partnership for Advanced.
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center National Partnership for Advanced.
National Partnership for Advanced Computational Infrastructure Digital Library Architecture Reagan Moore Chaitan Baru Amarnath Gupta George Kremenek Bertram.
San Diego Supercomputer CenterUniversity of California, San Diego Preservation Research Roadmap Reagan W. Moore San Diego Supercomputer Center
San Diego Supercomputer Center Grid Physics Network (GriPhyN) University of Florida Programming Gridflows using Matrix Arun Jagatheesan Architect, SDSC.
San Diego Supercomputer Center SDSC Storage Resource Broker SRB as data grid solution (Chinese version) Arun Jagatheesan San Diego Supercomputer.
DISTRIBUTED COMPUTING
San Diego Supercomputer Center Grid Physics Network (GriPhyN) University of Florida Dataflows in SRB using SDSC Matrix Arun Jagatheesan Architect & Team.
Rule-Based Data Management Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar {moore, schroede, mwan, {moore, schroede, mwan,
San Diego Supercomputer Center Grid Physics Network (GriPhyN) University of Florida Data Grid Management Systems (DGMS) Arun Swaran Jagatheesan San Diego.
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center National Partnership for Advanced.
San Diego Supercomputer Center SDSC Storage Resource Broker Data Grid Automation Arun Jagatheesan et al., San Diego Supercomputer Center University of.
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center National Partnership for Advanced.
Production Data Grids SRB - iRODS Storage Resource Broker Reagan W. Moore
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center National Partnership for Advanced.
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.
Data Grid Management Systems (DGMS) Arun Jagatheesan San Diego Supercomputer Center
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure SRB + Web Services = Datagrid Management System (DGMS) Arcot.
Rule-Based Preservation Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar Richard Marciano {moore, schroede, mwan, sekar,
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Management of Distributed Data Reagan W. Moore.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Archive for the NSDL Reagan W. Moore Charlie Cowart.
San Diego Supercomputer Center Grid Physics Network (GriPhyN) University of Florida DGL: The Assembly Language for Grid Computing Arun swaran Jagatheesan.
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
Ames Research CenterDivision 1 Information Power Grid (IPG) Overview Anthony Lisotta Computer Sciences Corporation NASA Ames May 2,
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure1 Data Grids, Digital Libraries, and Persistent Archives Reagan.
The Global Land Cover Facility is sponsored by NASA and the University of Maryland.The GLCF is a founding member of the Federation of Earth Science Information.
SAN DIEGO SUPERCOMPUTER CENTER By: Roman Olschanowsky An Introduction to the.
Introduction to The Storage Resource.
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center National Partnership for.
7. Grid Computing Systems and Resource Management
Biomedical Informatics Research Network The Storage Resource Broker & Integration with NMI Middleware Arcot Rajasekar, BIRN-CC SDSC October 9th 2002 BIRN.
National Archives and Records Administration1 Integrated Rules Ordered Data System (“IRODS”) Technology Research: Digital Preservation Technology in a.
Rights Management for Shared Collections Storage Resource Broker Reagan W. Moore
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center National Partnership for Advanced.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Collection-Based Persistent Archives Arcot Rajasekar, Richard Marciano, Reagan Moore San Diego Supercomputer Center Presented by: Preetham A Gowda.
Preservation Data Services Persistent Archive Research Group Reagan W. Moore October 1, 2003.
Grid Services for Digital Archive Tao-Sheng Chen Academia Sinica Computing Centre
Building Preservation Environments from Federated Data Grids Reagan W. Moore San Diego Supercomputer Center Storage.
Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,
GGF OGSA-WG, Data Use Cases Peter Kunszt Middleware Activity, Data Management Cluster EGEE is a project funded by the European.
Collection Based Persistent Archives
Policy-Based Data Management integrated Rule Oriented Data System
Arcot Rajasekar Michael Wan Reagan Moore (sekar, mwan,
San Diego Supercomputer Center University of California, San Diego
VORB Virtual Object Ring Buffers
Presentation transcript:

San Diego Supercomputer Center, University of California at San Diego Grid Physics Network (GriPhyN) University of Florida Data Grid and Gridflow Management Systems Arun swaran Jagatheesan San Diego Supercomputer Center (SDSC) University of California at San Diego

San Diego Supercomputer Center University of California at San Diego University of Florida 2 “On Demand” Calibration of Content Been here before? a)Data / Storage in Grid - Concepts b)Data / Storage in Grid - Implementation c)Web Services in Grids d)Grid Workflow e)All the above f)Who would be elected in US?

San Diego Supercomputer Center University of California at San Diego University of Florida 3 Talk Outline Introduction Grid Computing, Data Grids Data Grid Infrastructures Data Grid Management System (DGMS) Basic Concepts Implementation Example (SRB) Grid Middleware as Service Gridflows Related Topics and Research Issues Demo / Hands on session

San Diego Supercomputer Center University of California at San Diego University of Florida 4 Acknowledgement: SDSC SRB Team  Arun Jagatheesan  George Kremenek  Sheau-Yen Chen  Arcot Rajasekar  Reagan Moore  Michael Wan  Roman Olschanowsky  Bing Zhu  Charlie Cowart Not In Picture:  Wayne Schroeder  Tim Warnock (BIRN)  Lucas Gilbert  Marcio Faerman (SCEC)  Antoine De Torcy Students: Jonathan Weinberg Yufang Hu Daniel Moore Grace Lin Allen Ding Yi Li Emeritus: Vicky Rowley (BIRN) Qiao Xin Ethan Chen Reena Mathew Erik Vandekieft Xi (Cynthia) Sheng

San Diego Supercomputer Center University of California at San Diego University of Florida 5 Distributed Computing © Images courtesy of Computer History Museum

San Diego Supercomputer Center University of California at San Diego University of Florida 6 The “Grid” Vision

San Diego Supercomputer Center University of California at San Diego University of Florida 7 Data Grids Coordinated Cyberinfrastructure Formed by coordination of multiple autonomous organizations Preserves local autonomy and provides global consistency Logical Namespaces (Virtualizations) Virtualization mechanisms for resources (including storage space, data, metadata, processing pipelines and inter-organizational users)

San Diego Supercomputer Center University of California at San Diego University of Florida 8 Data Grid Using a Data Grid – in Abstract Ask for data User asks for data/storage from the data grid Data delivered The data/storage is found and returned Where & how details are managed by data grid But access controls are specified by owner

San Diego Supercomputer Center University of California at San Diego University of Florida 9 Data Grids – Hype Factor/Reality? Forrester Research “Data and infrastructure are top of mind for grid at more than 50 percent of firms … The vision of data grids will become part of a greater vision of storage virtualization and information life cycle management” – May 2004 CIO Magazine “While most people think of computational grids, enterprises are looking into data grids” – May 2004 Why talk about Busine$$ in an IEEE conference? Necessity drives business; business drives standards and technology evolution; …; grid is not just technology, but also standards

San Diego Supercomputer Center University of California at San Diego University of Florida 10 Talk Outline Introduction Grid Computing, Data Grids Data Grid Infrastructures Data Grid Management System (DGMS) Basic Concepts Implementation Example (SRB) Grid Middleware as Service Gridflows Related Topics and Research Issues Demo / Hands on session

San Diego Supercomputer Center University of California at San Diego University of Florida 11 DGMS Technology usage NSF Southern California Earthquake Center digital library Worldwide Universities Network data grid NASA Information Power Grid NASA Goddard Data Management System data grid DOE BaBar High Energy Physics data grid NSF National Virtual Observatory data grid NSF ROADnet real-time sensor collection data grid NIH Biomedical Informatics Research Network data grid NARA research prototype persistent archive NSF National Science Digital Library persistent archive NHPRC Persistent Archive Test bed

San Diego Supercomputer Center University of California at San Diego University of Florida 12 Southern California Earthquake Center

San Diego Supercomputer Center University of California at San Diego University of Florida 13 Southern California Earthquake Center Build community digital library Manage simulation and observational data Anelastic wave propagation output 10 TBs, 1.5 million files Provide web-based interface Support standard services on digital library Manage data distributed across multiple sites USC, SDSC, UCSB, SDSU, SIO Provide standard metadata Community based descriptive metadata Administrative metadata Application specific metadata

San Diego Supercomputer Center University of California at San Diego University of Florida 14 SCEC Data Management Technologies Portals Knowledge interface to the library, presenting a coherent view of the services Knowledge Management Systems Organize relationships between SCEC concepts and semantic labels Process management systems Data processing pipelines to create derived data products Web services Uniform capabilities provided across SCEC collections Data grid Management of collections of distributed data Computational grid Access to distributed compute resources Persistent archive Management of technology evolution

San Diego Supercomputer Center University of California at San Diego University of Florida 15

San Diego Supercomputer Center University of California at San Diego University of Florida 16 NASA Data Grids NASA Information Power Grid NASA Ames, NASA Goddard Distributed data collection using the SRB ESIP federation Led by Joseph JaJa (U Md) Federation of ESIP data resources using the SRB NASA Goddard Data Management System Storage repository virtualization (Unix file system, Unitree archive, DMF archive) using the SRB NASA EOS Petabyte store Storage repository virtualization for EMC persistent store using the Nirvana version of SRB

San Diego Supercomputer Center University of California at San Diego University of Florida 17 NCSA 6+2 TF 4 TB Memory 400 TB disk SDSC 4.1 TF 2 TB Memory 500 TB SAN Caltech 0.5 TF.4 TB Memory 86 TB disk ANL 1 TF.25 TB Memory 25 TB disk TeraGrid: 13.6 TF, 6.8 TB memory, 900 TB network disk, 10 PB archive HPSS 9 PB ESnet HSCC MREN/Abilene Starlight Juniper M160 OC-12 OC-48 OC p IA-32 Chiba City 128p Origin HR Display & VR Facilities 256p HP X-Class 128p HP V p IA-32 Myrinet Chicago & LA DTF Core Switch/Routers Cisco 65xx Catalyst Switch (256 Gb/s Crossbar) OC-12 OC-3 vBNS Abilene MREN 1176p IBM SP 1.7 TFLOPs Blue Horizon OC-48 NTON x Sun E10K 4 15xxp Origin UniTree 1024p IA p IA vBNS Abilene Calren ESnet OC-12 OC-3 8 Sun Server 16 GbE 24 Extreme Blk Diamond OC-12 ATM Calren

San Diego Supercomputer Center University of California at San Diego University of Florida 18 NIH BIRN SRB Data Grid Biomedical Informatics Research Network Access and analyze biomedical image data Data resources distributed throughout the country Medical schools and research centers across the US Stable high performance grid based environment Coordinate data sharing Federate collections Support data mining and analysis

San Diego Supercomputer Center University of California at San Diego University of Florida 19 BIRN: Inter-organizational Data

San Diego Supercomputer Center University of California at San Diego University of Florida 20 SRB Collections at SDSC

San Diego Supercomputer Center University of California at San Diego University of Florida 21 Commonality in all these projects Distributed data management Data Grids, Digital Libraries, Persistent Archives, Workflow/dataflow Pipelines, Knowledge Generation Data/storage provisioning – multiple domains Common logical namespace for data and storage Data publication Browsing and discovery of data in collections Data Preservation Management of technology evolution

San Diego Supercomputer Center University of California at San Diego University of Florida 22 Talk Outline Introduction Grid Computing, Data Grids Data Grid Infrastructures Data Grid Management System (DGMS) Basic Concepts Implementation Example (SRB) Grid Middleware as Service Gridflows Related Topics and Research Issues Demo / Hands on session

San Diego Supercomputer Center University of California at San Diego University of Florida 23 Common Data Grid Components Federated Middleware (servers and clients) Servers can talk to each other independently of the client Infrastructure independent naming Logical names for users, resources, files, applications Collective ownership of data Collection-owned data, with infrastructure independent access control lists Context management Record state information in a metadata catalog from data grid services such as replication Abstractions for dealing with heterogeneity

San Diego Supercomputer Center University of California at San Diego University of Florida 24 Physical Layer (Real World) Distributed digital entities Heterogeneous and distributed storage resources Autonomous Organizations Distributed Users, distributed authentication Heterogeneous authorization schemes Users; sub-organizations; organizations/enterprises; virtual organizations

San Diego Supercomputer Center University of California at San Diego University of Florida 25 Data Grid Transparencies/Virtualizations (bits,data,information,..) Storage Resource Transparency Storage Location Transparency E:\srbVault\image.jpg /users/srbVault/image.jpg Select … from srb.mdas.td where... Data Identifier Transparency image_0.jpg…image_100.jpg Data Replica Transparency image.sqlimage.cgiimage.wsdl Virtual Data Transparency Semantic data Organization (with behavior) patientRecordsCollectionmyActiveNeuroCollection Inter- organizational Information Storage Management

San Diego Supercomputer Center University of California at San Diego University of Florida 26 Data Grid Transparencies Find data without knowing the identifier Descriptive attributes Access data/storage without knowing the location Logical name space Access data without knowing the type of storage Storage repository abstraction Provide transformations for any data collection Data behavior abstraction

San Diego Supercomputer Center University of California at San Diego University of Florida 27 Data Grid Abstractions Storage repository virtualization Standard operations supported on storage systems Data virtualization Logical name space for files - Global persistent identifier Information repository virtualization Standard operations to manage collections in databases Access virtualization Standard interface to support alternate APIs Latency management mechanisms Aggregation, parallel I/O, replication, caching Security interoperability GSSAPI, inter-realm authentication, collection-based authorization

San Diego Supercomputer Center University of California at San Diego University of Florida 28 Data Organization Physical Organization of the data Distributed Data Heterogeneous resources Multiple formats (structured and unstructured) Logical Organization Impose logical structure for data sets Collections of semantically related data sets Users create their own views (collections) of the data grid

San Diego Supercomputer Center University of California at San Diego University of Florida 29 Data Identifier Transparency Four Types of Data Identifiers: Unique name OID or handle Descriptive name Descriptive attributes – meta data Semantic access to data Collective name Logical name space of a collection of data sets Location independent Physical name Physical location of resource and physical path of data

San Diego Supercomputer Center University of California at San Diego University of Florida 30 Mappings on Resource Name Space Define logical resource name List of physical resources Replication Write to logical resource completes when all physical resources have a copy Load balancing Write to a logical resource completes when copy exist on next physical resource in the list Fault tolerance Write to a logical resource completes when copies exist on “k” of “n” physical resources

San Diego Supercomputer Center University of California at San Diego University of Florida 31 Data Replica Transparency Replication Improve access time Improve reliability Provide disaster backup and preservation Physically or Semantically equivalent replicas Replica consistency Synchronization across replicas on writes Updates might use “m of n” or any other policy Distributed locking across multiple sites Versions of files Time-annotated snapshots of data

San Diego Supercomputer Center University of California at San Diego University of Florida 32 Latency Management -Bulk Operations Bulk register Create a logical name for a file Bulk load Create a copy of the file on a data grid storage repository Bulk unload Provide containers to hold small files and pointers to each file location Bulk delete Mark as deleted in metadata catalog After specified interval, delete file Bulk metadata load Requests for bulk operations for access control setting, …

San Diego Supercomputer Center University of California at San Diego University of Florida 33 Talk Outline Introduction Grid Computing, Data Grids Data Grid Infrastructures Data Grid Management System (DGMS) Basic Concepts Implementation Example (SRB) Grid Middleware as Service Gridflows Related Topics and Research Issues Demo / Hands on session

San Diego Supercomputer Center University of California at San Diego University of Florida 34 Storage Resource Broker Distributed data management technology Developed at San Diego Supercomputer Center (Univ. of California, San Diego) DARPA Massive Data Analysis DARPA/USPTO Distributed Object Computation Test bed 2000 to present - NSF, NASA, NARA, DOE, DOD, NIH, NLM, NHPRC Applications Data grids - data sharing Digital libraries - data publication Persistent archives - data preservation Used in national and international projects in support of Astronomy, Bio-Informatics, Biology, Earth Systems Science, Ecology, Education, Geology, Government records, High Energy Physics, Seismology

San Diego Supercomputer Center University of California at San Diego University of Florida 35 Data Grid Federation Data grids provide the ability to name, organize, and manage data on distributed storage resources Federation provides a way to name, organize, and manage data on multiple data grids.

San Diego Supercomputer Center University of California at San Diego University of Florida 36 SRB Zones Each SRB zone uses a metadata catalog (MCAT) to manage the context associated with digital content Context includes: Administrative, descriptive, authenticity attributes Users Resources Applications

San Diego Supercomputer Center University of California at San Diego University of Florida 37 SRB Peer-to-Peer Federation Mechanisms to impose consistency and access constraints on: Resources Controls on which zones may use a resource User names (user-name / domain / SRB-zone) Users may be registered into another domain, but retain their home zone, similar to Shibboleth Data files Controls on who specifies replication of data MCAT metadata Controls on who manages updates to metadata

San Diego Supercomputer Center University of California at San Diego University of Florida 38 Peer-to-Peer Federation 1. Occasional Interchange - for specified users 2. Replicated Catalogs - entire state information replication 3. Resource Interaction - data replication 4. Replicated Data Zones - no user interactions between zones 5. Master-Slave Zones - slaves replicate data from master zone 6. Snow-Flake Zones - hierarchy of data replication zones 7. User / Data Replica Zones - user access from remote to home zone 8. Nomadic Zones “SRB in a Box” - synchronize local zone to parent 9. Free-floating “myZone” - synchronize without a parent zone 10. Archival “BackUp Zone” - synchronize to an archive SRB Version released December 19, 2003

San Diego Supercomputer Center University of California at San Diego University of Florida 39 Talk Outline Introduction Grid Computing, Data Grids Data Grid Infrastructures Data Grid Management System (DGMS) Basic Concepts Implementation Example (SRB) Grid Middleware as Service Gridflows Related Topics and Research Issues Demo / Hands on session

San Diego Supercomputer Center University of California at San Diego University of Florida 40 Why Data Grids Need Services? Standardization Grid vision can not be achieved without standards Dynamic discovery and late-binding of resources on demand Resources as Services Each resource could be treated as an endpoint Flexibility XML internally helps in interoperable resource descriptions

San Diego Supercomputer Center University of California at San Diego University of Florida 41 Grid Middleware as Services Standard protocols / interfaces Heterogeneous implementation infrastructure Standard PortType for a Data Grid Service Provider Standard Schemes Flexible open standards for operations on Data Grids Adding an organization to a data grid programmatically Adding resources on demand Standard Data models for Resources (Schemas) Description of Resources in a Grid Resources become Services

San Diego Supercomputer Center University of California at San Diego University of Florida 42 Challenges in using Services for Data Grids High Performance Requirement Can verbose XML between clients and grid middleware? Scalability Will XML based services scale or become bottleneck? Wait for SOA Standardization (e.g) WSDL necessary wait though Community Resistance “I will use the regular API – at least for now” Tools for development

San Diego Supercomputer Center University of California at San Diego University of Florida 43 Standardization of Data Grid Services Global Grid Forum (GGF) Data Area Grid File System, DAIS, GSM, … Open Grid Services Architecture (OGSA) WS-Resource Framework Slowly evolving as a community effort

San Diego Supercomputer Center University of California at San Diego University of Florida 44 GFS Service Provider /…/text1.txt /…//text2.txt GRP /txt3.txt GRP Storage-R-Us Resource Providers data + storage (50) Finance Department data + storage (40) Research Lab data + storage (10) /home/arun.sdsc/exp1 /home/arun.sdsc/exp1/text1.txt /home/arun.sdsc/exp1/text2.txt /home/arun.sdsc/exp1/text3.txt data + storage (100) Logical Namespace (Need not be same as physical view of resources )

San Diego Supercomputer Center University of California at San Diego University of Florida 45 Talk Outline Introduction Grid Computing, Data Grids Data Grid Infrastructures Data Grid Management System (DGMS) Basic Concepts Implementation Example (SRB) Grid Middleware as Service Gridflows Related Topics and Research Issues Demo / Hands on session

San Diego Supercomputer Center University of California at San Diego University of Florida 46 Work in progress GfMS is ‘Hard Hat Area’ (Research)

San Diego Supercomputer Center University of California at San Diego University of Florida 47 Gridflow in SCEC (data  information pipeline) Metadata derivation Ingest Metadata Ingest Data Determine analysis pipeline Initiate automated analysis Organize result data into distributed data grid collections Use the optimal set of resources based on the task – on demand Pipeline could be triggered by input at data source or by a data request from user All gridflow activities stored for data flow provenance

San Diego Supercomputer Center University of California at San Diego University of Florida 48 Gridflows Grid Workflow (Gridflow) is the automation of a execution pipeline in which data or tasks are processed through multiple autonomous grid resources according to a set of procedural rules Gridflows are executed on resources that are dynamically obtained through confluence of one or more autonomous administrative domains (peers)

San Diego Supercomputer Center University of California at San Diego University of Florida 49 Data  Discovery Digital entities Meta-data Services State New data updates relationships among data in collections Services invoked to analyze new relationships DGMS applications get notified of state updates Digital entities Meta-data Services State

San Diego Supercomputer Center University of California at San Diego University of Florida 50 What they want? We know the business (scientific) process CyberInfrastructure is all we care (why bother about atoms or DNA)

San Diego Supercomputer Center University of California at San Diego University of Florida 51 What they want? Use DGL to describe your process logic with abstract references to datagrid infrastructure dependencies

San Diego Supercomputer Center University of California at San Diego University of Florida 52 Need for Gridflows Data-intensive and/or compute-intensive processes Long run processes or pipelines on the Grid (e.g) If job A completes execute jobs x, y, z; else execute job B. Self-organization/management of data Semi-automation of data, storage distribution, curation processes (e.g) After each data insert into a collection, update the meta-data information about the collection or replicate the collection Knowledge Generation Offline data analysis and knowledge generation pipelines (e.g) What inferences can be assumed from the new seismology graphs added to this collection? Which domain scientist will be interested to study these new possible pre-results?

San Diego Supercomputer Center University of California at San Diego University of Florida 53 Gridflow Description Requirements Import and export Import or export Gridflows (embedded gridflows) Support and extend existing standards like XQuery, BPEL, SOAP etc., Rules Dynamic rules to control the execution of gridflow Query Runtime Query on status of gridflow Granular Metadata Metadata associated with the steps in a gridflow execution that can be queried Gridflow Patterns Scientific Computing - more looping structures Interest in execution of each iteration and the changes in interested attributes

San Diego Supercomputer Center University of California at San Diego University of Florida 54 Data Grid Language Assembly Language for Grid Computing? (ok, its hype) Describes Gridflow Both structure-based and state-based gridflow patterns Described ECA based rules Inbuilt support to define data grid datatypes like collections, … Query Gridflow Query on the execution of any gridflow (any granular detail) XQuery is used to query on the status of gridflow and its attributes Manage Gridflow Start or stop the gridflow in execution

San Diego Supercomputer Center University of California at San Diego University of Florida 55 Structure and state based Gridflow patterns Simple Sequential Execute steps in a gridflow in a sequence one after another Simple Parallel Start all the steps in a gridflow at the same time For Loop Iteration Execute steps changing some iterator value until a given state is achieved While Block (Milestone) Execute steps while some mile stone can be achieved IF-Else Block Branch based on the evaluation of a state condition Switch-choice(s) Split to execute any of the possible cases based on the context For-each

San Diego Supercomputer Center University of California at San Diego University of Florida 56 SDSC Matrix Project R&D effort that is used in academic projects Gridflow Protocols Gridflow Language Descriptions Version 3.2 released, implements DGL Community based Both Industry and Academia can benefit by participation Involves University of Florida, UCSD, … (Are you In?)

San Diego Supercomputer Center University of California at San Diego University of Florida 57 Gridflow Process I End User using DGBuilder Gridflow Description Data Grid Language

San Diego Supercomputer Center University of California at San Diego University of Florida 58 Gridflow Process II Abstract Gridflow using Data Grid Language Concrete Gridflow Planner

San Diego Supercomputer Center University of California at San Diego University of Florida 59 Gridflow Process III Concrete Gridflow Gridflow P2P Network Gridflow Processor

San Diego Supercomputer Center University of California at San Diego University of Florida 60 SDSC Matrix Project: Open source gridflow effort The growth of the SDSC Matrix Project is made possible by developers and grid-prophets like you (Thank you)

San Diego Supercomputer Center University of California at San Diego University of Florida 61 Talk Outline Introduction Grid Computing, Data Grids Data Grid Infrastructures Data Grid Management System (DGMS) Basic Concepts Implementation Example (SRB) Grid Middleware as Service Gridflows Related Topics and Research Issues Demo / Hands on session

San Diego Supercomputer Center University of California at San Diego University of Florida 62 DGMS Philosophy Collective view of Inter-organizational data Operations on datagrid space Local autonomy and global state consistency Collaborative datagrid communities Multiple administrative domains or “Grid Zones” Self-describing and self-manipulating data Horizontal and vertical behavior Loose coupling between data and behavior (dynamically) Relationships between a digital entity and its Physical locations, Logical names, Meta-data, Access control, Behavior, “Grid Zones”.

San Diego Supercomputer Center University of California at San Diego University of Florida 63 DGMS Research Issues Self-organization of datagrid communities Using knowledge relationships across the datagrids Inter-datagrid operations based on semantics of data in the communities (different ontologies) High speed data transfer Terabyte to transfer Protocols, routers needed Latency Management Data source speed >> data sink speed Datagrid Constraints Data placement and scheduling How many replicas, where to place them…

San Diego Supercomputer Center University of California at San Diego University of Florida 64 Active Datagrid Collections SDSC 121.Event Thit.xml National Lab getEvents() 121.Event Hits.sql University of Gators addEvent() Resources Data Sets Behavior

San Diego Supercomputer Center University of California at San Diego University of Florida 65 Active Datagrid Collections Heterogeneous, distributed physical data SDSC Dynamic or virtual data 121.Event Thit.xml National Lab getEvents() 121.Event Hits.sql University of Gators addEvent() National Lab University of Gators

San Diego Supercomputer Center University of California at San Diego University of Florida 66 Active Datagrid Collections myHEP-Collection SDSC 121.Event Thit.xml National Lab 121.Event Hits.sql University of Gators Logical Collection gives location and naming transparency SDSC Meta-data

San Diego Supercomputer Center University of California at San Diego University of Florida 67 Active Datagrid Collections myHEP-Collection SDSC 121.Event Thit.xml National Lab 121.Event Hits.sql University of Gators Now add behavior or services to this logical collection Meta-data SDSC Collection state and services Horizontal Services getEvents() addEvent()

San Diego Supercomputer Center University of California at San Diego University of Florida 68 Active Datagrid Collections myHEP-Collection SDSC 121.Event Thit.xml National Lab 121.Event Hits.sql University of Gators Meta-data SDSC Collection state and services Horizontal Services getEvents() addEvent() ADC specific Operations + Model View Controllers ADC Logical view of data & operations

San Diego Supercomputer Center University of California at San Diego University of Florida 69 Active Datagrid Collections Digital entities Meta-data Services State Horizontal datagrid services and vertical domain specific services (portType) or pipelines (DGL) Events, collective state, mappings to domain services to be invoked Standardized schema with domain specific schema extensions Physical and virtual data present in the datagrid

San Diego Supercomputer Center University of California at San Diego University of Florida 70 Talk Outline Introduction Grid Computing, Data Grids Data Grid Infrastructures Data Grid Management System (DGMS) Basic Concepts Implementation Example (SRB) Grid Middleware as Service Gridflows Related Topics and Research Issues Demo / Hands on session

San Diego Supercomputer Center University of California at San Diego University of Florida 71 Related Technologies/Links – Data Grids A complete history of the Grid SDSC Storage Resource Broker Globus Data Grid The Legion Project To go a link, use them in Google’s “I’m Feeling Lucky” option (or look into the notes in this presentation)

San Diego Supercomputer Center University of California at San Diego University of Florida 72 Talk Outline Introduction Grid Computing, Data Grids Data Grid Infrastructures Data Grid Management System (DGMS) Basic Concepts Implementation Example (SRB) Grid Middleware as Service Gridflows Related Topics and Research Issues Demo / Hands on session

San Diego Supercomputer Center University of California at San Diego University of Florida 73 Summary Data Grids – next generation data management Lot of possibilities and use cases Transparencies for distributed storage and data Data Grid Management System Need for standard Services Gridflows Peer-2-peer Grid Workflows Research Issues