Distributed Data for Science Workflows Data Architecture Progress Report December 2008.

Slides:



Advertisements
Similar presentations
INDIANAUNIVERSITYINDIANAUNIVERSITY GENI Global Environment for Network Innovation James Williams Director – International Networking Director – Operational.
Advertisements

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
TeraGrid Quarterly Meeting Dec 6-7, 2007 DVS GIG Project Year 4&5 Project List Kelly Gaither, DVS Area Director.
The Top 10 Reasons Why Federated Can’t Succeed And Why it Will Anyway.
Unveiling ProjectWise V8 XM Edition. ProjectWise V8 XM Edition An integrated system of collaboration servers that enable your AEC project teams, your.
ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.
Sergiu January 2007 TG Users’ Data Transfer Needs SDSC NCAR TACC UC/ANL NCSA ORNL PU IU PSC.
XSEDE 13 July 24, Galaxy Team: PSC Team:
(e)Science-Driven, Production- Quality, Distributed Grid and Cloud Data Infrastructure for the Transformative, Disruptive, Revolutionary, Next-Generation.
Establishing a service oriented composite applications development process for supporting work- based learning and competency progression management Hilary.
Office of Science U.S. Department of Energy Grids and Portals at NERSC Presented by Steve Chan.
Data Grids: Globus vs SRB. Maturity SRB  Older code base  Widely accepted across multiple communities  Core components are tightly integrated Globus.
Robust Tools for Archiving and Preserving Digital Data Joseph JaJa, Mike Smorul, and Mike McGann Institute for Advanced Computer Studies Department of.
Grids and Grid Technologies for Wide-Area Distributed Computing Mark Baker, Rajkumar Buyya and Domenico Laforenza.
UMIACS PAWN, LPE, and GRASP data grids Mike Smorul.
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation Mike Smorul, Joseph JaJa, Yang Wang, and Fritz McCall.
Simo Niskala Teemu Pasanen
System Design/Implementation and Support for Build 2 PDS Management Council Face-to-Face Mountain View, CA Nov 30 - Dec 1, 2011 Sean Hardman.
Client/Server Grid applications to manage complex workflows Filippo Spiga* on behalf of CRAB development team * INFN Milano Bicocca (IT)
QCDgrid Technology James Perry, George Beckett, Lorna Smith EPCC, The University Of Edinburgh.
GIG Software Integration: Area Overview TeraGrid Annual Project Review April, 2008.
TeraGrid Information Services December 1, 2006 JP Navarro GIG Software Integration.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
Long Term Ecological Research Network Information System LTER Grid Pilot Study LTER Information Manager’s Meeting Montreal, Canada 4-7 August 2005 Mark.
DOE BER Climate Modeling PI Meeting, Potomac, Maryland, May 12-14, 2014 Funding for this study was provided by the US Department of Energy, BER Program.
Flexibility and user-friendliness of grid portals: the PROGRESS approach Michal Kosiedowski
The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes.
Through the development of advanced middleware, Grid computing has evolved to a mature technology in which scientists and researchers can leverage to gain.
What are the main differences and commonalities between the IS and DA systems? How information is transferred between tasks: (i) IS it may be often achieved.
Data Area Report Chris Jordan, Data Working Group Lead, TACC Kelly Gaither, Data and Visualization Area Director, TACC April 2009.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting October 10-11, 2002.
Workshop on the Future of Scientific Workflows Break Out #2: Workflow System Design Moderators Chris Carothers (RPI), Doug Thain (ND)
1 PY4 Project Report Summary of incomplete PY4 IPP items.
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.
Heterogeneous Database Replication Gianni Pucciani LCG Database Deployment and Persistency Workshop CERN October 2005 A.Domenici
Virtual Data Grid Architecture Ewa Deelman, Ian Foster, Carl Kesselman, Miron Livny.
TeraGrid Quarterly Meeting Dec 5 - 7, 2006 Data, Visualization and Scheduling (DVS) Update Kelly Gaither, DVS Area Director.
TeraGrid CTSS Plans and Status Dane Skow for Lee Liming and JP Navarro OSG Consortium Meeting 22 August, 2006.
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
User Working Group 2013 Data Access Mechanisms – Status 12 March 2013
Software Design: Principles, Process, and Concepts Getting Started with Design.
Rational Unified Process Fundamentals Module 7: Process for e-Business Development Rational Unified Process Fundamentals Module 7: Process for e-Business.
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
The Global Land Cover Facility is sponsored by NASA and the University of Maryland.The GLCF is a founding member of the Federation of Earth Science Information.
Biomedical and Bioscience Gateway to National Cyberinfrastructure John McGee Renaissance Computing Institute
Middleware for Campus Grids Steven Newhouse, ETF Chair (& Deputy Director, OMII)
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
Software Reuse Course: # The Johns-Hopkins University Montgomery County Campus Fall 2000 Session 4 Lecture # 3 - September 28, 2004.
Data Area Report Chris Jordan, Data Working Group Lead, TACC Kelly Gaither, Data and Visualization Area Director, TACC April 2009.
Data, Visualization and Scheduling (DVS) TeraGrid Annual Meeting, April 2008 Kelly Gaither, GIG Area Director DVS.
AHM04: Sep 2004 Nottingham CCLRC e-Science Centre eMinerals: Environment from the Molecular Level Managing simulation data Lisa Blanshard e- Science Data.
Implementation Review1 Archive Ingest Redesign March 14, 2003.
Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.
The Storage Resource Broker and.
ATLAS Distributed Analysis Dietrich Liko IT/GD. Overview  Some problems trying to analyze Rome data on the grid Basics Metadata Data  Activities AMI.
Software Integration Highlights CY2008 Lee Liming, JP Navarro GIG Area Directors for Software Integration University of Chicago, Argonne National Laboratory.
Directions in eScience Interoperability and Science Clouds June Interoperability in Action – Standards Implementation.
LHCC Referees Meeting – 28 June LCG-2 Data Management Planning Ian Bird LHCC Referees Meeting 28 th June 2004.
Simulation Production System Science Advisory Committee Meeting UW-Madison March 1 st -2 nd 2007 Juan Carlos Díaz Vélez.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
Data Infrastructure in the TeraGrid Chris Jordan Campus Champions Presentation May 6, 2009.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
TeraGrid Software Integration: Area Overview (detailed in 2007 Annual Report Section 3) Lee Liming, JP Navarro TeraGrid Annual Project Review April, 2008.
Onedata Eventually Consistent Virtual Filesystem for Multi-Cloud Infrastructures Michał Orzechowski (CYFRONET AGH)
OpenMosix, Open SSI, and LinuxPMI
Introduction to Data Management in EGI
The Top 10 Reasons Why Federated Can’t Succeed
Gordon Erlebacher Florida State University
Presentation transcript:

Distributed Data for Science Workflows Data Architecture Progress Report December 2008

Challenges and Opportunities TeraGrid is larger than ever before, meaning data is more widely distributed and needs to be more mobile As previously reported, balance of FLOPS to available storage has drastically changed TeraGrid user portal, and science gateways, have matured, and interfaces to TG resources have diversified Need greater emphasis on unified interfaces to data, and integration of data into common workflows

Constraints on the Architecture We cannot address the issue of available storage Limited opportunity to improve data transfer performance at the high end Cannot introduce drastic changes to TG infrastructure at this stage of the project Remain dependent on the availability of technology and resources for wide-area file systems

Goals for the Data Architecture Improve the experience of working with data in the TeraGrid for the majority of users Reliability, Ease of use, Performance Integrate data management into the user workflow Balance performance goals against usability Avoid overdependence on data location Support the most common use cases as transparently as possible Move data in, run job, move data out as basic pattern Organize, search, and retrieve data from large “collections”

Areas of Interest Simplifying command-line data movement Extending the reach of WAN file systems Develop unified data replication and management infrastructure Extend and unify user portal interfaces to data Integrate data into scheduling and workflows Provide common access mechanisms to diverse, distributed data resources

Command-line tools Many users are still oriented towards shell access GridFTP is too difficult to use SSH is widely known but has limited usefulness in current configuration We need a new approach and/or tool to provide common, easy-to-use data movement, without compromising on performance

Extending Wide-Area File Systems A “Wide-Area” file system is available on multiple resources A “Global” file system is available on all TeraGrid resources Indiana and SDSC each have a WAN-FS in production now Need to honestly assess the potential for Global file systems, while making WAN file systems available on more resources

Unified Data Management Management of both data and metadata, which may be stored at one or more locations in TeraGrid Multiple sites support data collections using SRB, iRODS, databases, web services, etc. This diversity is good, but also confusing to new users Need a single service, which may utilize multiple technologies, to provide a common entry point for users

Interfaces to Data SSH and “ls” are not effective interfaces to large, complex datasets Portal and Gateway interfaces to data have proven useful and popular, but: They may not be able to access all resources, may require significant gateway developer effort Extend user portal to support WAN file systems and distributed data management Possible to expose user portal internals to ease development of gateways?

Integrating Data into Workflows Almost all tasks run on TeraGrid require some data management and multiple storage resources –Moving data into an HPC system –Moving results to an analysis or viz system –Moving results to an archive Need to make these tasks less human-intensive Users should be able to include these steps as part of their job submission Tools such as DMOVER, PetaShare already exist but are not widely available in TeraGrid

Some Implementation Plans Extend current iRODS-based data management infrastructure to additional sites Test use of REDDNET for distributed data storage and access in TeraGrid Provide a TGUP interface to Lustre-WAN Provide a TGUP interface to distributed data and metadata management Extend current production IU Lustre-WAN and GPFS-WAN to as many compatible resources as possible

More Implementation Plans Port DMOVER to additional schedulers, deploy across TeraGrid Develop and execute plan for PSC-based Lustre- WAN and GPFS/pNFS testing and eventual production deployment (already underway) Work with Gateways group to provide appropriate interfaces to data movement through UP or other mechanisms Simple changes to SSH/SCP configuration: –Support SCP-based access to data mover nodes –Support simpler addressing of data resources

The Cutting, not the Bleeding Edge Primary goal is to improve the availability of robust, production technologies for data Balancing performance, usability and reliability will always be a challenge Need to be agile in assessing new technologies or improvements on old technologies Data Working Group should focus on improvements to configuration for a few production components Make consistent, well-planned efforts to evaluate new components

To-Do List for December Understanding level of required vs. available effort Work with other areas/WGs to place Data Architecture in context (CTSS, Gateways, etc) Setting of priorities and ordering of tasks Development of timelines and milestones for execution Presentation of integrated Data Architecture Description and Plan in early January.