Data Area Report Chris Jordan, Data Working Group Lead, TACC Kelly Gaither, Data and Visualization Area Director, TACC April 2009.

Slides:



Advertisements
Similar presentations
Distributed Processing, Client/Server and Clusters
Advertisements

1 US activities and strategy :NSF Ron Perrott. 2 TeraGrid An instrument that delivers high-end IT resources/services –a computational facility – over.
TeraGrid Quarterly Meeting Dec 6-7, 2007 DVS GIG Project Year 4&5 Project List Kelly Gaither, DVS Area Director.
SSRS 2008 Architecture Improvements Scale-out SSRS 2008 Report Engine Scalability Improvements.
Enabling Responsible International Workplaces New FFC Partnership Model.
High Performance Computing Course Notes Grid Computing.
Kathy Benninger, Pittsburgh Supercomputing Center Workshop on the Development of a Next-Generation Cyberinfrastructure 1-Oct-2014 NSF Collaborative Research:
Technical Review Group (TRG)Agenda 27/04/06 TRG Remit Membership Operation ICT Strategy ICT Roadmap.
(e)Science-Driven, Production- Quality, Distributed Grid and Cloud Data Infrastructure for the Transformative, Disruptive, Revolutionary, Next-Generation.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Office of Science U.S. Department of Energy Grids and Portals at NERSC Presented by Steve Chan.
Milos Kobliha Alejandro Cimadevilla Luis de Alba Parallel Computing Seminar GROUP 12.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
The University of Texas Research Data Repository : “Corral” A Geographically Replicated Repository for Research Data Chris Jordan.
TG QM Arlington: GIG User Support Coordination Plan Sergiu Sanielevici, GIG Area Director for User Support Coordination
Core Services I & II David Hart Area Director, UFP/CS TeraGrid Quarterly Meeting December 2008.
Network, Operations and Security Area Tony Rimovsky NOS Area Director
1 Microsoft Distributed File System (Dfs) Brett O’Neill CSE 8343 – Group A6.
GIG Software Integration: Area Overview TeraGrid Annual Project Review April, 2008.
TeraGrid Information Services December 1, 2006 JP Navarro GIG Software Integration.
GIG Software Integration Project Plan, PY4-PY5 Lee Liming Mary McIlvain John-Paul Navarro.
TeraGrid Information Services JP Navarro, Lee Liming University of Chicago TeraGrid Architecture Meeting September 20, 2007.
PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.
CTSS 4 Strategy and Status. General Character of CTSSv4 To meet project milestones, CTSS changes must accelerate in the coming years. Process –Process.
Data Area Report Chris Jordan, Data Working Group Lead, TACC Kelly Gaither, Data and Visualization Area Director, TACC April 2009.
By: Ashish Gohel 8 th sem ISE.. Why Cloud Computing ? Cloud Computing platforms provides easy access to a company’s high-performance computing and storage.
1 PY4 Project Report Summary of incomplete PY4 IPP items.
The Grid System Design Liu Xiangrui Beijing Institute of Technology.
Kelly Gaither Visualization Area Report. Efforts in 2008 Focused on providing production visualization capabilities (software and hardware) Focused on.
Virtual Data Grid Architecture Ewa Deelman, Ian Foster, Carl Kesselman, Miron Livny.
TeraGrid Quarterly Meeting Dec 5 - 7, 2006 Data, Visualization and Scheduling (DVS) Update Kelly Gaither, DVS Area Director.
HP and Microsoft Solutions for Microsoft Exchange Server 2007 with HP Servers and Storage Presented by: Plaza Dynamics.
Russ Hobby Program Manager Internet2 Cyberinfrastructure Architect UC Davis.
Adaptive Web Caching CS411 Dynamic Web-Based Systems Flying Pig Fei Teng/Long Zhao/Pallavi Shinde Computer Science Department.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
TeraGrid CTSS Plans and Status Dane Skow for Lee Liming and JP Navarro OSG Consortium Meeting 22 August, 2006.
Oracle's Distributed Database Bora Yasa. Definition A Distributed Database is a set of databases stored on multiple computers at different locations and.
Policy Based Data Management Data-Intensive Computing Distributed Collections Grid-Enabled Storage iRODS Reagan W. Moore 1.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
Microsoft Management Seminar Series SMS 2003 Change Management.
Leveraging the InCommon Federation to access the NSF TeraGrid Jim Basney Senior Research Scientist National Center for Supercomputing Applications University.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
TeraGrid Quarterly Meeting Arlington, VA Sep 6-7, 2007 NCSA RP Status Report.
TeraGrid Program Year 5 Overview John Towns Chair, TeraGrid Forum Director, Persistent Infrastructure National Center for Supercomputing Applications University.
The User Perspective Michelle Osmond. The Research Challenge Molecular biology, biochemistry, plant biology, genetics, toxicology, chemistry, and more.
1 NSF/TeraGrid Science Advisory Board Meeting July 19-20, San Diego, CA Brief TeraGrid Overview and Expectations of Science Advisory Board John Towns TeraGrid.
1 Accomplishments. 2 Overview of Accomplishments  Sustaining the Production Earth System Grid Serving the current needs of the climate modeling community.
1 Overall Architectural Design of the Earth System Grid.
NOS Report Jeff Koerner Feb 10 TG Roundtable. Security-wg In Q a total of 11 user accounts and one login node were compromised. The Security team.
Distributed Data for Science Workflows Data Architecture Progress Report December 2008.
Globus and PlanetLab Resource Management Solutions Compared M. Ripeanu, M. Bowman, J. Chase, I. Foster, M. Milenkovic Presented by Dionysis Logothetis.
Education, Outreach and Training (EOT) and External Relations (ER) Scott Lathrop Area Director for EOT Extension Year Plans.
TeraGrid Program Year 5 Overview John Towns Chair, TeraGrid Forum Director, Persistent Infrastructure National Center for Supercomputing Applications University.
Data, Visualization and Scheduling (DVS) TeraGrid Annual Meeting, April 2008 Kelly Gaither, GIG Area Director DVS.
Network, Operations and Security Area Tony Rimovsky NOS Area Director
GRID ANATOMY Advanced Computing Concepts – Dr. Emmanuel Pilli.
TeraPaths: A QoS Enabled Collaborative Data Sharing Infrastructure for Petascale Computing Research The TeraPaths Project Team Usatlas Tier 2 workshop.
Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
TeraGrid-Wide Operations Von Welch Area Director for Networking, Operations and Security NCSA, University of Illinois April, 2009.
NeST: Network Storage John Bent, Venkateshwaran V Miron Livny, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau.
Software Integration Highlights CY2008 Lee Liming, JP Navarro GIG Area Directors for Software Integration University of Chicago, Argonne National Laboratory.
Visualization Update June 18, 2009 Kelly Gaither, GIG Area Director DV.
TeraGrid Program Year 5 Overview John Towns Chair, TeraGrid Forum Director, Persistent Infrastructure National Center for Supercomputing Applications University.
TG ’08, June 9-13, State of TeraGrid John Towns Co-Chair, TeraGrid Forum Director, Persistent Infrastructure National Center for Supercomputing.
Building PetaScale Applications and Tools on the TeraGrid Workshop December 11-12, 2007 Scott Lathrop and Sergiu Sanielevici.
Data Infrastructure in the TeraGrid Chris Jordan Campus Champions Presentation May 6, 2009.
TeraGrid’s Process for Meeting User Needs. Jay Boisseau, Texas Advanced Computing Center Dennis Gannon, Indiana University Ralph Roskies, University of.
TeraGrid Software Integration: Area Overview (detailed in 2007 Annual Report Section 3) Lee Liming, JP Navarro TeraGrid Annual Project Review April, 2008.
Presentation transcript:

Data Area Report Chris Jordan, Data Working Group Lead, TACC Kelly Gaither, Data and Visualization Area Director, TACC April 2009

PY 4 Data Area Characteristics Relatively Stable software and user tools Relatively dynamic site/machine configuration –New Sites and Systems –Older systems being retired TeraGrid Emphasis on broadening participation –Campus Champions –Science Gateways –Underrepresented disciplines 2

PY4 Areas of Emphasis Improve campus-level access mechanisms Provide support for gateways and other “mobile” computing models Improve clarity of documentation Enhance user ability to manage complex datasets across multiple resources Develop comprehensive plan for future developments in the Data area Production deployments of Lustre-WAN, path to Global file systems 3

Data Working Group Coordination Led by Chris Jordan Meets bi-weekly to discuss current issues Has membership from each RP Attendees are a blend of system administrators, software developers, and users 4

Wide-Area and Global File Systems Providing a TeraGrid global file system is a highly requested service Global file systems imply that a single file system is mounted on most TeraGrid resources –No solution currently exists for production global file systems Wide area file systems give the look and feel of a single file system. Possible with technologies such as GPFS-WAN or Lustre-WAN –GPFS-WAN has licensing issues and isn’t available for all platforms –Lustre-WAN is preferable for both licensing and compatibility reasons pNFS is a possible path for global file systems, but is far away from being viable 5

Lustre-WAN Progress There is an initial production deployment of Indiana’s Data Capacitor Lustre-WAN on IU’s BigRed, PSC’s Pople –Declared to be production in PY4 (involves testing and implementation of security enhancements) In PY4, had successful testing and commitment to production on LONI’s QueenBee, TACC’s Ranger/Lonestar, NCSA’s Mercury/Abe, and SDSC’s IA64 –Additional sites (NICS, Purdue) will begin testing this year Additionally, in PY4, ongoing work to improve performance and authentication infrastructure –Work in parallel with production deployment 6

CTSS Efforts in the Data Area In PY4, created data kits –data movement kit – 20 TG resources –data management kit (SRB) – 4 TG resources –wide area file systems kits - GPFS-WAN (5), LUSTRE-WAN (2) Currently reworking data kits to include: –new client-level kits to express functionality and accessibility more clearly –new server-level kits to report more accurate information on server configurations –broadened use cases –requirements for more complex functionality (managing, not just moving, data) –improved information services to support science gateways and automated resource selection 7

Data/Collections Management PY4 Tested new infrastructure for data replication and management across TeraGrid resources (iRODS) Made assessment of archive replication and transition challenges Gathered requirements for data management clients in CTSS 8

Data Architecture Two primary categories of use for data movement tools in the TeraGrid –Users moving data to or from a location outside the TeraGrid –Users moving data between TeraGrid resources –(Frequently, users will need to do both within the span of a given workflow) Moving data to/from location outside the TeraGrid: –Tend to be smaller numbers of files and less overall data to move –Primarily encounter problems with usability due to availability or ease-of-use 9

Data Architecture (2) Moving data between TeraGrid resources –Datasets tend to be larger –Users are more concerned with performance, high- reliability and ease of use General trend that we have seen – as need for data movement has increased, both the complexity of the deployments and the frustrations of users have increased. 10

Data Architecture (3) This is an area in which we think we can have a significant impact –Users want reliability, ease of use, and in some cases high performance –How the technology is implemented should be transparent to the user. –User initiated data movement, particularly on large systems has proven to create problems with contention for disk resources 11

Data Architecture (4) Data Movement Requirements: –R1: Users need reliable, easy to use file transfer tools for user moving data from outside the TeraGrid to resources inside the TeraGrid. –R2: Users need reliable, high performance, easy to use file transfer tools for moving data from one TeraGrid resource to another. –R3: Tools for providing transparent data movement are needed on large systems with low storage to flops ratio. SSH/SCP with the High-performance networking patches (HPN-SCP) SCP-based transfers to gridFTP nodes – RSSH TGUP Data mover) 12

Data Architecture (5) Network architecture on the petascale systems is proving to be a challenge – only a few router nodes are connected to wide area networks directly and the rest of the compute nodes are routed through these. Wide area file systems often need direct connect access. It has become clear that no single solution will provide a production global wide area network file system. -R4: The “look and feel” or the appearance of a global wide area file system with high availability and high reliability (LUSTRE-WAN, pNFS). 13

Data Architecture (6) Until recently, visualization and in many cases, data analysis have been considered a post- processing task requiring some sort of data movement. With the introduction of petascale systems, we are seeing data set sizes that prohibit data movement or make it necessary to minimize the movement. It is anticipated that scheduled data movement is one way in which to guarantee that the data is present at the time it is needed. 14

Data Architecture (7) Visualization and data analysis tools have not been designed to be data aware and have made assumptions that the data can be read into memory and that the applications and tools don’t need to be concerned with exotic file access mechanisms. - R5: Ability to schedule data availability for post- processing tasks. (DMOVER) - R6: Availability of data mining/data analysis tools that are more data aware. (Currently working with VisIt developers to modify open source software. Leveraging work done on parallel Mesa) 15

Data Architecture (8) Many TeraGrid sites provide effectively unlimited archival storage to compute-allocated users. The volume of data flowing into and out of particular archives is already increasing drastically, in some cases exponentially, beyond the ability of the disk caches and tape drives currently allocated. -R7: The TeraGrid must provide better organized, more capable, and more logically unified access to archival storage for the user community. (Proposal to NSF for unified approach to archival storage and data replication) 16

Plans for PY5 Implement Data Architecture recommendations –User portal integration –Data Collections infrastructure –Archival replication services –Continued investigation of new location-independent access mechanisms (PetaShare, ReDDnet) Complete production deployments of Lustre-WAN Develop plans for next-generation Lustre-WAN and pNFS technologies Work with CTSS team on continued improvements to Data kit implementations 17