Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Area Report Chris Jordan, Data Working Group Lead, TACC Kelly Gaither, Data and Visualization Area Director, TACC April 2009.

Similar presentations


Presentation on theme: "Data Area Report Chris Jordan, Data Working Group Lead, TACC Kelly Gaither, Data and Visualization Area Director, TACC April 2009."— Presentation transcript:

1 Data Area Report Chris Jordan, Data Working Group Lead, TACC Kelly Gaither, Data and Visualization Area Director, TACC April 2009

2 PY 4 Data Area Characteristics Relatively Stable software and user tools Relatively dynamic site/machine configuration –New Sites and Systems –Older systems being retired TeraGrid Emphasis on broadening participation –Campus Champions –Science Gateways –Underrepresented disciplines 2

3 PY4 Areas of Emphasis Improve campus-level access mechanisms Provide support for gateways and other “mobile” computing models Improve clarity of documentation Enhance user ability to manage complex datasets across multiple resources Develop comprehensive plan for future developments in the Data area Production deployments of Lustre-WAN, path to Global file systems 3

4 Data Working Group Coordination Led by Chris Jordan Meets bi-weekly to discuss current issues Has membership from each RP Attendees are a blend of system administrators and software developers 4

5 Wide-Area and Global File Systems Providing a TeraGrid global file system is a highly user requested service Global file system implies that a file system is mounted on most TeraGrid resources –No single file system can be mounted across all TG resources Deploying wide-area file systems, however, are possible with technologies such as GPFS-WAN –GPFS-WAN has licensing issues and isn’t available for all platforms Lustre-WAN is promising for both licensing and compatibility reasons Additional technologies such as pNFS will be necessary to make a file system global 5

6 Lustre-WAN Progress There is an initial production deployment of Indiana’s Data Capacitor Lustre-WAN on IU’s BigRed, PSC’s Pople –Declared to be production in PY4 (involves testing and implementation of security enhancements) In PY4, had successful testing and commitment to production on LONI’s QueenBee, TACC’s Ranger/Lonestar, NCSA’s Mercury/Abe, and SDSC’s IA64 (expected to go into production before PY5) –Additional sites (NICS, Purdue) will begin testing Q4 PY4 Additionally, in PY4, ongoing work to improve performance and authentication infrastructure –Work in parallel with production deployment 6

7 CTSS Efforts in the Data Area In PY4, created data kits –data movement kit –data management kit –wide area file systems kit Currently reworking data kits to include: –new client-level kits to express functionality and accessibility more clearly –new server-level kits to report more accurate information on server configurations –broadened use cases –requirements for more complex functionality (managing, not just moving, data) –improved information services to support science gateways and automated resource selection 7

8 Data/Collections Management In PY4, tested new infrastructure for data replication and management across TeraGrid resources (iRODS) In PY4, made assessment of archive replication and transition challenges In PY4, gathered requirements for data management clients in CTSS 8

9 Data Collection Highlights Large data collections. –MODIS Satellite Imagery of the Earth. Remote sensing data from Center for Space Research. Grows by ~ 2.4 GB/day, widely used by earth scientists, many derivative products produced. (6 TB) –Purdue Terrestrial Observatory. Remote sensing data. (1.4 TB) –Alaska Herbarium collection. High-resolution scans of > 223,000 plant specimens from Alaska and the Circumpolar North. (1.5 TB) Hosting of Data Collection services within VMs (provides efficient delivery of services related to modest scale data sets) –Flybase – key resource for Drosophila genomics. Front end hosted within VM (2.3 GB) –MutDB – web services data resource – delivers info on known effects of mutations in genes (across taxa). 9

10 Data Architecture (1) Two primary categories of use for data movement tools in the TeraGrid –Users moving data to or from a location outside the TeraGrid –Users moving data between TeraGrid resources –(Frequently, users will need to do both within the span of a given workflow) Moving data to/from location outside the TeraGrid: –Tend to be smaller numbers of files and less overall data to move –Primarily encounter problems with usability due to availability or ease-of-use 10

11 Data Architecture (2) Moving data between TeraGrid resources –Datasets tend to be larger –Users are more concerned with performance, high- reliability and ease of use General trend that we have seen – as need for data movement has increased, both the complexity of the deployments and the frustrations of users have increased. 11

12 Data Architecture (3) This is an area in which we think we can have a significant impact –Users want reliability, ease of use, and in some cases high performance –How the technology is implemented should be transparent to the user. –User initiated data movement, particularly on large systems has proven to create problems with contention for disk resources 12

13 Data Architecture (4) Data Movement Requirements: –R1: Users need reliable, easy to use file transfer tools for user moving data from outside the TeraGrid to resources inside the TeraGrid. –R2: Users need reliable, high performance, easy to user file transfer tools for using moving data from one TeraGrid resource to another. –R3: Tools for providing transparent data movement are needed on large systems with low storage to flops ratio. (SSH/SCP with the High-performance networking patches (HPN-SCP), SCP-based transfers to gridftp nodes - RSSH) 13

14 Data Architecture (5) Users continue to request a single file system that is shared across all resources. Wide area file systems have proven to be a real possibility through the production operation of GPFS-WAN. There are still significant technical and licensing issues that prevent GPFS-WAN from becoming a global WAN-FS solution. 14

15 Data Architecture (6) Network architecture on the petascale systems is proving to be a challenge – only a few router nodes are connected to wide area networks directly and the rest of the compute nodes are routed through these. Wide area file systems often need direct connect access. It has become clear that no single solution will provide a production global wide are network file system. - R4: The “look and feel” or the appearance of a global wide area file system with high availability and high reliability. (LUSTRE-WAN, pNFS) 15

16 Data Architecture (7) Until recently, visualization and in many cases, data analysis have been considered a post- processing task requiring some sort of data movement. With the introduction of petascale systems, we are seeing data set sizes grow to size that prohibits data movement or makes it necessary to minimize the movement. It is anticipated that scheduled data movement is one way in which to guarantee that the data is present at the time it is needed. 16

17 Data Architecture (8) Visualization and data analysis tools have not been designed to be data aware and have made assumptions that the data can be read into memory and that the applications and tools don’t need to be concerned with exotic file access mechanisms. - R5: Ability to schedule data availability for post- processing tasks. (DMOVER) - R6: Availability of data mining/data analysis tools that are more data aware. (Currently working with VisIt developers to modify open source software. Leveraging work done on parallel Mesa) 17

18 Data Architecture (9) Many TeraGrid sites provide effectively unlimited archival storage to compute-allocated users. Almost none of these sites have a firm policy requiring or allowing them to delete data after a triggering event. The volume of data flowing into and out of particular archives is already increasing drastically, in some cases exponentially, beyond the ability of the disk caches and tape drives currently allocated. -R7: The TeraGrid must provide better organized, more capable, and more logically unified access to archival storage for the user community. (Proposal to NSF for unified approach to archival storage) 18

19 Plans for PY5 Implement Data Architecture recommendations –User portal integration –Data Collections infrastructure –Archival replication services –Continued investigation of new location-independent access mechanisms (Petashare, Reddnet) Complete production deployments of Lustre-WAN Develop plans for next-generation Lustre-WAN and pNFS technologies Work with CTSS team on continued improvements to Data kit implementations 19


Download ppt "Data Area Report Chris Jordan, Data Working Group Lead, TACC Kelly Gaither, Data and Visualization Area Director, TACC April 2009."

Similar presentations


Ads by Google