Ischia, Italy - 9-21 July 20061 Session 9 Wednesday 12 th July Malcolm Atkinson.

Slides:

Advertisements

Similar presentations

Giggle: A Framework for Constructing Scalable Replica Location Services Ann Chervenak, Ewa Deelman, Ian Foster, Leanne Guy, Wolfgang Hoschekk, Adriana.

Advertisements

The Replica Location Service In wide area computing systems, it is often desirable to create copies (replicas) of data objects. Replication can be used.

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.

© 2007 Open Grid Forum Data Management Challenge - The View from OGF OGF22 – February 28, 2008 Cambridge, MA, USA Erwin Laure David E. Martin Data Area.

Data Management Expert Panel. RLS Globus-EDG Replica Location Service u Joint Design in the form of the Giggle architecture u Reference Implementation.

High Performance Computing Course Notes Grid Computing.

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

Application of GRID technologies for satellite data analysis Stepan G. Antushev, Andrey V. Golik and Vitaly K. Fischenko 2007.

Data Grids: Globus vs SRB. Maturity SRB  Older code base  Widely accepted across multiple communities  Core components are tightly integrated Globus.

NextGRID & OGSA Data Architectures: Example Scenarios Stephen Davey, NeSC, UK ISSGC06 Summer School, Ischia, Italy 12 th July 2006.

Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.

Milos Kobliha Alejandro Cimadevilla Luis de Alba Parallel Computing Seminar GROUP 12.

Enterprise Search With SharePoint Portal Server V2 Steve Tullis, Program Manager, Business Portal Group 3/5/2003.

Oxford Jan 2005 RAL Computing 1 RAL Computing Implementing the computing model: SAM and the Grid Nick West.

Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.

Globus 4 Guy Warner NeSC Training.

QCDgrid Technology James Perry, George Beckett, Lorna Smith EPCC, The University Of Edinburgh.

Data Management Kelly Clynes Caitlin Minteer. Agenda Globus Toolkit Basic Data Management Systems Overview of Data Management Data Movement Grid FTP Reliable.

M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.

DISTRIBUTED COMPUTING

INFSO-RI Enabling Grids for E-sciencE gLite Data Management Services - Overview Mike Mineter National e-Science Centre, Edinburgh.

GT Components. Globus Toolkit A “toolkit” of services and packages for creating the basic grid computing infrastructure Higher level tools added to this.

1 School of Computer, National University of Defense Technology A Profile on the Grid Data Engine (GridDaEn) Xiao Nong

The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes.

ESP workshop, Sept 2003 the Earth System Grid data portal presented by Luca Cinquini (NCAR/SCD/VETS) Acknowledgments: ESG.

INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.

QCDGrid Progress James Perry, Andrew Jackson, Stephen Booth, Lorna Smith EPCC, The University Of Edinburgh.

Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.

1 Use of SRMs in Earth System Grid Arie Shoshani Alex Sim Lawrence Berkeley National Laboratory.

Chapter 4 Realtime Widely Distributed Instrumention System.

DATABASE MANAGEMENT SYSTEMS IN DATA INTENSIVE ENVIRONMENNTS Leon Guzenda Chief Technology Officer.

CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE middleware: gLite Data Management EGEE Tutorial 23rd APAN Meeting, Manila Jan.

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Management of Distributed Data Reagan W. Moore.

Data and storage services on the NGS Mike Mineter Training Outreach and Education

Enabling Grids for E-sciencE Introduction Data Management Jan Just Keijser Nikhef Grid Tutorial, November 2008.

1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.

SEEK Welcome Malcolm Atkinson Director 12 th May 2004.

The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.

NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.

CLASS Information Management Presented at NOAATECH Conference 2006 Presented by Pat Schafer (CLASS-WV Development Lead)

IODE Ocean Data Portal - ODP  The objective of the IODE Ocean Data Portal (ODP) is to facilitate and promote the exchange and dissemination of marine.

Replica Management Services in the European DataGrid Project Work Package 2 European DataGrid.

Grid User Interface for ATLAS & LHCb A more recent UK mini production used input data stored on RAL’s tape server, the requirements in JDL and the IC Resource.

6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.

INFSO-RI Enabling Grids for E-sciencE The gLite File Transfer Service: Middleware Lessons Learned form Service Challenges Paolo.

INFSO-RI Enabling Grids for E-sciencE Introduction Data Management Ron Trompert SARA Grid Tutorial, September 2007.

1 Overall Architectural Design of the Earth System Grid.

Providing web services to mobile users: The architecture design of an m-service portal Minder Chen - Dongsong Zhang - Lina Zhou Presented by: Juan M. Cubillos.

Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.

Data and storage services on the NGS.

1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.

LHCC Referees Meeting – 28 June LCG-2 Data Management Planning Ian Bird LHCC Referees Meeting 28 th June 2004.

Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.

INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.

Breaking the frontiers of the Grid R. Graciani EGI TF 2012.

A System for Monitoring and Management of Computational Grids Warren Smith Computer Sciences Corporation NASA Ames Research Center.

Preservation Data Services Persistent Archive Research Group Reagan W. Moore October 1, 2003.

The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.

Grid Services for Digital Archive Tao-Sheng Chen Academia Sinica Computing Centre

System Software Laboratory Databases and the Grid by Paul Watson University of Newcastle Grid Computing: Making the Global Infrastructure a Reality June.

Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,

GGF OGSA-WG, Data Use Cases Peter Kunszt Middleware Activity, Data Management Cluster EGEE is a project funded by the European.

Network Requirements Javier Orellana

OGSA Data Architecture Scenarios

Cloud computing mechanisms

Chapter 2: Operating-System Structures

Data Management Components for a Research Data Archive

Chapter 2: Operating-System Structures

Presentation transcript:

Ischia, Italy July Session 9 Wednesday 12 th July Malcolm Atkinson

Ischia, Italy July 20062

3 Plan for talk Reminder of distributed system realities Motivation for Distributed Data Management –Example applications Data lifecycle –Generation, Storage, Use & Update, Archiving & Deletion/De-allocation Data movement isn’t free –But it is needed in many forms OGSA Data Architecture Examples of Data Services in Grid systems Summary & Conclusions

Ischia, Italy July Principles of Distributed Computing Issues you can’t avoid –Lack of Complete Knowledge (LOCK) –Latency –Heterogeneity –Autonomy –Unreliability –Change A Challenging goal –balance technical feasibility –against virtual homogeneity, stability and reliability Appropriate balance between usability and productivity –while remaining affordable, manageable and maintainable This is still NOT easy

Ischia, Italy July Reminder of Engineering Trade offs Challenge Heterogeneity & Variety Complex platform behaviour Partial failures Partial failures + large tasks Autonomy – owner’s rights Independent provision Scale, costs & latency Vulnerable to misuse Diverse & Evolving Valuable assets –Reputation, equipment, teams, data, algorithms, working practices Goal Simple operational model Simple application model Simple user model Minimal resource wastage Stability & uniformity Simple resource access Good performance Dependable protection Flexible & agile IPR & assets well protected

Ischia, Italy July 20066

7 Compound Causes of Data Growth Faster devices Cheaper devices Higher-resolution –all ~ Moore’s law Increased processor throughput –  more derived data Cheaper & higher-volume storage Remote data more accessible –Public policy to make research data available –Bandwidth increases –Latency doesn’t get less though A product of effects  faster than Moore’s law

Ischia, Italy July Diverse Data Sources Output from Modelling and Simulation –Increasingly sophisticated and detailed models –Longer model runs –More model runs Data from Observation –Arrays of deployed instruments Oceans, biosphere, atmosphere, geophysics, space, … Exploration systems Engineering & built-environment monitoring People monitoring – health, safety, security, finance, epidemiology, … –Systematic data collection Space-based earth observation Extensive sky surveys Large experiments – iter, LHC, … Extensive automated laboratories – crystallography, biochemistry, medicine, … Powered by Grids

Ischia, Italy July Diverse Data Sources 2 Commercial and Industrial data –Customer tracking –Production and product tracking –Digital entertainment media –Financial tracking & transactions Governmental & Socio-economic data –Census, surveys, enquiries, legal –Spatio-temporal socio-economic & historic data Derived data –Analysis, calibration & summaries  more data –Computer-based composition e.g. automated annotation of sequences Powered by Grids

Ischia, Italy July

Ischia, Italy July Immense potential Increasing size of data collections –Allow smaller scale phenomena –Rarer phenomena –To be investigated / detected Increasing scope of data collections –Allow larger scale phenomena to be investigated Diverse data collections –Allow discovery by combining data from multiple sources E.g. the earthquake fault in China (see Highlights) Business intelligence from data –A crucial competitive advantage Growth in number of data collections –Generates a combinatorial expansion of the opportunities

Ischia, Italy July Exploiting that potential There are knowledge nuggets in the data But the data are in many places Mining them is hard –Finding, extracting & fetching relevant data –Processing the sheer volumes of data –Using more sophisticated matching Transformations to deal with data collection systems Transformations to remove “known” phenomena, hiding new phenomena Combinatorial space searches Delicate matching criteria Sophisticated statisics

Ischia, Italy July Exploiting that potential 2 Requires large amounts of data management –Acquisition, storage, cataloguing, movement, archiving, discard Requires large amounts of computation Requires the usual AAA controls –and sometimes privacy mechanisms Requires provenance trails & metadata records –E.g. for attribution & legal requirements –To handle re-computations efficiently This combination requires grids Data needs more

Ischia, Italy July Interpretational Challenges Finding & Accessing data –Variety of mechanisms & policies Interpreting data –Variety of forms, representations, value systems & ontologies Independent provision & ownership –Autonomous changes in availability, form, policy, … –Regional variations in legal requirements Processing data –Understanding how it may be related –Devising models that expose the relationships Presenting results –Humans need either Derived small volumes of statistics Visualisations Requires insight & creativity

Ischia, Italy July

Ischia, Italy July Motivation Entering an age of data –Data Explosion –CERN: LHC will generate 1GB/s = 10PB/y –VLBA (NRAO) generates 1GB/s today –Pixar generate 100 TB/Movie –Storage getting cheaper Data stored in many different ways –Data resources –Relational databases –XML databases –Flat files Need ways to facilitate –Data discovery –Data access –Data integration Empower e-Business and e-Science –The Grid is a vehicle for achieving this

Ischia, Italy July Composing Observations in Astronomy Data and images courtesy Alex Szalay, John Hopkins No. & sizes of data sets as of mid-2002, grouped by wavelength 12 waveband coverage of large areas of the sky Total about 200 TB data Doubling every 12 months Largest catalogues near 1B objects

Ischia, Italy July 2006© 18 Biomedical data – making connections acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt taggtgactt gcctgttttt ttttaattgg Slide provided by Carole Goble: University of Manchester

PDB 33,367 Protein structures EMBL DB 111,416,302,701 nucleotides Database Growth Slide provided by Richard Baldock: MRC HGU Edinburgh

China Workshops December 2005 GODIVA Data Portal Grid for Ocean Diagnostics, Interactive Visualisation and Analysis Daily Met Office Marine Forecasts and gridded research datasets National Centre for Ocean Forecasting ~3Tb climate model datastore via Web Services Interactive Visualisations inc. Movies ~ 30 accesses a day worldwide Other GODIVA software produces 3D/4D Visualisations reading data remotely via Web Services Online Movies

China Workshops December 2005 GODIVA Visualisations Unstructured Meshes Grid Rotation/Interpolation GeoSpatial Databases v. Files (Postgres, IBM, Oracle) Perspective 3D Visualisation Google maps viewer

China Workshops December 2005 NERC Data Grid The DataGrid focuses on federation of NERC Data Centres Grid for data discovery, delivery and use across sites Data can be stored in many different ways (flat files, databases…) Strong focus on Metadata and Ontologies Clear separation between discovery and use of data. Prototype focussing on Atmospheric and Oceanographic data

global in-flight engine diagnostics in-flight data airline maintenance centre ground station global network eg SITA internet, , pager DS&S Engine Health Center data centre Distributed Aircraft Maintenance Environment: Leeds, Oxford, Sheffield &York, Jim Austin 100,000 aircraft 0.5 GB/flight 4 flights/day 200 TB/day Now BROADEN Significant in getting Boeing 787 engine contract

Asif Usmani FireGrid Technologies Maps, models, scenarios Super-real-time simulation (HPC) KBS and Planning Emergency Responders 1000s of sensors & gateway processing Grid

Ischia, Italy July

Ischia, Italy July Terabyte → Petabyte TerabytePetabyte RAM time to move15 minutes2 months 1GB WAN move time 10 hours ($1000) 14 months ($1 million) Disk cost7 disks = $5000 (SCSI) 6800 Disks units + 32 racks = $7 million Disk power100 Watts100 Kilowatts Disk weight5.6 Kg33 Tonnes Disk footprintInside machine 60 m 2 Approximately Correct in May 2003 Distributed Computing Economics Jim Gray, Microsoft Research, MSR-TR

Ischia, Italy July Mohammed & Mountains Petabytes of Data cannot be moved (often) –It stays where it is produced or curated Hospitals, observatories, European Bioinformatics Institute, … –A few caches and a small proportion cached –Sometimes replicated Diverse data collections –Discovery depends on insights –Unpredictable or unexpected use of data –Remote users –Composition from multiple sites

Ischia, Italy July Move computation to the data Assumption: code size << data size –Minimise data transport Provision combined storage & compute resources Develop the database philosophy for this? –Queries are programs safe to run near data Develop the storage architecture for this? –Selection computation hosted close to storage Develop experiment, sensor & simulation architectures –That take code to select and digest data as an output control –That automatically attach the provenance & metadata Data Cutter a step in this direction –Sub-setting and aggregation of datasets using filters executed close to data –

Ischia, Italy July Reduce data movement: Caching Strategies Caching –Based on coherence in demand for data –The same applications, individuals or groups –Request the same or similar data repeatedly –Until their focus of interest moves Save data locally –To save re-fetching

Ischia, Italy July Reduce data movement: Caching Strategies Challenges –Choosing the right amount to store locally –Balance storage costs v re-fetch costs –Detect stale data an update on the original has made local data invalid –Detect a scan much larger than cache A cache would then generate extra costs for no return –But much research data is static or has regular update patterns Easier to cache –Some researchers accept bounded staleness E.g. 1 day, 1 week

Ischia, Italy July Reduce data movement: Caching Strategies Who handles those challenges? –Automated services –Application developers Exploiting general application-specific properties Requires infrastructure & application knowledge May be out-of-date when used –End-users Exploiting particular run properties But requires infrastructure and application knowledge Hard to develop Liable to be out of date –Adaptive controllers Learning on coherent workloads

Ischia, Italy July Reduce data movement: Replication Strategies Replication for –Reliability & Performance –Increase chance to be or compute data near Challenges –Logical to physical map maintenance and use –Deciding what to replicate Which data, which subsets/supersets of data –Propagating updates –Recycling storage space used for replicas –Physically making the copies More data movement Similar decision options

Ischia, Italy July Reduce delays from data movement Pre-load required data near execution host Pre-load required code near data Challenges –Storage availability in right place –Licence availability –Clean up after execution –Clean up after failures

Ischia, Italy July Streaming What is streaming? –C.f. Unix pipes & video / audio delivery Despatch and delivery of data in increments –Continuous or long stream of blocks –Data generator writes to stream when data ready May block if stream capacity reached –Data consumer reads from stream when needs data Will block when stream empty –Stream management system May organise intermediate buffer sizes and storage

Ischia, Italy July Streaming 2 Why stream –The consumer can start as soon as some data is ready Overlap execution times, e.g. on different processors Pipelined execution –The cost of intermediary storage may be avoided Read and write times to secondary memory Local RAM (and processor cache) consumption –The scale of data processed can exceed store capacity –The stream may flow from continuous recorders –The stream may flow to continuous enactors –Enables computational steering Stream of visualisable data showing computation or experiment progress

Ischia, Italy July Streaming 3 Why not stream –For small data transfers Stream administration may outweigh simple transfer time –Operation may require all of the data E.g. sort, aggregate calculation (count, sum, average, standard deviation) Can stream in our out in some cases – but not if both source & destination require all of the data

Ischia, Italy July

Ischia, Italy July OGSA Capabilities Security Cross-organizational users Trust nobody Authorized access only Security Cross-organizational users Trust nobody Authorized access only Information Services Registry Notification Logging/auditing Information Services Registry Notification Logging/auditing Execution Management Job description & submission Scheduling Resource provisioning Execution Management Job description & submission Scheduling Resource provisioning Data Services Common access facilities Efficient & reliable transport Replication services Data Services Common access facilities Efficient & reliable transport Replication services Self-Management Self-configuration Self-optimization Self-healing Self-Management Self-configuration Self-optimization Self-healing Resource Management Discovery Monitoring Control Resource Management Discovery Monitoring Control OGSA OGSA “profiles” Web services foundation Hiro Kishimoto: Keynote GGF17

Ischia, Italy July Issues Find Describe Access Data Formats Protocols Use cases Data Move/Copy/Replicate Metadata Data Manage Common access Data Services The basic problem Manage, transfer and access distributed data services and resources The basic problem Manage, transfer and access distributed data services and resources Derived dataCatalog SensorData stream Text file Relational database Hiro Kishimoto: Keynote GGF17

Ischia, Italy July Basic Data Services Data Resources Managed Storage Data Resources Transfer Protocols Storage Managemen t Data Managemen t Other Data Services Transfer Registries Non-OGSA client APIs & other services Service interface Resource interface Hiro Kishimoto: Keynote GGF17

Ischia, Italy July Data Services Data Service n Data Service 1 Data Service 2 Composite Data Services Replication Cache Federation Hiro Kishimoto: Keynote GGF17

Ischia, Italy July Data Services Name Service Data Service 1 Data Service 2 File System Data Services File Service

Ischia, Italy July Basic Data Interfaces Storage Management −e.g. Storage Resource Management (SRM) Storage Management −e.g. Storage Resource Management (SRM) Data Access −ByteIO −Data Access & Integration (DAI) Data Access −ByteIO −Data Access & Integration (DAI) Data Transfer −Data Movement Interface Specification (DMIS) −Protocols (e.g. GridFTP) Data Transfer −Data Movement Interface Specification (DMIS) −Protocols (e.g. GridFTP) Replica management Metadata catalog Cache management Replica management Metadata catalog Cache management Hiro Kishimoto: Keynote GGF17

Ischia, Italy July

Ischia, Italy July Storage Resource Manager (SRM) de facto & written standard in physics, … Collaborative effort –CERN, FNAL, JLAB, LBNL and RALCERN, FNAL, JLAB, LBNL and RAL Essential bulk file storage –(pre) allocation of storage abstraction over storage systems –File delivery / registration / access –Data movement interfaces E.g. gridFTP Rich function set –Space management, permissions, directory, data transfer & discovery

Ischia, Italy July Storage Resource Broker (SRB) SDSC developed Widely used –Archival document storage –Scientific data: bio-sciences, medicine, geo-sciences, … Manages –Storage resource allocation abstraction over storage systems –File storage –Collections of files –Metadata describing files, collections, etc. –Data transfer services

Ischia, Italy July OMII Data Management Services FTP –File Transfer Service OGSA-DAI –Access to structured data

Ischia, Italy July Condor Data Management Stork –Manages File Transfers –May manage reservations Nest –Manages Data Storage –C.f. GridFTP with reservations Over multiple protocols

Ischia, Italy July Globus Tools and Services for Data Management l GridFTP u A secure, robust, efficient data transfer protocol l The Reliable File Transfer Service (RFT) u Web services-based, stores state about transfers l The Data Access and Integration Service (DAIS) u Service to access to data resources, particularly relational and XML databases l The Replica Location Service (RLS) u Distributed registry that records locations of data copies l The Data Replication Service u Web services-based, combines data replication and registration functionality Slides from Ann Chervenak

Ischia, Italy July A Replica Location Service l A Replica Location Service (RLS) is a distributed registry that records the locations of data copies and allows replica discovery u RLS maintains mappings between logical identifiers and target names u Must perform and scale well: support hundreds of millions of objects, hundreds of clients l E.g., LIGO (Laser Interferometer Gravitational Wave Observatory) Project u RLS servers at 10 sites u Maintain associations between 6 million logical file names & 40 million physical file locations Slides from Ann Chervenak

Ischia, Italy July LRC RLI LRC Replica Location Indexes Local Replica Catalogs Replica Location Index (RLI) nodes aggregate information about one or more LRCs LRCs use soft state update mechanisms to inform RLIs about their state: relaxed consistency of index Optional compression of state updates reduces communication, CPU and storage overheads RLS Features Local Replica Catalogs (LRCs) contain consistent information about logical-to-target mappings Slides from Ann Chervenak

Ischia, Italy July Components of RLS Implementation l Common server implementation for LRC and RLI l Front-End Server u Multi-threaded u Written in C u Supports GSI Authentication using X.509 certificates l Back-end Server u MySQL, PostgreSQL and Oracle Relational Database l Client APIs: C, Java, Python l Client Command line tool Slides from Ann Chervenak

Ischia, Italy July RLS in Production Use: LIGO l Laser Interferometer Gravitational Wave Observatory Currently use RLS servers at 10 sites u Contain mappings from 6 million logical files to over 40 million physical replicas l Used in customized data management system: the LIGO Lightweight Data Replicator System (LDR) u Includes RLS, GridFTP, custom metadata catalog, tools for storage management and data validation Slides from Ann Chervenak

Ischia, Italy July RLS in Production Use: ESG l Earth System Grid: Climate modeling data (CCSM, PCM, IPCC) l RLS at 4 sites l Data management coordinated by ESG portal l Datasets stored at NCAR u TB in total files u 1230 portal users l IPCC Data at LLNL u TB in 59,300 files u 400 registered users u Data downloaded: TB in 263,800 files u Avg. 300GB downloaded/day u 200+ research papers being written Slides from Ann Chervenak

Ischia, Italy July gLite Data Management FTS –File Transfer Service LFC –Logical file catalogue Replication Service –Accessed through LFC AMGA –Metadata services

Enabling Grids for E-sciencE INFSO-RI Ischia, Italy July nd EGEE Review, CERN - gLite Middleware Status 56 Data Management Services FiReMan catalog –Resolves logical filenames (LFN) to physical location of files (URL understood by SRM) and storage elements –Oracle and MySQL versions available –Secure services, using VOMS groups, ACL support for DNs –Full set of Command Line tools –Simple API for C/C++ wrapping a lot of the complexity for easy usage –Attribute support –Symbolic link support –Exposing ServiceIndex and DLI (for matchmaking) –Separate catalog available as a keystore for data encryption (‘Hydra’) –Deployed on the Pre-Production Service and DILIGENT testbed gLite I/O –Posix-like access to Grid files –Castor, dCache and DPM support –Added a remove method to be able to delete files –Changed the configuration to match all other CLI configuration to service-discovery –Improved error reporting –Has been used for the BioMedical Demo  Encryption and DICOM SRM –Deployed on the Pre-Production Service and the DILIGENT testbed AMGA MetaData Catalog –NA4 contribution  Result of JRA1 & NA4 prototyping together with PTF assessment  Used by the LHCb experiment  Has been used for the BioMedical Demo

Enabling Grids for E-sciencE INFSO-RI Ischia, Italy July nd EGEE Review, CERN - gLite Middleware Status 57 File Transfer Service Reliable file transfer Full scalable implementation –Java Web Service front-end, C++ Agents, Oracle or MySQL database support –Support for Channel, Site and VO management –Interfaces for management and statistics monitoring –Gsiftp, SRM and SRM-copy support Has been in use by the Service Challenges for the last 5 months. –Evolved together with the Service Challenges Team –Daily meetings FTS evolved over summer to include –Support for MySQL and Oracle –Multi-VO support –GridFTP and SRM copy support –MyProxy server as a CLI argument –Many small changes/optimizations revealed by SC3 usage FTS workshop with LHC experiments on November 16 –Issues, Feedback and short term plans

Ischia, Italy July

Ischia, Italy July Summary: Take home message Distributed data management –A motivation for grids –Grids will not work without it Principal Requirements and functions –Abstracted storage management Data lifetime Bulk data: files & collections of files Creation, cataloguing, description, protection, movement, access & update, deletion/de-allocation –Varieties of data movement –Replication & caching

Ischia, Italy July