EUDAT Towards a pan-European Collaborative Data Infrastructure Ari Lukkarinen CSC-IT Center for Science, Finland APA Conference, November 6th, 2012.

Slides:



Advertisements
Similar presentations
National Institute of Statistics, Geography and Informatics (INEGI) Implementation of SDMX in Mexico.
Advertisements

DELOS Highlights COSTANTINO THANOS ITALIAN NATIONAL RESEARCH COUNCIL.
CLEARSPACE Digital Document Archiving system INTRODUCTION Digital Document Archiving is the process of capturing paper documents through scanning and.
EUDAT Towards a pan-European Collaborative Data Infrastructure Ari Lukkarinen CSC-IT Center for Science, Finland Digital Research Conference Oxford, 12.
ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.
QCDgrid Technology James Perry, George Beckett, Lorna Smith EPCC, The University Of Edinburgh.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
EGI-Engage EGI-Engage Engaging the EGI Community towards an Open Science Commons Project Overview 9/14/2015 EGI-Engage: a project.
1 Common Challenges Across Scientific Disciplines Laurence Field CERN 18 th November 2013.
Recordkeeping for Good Governance Toolkit Digital Recordkeeping Guidance Funafuti, Tuvalu – June 2013.
1 INFRA : INFRA : Scientific Information Repository supporting FP7 “The views expressed in this presentation are those of the author.
CLARIN Infrastructure Vision (and some real needs) Daan Broeder CLARIN EU/NL Max-Planck Institute for Psycholinguistics.
Towards a European network for digital preservation Ideas for a proposal Mariella Guercio, University of Urbino.
Results of the HPC in Europe Taskforce (HET) e-IRG Workshop Kimmo Koski CSC – The Finnish IT Center for Science April 19 th, 2007.
Jamie Hall (ILL). SciencePAD Persistent Identifiers Workshop PANData Software Catalogue January 30th 2013 Jamie Hall Developer IT Services, Institut Laue-Langevin.
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
Exploring ‘Workspaces’ Tom Visser, SARA compute and networking services, Amsterdam Garching Workshop 21 st September 2010.
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No The pan-European.
Sync and Exchange Research Data b2drop.eudat.eu This work is licensed under the Creative Commons CC-BY 4.0 licence B2DROP EUDAT’s Personal.
26/05/2005 Research Infrastructures - 'eInfrastructure: Grid initiatives‘ FP INFRASTRUCTURES-71 DIMMI Project a DI gital M ulti M edia I nfrastructure.
Identity Management in DEISA/PRACE Vincent RIBAILLIER, Federated Identity Workshop, CERN, June 9 th, 2011.
EUDAT: Data sharing and management in a collaborative data infrastructure Rob Baxter, EPCC, University of Edinburgh.
Find Research Data b2find.eudat.eu B2FIND User Training How to find data objects and collections using EUDAT’s B2FIND This work is licensed.
Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.
Aalto Data Repository Keijo Heljanko and Mikko Hakala
Replicate Research Data Safely eudat.eu/b2safe B2SAFE How to replicate your data using EUDAT’s B2SAFE Version 3 November 2015 This work is.
INFSO-RI Enabling Grids for E-sciencE The EGEE Project Owen Appleton EGEE Dissemination Officer CERN, Switzerland Danish Grid Forum.
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No EUDAT- Towards.
Store and Share Research Data b2share.eudat.eu B2SHARE How to share and store research data using EUDAT’s B2SHARE This work is licensed under.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI strategy and Grand Vision Ludek Matyska EGI Council Chair EGI InSPIRE.
b2access.eudat.eu B2ACCESS The simple and secure authorisation and authentication platform of EUDAT This work is licensed under the Creative.
1 st EGI CTA VT meeting 18 January 2013 C. Vuerli (INAF, Italy), N. Neyroud (CNRS/IN2P3/LAPP, France)
CLARIN EUDAT2020 uptake plan Dieter Van Uytvanck CLARIN ERIC EUDAT User Forum, Rome.
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No EPOS and EUDAT.
Collection-Based Persistent Archives Arcot Rajasekar, Richard Marciano, Reagan Moore San Diego Supercomputer Center Presented by: Preetham A Gowda.
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No EUDAT Aalto Data.
Get Data to Computation eudat.eu/b2stage B2STAGE How to shift large amounts of data Version 4 February 2016 This work is licensed under the.
Data Stewardship Lifecycle A framework for data service professionals Protectors of data.
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No Public access.
Authentication and Authorisation for Research and Collaboration Heiko Hütter, Martin Haase, Peter Gietz, David Groep AARC 3 rd.
Store and exchange data with colleagues and team Synchronize multiple versions of data Ensure automatic desktop synchronization of large files B2DROP is.
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No Collaboration.
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No Support to scientific.
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No EUDAT Services.
EGI-InSPIRE RI EGI Compute and Data Services for Open Access in H2020 Tiziana Ferrari Technical Director, EGI.eu
EGI-InSPIRE EGI-InSPIRE RI EGI strategy towards the Open Science Commons Tiziana Ferrari EGI-InSPIRE Director at EGI.eu.
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No Herbadrop.
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No Aalto Data Repository.
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No LTER- Europe &
Bob Jones EGEE Technical Director
Accessing the VI-SEEM infrastructure
PIDs in EUDAT Webinar, 15 Februari 2013
Towards a pan-European Collaborative Data Infrastructure
EUDAT Towards a European Collaborative Data Infrastructure
The EUDAT Services Suite
Tokamak data mirror for JET and MAST Moving towards an open data repository for European nuclear fusion research.
GISELA & CHAIN Workshop Digital Cultural Heritage Network
RDA US Science workshop Arlington VA, Aug 2014 Cees de Laat with many slides from Ed Seidel/Rob Pennington.
AAI for a Collaborative Data Infrastructure
VI-SEEM Data Repository
EGI-Engage Engaging the EGI Community towards an Open Science Commons
VI-SEEM Data Repository
DATA SPHINX & EUDAT Collaboration
EGI Webinar - Introduction -
NFFA Europe.
WP 5 Shared Data Access & Enrichment
GISELA & CHAIN Workshop Digital Cultural Heritage Network
DATATURB Direct simulation data of turbulent flows
Bird of Feather Session
EOSC-hub Contribution to the EOSC WGs
Presentation transcript:

EUDAT Towards a pan-European Collaborative Data Infrastructure Ari Lukkarinen CSC-IT Center for Science, Finland APA Conference, November 6th, 2012

Big (Chaotic) Data 2 PROBLEM T10 T10_H50 T10_H50_RAMP T10_H50_RAMP_UNIFORM FCC D012 D012_GAP020 D012_GAP010_1000 SCIENTIFIC DATA PROBLEM Cheap CPU capacity Systems easy to built Hard drive price erosion Lot of data ”Unorganized library, where books are written with disappearing ink.”

What is needed ? Standardized storage service and processes for data management

4 Capturing Community Requirements How data is organised What are the first wishes (Community Interviews) Service Deployment and Operations How to deploy services on the distributed infrastructure (Operations team) Technology Appraisal and Service building What services can be built to match the requirements (Service Taskforces) How Do We Achieve This?

5

EUDAT Consortium 6

Data centers and Communities 7

8

9

10

11

12

Communities 13 Capturing Community Requirements How data is organised What are the first wishes (Community Interviews ) Service Deployment and Operations How to deploy services on the distributed infrastructure (Operations team) Technology Appraisal and Service building What services can be built to match the requirements (Service Taskforces)

14 The collaboration concept

15 Communities and Data Centers What are the basic requirements? Which common services are needed?

Common services 1)Safe replication Enable communities to safely replicate data to selected data centers for storage and do this in a robust, reliable and highly available way. 2)Dynamic replication Enable communities to perform (HPC) computations on the replicated data. 3)Metadata Common metadata domain for all data stored by EUDAT data centers and a searchable catalogue covering all the data stored within EUDAT, allowing data searches. 16 HPC

Common services, cont 4)Research data store Easy-to-use service that will enable researchers and scientists to upload, store and share data that are not part of the officially- managed data sets of the research communities 5)PID PID system that can be used within the communities and by EUDAT. 6)AAI Federated AAI 17

Use cases Research community: – Provides services to research groups. Research group: –Active data is stored to research data store. –Move ”finalized” data to storage service. PID will be automatically generated. Metadata is required. Data will be replicated to secondary site. –When needed stored data will be copied to HPC environment for further analysis. Other scientists (open data ?) –Centralized metadata service helps to locate the data. 18

What are the Requirements? 19 6 service/use cases identified Safe replication: Enable communities to safely replicate data to selected data centers for storage and do this in a robust, reliable and highly available way. Dynamic replication: Enable communities to perform (HPC) computations on the replicated data. Metadata: Create a common metadata domain for all data stored by EUDAT data centres and a searchable catalogue covering all the data stored within EUDAT, allowing data searches Research data store: create an easy-to-use service that will enable researchers and scientists to upload, store and share data that are not part of the officially-managed data sets of the research communities AAI: A solution for a working AAI system in a federation scenario. PID: a robust, highly available and effective PID system that can be used within the communities and by EUDAT.

20 INFRASTRUCTURE

Welcome to the 1st EUDAT Conference! October 2012, Barcelona International event with keynotes from Europe and US A forum to discuss the future of data infrastructures Project presentations and poster sessions Parallel session on Sustainability and Funding Models Training tutorials

SAFE REPLICATION 22 Objective: Enable communities to easily replicate data to selected data centers for storage in a robust and reliable manner. Key benefits: data bit stream preservation, more optimal data curation, better accessibility Description: Data replication management based on users’ requirements and constraints; data replication solutions and services embedded into critical security policies, including firewall setups and user accounting procedures. Technology: iRODS to be used as an initial replication middleware, implemented across the community centers and data centers; as more user communities join the task force, other storage technologies may be added, depending on user needs.  Production setup expected by 2013, such that users will be able to safely replicate data across different user community centres and data centres. More info:

DATA STAGING 23 Objective: Enable communities to perform (HPC) computations on the replicated data Key benefits: Access to large computing facilities Description: This service will allow the EUDAT communities to dynamically replicate subsets of their data stored in EUDAT to HPC machine workspaces for processing. Differences with the safe replication scenario:  replicated data are discarded when the analysis application ends;  Persistent Identifier (PID) references are not applied to replicated data into HPC workspaces;  Users initiate the process of replicating data while in the safe replication scenario data are replicated automatically on a policy basis. Technologies: GridFTP, Griffin, gTransfer, FTS (under appraisal) More info: EUDAT Storage HPC Facility CINECA EUDAT Storage HPC Facility SARA HPC Facility PRACE PID Community Storage EPOS

METADATA 24 Objective: Create a joint metadata domain for all data stored by EUDAT data centers and a catalogue which exposes the data stored within EUDAT, allowing data searches. Key benefits: Advertising platform for data sets, metadata service for less mature communities Description: EUDAT will handle metadata for more resources than just those deposited within the EUDAT CDI. In the initial phase we will target mainly resources contributed by the participating communities augmented with those of interested well-organized communities that are ready to contribute. Then, later, other interested communities can be approached depending on the respective community capabilities. Technology: OAI-PMH and embeds domain specific metadata, as XML, within the OAI-PMH record More info:

SIMPLE STORE 25 Objective: create an easy-to-use service that will enable researchers and scientists to upload, store and share data that are not part of the officially-managed data sets of the research communities. Key benefits: Store, share, and retrieve smaller sets of data not officially handled. Description: This service will address the long tail of "small" data, and the researchers/citizen scientists creating and manipulating it. Typically this type of data comes in a wide range of formats including text, spreadsheets, number series, audio and video files, photographs and other images. The Research Data Store is complementary to the other EUDAT services that manage the large volumes of official community data. Technologies: Invenio, figshare, beehub and MyExperiment. More info:

PID 26 Objective: Deploy a robust, highly available and effective PID service that can be used within the communities and by EUDAT. Description: Keeping track of the “names” of data sets or other digital artefacts deposited with the CDI requires more robust mechanisms than “noting down the filename”. The PID service will be required by many other CDI services, from Data Movement to Search and Query. Technologies: Currently considering use of both EPIC for data objects, and DataCite to register DOIs (Digital Object Identifiers for published collections. More info:

27 Objective: Provide a solution for a working AAI system in a federated scenario. Description: Design the AA infrastructure to be used during the EUDAT project and beyond. Key tasks: Leveraging existing identification systems within communities and/or data centers Establishing a network of trust among the AA actors: Identifty Providers (IdPs), Service Providers (SPs), Attribute Authorities and Federations Attribute harmonization Technologies: Oauth2, OpenID, RADIUS, SAML2, X.509, XACML, etc. More info:

28 ServiceSRDRMDSSPIDAAI Community CLARINX+XX+X ENESXXX+X EPOSXXXX VPHXXXX LifeWatchX+X++X NB: “X”= this service is relevant to this community, “+“ = this community has interest in this service but at a later stage or has a similar service already running in production. How Requirements are Shared?

29 Dynamic replication to HPC workspace for processing

30 Thank you Ari Lukkarinen

31 Thank you Ari Lukkarinen

OPERATION TEAM 32

33 Communities and Data Centers What are the basic requirements? Which common services are needed?

How Do We Sustain This? 34  Organisational Model  How do we move for a project collaboration to a federated infrastructure?  Which are the actors of this infrastructure and what is/are their role(s)?  How do we integrate new members?  How will the infrastructure interact with other infrastructures and projects?  Costs and Funding Models  Who will pay for the infrastructure and the shared services?  What are the costs of the services?  How to define a business model that best support the interest of research communities, data centers and funders? What role for Knowledge Exchange?

Welcome to the 1st EUDAT Conference! October 2012, Barcelona International event with keynotes from Europe and US A forum to discuss the future of data infrastructures Project presentations and poster sessions Parallel session on Sustainability and Funding Models Training tutorials

36