Presentation is loading. Please wait.

Presentation is loading. Please wait.

Neil Chue Hong Project Manager, EPCC +44 131 650 5957 Data Services What, Why, How e-Research Meeting NeSC, 2 nd.

Similar presentations


Presentation on theme: "Neil Chue Hong Project Manager, EPCC +44 131 650 5957 Data Services What, Why, How e-Research Meeting NeSC, 2 nd."— Presentation transcript:

1 Neil Chue Hong Project Manager, EPCC Data Services What, Why, How e-Research Meeting NeSC, 2 nd March 2005

2 e-Research within The University of Edinburgh Overview The difficulty with data Data Services Data Middleware Data Repositories

3 e-Research within The University of Edinburgh The Data Deluge Entering an age of data –Data Explosion –CERN: LHC will generate 1GB/s = 10PB/y –VLBA (NRAO) generates 1GB/s today –Pixar generate 100 TB/Movie –Storage getting cheaper Data stored in many different ways –Data resources –Relational databases –XML databases / files –Result files Need ways to facilitate –Data discovery –Data access –Data integration Empower e-Business and e-Science –The Grid is a vehicle for achieving this

4 e-Research within The University of Edinburgh What is e-Science? Goal: to enable better research Method: Invention and exploitation of advanced computational methods –to generate, curate and analyse research data –From experiments, observations and simulations –Quality management, preservation and reliable evidence –to develop and explore models and simulations –Computation and data at extreme scales –Trustworthy, economic, timely and relevant results –to enable dynamic distributed virtual organisations –Facilitate collaboration with resource sharing –Security, reliability, accountability, and manageability Multiple, independently managed sources of data – each with own time-varying structure Creative researchers discover new knowledge by combining data from multiple sources

5 e-Research within The University of Edinburgh Composing Observations in Astronomy No. & sizes of data sets as of mid-2002, grouped by wavelength 12 waveband coverage of large areas of the sky Total about 200 TB data Doubling every 12 months Largest catalogues near 1B objects Data and images courtesy Alex Szalay, John Hopkins

6 e-Research within The University of Edinburgh Data Services: motives Key to Integration of Scientific Methods –Publication and sharing of results –Primary data from observation, simulation & experiment –Encourages novel uses –Allows validation of methods and derivatives –Enables discovery by combining data collected independently Key to Large-scale Collaboration –Economies: data production, publication & management –Sharing cost of storage, management and curation –Many researchers contributing increments of data –Pooling annotation leads to rapid incremental publication –Accommodates global distribution –Data & code travel faster and more cheaply –Accommodates temporal distribution –Researchers assemble data –Later (other) researchers access data

7 e-Research within The University of Edinburgh Data Services: challenges Scale –Many sites, large collections, many uses Longevity –Research requirements outlive technical decisions Diversity –No one size fits all solutions will work –Primary Data, Data Products, Meta Data, Administrative data, … Many Data Resources –Independently owned & managed –No common goals –No common design –Work hard for agreements on foundation types and ontologies –Autonomous decisions change data, structure, policy, … –Geographically distributed and I havent even mentioned security yet!

8 e-Research within The University of Edinburgh The Discovery Process Choosing data sources –How do you find them? –How do they describe and advertise them? –Is the equivalent of Google possible? Obtaining access to that data –Overcoming administrative barriers –Overcoming technical barriers Understanding that data and extracting from multiple sources –The parts you care about for your research Combing them using sophisticated models –The picture of reality in your head Analysis on scales required by statistics –Coupling data access with computation Repeated Processes –Examining variations, covering a set of candidates –Monitoring the emerging details

9 e-Research within The University of Edinburgh Small problems Not just Grand Challenges! –Also the small problems For instance: –What happens to data when a researcher leaves a team? –How can a research leader point to popular data when a new researcher joins? –How can you manage your data when you start to run out of local storage space? –How do I get my data from one format/database to another? –How do I combine my data with your data? You need to manage your data: metadata

10 e-Research within The University of Edinburgh What is a data service? An interface to a stored collection of data –e.g. Google and Amazon –web services But the data could be: –replicated –shared –federated –virtual –incomplete Dont care about the underlying representation –do care about the information it represents

11 e-Research within The University of Edinburgh Examples of Data Services Many Data Services and applications –Commercial databases –Web interfaces –Applications developed individually by groups and projects Also many places to get hold of public data –Publications and citation servers –Results servers Highlight a few of these –principally ones trying to bridge the gap between local and distributed But… no such thing as a free lunch –Things are not yet Plug and Play –You will need to expend some effort to use these tools effectively

12 e-Research within The University of Edinburgh OGSA-DAI / DQP Data Access and Integration / Distributed Query Processing –http://www.ogsadai.org.uk –Provides a way to access and query hetereogenous, structured data resources –Relational databases –XML databases –files –Provides a framework for extending services –more smarts, closer to the data –Everything looks like a database National Grid Service starting to host –both through OGSA-DAI and Oracle

13 e-Research within The University of Edinburgh SRB Storage Resource Broker –http://www.sdsc.edu/srb/ –Provides a way to access data sets and resources based on their attributes and/or logical names rather than their names or physical locations. –may be hetererogenous, distributed and/or replicated –Many different ways of connecting –Can connect SRB systems together –zoneSRB –Everything looks like a filesystem

14 e-Research within The University of Edinburgh SRM and more Storage Resource Managers –http://forge.gridforum.org/projects/gsm-wg/ –a joint effort between a number of institutions –EU DataGrid/CERN, FermiLab, LBNL, JL –to define a standardised interface to Storage Resource Managers so that different implementations can work together –principally between physics communities, extending further now Many other examples of data middleware –Replication management and location: RLS, QCDGrid –Many datagrids: SciDAC, Gfarm –GridFTP for efficient transfer –Packaged software: Virtual Data Toolkit

15 e-Research within The University of Edinburgh EDINA and friends EDINA –http://edina.ac.uk/ –Offers the UK tertiary education and research community networked access to a library of data, information and research resources, e.g geographical data Digital Curation Centre –http://www.dcc.ac.uk –support UK institutions to store, manage and preserve these data to ensure their enhancement and their continuing long-term use. Other national data centres: –MIMAS, UKDA, CCLRC DataPortal…

16 e-Research within The University of Edinburgh Summary Data is important to research –across all disciplines There is already a large amount of data –but its sometimes difficult to find and bring together Data Services are built to standards –which define particular functionality Data Services should be composable –so that it is easier to work with data There is already software out there –so it is possible to evaluate against your requirements


Download ppt "Neil Chue Hong Project Manager, EPCC +44 131 650 5957 Data Services What, Why, How e-Research Meeting NeSC, 2 nd."

Similar presentations


Ads by Google