Presentation is loading. Please wait.

Presentation is loading. Please wait.

Digital Curation Centre research agenda

Similar presentations


Presentation on theme: "Digital Curation Centre research agenda"— Presentation transcript:

1 Digital Curation Centre research agenda
a centre of expertise in data curation and preservation Digital Curation Centre research agenda Michael Day Digital Curation Centre UKOLN, University of Bath The Digital Curation Centre: a NIEeS community awareness day, Centre for Mathematical Sciences, Cambridge, UK, 16 June 2005 Funded by:

2 DCC research The DCC research team Links with other research groups
Led by Professor Peter Buneman (School of Informatics, University of Edinburgh) Distributed throughout all four DCC partner organisations Strong links with other DCC components, through multi-team working, etc. Links with other research groups Visitors programme Research Agenda The DCC research team is led by Professor Peter Buneman and has four main goals: To draw together the various functions of curation, from the traditional archival functions to the maintenance and publication of evolving knowledge as seen in scientific databases. To identify through direct research collaboration, and through interaction with the service arm of DCC, the key projects in which research is needed. To conduct research in areas already identified by the partners as crucial to digital curation. To institute two-way conduits between research and service in which practical issues can be drawn to the attention of researchers and the products of research can be tested in practice. Current research priorities are: Data integration and publication Performance and optimisation Annotation Appraisal and long-term preservation Socio-economic and legal context: rights, responsibilities and viability Cost-benefit analysis of the data curation process Security: safe and effective data analysis environments Automation of metadata extraction The DCC hosts a Visitors Programme in which talks by those engaged in cutting edge research are brought to the UK to disseminate their findings and engage with DCC staff. See upcoming and previous events for more information. If you have any questions, comments, or offers of collaboration, contact the research team by sending an to

3 Research goals (1) To draw together the various functions of curation, from the traditional archival functions to the maintenance and publication of evolving knowledge as seen in scientific databases To conduct research in areas already identified by the partners as crucial to digital curation

4 Research goals (2) To identify through direct research collaboration, and through interaction with the service arm of DCC, the key projects in which research is needed To institute two-way conduits between research and service in which practical issues can be drawn to the attention of researchers and the products of research can be tested in practice

5 Current priorities (1) Data integration and publication
Review of techniques Publishing data that conforms to a given format or schema Performance and optimisation Safe data analysis environments within data centres Initial testbed based on sky survey databases (in collaboration with the Wide Field Astronomy Unit and AstroGrid) Data integration and publication Review of techniques Report to be delivered within the first year. A special emphasis of this project will be to look at integration techniques in the context of digital preservation metadata. Enhanced publishing systems A new form of data integration arises in the need to publish data that conforms to some format or schema. The situation is typically one in which a community may support several data resources and wants to publish one or more integrated views of these resources. A longer-term research goal is to provide some synthesis between these projects with the goal of building in constraints, security features and tools for coping with schema evolution. Performance and optimisation A safe and effective data analysis environment in the data centre We propose to address this problem in collaboration with the Wide Field Astronomy Unit and AstroGrid, to develop a testbed system based on the SuperCOSMOS and WFCAM Science Archive, two TB-scale sky survey databases created and curated in Edinburgh. Deliverables: Year 1: A report assessing the scientific requirements for developing a safe data analysis environment and review existing work in this area. Year 2: Development and deployment of a testbed data analysis system based on the SuperCOSMOS and WFCAM Science Archives. Year 3: The testbed data analysis system will be generalised, in collaboration with partners in other communities (e.g. bioinformatics, geological sciences) where this problem is important.

6 Current priorities (2) Performance and optimisation (continued)
Automated metadata extraction and generation Essential for testing the scalability of metadata-based preservation strategies Review of tools, assessment of text mining techniques Annotation Survey of the forms of annotation Annotation and provenance A model for data transformations that maintains annotation and provenance Automation of metadata extraction These issues are important in the scientific domain, as well as within the digital library community. In many disciplines vast quantities of data products are archived and adequate metadata must be made available to make their subsequent retrieval efficient. Our research into this topic will focus on a review of currently available tools for automatic metadata extraction, together with an assessment of how text mining techniques may be applied to this problem. Deliverables: Year 1: A study of automated metadata extraction to aid long-term digital preservation, and the use of text mining techniques in digital curation. Year 2: A report to summarise existing tools available for automated metadata extraction and their place in the curation process, together with an assessment of what further tools should be developed. Annotation Forms of Annotation A survey is needed of the various forms of annotation. Of particular interest will be the extent to which forms of annotation can be predicted when metadata formats or databases are designed. We know from examples that this is not always possible, and in these cases, the difficulty of subsequently attaching annotations will be investigated. Annotation and Provenance The grand challenge here is to develop a model for data transformations in which annotation and provenance are fully described. BIO-DAS, for example, is a system in which annotation is carried with provenance of data items. This project will also investigate annotation of special (e.g. spatial) structures and the attachment of annotation to data.

7 Current priorities (3) Appraisal and long-term preservation
Appraisal techniques Investigating the applicability and scalability of traditional appraisal techniques in 'data-intensive' contexts Dynamic databases Preservation techniques for evolving metadata and databases Appraisal and long-term preservation A study of appraisal techniques This study involves the questions of when and how to retain and preserve data. It requires a two-way communication of expertise between the "library/archives" and "scientific database" or components of the proposal. The flood of raw scientific data will defeat Moore's law and some form of appraisal of experimental scientific data is essential. This is an area in which library and archival expertise may be of use to scientists. Development and field-testing of database and dynamic data set preservation software We also need to archive dynamic data – the fluid datasets that constitute preservation metadata and much scientific data (especially annotation data). Recent work has shown that this kind of data has certain properties that allow all past versions of the database to be efficiently preserved. Tests on existing scientific data sets indicate that all versions of such a database over a year can be stored in an XML file that is typically 10-15% larger than the size of one version of the data. The frequency of archiving is limited only by the speed of the algorithm, which, in its basic form, is the time to scan the archival file and the most recent version. The method enjoys certain other useful properties. It interacts well with compression techniques, and it permits – with the appropriate indexing structures – temporal queries on objects in the file. Development of preservation techniques for evolving metadata and databases  We need also to investigate issues such as how database attributes change in their interpretation (meaning) when the database is active over very long periods of time. Some e-Science databases will continue to grow either for long but fixed periods (e.g. possibly sky surveys?), or indefinitely (e.g. datasets of genetic information). In that time, it is likely that concepts and attitudes will change, and this may mean the data is interpreted differently. Sometimes this will add value (new meanings discovered in old data); sometimes it will obscure value. As an example from another field, the meaning of a credit in a University student record system will change over time. Explicit recognition must be made of this.

8 Current priorities (4) Socio-economic and legal contexts
Networks of trusted repositories Varying preservation role for repositories Roles for co-operation, exchange formats, replication, etc. Economic cost-benefit analysis of curation processes Quantifying costs and benefits Testing economic viability of curation processes Socio-economic and legal context: rights, responsibilities and viability The organisational dynamics of a network of trusted repositories In the near future, there is likely to be a variety of different trusted repositories that will need to interact both with each other and with their designated communities. One of the key roles of the DCC will be to help synergise effort between these organisations.   The first stage of this will be a research study examining the organisational dynamics of trusted repositories and how a future network of repositories might function in the UK higher and further education and research contexts and in the wider global network. This would identify the full range of potential stakeholders and propose ways in which they could co-operate in order to prevent duplication of effort, e.g. on common technical approaches to the curation of digital data, repository certification, etc. The study would also help initiate a debate on the long-term curation role of institution-based repositories and how they might link with services based at national or international level. For example, this might include the replication of data between repositories or the development of policies that deal with institutional impermanence. The project will produce a report the organisational dynamics of a network of trusted repositories. Economic cost-benefit analysis of the data curation process Research is needed to help funding bodies, and others, to quantify the costs as well as the benefits of Data Curation. The OAIS Reference Model stresses the importance of the "Designated User Community/Communities" and the need to understand their knowledge bases in determining the needs for preserving/curating information in a useful way. Ontologies and data models form part of this but just as important are the expected levels of knowledge which users might have, and the "Gödel ends". It is likely that digital curation to the level encouraged by the DCC will imply additional costs which must be estimated. Benefits are more difficult to quantify; for Research funding bodies, publication and citation statistics would be obvious ones. Other measures must be constructed in discussion with the funding bodies and other stakeholders. It is likely that a range of such measures will be needed to suit a variety of types of archives.   Economic analysis and modelling techniques will be applied in this research study. The information obtained will provide evidence for the economic viability of the data curation process and the associated data repositories. Where appropriate, additional expertise in economic analysis and modelling methods will be sought to complement the expertise within the Consortium.  The project will produce a preliminary analysis of the economic cost-benefits of the data curation process.

9 Current priorities (5) Socio-economic and legal contexts (continued)
Rights and responsibilities The legal contexts of curation, e.g. impacts of the Database Directive Complexity of rights held in databases, impacts on aggregation and reuse of data Rights and responsibilities Where tools are built specifically to enable aggregation and re-use of existing works and data, issues of IP ownership and exploitation of the results of aggregation become difficult to resolve. Whereas rights management systems may be able to deal with some ownership issues, complexity can quickly develop where different ownership, access and re-use conditions apply. Pressure to release and further share results may be impeded by IPRs. Upstream IPR claims may inhibit downstream exploitation. How could the tools, and more particularly the legal framework, enable research and development whilst at the same time ensuring that the rights of third parties are not compromised? Where might the balance be struck between respecting existing rights and the general public interest in furthering research and development, and what strategies might assist in achieving that goal? The project will produce a scoping report identifying intellectual property rights and responsibilities in the development of digital curation tools and their uses.

10 Further information Digital Curation Centre Web site: Contact:


Download ppt "Digital Curation Centre research agenda"

Similar presentations


Ads by Google