Presentation is loading. Please wait.

Presentation is loading. Please wait.

Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki.

Similar presentations


Presentation on theme: "Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki."— Presentation transcript:

1 Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki

2 Digital Preservation… Easy to do… …as long as you can provide money forever Easy to test claims about repositories… …as long as you live a long time

3 Digital Preservation activities Infrastructure Information about users and practices ISO standard: OAIS ISO standard: OAIS update ISO standards: Audit and Certification Tools Relationship to related work and community practices

4 Alliance for Permanent Access The Alliance aims to develop a shared vision and framework for a sustainable organisational infrastructure for permanent access to scientific information The British Library European Organization for Nuclear Research [CERN] CSC IT Center for Science Delegation of the Finnish Academies of Science and Letters Deutsche Nationalbibliothek Digital Preservation Coalition European Science Foundation [ESF] European Space Agency [ESA] Helmholtz-Gemeinschaft Deutscher Forschungszentren International Association of Scientific, Technical & Medical Publishers Joint Information Systems Committee [JISC] Koninklijke Bibliotheek Max-Planck-Gesellschaft NESTOR Kompenteznetzwerk Nationale Coalitie Digitale Duurzaamheid [NCDD] Portico Science & Technology Facilities Council [STFC]

5 Alliance for Permanent Access The Alliance aims to develop a shared vision and framework for a sustainable organisational infrastructure for permanent access to scientific information The British Library European Organization for Nuclear Research [CERN] CSC IT Center for Science Delegation of the Finnish Academies of Science and Letters Deutsche Nationalbibliothek Digital Preservation Coalition European Science Foundation [ESF] European Space Agency [ESA] Helmholtz-Gemeinschaft Deutscher Forschungszentren International Association of Scientific, Technical & Medical Publishers Joint Information Systems Committee [JISC] Koninklijke Bibliotheek Max-Planck-Gesellschaft NESTOR Kompenteznetzwerk Nationale Coalitie Digitale Duurzaamheid [NCDD] Portico Science & Technology Facilities Council [STFC] PARSE.Insight

6 Preservation is a Social activity Sometimes are activities are personal preserve for your future self [Australia] In the short term for re-use by colleagues and other people In the long term for re-use by future generations Neeri Oct 2009, Helsinki

7 Definitions (OAIS) Long Term Preservation: The act of maintaining information, Independently Understandable by a Designated Community, and with evidence supporting its Authenticity, over the Long Term. Long Term: A period of time long enough for there to be concern about the impacts of changing technologies, including support for new media and data formats, and of a changing Designated Community, on the information being held in an OAIS. This period extends into the indefinite future. Neeri Oct 2009, Helsinki Not just BIT preservation Not just rendering Information not just DATA or Documents Authenticity

8 Information is the important thing What information? Documents…… Data……. Original bits? Look and feel? Behaviour? Performance? Explicit/ Implicit/ Tacit Information : Any type of knowledge that can be exchanged. In an exchange, it is represented by data. Long Term is long enough to be concerned with the impacts of changing technologies, including support for new media and data formats, or with a changing user community. Long Term may extend indefinitely. Ensure that the information to be preserved is Independently Understandable to (and usable by) the Designated Community.

9 Things change/disappear Software Hardware Environment E.g. Network links to related information People What is common knowledge How can we ensure that the information trapped in the bits remains understandable despite all these changes?

10 Just Format? sfqsftfoubujpo jogpsnbujpo svmft representation information rules You have a file JHOVE tells you it is WORD version 7 Format – necessary but not sufficient: formats can be used for multiple purposes e.g. audio files used to store configuration parameters

11 XML enough? John Mary Paul URL of data file used to create this table. Target name U0lNUExFICA9ICAgICAgICAgICAgICAgICAgICBUIC8gU3RhbmRhcmQgRklUUyBm b3JtYXQgICAgICAgICAgICAgICAgICAgICAgICAgICBCSVRQSVggID0gICAgICAg ICAgICAgICAgICAgIDggLyBDaGFyYWN0ZXIgZGF0YSAgICAgICAgICAgICAgICAg ICAgICAgICAgICAgICAgIE5BWElTICAgPSAgICAgICAgICAgICAgICAgICAgMCAv IE5vIGltYWdlLCBqdXN0IGV4dGVuc2lvbnMgICAgICAgICAgICAgICAgICAgICAg

12 Data… Level 2 GOME Satellite instrument data

13 Complex container objects Neeri Oct 2009, Helsinki

14 What can we rely on over time? Words on paper (or titanium sheets) that people can read; Carvings in stone and books have proven track records of preserving information over hundreds of years. The information such as Representation Information which is collected. A somewhat recursive assumption, however it is difficult to make progress without it. Some kind of remote access Network access is the natural assumption but in principle other methods of obtaining information from a given address/location would suffice, for example fax or horse-back rider. Some kind of computers People? Organisations? Identifiers? clearly we cannot assume that any given URI for example, will remain valid forever – nor even that the DNS will resolve names

15 Key OAIS Concepts Claiming This is being preserved is untestable Essentially meaningless Except BIT PRESERVATION How can we make it testable? Claim to be able to continue todo something with it Understand/use Need Representation Information Still meaningless… Things are too interrelated Representation Information potentially unlimited Designated Community Many other concepts identified Finer grained taxonomy than simply saying Allows one to ask if one has all the required types Available from: Metadata

16 Representation Information The Information Model is key Recursion ends at KNOWLEDGEBASE of the DESIGNATED COMMUNITY (this knowledge will change over time and region)

17 OAIS Archival Information Package (AIP) Neeri Oct 2009, Helsinki

18 Representation Information Network Neeri Oct 2009, Helsinki

19 Preservation and Re-use Unfamiliar information Preservation Digitally encoded information which must be usable and understandable Unfamiliar because of separation in time E-Science/GRID/CyberInfrastructure for data Digitally encoded information which must be usable and understandable Unfamiliar because of separation in discipline or location – even if created yesterday Support automated usage where possible

20 Rep Info /DISCIPLINE Virtualisation

21 Insight: stakeholders Research Research institutes (non-profit) Universities Academic libraries Data management (preservation) Data centres (profit / non-profit) Libraries Archives Funding/policy National Funding organisations European funding Corporate funding Publishing General (cross-community) publishers Specific (community) publishers

22 Surveys to stakeholders Research Elsevier mailinglist (35,000 people), ESF, MCFA, Eurodoc, ALLEA, YEAR, Digital Humanities Observatory, etc. Data management (preservation) LIBER, DPE, DPC, NCDD, DCC, D-lib Magazine, PADI, JISC mailing lists, CASPAR, Planets, etc. Funding/policy ESF, Alliance for Permanent Access, national funding agencies Publishing International Association of STM publishers, Directory of Open Access Journals (DOAJ)

23 Surveys to stakeholders Research 1397 responses Data management (preservation) 273 responses Funding/policy < responses Publishing 186 responses

24 Threats to preservation 1. Users may be unable to understand or use the data e.g. the semantics, format or algorithms involved. 2. Lack of sustainable hardware, software or support of computer environment may make the information inaccessible. 3. Evidence may be lost because the origin and authenticity of the data may be uncertain. 4. Access and use restrictions (e.g. Digital Rights Management) may not be respected in the future. 5. Loss of ability to identify the location of data. 6. The current custodian of the data, whether an organisation or project, may cease to exist at some point in the future. 7. The ones we trust to look after the digital holdings may let us down.

25 Threats to preservation (R) The ones we trust to look after the digital holdings may let us down The current custodian of the data may cease to exist Loss of ability to identify the location of data Access and use restrictions may not be respected in the future Evidence may be lost Lack of sustainable hardware/software Users may be unable to understand or use the data

26 Threats to preservation (R) Users may be unable to understand or use the data e.g. the semantics, format or algorithms involved.

27 Threats to preservation (DM) The ones we trust to look after the digital holdings may let us down The current custodian of the data may cease to exist Loss of ability to identify the location of data Access and use restrictions may not be respected in the future Evidence may be lost Lack of sustainable hardware/software Users may be unable to understand or use the data

28 Threats to preservation (P) The ones we trust to look after the digital holdings may let us down The current custodian of the data may cease to exist Loss of ability to identify the location of data Access and use restrictions may not be respected in the future Evidence may be lost Lack of sustainable hardware/software Users may be unable to understand or use the data

29 ThreatRequirement for solution Users may be unable to understand or use the data e.g. the semantics, format, processes or algorithms involved Ability to create and maintain adequate Representation Information Non-maintainability of essential hardware, software or support environment may make the information inaccessible Ability to share information about the availability of hardware and software and their replacements/substitutes The chain of evidence may be lost and there may be lack of certainty of provenance or authenticity Ability to bring together evidence from diverse sources about the Authenticity of a digital object Access and use restrictions may make it difficult to reuse data, or alternatively may not be respected in future Ability to deal with Digital Rights correctly in a changing and evolving environment Loss of ability to identify the location of dataAn ID resolver which is really persistent The current custodian of the data, whether an organisation or project, may cease to exist at some point in the future Brokering of organisations to hold data and the ability to package together the information needed to transfer information between organisations ready for long term preservation The ones we trust to look after the digital holdings may let us down Certification process so that one can have confidence about whom to trust to preserve data holdings over the long term

30 FUTURE Users may be unable to understand or use the data e.g. the semantics, format, processes or algorithms involved Non-maintainability of essential hardware, software or support environment may make the information inaccessible The chain of evidence may be lost and there may be lack of certainty of provenance or authenticity Access and use restrictions may not be respected in the future Loss of ability to identify the location of data The current custodian of the data, whether an organisation or project, may cease to exist at some point in the future The ones we trust to look after the digital holdings may let us down

31 Roadmap PARSE.Insight produced draft Preservation Infrastructure Roadmap Now a SCIENCE DATA INFRASTRUCTURE ROADMAP after consultation with EU

32 Infrastructures for preservation Social / Legal / Financial / Organisational Agreements / Trust / Standards Costs/ Benefits/ Rewards Technical components

33 Lessons from other Infrastructures Need to grow, encourage, foster rather than build include organisational, financial, legal & marketing Provide services rather than specific technologies Tackle choke points Various phases of development

34 Encouraging Organisational and Social change Policies: mandates for depositing research data and funding agencies requirements: Robust and reliable deposit places, where researchers can be sure their data will not get lost, be corrupted or misused with correct right access mechanisms. Elements that increase comfort levels so that new users will know how to use and interpret the available data.. Communication and awareness around these issues. Have publication of data as valued and as referencable as is a publication of a paper in a journal.

35 Repository Audit and Certification Standard for certification in OAIS Roadmap Initial work produced TRAC Now an official CCSDS Working Group Open virtual meetings, notes and documents: Draft standard submitted to CCSDS/ISO to form the basis of an international audit and certification process

36 36 CASPAR Consortium EU FP6 Integrated Project Total spend approx. 16MEuro (8.8 MEuro from EU) Started April 2006, for 42 months

37

38 Preservation Data Flows and Strategies More strategies than just emulate or transform

39 Creating an OAIS Archival Information Package

40 Modules and Dependencies: defining the Designated Community README.txt TEXT EDITOR ENGLISH LANGUAGE WINDOWS XP FITS FILE FITS STANDARD PDF STANDARD FITS JAVA s/w JAVA VM PDF s/w FITS DICTIONARY SPECIFICATION UNICODE SPECIFICATION XML SPECIFICATION MULTIMEDIA PERFORMANCE DATA C3D DirectXMAX/MSP 3D motion data files 3D scene data files motion to music mapping strategy

41 Modules and Dependencies: Examples (Semantic Web data) ns4 ns2 ns1 ns3 RDF/S modules and dependencies

42 Scenario: Intelligibility-aware Packaging FITS STANDARD PDF STANDARD FITS DICTIONARY SPECIFICATION UNICODE SPECIFICATION XML SPECIFICATION o2o1 P1 P2 C3D DirectX MAX/MSP o3 P3 ZIP Gap(o2,P1) = Gap(o2,P2) = – {FITS, FITS_STANDARD, FITS_DICTIONARY, DICTIONARY_SPECIFICATION} Gap(o2,P3) = – {FITS, FITS_STANDARD, FITS_DICTIONARY, DICTIONARY_SPECIFICATION, PDF_STANDARD, XML_SPECIFICATION, UNICODE_SPECIFICATION} Gap(o3,P3) = – {ZIP} Gap(o3, ) = – {ZIP, C3D, DirectX, MAX/MSP}

43 E39. Actor Kia Ng Activity of Improvisation on the Violin Expression of the Improvisation on the Violin CR20. Perform Singleton has_type CR51. Attribution_Right Singleton generates LF1. Written_Norm Art. X of Law Y is_documented_in Kias right to claim authorship became_owner_of is_on created carried_out Works Provenance Legislation Rights Ontology CIDOC-CRM E72. Legal Object FRBRoo F22. Self_contained_Expression E7. Activity F28. Expression_Creation E30. Right CR.Ownership Right Derived Property Rights E7. Activity Kia claiming authorship CR. Activity_Type To claim authorship allows has_type performed_by has_right_type 100% recall, <100% precision 100% precision Example : Identification of an Attribution Right Thanks to MetaWare

44 Provenance: Performing Arts Thanks to ULeeds and CNRS

45 Authenticity Neeri Oct 2009, Helsinki

46 Neeri Oct 2009, Helsinki ThreatRequirements for solutions Users may be unable to understand or use the data e.g. the semantics, format, processes or algorithms involved Ability to create and maintain adequate Representation Information Non-maintainability of essential hardware, software or support environment may make the information inaccessible Ability to share information about the availability of hardware and software and their replacements/substitutes The chain of evidence may be lost and there may be lack of certainty of provenance or authenticity Ability to bring together evidence from diverse sources about the Authenticity of a digital object Access and use restrictions may make it difficult to reuse data, or alternatively may not be respected in future Ability to deal with Digital Rights correctly in a changing and evolving environment Loss of ability to identify the location of dataAn ID resolver which is really persistent The current custodian of the data, whether an organisation or project, may cease to exist at some point in the future Brokering of organisations to hold data and the ability to package together the information needed to transfer information between organisations ready for long term preservation The ones we trust to look after the digital holdings may let us down Certification process so that one can have confidence about whom to trust to preserve data holdings over the long term

47 Neeri Oct 2009, Helsinki ThreatCASPAR Component Users may be unable to understand or use the data e.g. the semantics, format, processes or algorithms involved RepInfo toolkit, Packager and Registry – to create and store Representation Information. In addition the Orchestration Manager and Knowledge Gap Manager help to ensure that the RepInfo is adequate. Non-maintainability of essential hardware, software or support environment may make the information inaccessible Registry and Orchestration Manager to exchange information about the obsolescence of hardware and software, amongst other changes. The Representation Information will include such things as software source code and emulators. The chain of evidence may be lost and there may be lack of certainty of provenance or authenticity Authenticity toolkit will allow one to capture evidence from many sources which may be used to judge Authenticity. Access and use restrictions may make it difficult to reuse data, or alternatively may not be respected in future Digital Rights and Access Rights tools allow one to virtualise and preserve the DRM and Access Rights information which exist at the time the Content Information is submitted for preservation. Loss of ability to identify the location of dataPersistent Identifier system: such a system will allow objects to be located over time. The current custodian of the data, whether an organisation or project, may cease to exist at some point in the future Orchestration Manager will, amongst other things, allow the exchange of information about datasets which need to be passed from one curator to another. The ones we trust to look after the digital holdings may let us down The Audit and Certification standard to which CASPAR has contributed will allow a certification process to be set up.

48 Conclusions Preservation Is a complex process involves more than just bits and formats metadata is too vague a term Transparency is vital What is being preserved For whom For how long OAIS is a good basis for preservation Recursion is an important concept in preservation Preservation threats must be countered by specific tools and shared infrastructure components Neeri Oct 2009, Helsinki

49 Additional links CASPAR: PARSE.Insight: Alliance for Permanent Access: Digital Curation Centre: Audit and certification: wiki.digitalrepositoryauditandcertification.org OAIS:

50 END

51

52 Summary What is digital preservation? Transparency What is needed for digital preservation? Many strategies –Need to be clear about the scope of each Document/rendered object? Scientific data – processed/combined to produce new results? Other? –How are all of the threats being addressed? What exactly is being preserved? For whom is it being preserved? –Designated Community must be specified –Testability through understandability/usability How will it be handed on to future custodians

53 Umbrella framework Need to integrate in some sense many different Systems Disciplines Funding Requirements Projects producing preservation artefacts Representation Information Significant Properties Provenance etc

54 About researchers EU 44%, USA 33%, Other 23% Per category

55 Data spectrum (R)

56 Cross-disciplinary use of research data

57 Sharing of data (R) Did you ever need digital research data gathered by other researchers that was not available?

58 Sharing of data (R) Do you presently make use of research data gathered by other researchers?

59 Sharing of data (R) Would you like to make use of research data gathered by other researchers? Within disciplineOutside discipline

60 Sharing of data (R) How open is your data?

61 Sharing of data (R) Which constrains do you see in making data open?

62 Sharing of data (R) How do you locate and access digital research data?

63 Linking of data (R) As researcher, do you think it is useful to link underlying research to formal literature?

64 Linking of data (P) Do you link references in your journals to underlying digital research data?

65 Linking of data (P) Do you as publisher charge separate fees when users want to access data associated with publications?

66 Linking of data (P) Can authors submit their underlying digital research data with their publication to the publisher?

67 About funding Researchers say: Data managers say : Publishers say: Government (national funding) Who should pay for data preservation? Who should pay for preservation of publications? Researchers say: Data managers say : Publishers say: Government (national funding)

68 Who should pay? (P) For preservation of other research output


Download ppt "Requirements for Long- Term Preservation David Giaretta 1 st October 2009, Helsinki."

Similar presentations


Ads by Google