Presentation is loading. Please wait.

Presentation is loading. Please wait.

The National Archives Digital Records Infrastructure Catalogue First Steps to Creating a Semantic Digital Archive Rob Walpole DeveXe Limited The National.

Similar presentations


Presentation on theme: "The National Archives Digital Records Infrastructure Catalogue First Steps to Creating a Semantic Digital Archive Rob Walpole DeveXe Limited The National."— Presentation transcript:

1 The National Archives Digital Records Infrastructure Catalogue First Steps to Creating a Semantic Digital Archive Rob Walpole DeveXe Limited The National Archives

2 Disclaimer This presentation is in no way intended to express views or opinions of The National Archives and is solely the work of Rob Walpole, an employee of DeveXe Limited who are currently contracted to assist in the development of the Digital Records Infrastructure at Kew in London. Apart from providing a case study of developing a semantic digital archive, this presentation discusses the opportunities permitted by such development. It should not be assumed that these developments will occur and DeveXe Limited take no responsibility for any perceived inaccuracies.

3 Background

4 Background – The National Archives The National Archives (TNA) Over 11 million historical government and public records From the Domesday Book to the Agreement on a Referendum on Independence for Scotland Photo by Chris Hill

5 Background – The National Archives The National Archives (TNA) Over 11 million historical government and public records From the Domesday Book to the Agreement on a Referendum on Independence for Scotland But not births, deaths and marriages, these are held by the General Register Office! Photo by Chris Hill

6 Background – The National Archives Most of these documents are currently held on paper - or even parchment... Photo by Liz West

7 Background – The National Archives But soon this will be overtaken by a tsunami of digital files... Photo by Marco Mazzei

8 Background – The National Archives But soon this will be overtaken by a tsunami of digital files......including office documents, emails, images, videos and much more. Photo by Marco Mazzei

9 Background – Digital Records Infrastructure There are many challenges around digital preservation including:- Format recognition Software preservation Compatibility Degradation of media

10 Background – Digital Records Infrastructure There are many challenges around digital preservation including:- Format recognition Software preservation Compatibility Degradation of media Many of these issues were highlighted by the BBC Domesday Project (1986)

11 Background – Digital Records Infrastructure TNA have been at the forefront of meeting this digital preservation challenge:- PRONOM – file format registry DROID – file format identification tool Legislation.gov.uk – all UK legislation on-line UK Government Web Archive – http://www.nationalarchives.gov.uk/webarchive/ http://www.nationalarchives.gov.uk/webarchive/ The London Gazette – published by HMSO (part of TNA)

12 Background – Digital Records Infrastructure In 2006 TNA deployed the Digital Repository System (DRS) which provided terabyte scale long-term storage. In 2012 TNA starts to build DRI (Digital Records Infrastructure) on the foundations of DRS to deliver extensible storage to the petabyte scale and beyond.

13 80,000 Digitised Home Guard Records from World War 2 were ingested into DRI as a proof of concept... Background – Digital Records Infrastructure

14 80,000 Digitised Home Guard Records from World War 2 were ingested into DRI as a proof of concept......now many more including LOCOG (2012 Olympic games) Leveson Enquiry

15 Background – Digital Records Infrastructure At its core this massive storage is provided by a robot tape library with frequently requested and low resolution copies of data held in a disk cache. Photo by Cory Doctrow

16 Background – The DRI Catalogue The DRI Catalogue is essentially an inventory of the items held in the archive. It is distinct from the TNA Catalogue which is a comprehensive catalogue system covering both paper and digital documents. Public access to the TNA Catalogue is provided by Discovery.

17 Background – The DRI Catalogue Rich XML metadata is stored in the archive itself, alongside the original document and a copy is sent to Discovery. This comes from a variety of sources: Record provider Archiving process Document transcription Archivists And there is a very good reason for using XML...

18 Background – The DRI Catalogue Rich XML metadata is stored in the archive itself, alongside the original document and a copy is sent to Discovery. This comes from a variety of sources: Record provider Archiving process Document transcription Archivists And there is a very good reason for using XML......it's human readable!

19 Requirements

20 Requirements – The DRI Catalogue Apart from being an inventory, the DRI Catalogue is needed to help manage:- Closure information Record opening Export lists Export status

21 Requirements - Closure Closure can be very fine-grained. e.g. Home Guard records have open description (individual's name, battalion etc.) but Service record closed until individual deceased Medical record closed until record = 100 years old

22 Requirements – Record Export The export process itself is in the form of a work-flow with many steps. The DRI Catalogue must maintain the status and other information about the export...

23 Requirements – The Problem Initially the DRI Catalogue was held in an RDBMS. However the fine-grained nature of closure meant very slow queries when attempting to export large numbers of records – sometimes taking hours to complete! Another approach was needed...!

24 Requirements – Initial Analysis Three different proposals were made for modelling the catalogue and therefore a trial was conducted to establish the best approach. Three models trialled were:- Relational – optimising the existing SQL queries against a modified table structure Graph – running SPARQL queries against a RDF store Hierarchical – running XQuery against a XML database

25 Requirements – Analysis Results Relational – reduced query time from hours to minutes

26 Requirements – Analysis Results Relational – reduced query time from hours to minutes Graph – reduced query time to seconds

27 Requirements – Analysis Results Relational – reduced query time from hours to minutes Graph – reduced query time to seconds Hierarchical – approach abandoned

28 Requirements – Analysis Results The hierarchical approach was abandoned because:- The graph approach provided a good solution The graph approach offered a path towards Linked Data Cost overheads and deadlines obliged us to move on.

29 Requirements – Analysis Results The hierarchical approach was abandoned because:- The graph approach provided a good solution The graph approach offered a path towards Linked Data Cost overheads and deadlines obliged us to move on. A hierarchical approach may have offered comparable performance and opportunity, we simply don't know...

30 Requirements – Analysis Conclusion The issues of closure and export had led to fundamental questions about the nature of the catalogue. We don't know exactly what information will need to go into DRI but we know it will be information about people, organisations, their relationships and activities. These things are complex and varied – just like the world around us! A graph approach not only resolved the issues with closure and export but provides a powerful and flexible tool for discovering information within the archive.

31 Design

32 Design - Technology Technologies used during the trial included:- D2RQ Apache Jena framework (including TDB and Fuseki) Turtle (RDF) SPARQL 1.1 (Query and Update) The Jena framework was chosen because:- Excellent Java API Open Source

33 Design - Technology UK Government Service Design Manual states... “...it remains the policy of the government that, where there is no significant overall cost difference between open and non-open source products that fulfil minimum and essential capabilities, open source will be selected on the basis of its inherent flexibility.” “Use open standards and common Government platforms (e.g. Identity Assurance) where available”

34 Design – The Catalogue Services

35 Design – DRI Vocabulary W3C recommend re-using vocabularies wherever possible and DRI already does this extensively in the XML metadata. But we needed to be able to talk about things very specific to DRI such as Closure and Export. So we extended the RDF Schema (RDFS) with a few of our own classes and properties such as:- – rdf:type rdfs:Class. – rdf:type rdfs:Property

36 Design – DRI Vocabulary This allows us to talk about DRI exports such as:- adri:Export ; dri:exportMember ; dri:exportMember.

37 Design – The Catalogue Services The Apache Jena Framework provides a straightforward approach to reading, writing, updating and deleting data using W3C standards...

38 Design – The Catalogue Services The Apache Jena Framework provides a straightforward approach to reading, writing, updating and deleting data using W3C standards... Reading – SPARQL 1.1 Query Language Writing – creating and persisting new RDF triples (e.g. Turtle) SPARQL 1.1 Graph Store Protocol Updating and Deleting – SPARQL 1.1 Update Language

39 Design – The Catalogue Services However......having to learn SPARQL can be a hurdle to widespread acceptance of this technology! The answer......Elda (Linked Data API implementation) provides RESTful access to pre-configured SPARQL queries:- spec:collectionList a apivc:ListEndpoint ; apivc:uriTemplate "/collection" ; apivc:selector [ apivc:where " ?item a dri:Collection. " ; ];.

40 Design - Implementation So how did we actually do it...? Create a mapping from RDBMS to vocabulary terms Export data from RDBMS to N-Quads using D2RQ Load N-Quads into Jena TDB (embedded version) Write SPARQL transform (CONSTRUCT) queries to refine RDF Run queries in Fuseki, download results and reload into clean database instances

41 Design - Implementation

42 Design – Catalogue Services API RESTful JAX-RS web application providing a very simple API e9f3c8e9- e883-4fcf-a9a3-5caf0c808c5d Why XML? Why not JSON? Web services consumed by Java applications. JSON is used in some circumstances, i.e. for a JavaScript tree editor.

43 Design – Insights Issues and Limitations Elda – Linked Data API Implementation

44 Design – Insights Issues and Limitations

45

46 Xturtle Shortage of RDF/SPARQL editors and IDEs! Xturtle provides a useful syntax highlighting plug-in for Eclipse...

47 Design – Insights Issues and Limitations Scardf - http://code.google.com/p/scardf/ Model model = ModelFactory.createDefaultModel(); model.createResource( "http://somewhere/JohnSmith" ).addProperty( N, model.createResource().addProperty( Given, "John" ).addProperty( Family, "Smith" ) );

48 Design – Insights Issues and Limitations Scardf - http://code.google.com/p/scardf/ Model model = ModelFactory.createDefaultModel(); model.createResource( "http://somewhere/JohnSmith" ).addProperty( N, model.createResource().addProperty( Given, "John" ).addProperty( Family, "Smith" ) ); Graph( UriRef( "http://somewhere/JohnSmith" ) -N-> Branch( Given -> "John", Family - > "Smith" ) )

49 Design – Insights Issues and Limitations Scale and Performance Will the DRI Catalogue cope with the tsunami?

50 Design – Insights Issues and Limitations Scale and Performance Will the DRI Catalogue cope with the tsunami? We think it will... 1) This solution was chosen because of it's performance 2) We are confident we can scale horizontally. In fact a catalogue for each collection makes some sense. You could then create a catalogue of catalogues to search everything! 3) If the existing framework fails to scale satisfactorily the fact that we are using open standards means moving to another framework should be straightforward.

51 The Future

52 The story so far:- Remodelling of the DRI Catalogue Solution for Closure and Export

53 The Future The story so far:- Remodelling of the DRI Catalogue Solution for Closure and Export So what next?

54 The Future The story so far:- Remodelling of the DRI Catalogue Solution for Closure and Export So what next? More metadata into the Catalogue Starting with the rich XML that we already have

55 The Future – Named Entity Recognition So what could this mean for members of the public viewing records on Discovery?

56 The Future – Named Entity Recognition Records cease to be just text and become machine readable with context and meaning...

57 The Future – Ontology-driven NLP Natural Language Processing (NLP) tools can be used in conjunction with RDF to extract meaning... “From 1 Aug 44 to 20 Oct 44 Bayeux, Rouen and Antwerp. During the period from 1 Aug to date this officer has carried the principal strain of establishing and re-establishing the hospital in three situations. His unrelenting energy, skill and patience have been the mainstay of the unit. His work as a quartermaster is the most outstanding I have met in my service. (A.R.ORAM) Colonel Comdg. No.9 (Br) General Hospital”

58 The Future – Ontology-driven NLP Natural Language Processing (NLP) tools can be used in conjunction with RDF to extract meaning... “From 1 Aug 44 to 20 Oct 44 Bayeux, Rouen and Antwerp. During the period from 1 Aug to date this officer has carried the principal strain of establishing and re-establishing the hospital in three situations. His unrelenting energy, skill and patience have been the mainstay of the unit. His work as a quartermaster is the most outstanding I have met in my service. (A.R.ORAM) Colonel Comdg. No.9 (Br) General Hospital”

59 The Future – Semantic Search Searching for “George John Potter” in Discovery currently returns 361 results...

60 The Future – Semantic Search Searching for “George John Potter” in Discovery currently returns 361 hits......that's 360 irrelevant ones as there is only one record for a person with that name.

61 The Future – Semantic Search Searching for “George John Potter” in Discovery currently returns 361 hits......that's 360 irrelevant ones as there is only one record for a person with that name. A semantic search would allow you to search for a “person”, a “soldier” or an “officer” with that name. This is known as query string extension.

62 The Future – Semantic Search Semantic search also allows you to search for terms closely associated with your matches – known as cross referencing. In this case we would receive information about Colonel A.R.Oram as he also had an entry in Discovery...

63 The Future – Semantic Search Because related concepts are held in a graph it is possible to do exploratory search into a particular area of interest. In this case we might discover that Colonel Oram was himself awarded a medal his work with No.9 British General Hospital...

64 The Future – Semantic Search It also becomes possible to do reasoning whereby rules can be applied creating new statements that are implied rather than explicit. For example we could say Colonel Oram “served with” Captain Potter...

65 The Future – Linked Data While TNA is a huge national (and international) source of information it is not an authority on all things. Linked Data, the brainchild of WWW inventor Sir Tim Berners- Lee provides a way of un-siloing and linking datasets using RDF-based machine-readable formats standardised by the W3C.

66 The Future – Linked Data While TNA is a huge national (and international) source of information it is not an authority on all things. Linked Data, the brainchild of WWW inventor Sir Tim Berners- Lee provides a way of un-siloing and linking datasets using RDF-based machine readable formats standardised by the W3C. TNA data could be Linked Data sources such as DBPedia Ordnance Survey British Library Smithsonian

67 The Future – Crowd-sourced linking Even the best machine reading will miss key facts and links. Digitised documents rely on transcriptions for metadata as OCR still has a long way to go. Crowd-sourced linking would allow users to link established vocabulary terms to specific documents. Discovery already allows tagging but users tend to create very personal terminology which doesn't necessarily help others...

68 The Future – Open World Assumption Using a semantic approach allows for an Open World Assumption. That is to say that it is... “implicitly assumed that a knowledge base may always be incomplete” [Hitzler, Krötzsch, Rudolph – Foundations of Semantic Web Technologies] This means that TNA can always add new information to the DRI Catalogue as it is discovered – without needing to redesign the storage architecture. Exactly what you want for an archive!

69 Thank you


Download ppt "The National Archives Digital Records Infrastructure Catalogue First Steps to Creating a Semantic Digital Archive Rob Walpole DeveXe Limited The National."

Similar presentations


Ads by Google