Presentation is loading. Please wait.

Presentation is loading. Please wait.

Semantic Annotation of Grey Literature from an Archaeological Digital Library Andreas Vlachidis, Doug Tudhope Hypermedia Research Unit University of Glamorgan.

Similar presentations


Presentation on theme: "Semantic Annotation of Grey Literature from an Archaeological Digital Library Andreas Vlachidis, Doug Tudhope Hypermedia Research Unit University of Glamorgan."— Presentation transcript:

1 Semantic Annotation of Grey Literature from an Archaeological Digital Library Andreas Vlachidis, Doug Tudhope Hypermedia Research Unit University of Glamorgan http://hypermedia.research.glam.ac.uk/kos/star/ http://andronikos.kyklos.co.uk avlachid@glam.ac.uk

2 Semantic Annotation of Grey Literature from an Archaeological Digital Library Semantic Technologies for Archaeology Resources AHRC funded 3 year project Investigate semantic technologies for integrating and cross searching datasets and associated grey literature Acknowledgements Ceri Binding (Glamorgan) Keith May (English Heritage)

3 Semantic Annotation of Grey Literature from an Archaeological Digital Library Is the Web machine-readable? Yes

4 Semantic Annotation of Grey Literature from an Archaeological Digital Library Is the web machine understandable? No Amsterdam Netherlands has_capital City type

5 Semantic Annotation of Grey Literature from an Archaeological Digital Library Machine readable vs. machine understandable What we say to the machine: The Cat in the Hat ISBN: 0007158440 Author: Dr. Seuss Publisher: Collins What the machine understands: bla bla bla bla

6 Semantic Annotation of Grey Literature from an Archaeological Digital Library STAR Current situation is one of fragmented datasets and applications, with different terminology systems Need for integrative conceptual framework English Heritage extended CIDOC CRM ontology for archaeology Need for terminology control English Heritage Thesauri Recording Manual glossaries augmented with dataset glossaries

7 Semantic Annotation of Grey Literature from an Archaeological Digital Library General Architecture RRAD RPRE RDF Based Common Ontology Data Layer (CRM / CRMEH / SKOS) Grey literature Grey literature EH thesauri, glossaries LEAP STAN IADB Data Mapping / Normalisation Conversion Indexing Web Services, SQL, SPARQL Applications – Server Side, Rich Client, Browser

8 Semantic Annotation of Grey Literature from an Archaeological Digital Library STAR outcomes Aim: “To investigate the potential of semantic terminology tools for widening access to digital archaeology resources, including disparate data sets and associated grey literature” Research Demonstrator Rich semantic indexing of OASIS grey literature reports More specific focus than complementary ADS ArcheoTools project

9 Semantic Annotation of Grey Literature from an Archaeological Digital Library Background and Definitions Information Extraction Information Extraction (IE) is a Natural Language Processing technique defined as a text analysis task aimed at extracting targeted information from context. It is a process where a textual input is analysed to form a textual output able for further manipulation. Semantic Annotation Specific metadata generation and usage schema (usually described by Ontology) aimed to automate identification of concepts and their relationships in documents

10 Semantic Annotation of Grey Literature from an Archaeological Digital Library Rule Based Information Extraction  Aims to Enable ‘rich’, semantic aware indexing of Archaeology fieldwork reports (Grey Literature) with respect to the CRM-EH Conceptual Reference Model (Ontology)  Grey Literature; source materials that can not be found through the conventional means of publication  OASIS Online AccesS to the Index of achaeological investigationS - Coordinated by ADS - Online index to Archaeological Grey Literature - Accessed via ADS ArchSearch online Service (http://www.oasis.ac.uk)

11 Semantic Annotation of Grey Literature from an Archaeological Digital Library General Architecture for Text Engineering XML structures to represent semantic properties EH Thesaurus Java Pattern Engine ADS – OASIS Grey Literature Ontology -CIDOC CRM-EH Gazetteer Lists Infrastructure for processing human language. Provides architecture, a framework and a development environment for developing and deploying natural language software components

12 Semantic Annotation of Grey Literature from an Archaeological Digital Library Rule Based Information Extraction Java Annotation Pattern Engine (JAPE): provides finite state transduction over annotations based on regular expressions (patterns - rules). IE Pipeline: consists of a set of phases, a cascading mechanism that runs sequentially a set of JAPE rules and text analysis processes (Tokenization, POS, etc) E49 E49 “Late Bronze Age or Early Iron Age” E49 E19 “prehistoric pottery” E53 ( / ) “Ditch containing prehistoric pottery”

13 Semantic Annotation of Grey Literature from an Archaeological Digital Library The KBIE Process  The Knowledge Based Information Extraction (KBIE) process is completed in three phases  Pre-processing: aimed to identify parts/sections that would assist in further processing of the document. Not related to ontology  Name Entity Recognition (NER) : aimed to exploit the KR and to provide the based Lookup mentions of selected concepts. Related to CIDOC-CRM ontology  Events Recognition (Relation Extraction): aimed to identify connections between previously identified CIDOC - CRM entities in text. Related to the CRM-EH ontology

14 Semantic Annotation of Grey Literature from an Archaeological Digital Library Pre-processing Phase  The Pre-processing phase is targeted to extract the following document sections:  Headings  Negation Phrases  Summary Sections  Noun Phrases  Verb Phrases  All the above types are used by the 2nd phase (NER) for validation of Lookup matches generated by the GATE gazetteers  The ANNIE application of GATE is used for producing the noun and the verb phrases

15 Semantic Annotation of Grey Literature from an Archaeological Digital Library Name Entity Recognition (NER)  The NER phase is targeted to extract the following annotation types with respect to the CIDOC-CRM.  E4.Period  E19.Physical Object  E53.Place  E57.Material  The phase supports disambiguation techniques between Material-Physical Object Lookup based on Word Pair disambiguation and use of Part of Speech (POS) : Determiners and Adjectives.  No Lookup generation for matches that belong to Heading, Tabular Data and Negation Sections  All Lookup matches must be parts of noun phrases

16 Semantic Annotation of Grey Literature from an Archaeological Digital Library NER - Semantic Expansion over Thesauri  How much to use from available thesauri structures?  IE pipeline configurable to run in 4 different modes  EH Glossary Terms used as entry points  Synonym: glossary terms plus their synonyms  Hyponym : Synonyms plus Narrow terms  Hypernym : Hyponym plus Broad terms  All available resources

17 Semantic Annotation of Grey Literature from an Archaeological Digital Library Targeting CRM-EH entities and Events  The third phase is targeted to extract a range of CRM-EH Entities and Events including  Context  Context Find  Context Event Time Appellation  Context Find Material  Context Find Time Appellation  Context Event  Context Find Production Event  Context Find Deposition Event

18 Semantic Annotation of Grey Literature from an Archaeological Digital Library Meaningful Connections between Entities  Context Event Time Appellation with Context i.e. “Roman deposits”  Context Find Production Event Time Appellation with Context Find i.e. “Mediaeval Pottery”  Context Find Deposition Event Context with Context Find i.e. “Ditch containing coins”  Consists of Material and Context Find i.e. “Copper alloy artefacts”

19 Semantic Annotation of Grey Literature from an Archaeological Digital Library CRM-EH Entities and Events (Example)

20 Semantic Annotation of Grey Literature from an Archaeological Digital Library CRM-EH Entities and Events (XML output)

21 Semantic Annotation of Grey Literature from an Archaeological Digital Library CRM-EH Entities and Events (RDF triples)

22 Semantic Annotation of Grey Literature from an Archaeological Digital Library Using the IE Output  The STAR demonstrator  Making use of the decoupled RDF files  Cross searching between grey literature and datasets  A SPARQL engine supports the semantic search  Semantic Search Examples  Context of type X containing Find of type Y “hearth” containing “coin”,  Context Find of type X within Context of type Y “Animal Remains” within “pit”. “the test pit produced a range of artefactual material which included animal bone”

23 Semantic Annotation of Grey Literature from an Archaeological Digital Library Example of Grey Literature Annotations

24 Semantic Annotation of Grey Literature from an Archaeological Digital Library Evaluation  Information Extraction Challenges  Domain Specific Issues  Language Ambiguity  False Positives  Coverage of Knowledge Base Resources  Evaluation of IE  Input from Experts is Critical  Basis for Assessing and Improving the IE system  Evaluation Method  The 'gold standard' describes a test set of human annotated documents,  It is used for comparison with system produced automatic annotations.

25 Semantic Annotation of Grey Literature from an Archaeological Digital Library SynonymHyponymHypernymAll Res. Precision-KM0.820.850.760.68 Precision-PC0.72 0.70.62 Precision-TB0.610.620.60.55 Recall-KM0.670.720.770.71 Recall-PC0.60.620.680.63 Recall-TB0.660.690.720.7 F-Measure-KM0.730.780.760.68 F-Measure-PC0.650.660.680.62 F-Measure-TB0.60.620.610.56 Precision Recall F-measure Evaluation Results

26 Semantic Annotation of Grey Literature from an Archaeological Digital Library PrecisionRecallF-Measure KM-PC-TB0.70.620.65 KM-PC0.70.680.69 KM-TB0.740.620.67 PC-TB0.650.560.61 Inter Annotator Agreement Score Evaluation Results

27 Semantic Annotation of Grey Literature from an Archaeological Digital Library Evaluation Results  Results validate the initial hypothesis about optimum semantic expansion.  Too little cause Recall to suffer  Too much cause Precision to suffer.  The best performance (2 out of 3 annotators) is on the Hyponym  Trade off between Hyponym and Hypernym in Precision and Recall.

28 Semantic Annotation of Grey Literature from an Archaeological Digital Library Evaluation Results  Annotators agree 65%  It is a good percentage  Archaeotools reported IAA around 60%;  IAA in Cultural Heritage is usually low  KM and PC agree 69%  KM and TB agree 67%  TB and PC 61%  Not all annotators distinguished the Material from the Physical object  Time appellations such as Phase or 'episode of flooding' cause disagreement

29 Semantic Annotation of Grey Literature from an Archaeological Digital Library Questions and References  Questions ?  URLs  http://hypermedia.research.glam.ac.uk/  http://hypermedia.research.glam.ac.uk/kos/CRM/  http://hypermedia.research.glam.ac.uk/resources/star-demonstrator/  http://www.oasis.ac.uk/  http://gate.ac.uk/

30 Semantic Annotation of Grey Literature from an Archaeological Digital Library avlachid@glam.ac.uk dstudhope@glam.ac.uk


Download ppt "Semantic Annotation of Grey Literature from an Archaeological Digital Library Andreas Vlachidis, Doug Tudhope Hypermedia Research Unit University of Glamorgan."

Similar presentations


Ads by Google