Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.

Similar presentations


Presentation on theme: "Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio."— Presentation transcript:

1 Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio Saggion University of Sheffield

2 The Challenge Lower the cost of annotating document collections with metadata and semantic information New ways to access digital collections via indexes of events, people, etc. The solution: use Human Language Technology (HLT) which requires little or no adaptation to the types of texts being processed

3 (Semi-)Automatic Annotation with Semantic Information Old Bailey – 18th century English Collection

4 Indexing and Search by Semantic Content

5 Information Extraction Technology Identify named entities (domain independent) Persons Dates Numbers Organizations Identify domain-specific events and terms Players Teams Events: goal, foul, etc

6

7 Question: Which of these tools and Human Language Technology (HLT) can I use in other digital library applications? Without modification in any domain With domain-specific customisations

8 Domain-Independent Named Entity Recognition Specifically designed for many genres and domains Work on a variety of document formats Person names, dates, numbers, organisations, monetary expressions, etc. Annotations can be exported as document markup (e.g. XML) for further processing and/or storage or indexed in Oracle Multilingual support via Unicode Support for distributed documents, e.g., WWW

9 Low-overhead customisation possible by non- computer scientists Used successfully in a number of projects, including adapted to new languages – Bengali, Bulgarian, etc. Publically available, Java-based modules at gate.ac.uk as part of Sheffield’s General Architecture for Text Enginnering (GATE) Domain-Independent Named Entity Recognition (2)

10 Name Entity Annotated Example President visit President Bush will visit Canada in the June. Bush is expected to…

11 Correcting the Computer’s Mistakes Less time-consuming than full manual annotation 85-90% correctness are sometimes enough

12 Other Human Language Technology Automatic speech recognition can be used in combination with IE to annotate sound/video material – results improved with training Domain-specific terms and events can be annotated by modifying the linguistic resources of the IE modules or training them on human- marked texts

13 Building and Customising HLT Modules for New Domains/Applications Facilitated by existing tools such as the graphical development environment provided by GATE GATE comes with a useful starting set Tokeniser Gazetteer list lookup Sentence detection module Part-of-speech tagging module A pattern-matching engine with grammars Information Retrieval support, etc. Try for free from http://gate.ac.uk

14 Why Are Digital Libraries Good for HLT? Digital libraries are challenging for HLT as they require robustness and scalability Cultural heritage DLs are particularly challenging as they pose new types of problems Example: Nouns in 18th century English texts were capitalised so the NE recogniser had to deal with less reliable orthographic information

15 Further information Demos: contact me during a coffee break E-mail: kalina@dcs.shef.ac.uk Web: http://gate.ac.uk Try NE recognition online: http://gate.ac.uk/annie/index.jsp


Download ppt "Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio."

Similar presentations


Ads by Google