Named Entity Recognition for Digitised Historical Texts by Claire Grover, Sharon Givon, Richard Tobin and Julian Ball (UK) presented by Thomas Packer 1.

Slides:



Advertisements
Similar presentations
End-to-end document capture, indexation, OCR to Microsoft SharePoint
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Delivering textual resources. Overview Getting the text ready – decisions & costs Structures for delivery Full text Marked-up Image and text Indexed How.
Introduction to the Logical Structure of XML Documents Web Engineering, SS 2007 Tomáš Pitner, Michael Derntl.
LABELING TURKISH NEWS STORIES WITH CRF Prof. Dr. Eşref Adalı ISTANBUL TECHNICAL UNIVERSITY COMPUTER ENGINEERING 1.
University of Sheffield NLP Exercise I Objective: Implement a ML component based on SVM to identify the following concepts in company profiles: company.
University of Sheffield NLP Module 4: Machine Learning.
A Text Processing Tool for the Romanian Language Oana Frunza and Diana InkpenDavid Nadeau School of Information Technology and Institute for Information.
Sharpdesk Overview Desktop Composer Search Imaging      
SPECIAL TOPIC XML. Introducing XML XML (eXtensible Markup Language) ◦A language used to create structured documents XML vs HTML ◦XML is designed to transport.
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007.
Information Retrieval in Practice
Ancestry OCR Project: Data Thomas L. Packer
Chapter 11 Beyond Bag of Words. Question Answering n Providing answers instead of ranked lists of documents n Older QA systems generated answers n Current.
Research Paper Presentation – CS572 Summer 2011 Presented by Donghee Sung Paper by Paul Clough (University of Sheffield Western Bank)
WELCOME PROJECT GROUP MEMBERS  Orhan AKSOY  Rıdvan ÇELEBİ  Ulan BAYALİYEV  Mustafa BAL  Mehmet BIÇAK.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.
XML CS 105. What is XML? XML stands for Extensible Markup Language. XML is a markup language like HTML. XML was designed to describe data. You must define.
The Third Chinese Language Processing Bakeoff: Word Segmentation and Named Entity Recognition Gina-Anne Levow Fifth SIGHAN Workshop July 22, 2006.
Name Extraction from Chinese Novels CS224n Spring 2008 Jing Chen and Raylene Yung.
Document Image Analysis CSE 717 An Introduction. Document Image Analysis  DIA is the theory and practice of recovering the symbol structures of digital.
The Chinese University of Hong Kong Department of Computer Science and Engineering Lyu0202 Advanced Audio Information Retrieval System.
How does computer know what is spam and what is ham?
بسم الله الرحمن الرحيم معالج الحروف الضوئي OCR. Introduction Definition : OCR stands for O ptical C haracter R ecognition refers to the branch of computer.
Overview of Search Engines
Text-To-Speech System for Marathi Miss. Deepa V. Kadam Indian Institute of Technology, Bombay.
Hardware, Software & Automatic input devices LO: Recognise hardware, software. Learning outcome: Correctly identify hardware and software. Recognise and.
Evidence from Content INST 734 Module 2 Doug Oard.
1 Statistical NLP: Lecture 6 Corpus-Based Work. 2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest.
Information Extraction From Medical Records by Alexander Barsky.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy.
Standard Grade Computing General Purpose Packages WORD-PROCESSING WORD-PROCESSING Chapter 2.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
XML eXtensible Markup Language. Topics  What is XML  An XML example  Why is XML important  XML introduction  XML applications  XML support CSEB.
1/(13) Using Corpora and Evaluation Tools Diana Maynard Kalina Bontcheva
27/03/01CROSSMARC kick-off meeting LTG Background XML-based Processing –Several years of experience in developing XML-based software –LT XML Tools –Pipeline.
Natural Language Processing Course Project: Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University
17 Apr 2002 XML Syntax: Documents Andy Clark. Basic Document Structure Element tags – Elements have associated attributes Text content Miscellaneous –
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Sheffield -- Victims of Mad Cow Disease???? Or is it really possible to develop a named entity recognition system in 4 days on a surprise language with.
TimeML compliant text analysis for Temporal Reasoning Branimir Boguraev and Rie Kubota Ando.
1 DUTIE Speech: Determining Utility Thresholds for Information Extraction from Speech John Makhoul, Rich Schwartz, Alex Baron, Ivan Bulyko, Long Nguyen,
1 Italian FE Component CROSSMARC Eighth Meeting Crete 24 June 2003.
MedKAT Medical Knowledge Analysis Tool December 2009.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Department of Computer Science The University of Texas at Austin USA Joint Entity and Relation Extraction using Card-Pyramid Parsing Rohit J. Kate Raymond.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Information Retrieval in Practice
Search Engine Architecture
Well-formed and Valid XML Documents
Training and Evaluation CSCI-GA.2591
Text Detection in Images and Video
Thomas L. Packer BYU CS DEG
The CoNLL-2014 Shared Task on Grammatical Error Correction
Extracting Full Names from Diverse and Noisy Scanned Document Images
Family History Technology Workshop
ListReader: Wrapper Induction for Lists in OCRed Documents
Extracting Information from Diverse and Noisy Scanned Document Images
Extracting Information from Diverse and Noisy Scanned Document Images
Extracting Information from Diverse and Noisy Scanned Document Images
Presentation transcript:

Named Entity Recognition for Digitised Historical Texts by Claire Grover, Sharon Givon, Richard Tobin and Julian Ball (UK) presented by Thomas Packer 1

A Need for Named Entity Recognition 2

Goal and High-Level Process Automatically index historical texts so they are searchable. 1.Scanning 2.OCR 3.Text processing 4.Named Entity Recognition (NER) 5.Indexing 6.Search and Retrieval 7.Highlighting 3

Challenges: Document Structure 4

Challenges: Placement of Tokens and Capitalization 5

Challenges: OCR Errors Abound 6

Challenges Related to ML: Capitalization CoNLL 2003 NER. Trained and tested on English and German Highest f-score for English: 88.8 Highest f-score for German: 72.4 Capitalization likely to be the main reason (usually a key feature in ML-based NER). 7

Challenges Related to ML: POS (Untested guess that) POS tagging would do poorly on OCR text. Poor POS tagging would produce poor ML- based NER. 8

Challenges Related to ML: Hand Labeling Quality of OCR text and kinds of OCR errors vary greatly from document to document. Traditional ML requires hand labeling for each document type. 9

Data Annotated Data: – British parliamentary proceedings – Files: 91 (1800’s); 46 (1600’s). – Person names: 2,751 (1800’s); 3,199 (1600’s). – Location names: 2,021 (1800’s); 164 (1600’s). Document Split: – Training: none – Dev-test: 45 docs (1800’s) – Test: 46 files (1800’s); 46 files (1600’s) 10

IAA, Balanced F-Score 11

Process XML OCR Engine Clean-up XML Extraction XML Generalist Person Specialist Place Specialist Person Generalist Place Token IDs Re- tokenize UTF-8 Conversion Mark-up Noise 12

Clean-up Convert to UTF-8 Separate trailing whitespace and punctuation from tokens Adding IDs to tokens Mark up noise: Marginal notes Quotes Unusual characters 13

Extraction: Four Grammars Specialist names, (monarchs, earls, lords, dukes, churchmen), e.g. Earl of Warwick. Common names, e.g. Mr. Stratford. High-confidence place names, e.g. Town of London. Lower-confidence place names. 14

Extraction: Several Lexicons Male christian names Female christian names Surnames Earldom place names 15

Extraction: Rule Application Apply higher-precision rules first: – Specialist rules before generalist rules. – Person rules before place rules. 16

Output 17

Results 18

Conclusions: Good & Bad Good: – Blind test set (developers did not look at). – Computed IAA. – Lots of entities labeled. Bad: – Reported IAA scores apparently not corrected after error was discovered. (Annotations corrected, but IAA score was not.) – Why is extraction tied to XML processing? 19

Future Work Improve rules and lexicons. Correct more OCR errors with ML. Extract organizations and dates. Extract relations and events. 20

Successful Named Entity Recognition 21

The Inconvenience of Good Named Entity Recognition 22

Questions 23