FAT – Finding All Taxa (in Text Documents) Guido Sautter, Donat Agosti, Klemens Böhm Universität Karlsruhe (TH) Research University – founded 1825.

Slides:



Advertisements
Similar presentations
Open Access to Grey Resources Indexing Grey resources : considering the usual behavior of library users and the use of DC metadata using a database of.
Advertisements

GCSE Computing Lesson 5.
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
ICS103 Programming in C Lecture 1: Overview of Computers & Programming
Lecture 1: Overview of Computers & Programming
Programming Types of Testing.
The Little man computer
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
Program Flow Charting How to tackle the beginning stage a program design.
Gobalisation Week 8 Text processes part 2 Spelling dictionaries Noisy channel model Candidate strings Prior probability and likelihood Lab session: practising.
A Self Learning Universal Concept Spotter By Tomek Strzalkowski and Jin Wang Presented by Iman Sen.
Help and Documentation CSCI324, IACT403, IACT 931, MCS9324 Human Computer Interfaces.
Chapter 1: An Overview of Computers and Programming Languages
Chapter 1 Understanding the Web Design Environment
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
Activity 1 - WBs 5 mins Go online and spend a moment trying to find out the difference between: HIGH LEVEL programming languages and LOW LEVEL programming.
Microsoft Visual Basic 2005 CHAPTER 1 Introduction to Visual Basic 2005 Programming.
DCT 1123 PROBLEM SOLVING & ALGORITHMS INTRODUCTION TO PROGRAMMING.
CHAPTER 4: INTRODUCTION TO COMPUTER ORGANIZATION AND PROGRAMMING DESIGN Lec. Ghader Kurdi.
Hands-on Introduction to Visual Basic.NET Programming Right from the Start with Visual Basic.NET 1/e 6.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
DAY 15: ACCESS CHAPTER 2 Larry Reaves October 7,
Printing v.15 e-Seminar Motke Keshet
Improved Bibliographic Reference Parsing Based on Repeated Patterns Guido Sautter, Klemens Böhm ViBRANT Virtual Biodiversity.
Week 7 Working with the BASH Shell. Objectives  Redirect the input and output of a command  Identify and manipulate common shell environment variables.
1 Computing Software. Programming Style Programs that are not documented internally, while they may do what is requested, can be difficult to understand.
BTEC Unit 06 – Lesson 08 Principals of Software Design Mr C Johnston ICT Teacher
PETRA – the Personal Embedded Translation and Reading Assistant Werner Winiwarter University of Vienna InSTIL/ICALL Symposium 2004 June 17-19, 2004.
© 2011 Pearson Addison-Wesley. All rights reserved. Addison Wesley is an imprint of Stewart Venit ~ Elizabeth Drake Developing a Program.
C++ Programming Language Lecture 2 Problem Analysis and Solution Representation By Ghada Al-Mashaqbeh The Hashemite University Computer Engineering Department.
User Support Chapter 8. Overview Assumption/IDEALLY: If a system is properly design, it should be completely of ease to use, thus user will require little.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Chapter 1 Introduction Chapter 1 Introduction 1 st Semester 2015 CSC 1101 Computer Programming-1.
FIX Eye FIX Eye Getting started: The guide EPAM Systems B2BITS.
ProgrammingLanguages Programming Languages Language Definition, Translation and Design.
CHAPTER 1 INTRODUCTION 1 st Semester H King Saud University College Of Applied Studies and Community Services CSC 1101 Computer Programming-1.
The PLAZI Markup System Donat Agosti Terry Catapano Robert “Bob“ Morris Guido Sautter Universität Karlsruhe (TH) Research University – founded 1825.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
XP Tutorial 8 Adding Interactivity with ActionScript.
This presentation demonstrates the transition from the traditional menu structure to a more GUI look. Our objectives were to allow for quick access to.
Information Retrieval
JISC/NSF PI Meeting, June Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer.
COIT29222 Structured Programming 1 COIT29222-Structured Programming Lecture Week 02  Reading: Textbook(4 th Ed.), Chapter 2 Textbook (6 th Ed.), Chapters.
Hands-on Introduction to Visual Basic.NET Programming Right from the Start with Visual Basic.NET 1/e 6.
The way from pdf-documents to xml-files A brief overview through the OCR- process and the XML mark up Christiana Klingenberg & Donat Agosti.
The Hashemite University Computer Engineering Department
CSE1222: Lecture 1The Ohio State University1. Computing Basics  Computers CPU, Memory & Input/Output (IO)  Program Sequence of instructions for the.
©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials,
Principles of Programming CSEB134 : BS/ CHAPTER Fundamentals of the C Programming Language.
Copyright © 2014 Pearson Addison-Wesley. All rights reserved. Chapter 2 C++ Basics.
Understanding Web-Based Digital Media Production Methods, Software, and Hardware Objective
Chapter 2: The Visual Studio.NET Development Environment Visual Basic.NET Programming: From Problem Analysis to Program Design.
1 INFILE - INformation FILtering Evaluation Evaluation of adaptive filtering systems for business intelligence and technology watch Towards real use conditions.
The Simple Corpus Tool Martin Weisser Research Center for Linguistics & Applied Linguistics Guangdong University of Foreign Studies
Definition CASE tools are software systems that are intended to provide automated support for routine activities in the software process such as editing.
Introduction to Visual Basic 2008 Programming
Java programming lecture one
Understand the Programming Process
Unit# 8: Introduction to Computer Programming
Query Languages.
Extracting Semantic Concept Relations
Understand the Programming Process
Problem Solving Skill Area 305.1
Public How to self-diagnose consolidation failures and performance issues in Financial Consolidation and Close (FCCS)? Question: How to self diagnose consolidation.
Mastering Memory Modes
Learning Intention I will learn about the different types of programming errors.
A02 Creating my website NAME ______________.
Extracting Information from Diverse and Noisy Scanned Document Images
Presentation transcript:

FAT – Finding All Taxa (in Text Documents) Guido Sautter, Donat Agosti, Klemens Böhm Universität Karlsruhe (TH) Research University – founded 1825

Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)2 FAT – Basic Principle Generate taxon name candidates Find out which candidates actually are a taxon names Divides text in –Sure positives –Sure negatives –Candidates Use sure positives and negatives to deal with candidates

Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)3 FAT – Detail Overview Find all parts of text that might be taxon names using –Morphological structure (in form of regular expressions) –Known taxon names (as positive gazetteer lists) Successively rule candidates to be taxa or not using –Morphological structure (in form of regular expressions) –Known taxon names (as positive gazetteer lists) –Textual hints (name labels, e.g. “sp. nov.”) –Ruled-out words (as negative gazetteer lists) –Common dictionaries (as negative gazetteer lists) –Document internal contradictions –User feedback (as last instance)

Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)4 FAT – Basic Benefits / Deficits Benefits –All available knowledge is used –Newly added knowledge is used early as possible –Can learn new taxa through use of structure –User can avoid errors through feedback at little effort Deficits –Regular expression patterns somewhat inflexible regarding Automated adaptation to different document styles Language-dependent capitalization schemes (e.g. in German) –Gazetteer lists somewhat susceptible to Misspellings / OCR errors Unseen languages

Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)5 Morphological Rules Exploit (Linnaean / ICZN) rules of nomenclature Challenges: –Different schemes of in-taxon-name punctuation –Embedded author names (differing styles, strange names) Imlementation: –Editor for basic building blocks, including - line-broken and indented layout - syntax check and test facilities –Actual expressions assembled dynamically at runtime  (almost) all parts maintainable in one place

Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)6 Gazetteer Lists Storage for known taxon names / epithets / authors Challenges: –Huge amount of data (main memory footprint) –Misspellings (source text or OCR) Imlementation: –Editor for lists, including - import / export - add / intersect / and subtract functions –Centralized access point  loaded and stored only once

Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)7 Running FAT (Overview) Recall Rules Label Rules Dictio- nary Filter Lexicon Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules Taxon Names Not Taxon Names Taxon Names Candiates User Feedback Document Text

Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)8 Recall Rules Label Rules Dictio- nary Filter Lexicon Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules Taxon Names Not Taxon Names Taxon Names Candiates User Feedback Document Text Create candidates: - morphological structure - filter out matches that stop contain stop words

Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)9 Dictionary Filter Recall Rules Label Rules Dictio- nary Filter Lexicon Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules Taxon Names Not Taxon Names Taxon Names Candiates User Feedback Document Text Filter candidates: - gazetteer based - filter out candidates with common language words in epithet positions (+ stemming for English)

Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)10 Lexicon Rules Recall Rules Label Rules Dictio- nary Filter Lexicon Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules Taxon Names Not Taxon Names Taxon Names Candiates User Feedback Document Text Exploit known epithets: - candidates  matches - create further candidates

Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)11 Label Rules Recall Rules Label Rules Dictio- nary Filter Lexicon Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules Taxon Names Not Taxon Names Taxon Names Candiates User Feedback Document Text Exploit taxon name labels: - labeled candidates  matches - „Genus species, sp. nov.“

Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)12 Precise Rules Recall Rules Label Rules Dictio- nary Filter Lexicon Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules Taxon Names Not Taxon Names Taxon Names Candiates User Feedback Document Text Exploit morphology: - candidates with distinctive structure  matches - „Genus species st. race“

Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)13 Known Data Rules Recall Rules Label Rules Dictio- nary Filter Lexicon Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules Taxon Names Not Taxon Names Taxon Names Candiates User Feedback Document Text Exploit prior runs: - Extract epitets from candidates - Known epithet combination candidates  matches

Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)14 Author Name Rules Recall Rules Label Rules Dictio- nary Filter Lexicon Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules Taxon Names Not Taxon Names Taxon Names Candiates User Feedback Document Text Exclude candidates with author names in genus or sub genus position

Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)15 Negative Rules Recall Rules Label Rules Dictio- nary Filter Lexicon Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules Taxon Names Not Taxon Names Taxon Names Candiates User Feedback Document Text Exclude candidates with words from negatives (all text excluded so far) in epithet positions

Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)16 Data Rules Recall Rules Label Rules Dictio- nary Filter Lexicon Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules Taxon Names Not Taxon Names Taxon Names Candiates User Feedback Document Text Candidates with known epithets in last position  matches

Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)17 Dynamic Lexicon Rules Recall Rules Label Rules Dictio- nary Filter Lexicon Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules Taxon Names Not Taxon Names Taxon Names Candiates User Feedback Document Text Exploit matches & negatives: - Works as combination of lexicon-based rules before - But with current document - Compute transitive hull

Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)18 User Feedback Recall Rules Label Rules Dictio- nary Filter Lexicon Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules Taxon Names Not Taxon Names Taxon Names Candiates User Feedback Document Text Ask user to decide on remaining candidates (displaying some context) Optional step, can be omitted

Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)19 Questions? Browse Madagascar Corpus at Download GoldenGATE from Universität Karlsruhe (TH) Research University – founded 1825