Download presentation
Presentation is loading. Please wait.
Published byMaryann Sutton Modified over 8 years ago
1
FAT – Finding All Taxa (in Text Documents) Guido Sautter, Donat Agosti, Klemens Böhm Universität Karlsruhe (TH) Research University – founded 1825
2
Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)2 FAT – Basic Principle Generate taxon name candidates Find out which candidates actually are a taxon names Divides text in –Sure positives –Sure negatives –Candidates Use sure positives and negatives to deal with candidates
3
Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)3 FAT – Detail Overview Find all parts of text that might be taxon names using –Morphological structure (in form of regular expressions) –Known taxon names (as positive gazetteer lists) Successively rule candidates to be taxa or not using –Morphological structure (in form of regular expressions) –Known taxon names (as positive gazetteer lists) –Textual hints (name labels, e.g. “sp. nov.”) –Ruled-out words (as negative gazetteer lists) –Common dictionaries (as negative gazetteer lists) –Document internal contradictions –User feedback (as last instance)
4
Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)4 FAT – Basic Benefits / Deficits Benefits –All available knowledge is used –Newly added knowledge is used early as possible –Can learn new taxa through use of structure –User can avoid errors through feedback at little effort Deficits –Regular expression patterns somewhat inflexible regarding Automated adaptation to different document styles Language-dependent capitalization schemes (e.g. in German) –Gazetteer lists somewhat susceptible to Misspellings / OCR errors Unseen languages
5
Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)5 Morphological Rules Exploit (Linnaean / ICZN) rules of nomenclature Challenges: –Different schemes of in-taxon-name punctuation –Embedded author names (differing styles, strange names) Imlementation: –Editor for basic building blocks, including - line-broken and indented layout - syntax check and test facilities –Actual expressions assembled dynamically at runtime (almost) all parts maintainable in one place
6
Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)6 Gazetteer Lists Storage for known taxon names / epithets / authors Challenges: –Huge amount of data (main memory footprint) –Misspellings (source text or OCR) Imlementation: –Editor for lists, including - import / export - add / intersect / and subtract functions –Centralized access point loaded and stored only once
7
Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)7 Running FAT (Overview) Recall Rules Label Rules Dictio- nary Filter Lexicon Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules Taxon Names Not Taxon Names Taxon Names Candiates User Feedback Document Text
8
Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)8 Recall Rules Label Rules Dictio- nary Filter Lexicon Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules Taxon Names Not Taxon Names Taxon Names Candiates User Feedback Document Text Create candidates: - morphological structure - filter out matches that stop contain stop words
9
Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)9 Dictionary Filter Recall Rules Label Rules Dictio- nary Filter Lexicon Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules Taxon Names Not Taxon Names Taxon Names Candiates User Feedback Document Text Filter candidates: - gazetteer based - filter out candidates with common language words in epithet positions (+ stemming for English)
10
Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)10 Lexicon Rules Recall Rules Label Rules Dictio- nary Filter Lexicon Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules Taxon Names Not Taxon Names Taxon Names Candiates User Feedback Document Text Exploit known epithets: - candidates matches - create further candidates
11
Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)11 Label Rules Recall Rules Label Rules Dictio- nary Filter Lexicon Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules Taxon Names Not Taxon Names Taxon Names Candiates User Feedback Document Text Exploit taxon name labels: - labeled candidates matches - „Genus species, sp. nov.“
12
Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)12 Precise Rules Recall Rules Label Rules Dictio- nary Filter Lexicon Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules Taxon Names Not Taxon Names Taxon Names Candiates User Feedback Document Text Exploit morphology: - candidates with distinctive structure matches - „Genus species st. race“
13
Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)13 Known Data Rules Recall Rules Label Rules Dictio- nary Filter Lexicon Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules Taxon Names Not Taxon Names Taxon Names Candiates User Feedback Document Text Exploit prior runs: - Extract epitets from candidates - Known epithet combination candidates matches
14
Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)14 Author Name Rules Recall Rules Label Rules Dictio- nary Filter Lexicon Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules Taxon Names Not Taxon Names Taxon Names Candiates User Feedback Document Text Exclude candidates with author names in genus or sub genus position
15
Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)15 Negative Rules Recall Rules Label Rules Dictio- nary Filter Lexicon Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules Taxon Names Not Taxon Names Taxon Names Candiates User Feedback Document Text Exclude candidates with words from negatives (all text excluded so far) in epithet positions
16
Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)16 Data Rules Recall Rules Label Rules Dictio- nary Filter Lexicon Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules Taxon Names Not Taxon Names Taxon Names Candiates User Feedback Document Text Candidates with known epithets in last position matches
17
Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)17 Dynamic Lexicon Rules Recall Rules Label Rules Dictio- nary Filter Lexicon Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules Taxon Names Not Taxon Names Taxon Names Candiates User Feedback Document Text Exploit matches & negatives: - Works as combination of lexicon-based rules before - But with current document - Compute transitive hull
18
Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)18 User Feedback Recall Rules Label Rules Dictio- nary Filter Lexicon Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules Taxon Names Not Taxon Names Taxon Names Candiates User Feedback Document Text Ask user to decide on remaining candidates (displaying some context) Optional step, can be omitted
19
Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)19 Questions? Browse Madagascar Corpus at http://plazi.org/GgSRS/searchhttp://plazi.org/GgSRS/search Download GoldenGATE from http://idaho.ipd.uka.de/GoldenGATE/http://idaho.ipd.uka.de/GoldenGATE/ Universität Karlsruhe (TH) Research University – founded 1825
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.