Presentation is loading. Please wait.

Presentation is loading. Please wait.

FAT – Finding All Taxa (in Text Documents) Guido Sautter, Donat Agosti, Klemens Böhm Universität Karlsruhe (TH) Research University – founded 1825.

Similar presentations


Presentation on theme: "FAT – Finding All Taxa (in Text Documents) Guido Sautter, Donat Agosti, Klemens Böhm Universität Karlsruhe (TH) Research University – founded 1825."— Presentation transcript:

1 FAT – Finding All Taxa (in Text Documents) Guido Sautter, Donat Agosti, Klemens Böhm Universität Karlsruhe (TH) Research University – founded 1825

2 Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)2 FAT – Basic Principle Generate taxon name candidates Find out which candidates actually are a taxon names Divides text in –Sure positives –Sure negatives –Candidates Use sure positives and negatives to deal with candidates

3 Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)3 FAT – Detail Overview Find all parts of text that might be taxon names using –Morphological structure (in form of regular expressions) –Known taxon names (as positive gazetteer lists) Successively rule candidates to be taxa or not using –Morphological structure (in form of regular expressions) –Known taxon names (as positive gazetteer lists) –Textual hints (name labels, e.g. “sp. nov.”) –Ruled-out words (as negative gazetteer lists) –Common dictionaries (as negative gazetteer lists) –Document internal contradictions –User feedback (as last instance)

4 Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)4 FAT – Basic Benefits / Deficits Benefits –All available knowledge is used –Newly added knowledge is used early as possible –Can learn new taxa through use of structure –User can avoid errors through feedback at little effort Deficits –Regular expression patterns somewhat inflexible regarding Automated adaptation to different document styles Language-dependent capitalization schemes (e.g. in German) –Gazetteer lists somewhat susceptible to Misspellings / OCR errors Unseen languages

5 Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)5 Morphological Rules Exploit (Linnaean / ICZN) rules of nomenclature Challenges: –Different schemes of in-taxon-name punctuation –Embedded author names (differing styles, strange names) Imlementation: –Editor for basic building blocks, including - line-broken and indented layout - syntax check and test facilities –Actual expressions assembled dynamically at runtime  (almost) all parts maintainable in one place

6 Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)6 Gazetteer Lists Storage for known taxon names / epithets / authors Challenges: –Huge amount of data (main memory footprint) –Misspellings (source text or OCR) Imlementation: –Editor for lists, including - import / export - add / intersect / and subtract functions –Centralized access point  loaded and stored only once

7 Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)7 Running FAT (Overview) Recall Rules Label Rules Dictio- nary Filter Lexicon Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules Taxon Names Not Taxon Names Taxon Names Candiates User Feedback Document Text

8 Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)8 Recall Rules Label Rules Dictio- nary Filter Lexicon Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules Taxon Names Not Taxon Names Taxon Names Candiates User Feedback Document Text Create candidates: - morphological structure - filter out matches that stop contain stop words

9 Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)9 Dictionary Filter Recall Rules Label Rules Dictio- nary Filter Lexicon Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules Taxon Names Not Taxon Names Taxon Names Candiates User Feedback Document Text Filter candidates: - gazetteer based - filter out candidates with common language words in epithet positions (+ stemming for English)

10 Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)10 Lexicon Rules Recall Rules Label Rules Dictio- nary Filter Lexicon Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules Taxon Names Not Taxon Names Taxon Names Candiates User Feedback Document Text Exploit known epithets: - candidates  matches - create further candidates

11 Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)11 Label Rules Recall Rules Label Rules Dictio- nary Filter Lexicon Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules Taxon Names Not Taxon Names Taxon Names Candiates User Feedback Document Text Exploit taxon name labels: - labeled candidates  matches - „Genus species, sp. nov.“

12 Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)12 Precise Rules Recall Rules Label Rules Dictio- nary Filter Lexicon Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules Taxon Names Not Taxon Names Taxon Names Candiates User Feedback Document Text Exploit morphology: - candidates with distinctive structure  matches - „Genus species st. race“

13 Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)13 Known Data Rules Recall Rules Label Rules Dictio- nary Filter Lexicon Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules Taxon Names Not Taxon Names Taxon Names Candiates User Feedback Document Text Exploit prior runs: - Extract epitets from candidates - Known epithet combination candidates  matches

14 Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)14 Author Name Rules Recall Rules Label Rules Dictio- nary Filter Lexicon Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules Taxon Names Not Taxon Names Taxon Names Candiates User Feedback Document Text Exclude candidates with author names in genus or sub genus position

15 Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)15 Negative Rules Recall Rules Label Rules Dictio- nary Filter Lexicon Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules Taxon Names Not Taxon Names Taxon Names Candiates User Feedback Document Text Exclude candidates with words from negatives (all text excluded so far) in epithet positions

16 Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)16 Data Rules Recall Rules Label Rules Dictio- nary Filter Lexicon Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules Taxon Names Not Taxon Names Taxon Names Candiates User Feedback Document Text Candidates with known epithets in last position  matches

17 Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)17 Dynamic Lexicon Rules Recall Rules Label Rules Dictio- nary Filter Lexicon Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules Taxon Names Not Taxon Names Taxon Names Candiates User Feedback Document Text Exploit matches & negatives: - Works as combination of lexicon-based rules before - But with current document - Compute transitive hull

18 Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)18 User Feedback Recall Rules Label Rules Dictio- nary Filter Lexicon Rules Precise Rules Known Data Rules Author Name Rules Negative Rules Data Rules Dynamic Lexicon Rules Taxon Names Not Taxon Names Taxon Names Candiates User Feedback Document Text Ask user to decide on remaining candidates (displaying some context) Optional step, can be omitted

19 Guido Sautter Universität Karlsruhe (TH) FAT – Finding All Taxa (in Text Documents)19 Questions? Browse Madagascar Corpus at http://plazi.org/GgSRS/searchhttp://plazi.org/GgSRS/search Download GoldenGATE from http://idaho.ipd.uka.de/GoldenGATE/http://idaho.ipd.uka.de/GoldenGATE/ Universität Karlsruhe (TH) Research University – founded 1825


Download ppt "FAT – Finding All Taxa (in Text Documents) Guido Sautter, Donat Agosti, Klemens Böhm Universität Karlsruhe (TH) Research University – founded 1825."

Similar presentations


Ads by Google