Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using Automatically Extracted Information in Species Page Retrieval: a use case Xiaoya Tang and P. Bryan Heidorn Biodiversity Standards Conference September.

Similar presentations


Presentation on theme: "Using Automatically Extracted Information in Species Page Retrieval: a use case Xiaoya Tang and P. Bryan Heidorn Biodiversity Standards Conference September."— Presentation transcript:

1 Using Automatically Extracted Information in Species Page Retrieval: a use case Xiaoya Tang and P. Bryan Heidorn Biodiversity Standards Conference September 2007 Bratislava, Slovakia

2 2007 Legacy(NL) - Modernist(RDF) bridge  Taxonomic Literature contains facts  Large collections available (BDL) OCR and extraction tools limit current use  Need rapid search and discovery beyond full text  Need concept normalization

3 2007 Criticism Knowledge extraction tools are not 100% complete or accurate so is not worth doing. Bunk!!! 1. If you use Google you are doing probabilistic search already and it is useful 2. Controlled experiment evidence that it is useful.

4 2007 Goal: Keys + Google together  Information needed for plant identification Key-like information  Accurate  Specific  Keyword-based retrieval on semi-structured collections Keywords as poor content representations Difficulties in creating keyword queries, esp. for end users Not able to make use of the document structure

5 2007 An Example Document Excerpt ……….. Plants, flowering to 2 m. Leaves 20–75, many-ranked, spreading and recurved, not twisted, gray-green (rarely variegated with linear cream stripes), to 1 m  1.5–3.5 cm, finely appressed-scaly; sheath pale or slightly rust colored, ovate, not inflated, not forming pseudobulb, 6–15 cm wide; blade linear-triangular, leathery, channeled to involute, apex attenuate. Inflorescences: scape, erect, 20–50 cm, 6–12 mm diam.; bracts densely imbricate proximally, often lax distally, erect to spreading, like leaves but gradually smaller; spikes very laxly 6–11-flowered, erect to spreading, 2– 3-pinnate, linear, with laxly appressed bracts, 15–40  10–15 cm, apex acute; branches 5–40 (rarely simple). Floral bracts widely spaced, erect, green or tinged purple, exposing most of rachis at anthesis, ovate, not keeled, 1.2–2 cm, leathery, venation slight, apex acute, glabrous. Flowers 10–200, conspicuous; sepals free, elliptic, not keeled, 1.4–2 cm, thin-leathery, veined, apex obtuse; corolla tubular, somewhat bilaterally symmetric, petals erect, slightly twisted, white, ligulate, to 4 cm; stamens exserted; stigma exserted, conduplicate-spiral. Fruits to 4 cm. n = 25. …………..

6 2007 SDD + TDWG-Lit ?=  SDD: All structured + NL  Literature: Rich, human friendly and semi- structured facts  We want to associate a set of characters and states with links to evidence in the text for an assertion without destroying the text.  Mixture of key and text retrieval

7 2007 Location of expressions  External Standoff Markup is-a External Document/Object, requires unique text identifier + offset  Internal Standoff markup part-of Literature Document Markup, requires offset  Internal Integrated markup impossible

8 2007 Why bother?  Full structured coding of natural language taxonomic descriptions is out of our reach  Partial extraction of facts can aid identification.  There is a need to accumulate information over time.  Prior fact patterns can be used to find similar patterns in new texts without human intervention.  Potential for ontology induction

9 2007 Location of expressions  External Standoff Markup is-a External Document/Object, requires unique text identifier + offset Allows information merging form multiple sources  Internal Standoff markup part-of Literature Document Markup, requires offset  Internal Integrated markup impossible

10 2007 Example 1 3 leaves …... Leaves 20–75, many-ranked, spreading and recurved, not twisted, gray-green (rarely variegated with linear cream stripes), to 1 m  1.5–3.5 cm, ……... Inflorescences: ……. spikes very laxly 6–11-flowered, erect to spreading, 2–3- pinnate, ……. User query False match: “3 and leaves” False match between query and index terms

11 2007 Example 2 Different vocabularies in queries and documents Long leaves …... Leaves 20–75, many-ranked, spreading and recurved, not twisted, gray-green (rarely variegated with linear cream stripes), to 1 m  1.5–3.5 cm, ……... Inflorescences: ……. spikes very laxly 6–11-flowered, erect to spreading, 2–3-pinnate, ……. User query Description of leaf Length in texts

12 2007 3 and leaves Leaf number is 3 What’s the Problem?  Keyword based retrieval only allows queries and documents to be matched Based on string occurrence Not based on semantic meaning  Early example revisited 3 leaves String match Semantic match long and leaves long leaves Leaf length > 50cm

13 2007 Approach  Identify useful semantic information within full-text documents using Information Extraction techniques  Allow users to search based on semantic meaning via structured semantic information

14 2007 Semantic information Semantic information Identifying Semantic Information Semantic match Approach Number of leaves 3 Leaf length1m ….. Number of leaves=3 Leaf length>50cm ….. Text Keyword query String match

15 2007 Add facts to the text 

16 2007 Morphological Information Extraction System features IE techniques Automaticportableaccurate computationally inexpensive Machine learning Partial parsing Knowledge bases

17 2007 Machine Learning Information extraction system Rules Doc1 Extracted information for Doc1 Doc2 Extracted information for Doc2 Doc60 Extracted information for Doc60 Training

18 2007 IE System Adaptation Extraction Rules Knowledge bases Templates for useful information FNA documents Structured information Automatically learned in the new domain Updated in the new domain Rule creation module Modified in the new domain Query analysis Pre-processing

19 IE System Training Learning module Learned Rules Knowledge bases Manually tagged instances Pre-processing module Training documents

20 Templates for useful information Information Extraction From FNA Extraction Rules Structured information User log analysis Leaf_Shape Leaf_Margin Leaf_Apex Leaf_Base Blade_Dimensio n ….. Leaf_Shape obovate Leaf_Shape orbiculate Blade_Dimension 3—9 x 3—8 cm ………….. Original documents ……….. Leaf blade obovate to nearly orbiculate, 3--9 × 3--8 cm, leathery, base obtuse to broadly cuneate, margins flat, coarsely and often irregularly doubly serrate to nearly dentate,. ……………… Knowledge bases ….. PartBlade: Leaf blade Blades blade …… Pattern:: * ' ' * ( ) ',' * Output:: leaf {leafShape $1} Pattern:: * * ', ' ( ' ' * ) * Output:: leaf {bladeDimension $1}

21 2007 Results - IE Recall = correct/possible Precision = correct/actual Type of informationpossiblecorrectincorrectactualRecall (%) Precision (%) Genus44 0 100 Species13 0 100 Distribution11 0 100 Leaf shape15882159751.984.54 Leaf margin473013163.8396.77 Leaf apex453263871.1184.21 Leaf base113901610679.6584.90 Leaf arrangement620233.33100 Blade dimension352352865.7182.14 Leaf color342112261.7695.45 Fruit/nut shape1921083514356.2575.52

22 2007 Retrieval System Design and User Evaluation FNA collection SEARF: Keyword retrieval SEARFA: Retrieval with keywords + structured semantic information User evaluation Performance comparison Information extraction User evaluation

23 2007

24

25 Results – System Performance GroupNTNTHTSRSSRNSSTTSTNDVST SEARFA6.758.0780.8600.2104.779338.811.16 SEARF4.503.5980.5680.0539.584435.214.75 Sig.(ANOVA)0.005 0.0000.0110.0000.720.162 NT: number of tasks accomplished in total NTH: number of tasks accomplished per hour TSR: task success rate SSR: search success rate NSST: number of searches to accomplish a task TST: time spent to accomplish a task NDVST: number of documents viewed to accomplish a task

26 2007 Results – User Satisfaction GroupCompletenessUsefulnessEase_useEase_learnOverall_satisfaction SEARFA5.3333 4.91676.00004.5833 SEARF3.66674.25003.33335.41673.0833 Sig.(ANOVA).001.059.043.255.005

27 2007 Limitations and Future Work  Generalization of text collections  Other collections in the same domain and other domains  Generalization of IE applications  Document representations  A wider range of attributes  Query formulation and interface design  Online term definitions  Visualized search interface  Retrieval algorithms  More accurate matching


Download ppt "Using Automatically Extracted Information in Species Page Retrieval: a use case Xiaoya Tang and P. Bryan Heidorn Biodiversity Standards Conference September."

Similar presentations


Ads by Google