Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U.

Similar presentations


Presentation on theme: "Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U."— Presentation transcript:

1 Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U. of Sheffield Susannah Lydon Earth Science Education Unit, U. of Keele Populating a Database from Parallel Texts using “Ontology- based” Information Extraction

2 The hypothesis

3 Overview Parallel texts Legacy data in the natural sciences “Ontology-based” Information Extraction

4 NLDB’04 - a few running threads Multiple / semi-overlapping text sources Sophisticated vs shallow or statistical text processing “Ontologies” are not the same as gazetteers or lexicons (or semantic nets!) Autonomous agents vs HCC (Human- Computer Collaborative) approaches

5 We are doing… Highly homogeneous data sources Shallow text processing “Ontologies” only as a last resort HCC approach

6 We are not doing… Heterogeneous data sources Sophisticated language processing Improvement of single-source IE or question-answering Autonomous agents

7 Parallel texts Text descriptions in the traditional descriptive sciences. Descriptions of protein sequences and functions in molecular biology. Press coverage of news stories. Police witness-of-crime reports. (Semi-) automatic marking of free text answers in examinations.

8 Legacy data in the natural sciences Text descriptions in the traditional descriptive sciences: Species descriptions in botany and zoology Descriptions of diseases in medicine.

9 Five species of Ranunculus (buttercups) Six botanists’ text descriptions (Floras) Data sources

10 R. acris L. - Meadow Buttercup. Erect perennial to 1m; basal leaves deeply palmately lobed, pubescent; flowers 15-25mm across; sepals not reflexed; achenes 2-3.5mm, glabrous, smooth, with short hooked beak; 2n=14. Typical data

11 Hand Parsing & Correlation

12 Results of hand-analysis of Ranunculus descriptions from six sources - Most data from one source only - Individual texts contain on average 39% of the total information for each species

13 Department of Botany Natural History Museum, London Rob Huxley David Sutton MultiFlora I Automatic compilation of accurate taxonomic databases from multiple non-computerised sources Department of Computer Science University of Manchester Mary McGee Wood David Rydeheard Susannah Lydon Supported by the BBSRC / EPSRC joint Bioinformatics Initiative, grant reference number 34/BIO12072

14 GATE I

15 Tagger output

16 Parse trees

17 Names & verbs ‘Basal leaves more or less deeply divided…’ 1231 semantics (qlf:[ne_tag(e13, offsets(179, 184)), name(e13, 'Basal'), realisation(e13, offsets(179, 184)), leave(e12), time(e12, present), aspect(e12, simple), voice(e12, active), realisation(e12, offsets(185, 191)), lsubj(e12, e13)]) 1247 semantics (qlf:[divide(e14), adv(e14, less), adv(e14, deeply), time(e14, none), aspect(e14, simple), voice(e14, passive), into(e14, e15), count(e15, 3), realisation(e15, offsets(225, 226)), realisation(e14, offsets(200, 226)), realisation(e14, offsets(200, 226))])

18 Template output (1) Erect perennial to 1m; basal leaves deeply palmately lobed, pubescent; HEAD KIND FEATURE TYPE KIND Erect Perennial to 1m measure unknown basal position pubescent leaves Prefix deeply palmately lobed

19 Template output (2) flowers 15-25mm across; sepals not reflexed; achenes 2-3.5mm, glabrous, smooth, with short hooked beak; HEAD KIND FEATURE TYPE KIND NEGATION flowers 15-25mm measure width across sepals reflexed true achenes short hooked smooth glabrous 2-3.5mm measure unknown

20 MultiFlora II: Combining Information Extraction and Knowledge Representation for Biodiversity Informatics Department of Computer Science, University of Manchester Mary McGee Wood Susannah Lydon Alan Rector Department of Botany, Natural History Museum, London Rob Huxley Natural Language Processing Group, University of Sheffield Hamish Cunningham Valentin Tablan Diana Maynard Supported by the BBSRC Bioinformatics and E-science Programme, grant reference number 34/BEP17049

21 GATE II

22 “Ontology-based” Information Extraction “Ontology” – classes of heads, properties, and features Gazetteers – instances of these classes (Lexicons – not currently used)

23 Head categories Specific plant parts: Flower : Flower, floret, Fl Leaf : leaf, leaves, Fronds Petal : petal, honey-leaf, vexillum Collective categories: PlantSeparatablePart : appendage, glume, tuber PlantUnseparatablePart : beak, lobe, segment SpecificRegionOfWhole : apex, border, head

24 Ontology: Heads ontology-heads.eps

25 Properties 2DShape : arching, linear, toothed 3DShape : branching, thickened, tube Colour : glossy, golden, greenish Count : numerous, several

26 Ontology: Properties

27 Features Habit : bush, shrub, succulent MorphologicalProperty : dense, contiguous, separate SurfaceProperty : pilose, pitted, rugose

28 Ontology: Features

29 Perennial herb with overwintering lf- rosettes from the short oblique to erect premorse stock up to 5 cm, rarely longer and more rhizome-like; roots white, rather fleshy, little branched. More typical data

30 System output Head Class Head Property FeatClass Feature Plant herb hasLifeform Lifeform Perennial Leaf lf-rosettes hasLifeform Lifeform overwintering PlantSepPart stock hasRelProperty RelProperty short PlantSepPart stock hasOrientation Orientation oblique to erect PlantSepPart stock hasLength Length up to 5 cm PlantSepPart stock hasRelProperty RelProperty rhizome-like Root roots hasColour Colour white Root roots hasShape3D Shape3D rather fleshy Root roots hasShape3D Shape3D little branched

31 R. acris R. bulbosus R. hederaceus Avg Single description, average Single description, average, for whole template Merged, for whole template Precision

32 R. acris R. bulbosus R. hederaceus Avg Single description, average Single description, average, for whole template Merged, for whole template Recall

33 R. acris R. bulbosus R. hederaceus Avg Single description, average Single description, average, for whole template Merged for whole template F-measure

34 Of all instances of missed information, percentage compensated for by merging Of total number of slots in template, percentage where merging allowed compensation for missed information Information merging

35 These figures based on human judgement Automated “merging reasoner” under active construction Information merging

36 Future work – short term Fine-tuning to improve precision (Semi-) automatic template correlation heuristics (Semi-) automatic data correlation heuristics Extend coverage and evaluation

37 Future targets Techniques: Merging reasoner Temporal reasoner Data types: Large-scale legacy data in biodiversity studies Free text annotations in Bioinformatics databases …

38

39


Download ppt "Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U."

Similar presentations


Ads by Google