Using Automatically Extracted Information in Species Page Retrieval: a use case Xiaoya Tang and P. Bryan Heidorn Biodiversity Standards Conference September.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Chapter 5: Introduction to Information Retrieval
Nomenclature and Anatomy of Flowers
UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
13 th September 2007 UK e-Science All Hands Meeting Text Mining Services to Support e-Research Brian Rea and Sophia Ananiadou National Centre for Text.
Morphology of Range Plants
Dialogue – Driven Intranet Search Suma Adindla School of Computer Science & Electronic Engineering 8th LANGUAGE & COMPUTATION DAY 2009.
Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.
Search Engines and Information Retrieval
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Interfaces for Retrieval Results. Information Retrieval Activities Selecting a collection –Talked about last class –Lists, overviews, wizards, automatic.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Aki Hecht Seminar in Databases (236826) January 2009
Image Search Presented by: Samantha Mahindrakar Diti Gandhi.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Presented by Zeehasham Rasheed
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
Information Retrieval
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Leaf Identification Topic 2014A and 2014 D Amanda Trutsch.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Principles of Agricultural Science – Plant 1. 2 Leaf External Parts and Types Unit 4 – Anatomy and Physiology Lesson 4.4 Leave It to Leaves Principles.
CASE Tools And Their Effect On Software Quality Peter Geddis – pxg07u.
Improving Data Discovery in Metadata Repositories through Semantic Search Chad Berkley 1, Shawn Bowers 2, Matt Jones 1, Mark Schildhauer 1, Josh Madin.
Introduction n Keyword-based query answering considers that the documents are flat i.e., a word in the title has the same weight as a word in the body.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
August 2008 Nomenclature and Anatomy of Flowers Modified by Georgia Agriculture Education Curriculum Office June 2002.
Search Engines and Information Retrieval Chapter 1.
1 The BT Digital Library A case study in intelligent content management Paul Warren
Dr. Susan Gauch When is a rock not a rock? Conceptual Approaches to Personalized Search and Recommendations Nov. 8, 2011 TResNet.
Basic Botany.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Plant ID #4 Horticulture 2. Dieffenbachia maculata Dumbcane –Foliage: simple; entire margin; ovate; pinnate; evergreen; 8 to 12 inches; variegated –Height:
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Laura Hlinka UMS 7th grade science
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
GTRI.ppt-1 NLP Technology Applied to e-discovery Bill Underwood Principal Research Scientist “The Current Status and.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST )
Personalized Interaction With Semantic Information Portals Eric Schwarzkopf DFKI
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Information Retrieval
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
An Ontology-based Automatic Semantic Annotation Approach for Patent Document Retrieval in Product Innovation Design Feng Wang, Lanfen Lin, Zhou Yang College.
Personalized Ontology for Web Search Personalization S. Sendhilkumar, T.V. Geetha Anna University, Chennai India 1st ACM Bangalore annual Compute conference,
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Nomenclature and Anatomy of Flowers
Visual Information Retrieval
Introduction Multimedia initial focus
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Presented by: Hassan Sayyadi
Information Retrieval on the World Wide Web
Social Knowledge Mining
Leaf.
Introduction to Information Retrieval
Presentation transcript:

Using Automatically Extracted Information in Species Page Retrieval: a use case Xiaoya Tang and P. Bryan Heidorn Biodiversity Standards Conference September 2007 Bratislava, Slovakia

2007 Legacy(NL) - Modernist(RDF) bridge  Taxonomic Literature contains facts  Large collections available (BDL) OCR and extraction tools limit current use  Need rapid search and discovery beyond full text  Need concept normalization

2007 Criticism Knowledge extraction tools are not 100% complete or accurate so is not worth doing. Bunk!!! 1. If you use Google you are doing probabilistic search already and it is useful 2. Controlled experiment evidence that it is useful.

2007 Goal: Keys + Google together  Information needed for plant identification Key-like information  Accurate  Specific  Keyword-based retrieval on semi-structured collections Keywords as poor content representations Difficulties in creating keyword queries, esp. for end users Not able to make use of the document structure

2007 An Example Document Excerpt ……….. Plants, flowering to 2 m. Leaves 20–75, many-ranked, spreading and recurved, not twisted, gray-green (rarely variegated with linear cream stripes), to 1 m  1.5–3.5 cm, finely appressed-scaly; sheath pale or slightly rust colored, ovate, not inflated, not forming pseudobulb, 6–15 cm wide; blade linear-triangular, leathery, channeled to involute, apex attenuate. Inflorescences: scape, erect, 20–50 cm, 6–12 mm diam.; bracts densely imbricate proximally, often lax distally, erect to spreading, like leaves but gradually smaller; spikes very laxly 6–11-flowered, erect to spreading, 2– 3-pinnate, linear, with laxly appressed bracts, 15–40  10–15 cm, apex acute; branches 5–40 (rarely simple). Floral bracts widely spaced, erect, green or tinged purple, exposing most of rachis at anthesis, ovate, not keeled, 1.2–2 cm, leathery, venation slight, apex acute, glabrous. Flowers 10–200, conspicuous; sepals free, elliptic, not keeled, 1.4–2 cm, thin-leathery, veined, apex obtuse; corolla tubular, somewhat bilaterally symmetric, petals erect, slightly twisted, white, ligulate, to 4 cm; stamens exserted; stigma exserted, conduplicate-spiral. Fruits to 4 cm. n = 25. …………..

2007 SDD + TDWG-Lit ?=  SDD: All structured + NL  Literature: Rich, human friendly and semi- structured facts  We want to associate a set of characters and states with links to evidence in the text for an assertion without destroying the text.  Mixture of key and text retrieval

2007 Location of expressions  External Standoff Markup is-a External Document/Object, requires unique text identifier + offset  Internal Standoff markup part-of Literature Document Markup, requires offset  Internal Integrated markup impossible

2007 Why bother?  Full structured coding of natural language taxonomic descriptions is out of our reach  Partial extraction of facts can aid identification.  There is a need to accumulate information over time.  Prior fact patterns can be used to find similar patterns in new texts without human intervention.  Potential for ontology induction

2007 Location of expressions  External Standoff Markup is-a External Document/Object, requires unique text identifier + offset Allows information merging form multiple sources  Internal Standoff markup part-of Literature Document Markup, requires offset  Internal Integrated markup impossible

2007 Example 1 3 leaves …... Leaves 20–75, many-ranked, spreading and recurved, not twisted, gray-green (rarely variegated with linear cream stripes), to 1 m  1.5–3.5 cm, ……... Inflorescences: ……. spikes very laxly 6–11-flowered, erect to spreading, 2–3- pinnate, ……. User query False match: “3 and leaves” False match between query and index terms

2007 Example 2 Different vocabularies in queries and documents Long leaves …... Leaves 20–75, many-ranked, spreading and recurved, not twisted, gray-green (rarely variegated with linear cream stripes), to 1 m  1.5–3.5 cm, ……... Inflorescences: ……. spikes very laxly 6–11-flowered, erect to spreading, 2–3-pinnate, ……. User query Description of leaf Length in texts

and leaves Leaf number is 3 What’s the Problem?  Keyword based retrieval only allows queries and documents to be matched Based on string occurrence Not based on semantic meaning  Early example revisited 3 leaves String match Semantic match long and leaves long leaves Leaf length > 50cm

2007 Approach  Identify useful semantic information within full-text documents using Information Extraction techniques  Allow users to search based on semantic meaning via structured semantic information

2007 Semantic information Semantic information Identifying Semantic Information Semantic match Approach Number of leaves 3 Leaf length1m ….. Number of leaves=3 Leaf length>50cm ….. Text Keyword query String match

2007 Add facts to the text 

2007 Morphological Information Extraction System features IE techniques Automaticportableaccurate computationally inexpensive Machine learning Partial parsing Knowledge bases

2007 Machine Learning Information extraction system Rules Doc1 Extracted information for Doc1 Doc2 Extracted information for Doc2 Doc60 Extracted information for Doc60 Training

2007 IE System Adaptation Extraction Rules Knowledge bases Templates for useful information FNA documents Structured information Automatically learned in the new domain Updated in the new domain Rule creation module Modified in the new domain Query analysis Pre-processing

IE System Training Learning module Learned Rules Knowledge bases Manually tagged instances Pre-processing module Training documents

Templates for useful information Information Extraction From FNA Extraction Rules Structured information User log analysis Leaf_Shape Leaf_Margin Leaf_Apex Leaf_Base Blade_Dimensio n ….. Leaf_Shape obovate Leaf_Shape orbiculate Blade_Dimension 3—9 x 3—8 cm ………….. Original documents ……….. Leaf blade obovate to nearly orbiculate, 3--9 × 3--8 cm, leathery, base obtuse to broadly cuneate, margins flat, coarsely and often irregularly doubly serrate to nearly dentate,. ……………… Knowledge bases ….. PartBlade: Leaf blade Blades blade …… Pattern:: * ' ' * ( ) ',' * Output:: leaf {leafShape $1} Pattern:: * * ', ' ( ' ' * ) * Output:: leaf {bladeDimension $1}

2007 Results - IE Recall = correct/possible Precision = correct/actual Type of informationpossiblecorrectincorrectactualRecall (%) Precision (%) Genus Species Distribution Leaf shape Leaf margin Leaf apex Leaf base Leaf arrangement Blade dimension Leaf color Fruit/nut shape

2007 Retrieval System Design and User Evaluation FNA collection SEARF: Keyword retrieval SEARFA: Retrieval with keywords + structured semantic information User evaluation Performance comparison Information extraction User evaluation

2007

Results – System Performance GroupNTNTHTSRSSRNSSTTSTNDVST SEARFA SEARF Sig.(ANOVA) NT: number of tasks accomplished in total NTH: number of tasks accomplished per hour TSR: task success rate SSR: search success rate NSST: number of searches to accomplish a task TST: time spent to accomplish a task NDVST: number of documents viewed to accomplish a task

2007 Results – User Satisfaction GroupCompletenessUsefulnessEase_useEase_learnOverall_satisfaction SEARFA SEARF Sig.(ANOVA)

2007 Limitations and Future Work  Generalization of text collections  Other collections in the same domain and other domains  Generalization of IE applications  Document representations  A wider range of attributes  Query formulation and interface design  Online term definitions  Visualized search interface  Retrieval algorithms  More accurate matching