A Pattern Based Approach to Answering Factoid, List and Definition Questions Mark A. Greenwood and Horacio Saggion Natural Language Processing Group Department.

Slides:

Advertisements

Similar presentations

1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.

Advertisements

Introduction to Information Retrieval

Improved TF-IDF Ranker

QA-LaSIE Components The question document and each candidate answer document pass through all nine components of the QA-LaSIE system in the order shown.

SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

Information Retrieval in Practice

The Informative Role of WordNet in Open-Domain Question Answering Marius Paşca and Sanda M. Harabagiu (NAACL 2001) Presented by Shauna Eggers CS 620 February.

Shared Ontology for Knowledge Management Atanas Kiryakov, Borislav Popov, Ilian Kitchukov, and Krasimir Angelov Meher Shaikh.

Using Information Extraction for Question Answering Done by Rani Qumsiyeh.

ITCS 6010 Natural Language Understanding. Natural Language Processing What is it? Studies the problems inherent in the processing and manipulation of.

Chapter 5: Information Retrieval and Web Search

Overview of Search Engines

Disambiguation of References to Individuals Levon Lloyd (State University of New York) Varun Bhagwan, Daniel Gruhl (IBM Research Center) Varun Bhagwan,

A Light-weight Approach to Coreference Resolution for Named Entities in Text Marin Dimitrov Ontotext Lab, Sirma AI Kalina Bontcheva, Hamish Cunningham,

The Problem Finding information about people in huge text collections or on-line repositories on the Web is a common activity Person names, however, are.

Building a Simple Question Answering System Mark A. Greenwood Natural Language Processing Group Department of Computer Science University of Sheffield,

Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.

Tomek Strzalkowski & Sharon G. Small ILS Institute, SUNY Albany LAANCOR May 22, 2010 (Tacitly) Collaborative Question Answering Utilizing Web Trails 5/22/10.

Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.

Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa

AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

A Data Driven Approach to Query Expansion in Question Answering Leon Derczynski, Robert Gaizauskas, Mark Greenwood and Jun Wang Natural Language Processing.

Question Answering.  Goal  Automatically answer questions submitted by humans in a natural language form  Approaches  Rely on techniques from diverse.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.

Querying Structured Text in an XML Database By Xuemei Luo.

A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.

Pete Bohman Adam Kunk.  ChronoSearch: A System for Extracting a Chronological Timeline ChronoChrono.

A Language Independent Method for Question Classification COLING 2004.

21/11/2002 The Integration of Lexical Knowledge and External Resources for QA Hui YANG, Tat-Seng Chua Pris, School of Computing.

XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.

INTERESTING NUGGETS AND THEIR IMPACT ON DEFINITIONAL QUESTION ANSWERING Kian-Wei Kor, Tat-Seng Chua Department of Computer Science School of Computing.

Chapter 6: Information Retrieval and Web Search

AnswerFinder Question Answering from your Desktop Mark A. Greenwood Natural Language Processing Group Department of Computer Science University of Sheffield,

Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.

A Semantic Approach to IE Pattern Induction Mark Stevenson and Mark A. Greenwood Natural Language Processing Group University of Sheffield, UK.

Evaluating Semantic Metadata without the Presence of a Gold Standard Yuangui Lei, Andriy Nikolov, Victoria Uren, Enrico Motta Knowledge Media Institute,

A Novel Pattern Learning Method for Open Domain Question Answering IJCNLP 2004 Yongping Du, Xuanjing Huang, Xin Li, Lide Wu.

BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.

Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.

21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,

LOGO 1 Corroborate and Learn Facts from the Web Advisor ： Dr. Koh Jia-Ling Speaker ： Tu Yi-Lang Date ： Shubin Zhao, Jonathan Betz (KDD '07 )

Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.

© 2004 Chris Staff CSAW’04 University of Malta of 15 Expanding Query Terms in Context Chris Staff and Robert Muscat Department of.

Web- and Multimedia-based Information Systems Lecture 2.

Department of Software and Computing Systems Research Group of Language Processing and Information Systems The DLSIUAES Team’s Participation in the TAC.

A Word Clustering Approach for Language Model-based Sentence Retrieval in Question Answering Systems Saeedeh Momtazi, Dietrich Klakow University of Saarland,Germany.

Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.

August 17, 2005Question Answering Passage Retrieval Using Dependency Parsing 1/28 Question Answering Passage Retrieval Using Dependency Parsing Hang Cui.

A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,

AQUAINT IBM PIQUANT ARDACYCORP Subcontractor: IBM Question Answering Update piQuAnt ARDA/AQUAINT December 2002 Workshop This work was supported in part.

Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.

Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.

Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.

AQUAINT AQUAINT Evaluation Overview Ellen M. Voorhees.

1 Question Answering and Logistics. 2 Class Logistics  Comments on proposals will be returned next week and may be available as early as Monday  Look.

AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.

Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,

CS791 - Technologies of Google Spring A Webbased Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.

Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.

Abstract  An abstract is a concise summary of a larger project (a thesis, research report, performance, service project, etc.) that concisely describes.

Information Retrieval in Practice

Presentation 王睿.

Introduction to Information Retrieval

Information Retrieval and Web Design

Presentation transcript:

A Pattern Based Approach to Answering Factoid, List and Definition Questions Mark A. Greenwood and Horacio Saggion Natural Language Processing Group Department of Computer Science University of Sheffield, UK

April 27th 2004RIAO 2004 Outline of Talk What is Question Answering?  Different Question Types System Description  Factoid and List Questions System Architecture Surface Matching Text Patterns Fallback to Semantic Entities  Definition Questions System Architecture Knowledge Acquisition Locating Possible Definitions Results and Evaluation  Factoid and List Questions  Definition Questions Conclusions and Future Work

April 27th 2004RIAO 2004 What is Question Answering? The main aim of QA is to present the user with a short answer to a question rather than a list of possibly relevant documents. As it becomes more and more difficult to find answers on the WWW using standard search engines, question answering technology will become increasingly important. Answering questions using the web is already enough of a problem for it to appear in fiction (Marshall, 2002): “I like the Internet. Really, I do. Any time I need a piece of shareware or I want to find out the weather in Bogotá… I’m the first guy to get the modem humming. But as a source of information, it sucks. You got a billion pieces of data, struggling to be heard and seen and downloaded, and anything I want to know seems to get trampled underfoot in the crowd.”

April 27th 2004RIAO 2004 Different Question Types Clearly there are many different types of questions which a user can ask. The system discussed in this presentation attempts to answer:  Factoid Questions usually require a single fact as answer and include questions such as “How high is Everest?” or “When was Mozart born?”.  List Questions require multiple facts to be returned in answer to a question. Examples are “Name 22 cities that have a subway system” or “Name companies which manufacture tractors”.  Definition Questions, such as “What is aspirin?”, which require answers covering essential (e.g. “aspirin is a drug”) as well as non- essential (e.g. “aspirin is a blood thinner”) descriptions of the definiendum (the term being defined). The system makes no attempt to answer other question types. For example speculative questions, such as “Is the airline industry in trouble?” are not handled.

April 27th 2004RIAO 2004 Outline of Talk What is Question Answering?  Different Question Types System Description  Factoid and List Questions System Architecture Surface Matching Text Patterns Fallback to Semantic Entities  Definition Questions System Architecture Knowledge Acquisition Locating Possible Definitions Results and Evaluation  Factoid and List Questions  Definition Questions Conclusions and Future Work

April 27th 2004RIAO 2004 System Description As the three types of questions of questions require different techniques to answer them the system consists of two sub- systems:  Factoid: This system answers both the factoid and list questions. For factoid questions the system returns the best answers and for list questions the system returns all the answers it found.  Definition: This system is only responsible for answering the definition questions. The rest of this section will provide an overview of both systems and how patterns are used to answer the differing question types.

April 27th 2004RIAO 2004 Factoid System Architecture

April 27th 2004RIAO 2004 Surface Text Patterns Learning patterns which can be used to find answers involves a two stage process:  The first stage is to learn a set of patterns from a set of question-answer pairs.  The second stage involves assigning a precision to each pattern and discarding those patterns which are tied to a specific question-answer pair. To explain the process we will use questions of the form “When was X born?”:  As a concrete example we will use “When was Mozart born?”.  For which the question-answer pair is: Mozart 1756

April 27th 2004RIAO 2004 Surface Text Patterns The first stage is to learn a set of patterns from the question- answer pairs for a specific question type:  For each example the question and answer terms are submitted to Google and the top ten documents are downloaded.  Each document then has the question and answer terms replaced by AnCHoR and AnSWeR respectively.  Depending upon the question type other replacements are also made, e.g. any dates may be replaced by a tag DatE.  Those sentences which contain both AnCHoR and AnSWeR are retained and joined together to create a single document.  This generated document is then used to build a token-level suffix tree, from which repeated strings containing both AnCHoR and AnSWeR and which do not span a sentence boundary are extracted as patterns.

April 27th 2004RIAO 2004 Surface Text Patterns The result of the first stage is a set of patterns. For questions of the form “When was X born?” these may include: AnCHor ( AnSWeR – From AnCHoR ( AnSWeR – DatE ) AnCHor ( AnSWeR Unfortunately some of these patterns may be specific to the question used to generate them. So the second stage of the approach is concerned with filtering out these specific patterns to produce a set which can be used to answer unseen questions.

April 27th 2004RIAO 2004 Surface Text Patterns The second stage of the approach requires a different set of question-answer pairs to those used in the first stage:  Within each of the top ten documents returned by Google, using only the question term: the question term is replaced by AnCHoR and the answer (if it is present) with AnSWeR and any other replacements made in the first stage are also carried out.  Those sentences which contain AnCHoR are retained.  All of the patterns from the first stage are converted to regular expressions designed to capture the token which appears in place of AnSWeR.  Each regular expression is then matched against each sentence and along with each pattern two counts are maintained: C a which is the total number of times this pattern has matched and C c which counts the number of times AnSWeR was selected as the answer.  After a pattern has been matched against every sentence if C c is less than 5 then it is discarded otherwise it’s precision is calculated as C c /C a and the pattern is retained only if the precision is greater than 0.1.

April 27th 2004RIAO 2004 Surface Text Patterns The result of assigning precision to patterns in this way is a set of precisions and regular expressions such as: 0.967: AnCHoR \( ([^ ]+) - DatE 0.566: AnCHoR \( ([^ ]+) 0.263: AnCHoR ([^ ]+) – These patterns can then be used to answer unseen questions:  The question term is submitted to Okapi and the top 20 returned documents have the question term replaced with AnCHoR and any other replacments necessary are also made.  Those sentences which contain AnCHoR are extracted and combined to make a single document.  Each pattern is then applied to each sentence to extract possible answers.  All the answers found are sorted based firstly on the precision of the pattern which selected it and secondly on the number of times the same answer was found.

April 27th 2004RIAO 2004 Fallback to Semantic Entities Q: How high is Everest?D 1 : Everest’s 29,035 feet is 5.4 miles above sea level… D 2 : At 29,035 feet the summit of Everest is the highest… If Q contains ‘how’ and ‘high’ then the semantic class, S, is measurement:distance 29,035 feet measurement:distance(‘5.4 miles’) 1 measurement:distance(‘29,035 feet’) 2 location(‘Everest’) 2 Known Entities# Okapi

April 27th 2004RIAO 2004 Definition System Definition questions such as “What is Goth?” contain very little information which can be used to retrieve relevant documents as they have almost nothing in common with potential answers:  “ a subculture that started as one component of the punk rock scene”  “horror/mystery literature that is dark, eerie, and gloomy” Having extra knowledge about the definiendum is important:  217 sentences in AQUAINT contain the term “Goth”.  If we know that “Goth” seems to be associated with “subculture” in definition passages then we can narrow the search space.  Only 6 sentences in AQUAINT contain the terms “Goth” & “subculture”. “the Goth subculture” “gloomy subculture known as Goth”

April 27th 2004RIAO 2004 Definition System To extract extra information about the definiendum we use a set of linguistic patterns which we instantiate with the definiendum, for example:  “X is a”  “such as X”  “X consists of” The patterns match many sentences some of which are definition bearing and some of which are not:  “Goth is a subculture”  “Becoming a Goth is a process that demands lots of effort” These patterns can be used to find terms which regularly appear along with the definiendum, outside of the target collection.

April 27th 2004RIAO 2004 Definition System Architecture

April 27th 2004RIAO 2004 Knowledge Acquisition We parse the question in order to extract the definiendum We then use the linguistic patterns (“Goth is a”, “such as Goth”…) to find definition-bearing passages in:  WordNet  Britannica  Web From these source we extract words (nouns, adjectives, verbs) and their frequencies from definition-bearing sentences. A sentence is definition bearing if:  WordNet: the gloss of the definiendum and any associated hypernyms.  Britannica: only if the sentence contains the definiendum.  Web: only if sentence contains one of the linguistic patterns.

April 27th 2004RIAO 2004 Knowledge Acquisition We retain all the words extracted from WordNet and all those words which occurred more than once. The words are sorted based on their frequency of occurrence. A list of n secondary terms to be used for query expansion is formed:  All terms found in WordNet, m  A maximum of (n – m) / 2 terms from Britannica  The list is expanded to size n with terms found on the web DefiniendumWordNetBritannicaWeb aspirinanalgesic; anti-inflammatory; antipyretic; drug; … inhibit; prostaglandin; ketofren; synthesis; … drug; drugs; blood; ibuprofen; medication; pain; … Aum Shirikyo* NOTHING * group; groups; cult; religious; japanese; …

April 27th 2004RIAO 2004 Locating Possible Definitions An IR query consisting of all the words in the question as well as the acquired secondary terms is submitted to Okapi and the 20 most relevant passage are retrieved. Sentence which pass one of the following tests are then extracted as definition candidates:  The sentence matches one of the linguistic patterns.  The sentence contains the definiendum and at least 3 secondary terms To avoid the inclusion of unnecessary information we discard the sentence prefix which does not contain either the definiendum or any secondary terms.

April 27th 2004RIAO 2004 Locating Possible Definitions Equivalent definitions are identified via the vector space model using the cosine similarity measure, and only one definition is retained. For example, the following two definitions are similar and only one would be retained by the system:  “the Goth subculture”  “gloomy subculture known as Goth”

April 27th 2004RIAO 2004 Outline of Talk What is Question Answering?  Different Question Types System Description  Factoid and List Questions System Architecture Surface Matching Text Patterns Fallback to Semantic Entities  Definition Questions System Architecture Knowledge Acquisition Locating Possible Definitions Results and Evaluation  Factoid and List Questions  Definition Questions Conclusions and Future Work

April 27th 2004RIAO 2004 Results and Evaluation The system was independently evaluated as part of the TREC 2003 question answering evaluation. This consisted of answer 413 factoid questions, 37 list questions and 50 definition questions. For further details on the evaluation metrics used by NIST see (Voorhees, 2003).

April 27th 2004RIAO 2004 Results & Evaluation: Factoid Unfortunately only 12 of the 413 factoid questions were suitable to be answered by the pattern sets.  Even worse is the fact that none of the patterns were able to select any answers, correct or otherwise. The fallback system correctly identified the answer type for 241 of the 413 questions  53 were given an incorrect type.  119 were outside the scope of the system. Okapi only located relevant documents for 131 of the questions the system could answer giving:  a maximum attainable score of (131/413)  An official score of (57/413) which contained 15 correct NIL responses so…  The system answered 42 questions giving a score of 0.102, 32% of the maximum score.

April 27th 2004RIAO 2004 Results & Evaluation: List Similar problems occurred when the system was used to answer list questions.  Over 37 questions only 20 distinct correct answers were returned  Giving an official F-score of The ability of the system to locate a reasonable number of correct answers was offset as many answers were returned per question.  There are seven known answers (in AQUAINT) to the question “What countries have won the men’s World Cup for soccer?”  This system returned 32 answers only two of which were correct  This gives recall of but precision of only 0.062

April 27th 2004RIAO 2004 Results & Evaluation: Definition Definition systems are evaluated based on their ability to return information nuggets (snippets of text containing information that helps define the definiendum). Some of these nuggets are considered essential, i.e. a full definition must contain them. Our system produced answers for 28 of the 50 questions, 23 of which contained at least one essential nugget. The official score for the system was placing the system 9 th out of the 25 participants. The knowledge acquisition step provided relevant secondary terms for a number of questions.  WordNet helped in 4 cases  Britannica helped in 5 cases  Web helped in 39 cases

April 27th 2004RIAO 2004 Outline of Talk What is Question Answering?  Different Question Types System Description  Factoid and List Questions System Architecture Surface Matching Text Patterns Fallback to Semantic Entities  Definition Questions System Architecture Knowledge Acquisition Locating Possible Definitions Results and Evaluation  Factoid and List Questions  Definition Questions Conclusions and Future Work

April 27th 2004RIAO 2004 Conclusions When using patterns for answering factoid and list questions the surface text patterns should probably be acquired from a source with similar writing style to the collection from which answers will be drawn.  Here we used the web to acquire the patterns and used them to find answers in the AQUAINT collection which have differing writing styles. Using patterns to answering definition questions while more successful than the factoid system still has it’s problems:  The filters used to determine if a passage is definition bearing is too restrictive. Despite these failings the use of patterns for answering factoid, list and definition questions shows promise.

April 27th 2004RIAO 2004 Future Work For the factoid and list QA system future work could include:  acquiring a wider range of pattern sets to cover more question types;  Using the full question not just the question term for passage retrieval; For the definition QA system future research could include:  extracted secondary terms for definition questions could be ranked, perhaps using IDF values, to help to eliminate inappropriate matches (aspirin is a great choice for active people).  a syntactic-based technique that prunes parse trees could be implemented to extract better definition strings  coreference information could be used in combination with the extraction patterns;

Any Questions? Copies of these slides can be found at:

April 27th 2004RIAO 2004 Bibliography Hamish Cunningham, Diana Maynard, Kalina Bontcheva and Valentin Tablan. GATE: A framework and graphical development environment for robust NLP tools and applications. In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics, Mark A. Greenwood and Robert Gaizauskas. Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering. In Proceedings of the Workshop on Natural Language Processing for Question Answering (EACL03), pages 29–34, Budapest, Hungary, April 14, Michael Marshall. The Straw Men. HarperCollins Publishers, Ellen M. Voorhees. Overview of the TREC 2003 Question Answering Track. In Proceedings of the 12th Text REtrieval Conference, 2003.