Learning Surface Text Patterns for a Question Answering System Deepak Ravichandran Eduard Hovy Information Sciences Institute University of Southern California.

Slides:



Advertisements
Similar presentations
ThemeInformation Extraction for World Wide Web PaperUnsupervised Learning of Soft Patterns for Generating Definitions from Online News Author Cui, H.,
Advertisements

Understanding Tables on the Web Jingjing Wang. Problem to Solve A wealth of information in the World Wide Web Not easy to access or process by machine.
Improved TF-IDF Ranker
QA-LaSIE Components The question document and each candidate answer document pass through all nine components of the QA-LaSIE system in the order shown.
TextMap: An Intelligent Question- Answering Assistant Project Members:Abdessamad Echihabi Ulf Hermjakob Eduard Hovy Kevin Knight Daniel Marcu Deepak Ravichandran.
QA and Language Modeling (and Some Challenges) Eduard Hovy Information Sciences Institute University of Southern California.
Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.
Information Retrieval in Practice
CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?
Automatic Web Page Categorization by Link and Context Analysis Giuseppe Attardi Antonio Gulli Fabrizio Sebastiani.
A Fully Automated Object Extraction System for the World Wide Web a paper by David Buttler, Ling Liu and Calton Pu, Georgia Tech.
The Informative Role of WordNet in Open-Domain Question Answering Marius Paşca and Sanda M. Harabagiu (NAACL 2001) Presented by Shauna Eggers CS 620 February.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Modern Information Retrieval Chapter 4 Query Languages.
Use of Patterns for Detection of Answer Strings Soubbotin and Soubbotin.
Overview of Search Engines
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Information Retrieval in Practice
A Web-based Question Answering System Yu-shan & Wenxiu
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
A Time Based Approach to Musical Pattern Discovery in Polyphonic Music Tamar Berman Graduate School of Library and Information Science University of Illinois.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
Hang Cui et al. NUS at TREC-13 QA Main Task 1/20 National University of Singapore at the TREC- 13 Question Answering Main Task Hang Cui Keya Li Renxu Sun.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
The CoNLL-2013 Shared Task on Grammatical Error Correction Hwee Tou Ng, Yuanbin Wu, and Christian Hadiwinoto 1 Siew.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Extracting Semantic Constraint from Description Text for Semantic Web Service Discovery Dengping Wei, Ting Wang, Ji Wang, and Yaodong Chen Reporter: Ting.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Natural Language Based Reformulation Resource and Web Exploitation for Question Answering Ulf Hermjakob, Abdessamad Echihabi, Daniel Marcu University of.
A Language Independent Method for Question Classification COLING 2004.
21/11/2002 The Integration of Lexical Knowledge and External Resources for QA Hui YANG, Tat-Seng Chua Pris, School of Computing.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Automatic Set Instance Extraction using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh,
Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001.
A Novel Pattern Learning Method for Open Domain Question Answering IJCNLP 2004 Yongping Du, Xuanjing Huang, Xin Li, Lide Wu.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Using Semantic Relations to Improve Passage Retrieval for Question Answering Tom Morton.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
A Word Clustering Approach for Language Model-based Sentence Retrieval in Question Answering Systems Saeedeh Momtazi, Dietrich Klakow University of Saarland,Germany.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Automatic Question Answering  Introduction  Factoid Based Question Answering.
Supertagging CMSC Natural Language Processing January 31, 2006.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Keyword Translation Accuracy and Cross-Lingual Question Answering in Chinese and Japanese Teruko Mitamura Mengqiu Wang Hideki Shima Frank Lin In CMU EACL.
SEMANTIC VERIFICATION IN AN ONLINE FACT SEEKING ENVIRONMENT DMITRI ROUSSINOV, OZGUR TURETKEN Speaker: Li, HueiJyun Advisor: Koh, JiaLing Date: 2008/5/1.
Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.
1 Question Answering and Logistics. 2 Class Logistics  Comments on proposals will be returned next week and may be available as early as Monday  Look.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,
1 Predicting Answer Location Using Shallow Semantic Analogical Reasoning in a Factoid Question Answering System Hapnes Toba, Mirna Adriani, and Ruli Manurung.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
Information Retrieval in Practice
Query Reformulation & Answer Extraction
Presentation transcript:

Learning Surface Text Patterns for a Question Answering System Deepak Ravichandran Eduard Hovy Information Sciences Institute University of Southern California

From Proceedings of the ACL Conference, 2002

Goal Explore power of surface text patterns for open-domain QA systems

Why This Paper Fall 2001 NLP project - QA system

Winning Team Matt Myers & Henry Longmore –"If we were asked to design another question answering system, we would keep the same basic system as a foundation. We would then use more patterns and variations of patterns in the NE recognizer. We would use Machine Learning techniques, particularly for learning patterns for the NE recognizer."

Meanwhile, back at the batcave... Automatic learning of surface text patterns for open-domain question answering

Recent Open Domain Systems External knowledge, tools –Named Entity taggers –WordNet –parsers –hand-tagged corpora –ontology lists

Recent O-D Systems (cont.) Recent TREC-10 evaluation –winning system used just 1 resource –extensive list of surface patterns –surprised many

Basic Idea Investigate potential of surface patterns –Learn patterns –Measure accuracy

Characteristic Phrases "When was born” –Typical answers "Mozart was born in 1756.” "Gandhi ( )...” –Suggests phrases like " was born in ” " ( -” –as Regular Expressions can help locate correct answer

Auto-learn Patterns from Web Tagged corpus using AltaVista Hand-crafted examples of each question type Bootstrapping to build large tagged corpus as in Information Extraction (Riloff, 96) Abundance of data on web - reliable statistical estimates

The System Assume sentence is a simple sequence of words Search for repeated word orderings Evidence for useful answer phrases

System (cont.) Suffix trees to extract substrings of optimal length Suffix trees from Computational Biology (Gusfield, 97) Used to detect DNA sequences Linear time on size of corpus Don't restrict length of substrings

Pattern Learning Algorithm Select example for question type –BIRTHYEAR questions select "Mozart 1756” "Mozart" is question term "1756" is answer term Submit Q & A terms to AltaVista Require both terms to be present

Pattern Learning (cont.) Download top 1000 documents returned Apply sentence breaker to documents Keep only those sentences with both terms present

Pattern Learning (cont.) Terms can be present in various forms –e.g. Mozart as: Wolfgang Amadeus Mozart Mozart, Wolfgang Amadeus Amadeus Mozart Mozart

Pattern Learning (cont.) Specify ways in which Q term and A term can be specified in text Easy to do for BIRTHDATE Not so for Q types like DEFINITION –Many acceptable answers, all answers need to be used to ensure high confidence in precision

Pattern Learning (cont.) Process (tokenize, smooth whitespace, remove tags, etc.) –simplify input for egrep (or other regular expression tool) Pass sentence through suffix tree constructor –finds substrings (and counts) of all lengths

Pattern Learning (cont.) Example: “The great composer Mozart ( ) achieved fame at a young age” “Mozart ( ) was a genius” “The whole world would always be indebted to the great music of Mozart ( )” –Longest matching substring for all 3 sentences is "Mozart ( )” –Suffix tree would extract "Mozart ( )" as an output, with score of 3

Pattern Learning (cont.) Filter phrases in suffix tree Keep phrases containing Q & A terms Replace question term with Replace answer term with

Pattern Learning (cont.) Repeat with different examples of same question type –“Gandhi 1869”, “Newton 1642”, etc. Some patterns learned for BIRTHDATE –a. born in, –b. was born on, –c. ( - –d. ( - )

Pattern Learning (last one!) Strings partly overlapping (c & d) saved separately –Separate counts of occurrence frequencies –Can distinguish (in this case) between pattern for person still living (d) and more general pattern (c)

Calculate Precision Submit query to AltaVista using only Q term ("Mozart") Download top 1000 returned documents Segment into sentences as in pattern learning algorithm Keep sentences containing Q term

Calculate Precision (cont.) For each pattern learned, check presence of pattern in sentence –pattern with tag matched by any word –pattern with tag matched by correct A term Mozart was born in Mozart was born in 1756

Calculate Precision (cont.) Calculate precision of each pattern P = Ca/Co –Ca = total # of patterns w/answer term present –Co = total # of patterns w/answer term replaced by any word Keep only patterns matching sufficient # of examples (e.g. >5)

Calculate Precision (cont.) Obtain table of Regular Expression patterns 1 table per question type –Precision of pattern –precision is probability pattern containing answer –principle of maximum likelihood estimation

Calculate Precision (cont.) BIRTHDATE table: 1.0 ( - ) 0.85 was born on, 0.6 was born in 0.59 was born 0.53 was born ( 0.36 ( -

Calculate Precision (cont.) Good range of patterns obtained with as few as 10 examples Rather long list difficult to come up with manually Largest number of examples the system required to get a good range of patterns?

Calculate Precision (cont.) Precision of patterns learned from one QA- pair calculated for other examples of same question type Helps eliminate dubious patterns –Contents of two or more sites are the same –Same document appears in search engine output for learning & precision stages

Finding Answers To new questions! Use existing QA system (Hovy et al., 2002b;2001) Determine type of new question Identify Question term

Finding Answers (cont.) Create query from Q term & do IR –use answer document corpus such as TREC-10 or web search Segment returned documents into sentences & process as before Replace Q term by Q tag –e.g. in case of BIRTHYEAR type

Finding Answers (cont.) Using pattern table developed for Q type, search for presence of each pattern Select words matching as potential answer Sort answers by pattern's precision scores Discard duplicate answers (string compare) Return top 5

Experiments 6 different Q types –from Webclopedia QA Typology (Hovy et al., 2002a) BIRTHDATE LOCATION INVENTOR DISCOVERER DEFINITION WHY-FAMOUS

Experiments (cont.) (BIRTHYEAR - previously shown) INVENTOR 1.0 invents 1.0the was invented by 1.0 invented the in – all have precision of 1.0

Experiments (cont.) DISCOVERER 1.0when discovered 1.0 's discovery of 0.9 was discovered by in DEFINITION 1.0 and related 1.0form of, 0.94as, and

Experiments (cont.) WHY-FAMOUS 1.0 called 1.0laureate 0.71 is the of LOCATION 1.0 's 1.0regional : : 0.92near in

Experiments (cont.) For each Q type, extract questions from TREC-10 set Run through testing phase (precision) Two sets of experiments

Experiments (cont.) Set one –TREC corpus is input –IR done by IR component of their QA system (Lin, 2002) Set two –Web is input –IR performed by AltaVista

Results Measured by Mean Reciprocal Rank (?) TREC Question type# of Q'sMRR BIRTHYEAR80.48 INVENTOR60.17 DISCOVERER40.13 DEFINITION WHY-FAMOUS30.33 LOCATION160.75

Results (cont.) Web Q type# of Q’sMRR BIRTHYEAR80.69 INVENTOR60.58 DISCOVERER40.88 DEFINITION WHY-FAMOUS30.00 LOCATION160.86

Results (cont.) System performs better on web data than on TREC corpus Abundant web data makes it easier for system to locate answers with high precision scores TREC corpus does not have enough candidate answers with high precision score –must settle for answers from low precision patterns WHY-FAMOUS exception - may be due to small # of test Q's

Shortcomings & Extensions Need for POS &/or semantic types "Where are the Rocky Mountains?” "Denver's new airport, topped with white fiberglass cones in imitation of the Rocky Mountains in the background, continues to lie empty” in NE tagger &/or ontology could enable system to determine "background" is not a location

Shortcomings... (cont.) DEFINITION Q's - match term too general, though correct technically "What is nepotism?”, "...in the form of widespread bureaucratic abuses: graft, nepotism...” "What is sonar?” and related "...while its sonar and related underseas systems are built...”

Shortcomings... (cont.) Long distance dependencies "Where is London?” "London, which has one of the most busiest airports in the world, lies on the banks of the river Thames” would require pattern like:, ( )*, lies on –Abundance & variety of Web data helps system to find an instance of patterns w/o losing answers to long distance dependencies

Shortcomings... (cont.) More info in patterns regarding length of expected answer phrase –Searches in range of 50 bytes of answer phrase to capture pattern –fails under some conditions "When was Lyndon B. Johnson born?” "...lost to democratic Sen. Lyndon B. Johnson, who ran for both re-election and the vice presidency” -

Shortcomings... (cont.) Lacks info that in this case should be exactly replaced by 1 word Could extend system to search for answer in range of 1-2 chunks –basic English phrases, NP, VP, PP, etc.

Shortcomings... (cont.) System doesn't work for Q types requiring multiple words from question to be in answer "In which county does the city of Long Beach lie?” "Long Beach is situated in Los Angeles County” required pattern: situated in

Shortcomings... (cont.) Performance of system depends greatly on having only 1 anchor word Multiple anchor points –would help eliminate candidate answers –require all anchor words be present in candidate answer sentence

Shortcomings... (cont.) Does not use case "What is micron?” "...a spokesman for Micron, a maker of semiconductors, said SIMMs are..." If Micron had been capitalized in question, would be a perfect answer

Shortcomings... (cont.) Canonicalization of words BIRTHDATE for Gandhi: 1869; Oct. 2, 1869; 2nd October 1869; October ; 02 October 1869; etc. –Use date tagger to cluster all variations and tag with same term –Extend idea to smooth out variations in Q term for names: Gandhi, Mahatma Gandhi, Mohandas Karamchand Gandhi, etc.

Conclusion Web results easily outperform TREC results Suggests need to integrate outputs from Web & TREC Word count to help eliminate unlikely answers + BIRTHDATE, LOCATION ? DEFINITION

Conclusion (cont.) But what about DEFINITION? 102 Q's in TREC Corpus and in Web Most Q's of all types MRR-TREC == 0.34 MRR-Web == 0.39 All other Q's have # < 20, most < 10 If enough Q's are asked, will difference in performance on Web data vs. TREC data diminish?

Conclusion (cont.) Simplicity - "perfect" for multilingual system QA –Low resource requirement - no NE taggers, no parsers, no ontologies, etc. –No adaptation of these to new language required –Need to create manual training terms & use appropriate web search engine

Regular Expressions from ask_iggy “place called\\s+($cap_pattern+)”“home called\\s+($cap_pattern+)” “at\\s((the)?\\s+($cap_pattern+))” “to\\s+($cap_pattern+)” “place\\s+in\\s+($cap_pattern+)called\\s+($cap_pattern+)” “in\\s+($cap_pattern+)”“up\\s+($cap_pattern+)” “left\\s+($cap_pattern+)” “(($cap_pattern+)[Ii]slands)” “(northern|southern|eastern|western)\\s+($cap_pattern+)” “from\\s+($cap_pattern+)” “far\\s+as\\s+($cap_pattern+)” “place\\s+in\\s+($cap_pattern+)”“home\\s+town” “city\\s+of\\s+($cap_pattern+)” “middle\\s+of\\s+((the)?\\s+($cap_pattern+))” “(($cap_pattern+)[Ii]slands\\s+of\\s+($cap_pattern+))” “place{1,1}d?\\s+near\\s+($cap_pattern+)” “((above|over)\\s+($cap_pattern+))”