Presentation on theme: "Question Answering Question Answering Available from: Mark A. Greenwood MEng."— Presentation transcript:
Question Answering Question Answering Available from: Mark A. Greenwood MEng
Overview What is Question Answering? Approaching Question Answering A brief history of Question Answering Question Answering at the Text REtrieval Conferences (TREC) Top performing systems. Progress to date The direction of future work
What is Question Answering?
The main aim of QA is to present the user with a short answer to a question rather than a list of possibly relevant documents. As it become more and more difficult to find answers on the WWW using standard search engines, question answering technology will become increasingly important. Answering questions using the web is already enough of a problem for it to appear in fiction (Marshall, 2002): “I like the Internet. Really, I do. Any time I need a piece of shareware or I want to find out the weather in Bogota… I’m the first guy to get the modem humming. But as a source of information, it sucks. You got a billion pieces of data, struggling to be heard and seen and downloaded, and anything I want to know seems to get trampled underfoot in the crowd.”
Approaching Question Answering
Question answering can be approached from one of two existing NLP research areas: Information Retrieval: QA can be viewed as short passage retrieval. Information Extraction: QA can be viewed as open- domain information extraction. Question answering can also be approached from the perspective of machine learning (see Soubbotin 2001)
A Brief History of Question Answering
A Brief History Of… Question answering is not a new research area as Simmons (1965) reviews no less than fifteen English Language QA systems. Question answering systems can be found in many areas of NLP research, including: Natural language database systems Dialog systems Reading comprehension systems Open domain question answering
Natural Language Database Systems These systems work by analysing the question to produce a database query. For example: “List the authors who have written books about business” Would generate the following database query (using Microsoft English Query): SELECT firstname, lastname FROM authors, titleauthor, titles WHERE authors.id = titleauthor.authors_id AND titleauthor.title_id = titles.id These are some of the oldest examples of question answering systems. Early systems such as BASEBALL and LUNAR were sophisticated, even by modern standards (see Green et al and Woods 1973).
Dialog Systems By definition dialog systems have to include the ability to answer questions if for no other reason than to confirm user input. Systems such as SHRDLU were limited to working in a small domain (Winograd, 1972) and they still had no real understanding of what they are discussing. This is still an active research area (including work in our own research group).
Reading Comprehension Systems Reading comprehension tests are frequently used to test the reading level of school children. Researchers realised that these tests could be used to test the language understanding of computer systems. One of the earliest systems designed to answer reading comprehension tests was QUALM (see Lehnert, 1977)
Reading Comprehension Systems How Maple Syrup is Made Maple syrup comes from sugar maple trees. At one time, maple syrup was used to make sugar. This is why the tree is called a "sugar" maple tree. Sugar maple trees make sap. Farmers collect the sap. The best time to collect sap is in February and March. The nights must be cold and the days warm. The farmer drills a few small holes in each tree. He puts a spout in each hole. Then he hangs a bucket on the end of each spout. The bucket has a cover to keep rain and snow out. The sap drips into the bucket. About 10 gallons of sap come from each hole. Who collects maple sap? (Farmers) What does the farmer hang from a spout? (A bucket) When is sap collected? (February and March) Where does the maple sap come from? (Sugar maple trees) Why is the bucket covered? (to keep rain and snow out)
Reading Comprehension Systems Modern systems such as Quarc and Deep Read (see Riloff et al and Hirschman et al. 1999) claim results of between 30% and 40% on these tests. These systems, however, only select the sentence which best answers the question rather than just the answer. These results are very respectable when you consider the fact that each question is answered from a small piece of text, in which the answer is only likely to occur once. Both of these systems use a set of pattern matching rules augmented with one or more natural language techniques.
Open Domain Question Answering In open domain question answering there are no restrictions on the scope of the questions which a user can ask. For this reason most open domain systems use large text collections from which they attempt to extract a relevant answer. In recent years the World Wide Web has become a popular choice of text collection for these systems, although using such a large collection can have its own problems.
Question Answering at the Text REtrieval Conferences (TREC)
Question Answering at TREC Question answering at TREC consists of answering a set of 500 fact based questions, such as: “When was Mozart born?”. For the first three years systems were allowed to return 5 ranked answers to each question. From 2002 the systems are only allowed to return a single exact answer and the notion of confidence has been introduced.
The TREC Document Collection The current collection uses news articles from the following sources: AP newswire, New York Times newswire, Xinhua News Agency newswire, In total there are 1,033,461 documents in the collection. Clearly this is too much text to process using advanced NLP techniques so the systems usually consist of an initial information retrieval phase followed by more advanced processing.
The Performance of TREC Systems The main task has been made more difficult each year: Each year the questions used have been select to better reflect the real world. The questions are no longer guaranteed to have a correct answer within the collection. Only one exact answer instead of five ranked answers Even though the task has become harder year- on-year, the systems have also been improved by the competing research groups. Hence, the best and average systems perform roughly the same each year
The Performance of TREC Systems
Top Performing Systems
For the first few years of the TREC evaluations the best performing systems were those using a vast array of NLP techniques (see Harabagiu et al, 2000) Currently the best performing systems at TREC can answer approximately 70% of the questions (see Soubbotin, 2001). These systems are relatively simply: They make use of a large collection of surface matching patterns They do not make use of NLP techniques such as syntactic and semantic parsing
Top Performing Systems These systems use a large collection of questions and corresponding answers along with a text collection (usually the web) to generate a large number of surface matching patterns. For example questions such as “When was Mozart born?” generate a list of patterns similar to: ( - ) was born on, born in, was born These patterns are then used to answer unseen questions of the same type with a high degree of accuracy.
Progress to Date
Most of the work undertaken this year has been to improve the existing QA system for entry to TREC This has included developing a few ideas of which the following were beneficial: Combining Semantically Similar Answers Increasing the ontology by incorporating WordNet Boosting Performance Through Answer Redundancy
Combining Semantically Similar Answers Often the system will propose two semantically similar answers, these can be grouped into two categories: 1.The answer strings are identical. 2.The answers are similar but the strings are not identical, i.e. Australia and Western Australia. The first group are easy to combine as a simple string comparison will show they are the same. The second group are harder to deal with and the approach taken is similar to that used in Brill et al. (2001).
Combining Semantically Similar Answers The test to see if two answers, A and B, are similar is: If the stem of every non-stopword in A matches the stem of a non-stopword in B then they are similar or vice versa. As well as allowing multiple similar answers to be combined, a useful side-effect is the expansion and clarification of some answer strings, for example: Armstrong becomes Neil A. Armstrong Davis becomes Eric Davis This method is not 100% accurate as two answers which appear similar, based on the above test, may in fact be different when viewed against the question.
Incorporating WordNet The semantic similarity between a possible answer and the question variable (i.e. what we are looking for) was computed as the reciprocal of the distance between the two corresponding entities in the ontology. The ontology is relatively small and so often there is no path between two entities and so they are not deemed to be similar (i.e. neither house nor abode are in the ontology but they are clearly similar in meaning). The solution to this was to use WordNet (Miller, 1995) and specifically the Leacock and Chodorow semantic similarity measure (1998).
Incorporating WordNet This measure uses the hypernym (… is a kind of …) relationships in WordNet to construct a path between two entities. For example these hypernym relationships for fish and food are present in WordNet
Incorporating WordNet We then work out all the paths between fish and food using the generated hypernym trees. It turns out there are three distinct paths. The shortest path is between the first definition of fish and the definition of food, as shown. The Leacock-Chodorow similarity is calculated as: Which would give a score of However, to match our existing measure we use just the reciprocal of the distance, i.e. 1/3.
Using Answer Redundancy Numerous groups have reported that the more instances of an answer there are the more likely it is that a system will find the answer (see Light et al, 2001). There are two ways of boosting the number of answer instances: 1.Use more documents (from the same source) 2.Use documents from more than one source (usually the web)
Using Answer Redundancy Our approach was to use documents from both the TREC collection and the WWW using Google as the IR engine, an approach also taken by Brill et al (2001), although their system differs from ours in the way they make use of the extra document collection. We used only the snippets returned by Google for the top ten documents not the full documents themselves. The system produces two lists of possible answers one for each collection.
Using Answer Redundancy The two answer lists are combined, making sure that each answer still references a document in the TREC collection. So if an answer appears only in the list from Google then it is discarded as there is no TREC document to link it to. The results of this approach are small but worth the extra effort. CollectionMRRNot Found (%) TREC (68%) Google (68%) Combined (65%)
Question Answering over the Web Lots of groups have tried to do this (see Bucholz, 2001 and Kwok et al, 2001). Some use full web pages, others use just the snippets of text returned by a web search engine (usually Google). Our implementation, known as AskGoogle, uses just the top ten snippets of text returned by Google. The method of question answering is then the same as in the normal QA system.
As previously mentioned, quite a few groups have had great success with very simple pattern matching systems. These systems currently use no advanced NLP techniques. The intention is to implement a simple pattern matching system, and then augment it with NLP techniques such as: Named Entity Tagging Anaphora Resolution Syntactic and Semantic Parsing …
Any Questions? Thank you for listening.
Bibliography E. Brill, J. Lin, M. Banko, S. Dumais and A. Ng. Data-Intensive Question Answering. Proceedings of the Tenth Text REtrieval Conference (TREC 2001). S. Bucholz and W. Daelemans. Complex Answers: A Case Study using a WWW Question Answering System. Journal of Natural Language Engineering, Vol. 7, No. 4 (2001). B. F. Green, A. K. Wolf, C. Chomsky and K. Laughery. BASEBALL: An Automatic Question Answerer. In Proceedings of the Western Joint Computer Conference 19, pages (1961). S. Harabagiu, D. Moldovan, M. Paşca, R. Mihalcea, M. Surdeanu, R. Bunescu, R. Gîrju, V.Rus and P. Morărescu. FALCON: Boosting Knowledge for Answer Engines. The Ninth Text REtrieval Conference (TREC 9), L. Hirschman, M. Light, E. Breck and J. Burger. Deep Read: A Reading Comprehension System. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, C. Leacock and M. Chodorow. Combining Local Context and WordNet Similarity for Word Sense Identification. In C. Fellbaum, editor, WordNet: An Electronic Lexical Database, chapter 11, pages MIT Press, W. Lehnert. A Conceptual Theory of Question Answering. Proceedings of the Fifth International Joint Conference on Artificial Intelligence, pages , D. Lin and P. Pantel. Discovery of Inference Rules for Question Answering. Journal of Natural Language Engineering, Vol. 7, No. 4 (2001). C. Kwok, O. Etzioni and D. Weld. Scaling Question Answering to the Web. ACM Transactions in Information Systems, Vol 19, No. 3, July 2001, pages M. Light, G. Mann, E. Riloff and E. Breck. Analyses for Elucidating Current Question Answering Technology. Journal of Natural Language Engineering, Vol. 7, No. 4 (2001). M. Marshall. The Straw Men. HarperCollins Publishers, G. A. Miller. WordNet: A Lexical Database. Communication of the ACM, vol 38: No11, pages 39-41, November E. Riloff, and M. Thelen. A Rule-based Question Answering System for Reading Comprehension Tests. ANLP/NAACL-2000 Workshop on Reading Comprehension Tests as Evaluation for Computer-Based Language Understanding Systems R. F. Simmons. Answering English Questions by Computer: A Survey. Communications of the ACM, 8(1):53-70 (1965). M. M. Soubbotin. Patterns of Potential Answer Expressions as Clues to the Right Answers. Proceedings of the Tenth Text REtrieval Conference (TREC 2001). J. Weizenbaum. ELIZA – A Computer Program for the Study of Natural Language Communication Between Man and Machine. Communications of the ACM, 9, pages 36-45, T. Winograd. Understanding Natural Language. Academic Press, New York, W. Woods. Progress in Natural Language Understanding – An Application to Lunar Geology. In AFIPS Conference Proceedings, volume 42, pages (1973).