A Language Independent Method for Question Classification COLING 2004.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

University of Sheffield NLP Module 11: Advanced Machine Learning.
Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Specialized models and ranking for coreference resolution Pascal Denis ALPAGE Project Team INRIA Rocquencourt F Le Chesnay, France Jason Baldridge.
Large-Scale Entity-Based Online Social Network Profile Linkage.
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Learning with Probabilistic Features for Improved Pipeline Models Razvan C. Bunescu Electrical Engineering and Computer Science Ohio University Athens,
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
On feature distributional clustering for text categorization Bekkerman, El-Yaniv, Tishby and Winter The Technion. June, 27, 2001.
Jianwei Lu1 Information Extraction from Event Announcements Student: Jianwei Lu ( ) Supervisor: Robert Dale.
1 Question Answering in Biomedicine Student: Andreea Tutos Id: Supervisor: Diego Molla.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
1/1/ An Integrated Knowledge-based and Machine Learning Approach for Chinese Question Classification Min-Yuh Day 1,2, Cheng-Wei Lee 1, Shih-Hung Wu 3,
Third Recognizing Textual Entailment Challenge Potential SNeRG Submission.
1 Automated Feature Abstraction of the fMRI Signal using Neural Network Clustering Techniques Stefan Niculescu and Tom Mitchell Siemens Medical Solutions,
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
SIEVE—Search Images Effectively through Visual Elimination Ying Liu, Dengsheng Zhang and Guojun Lu Gippsland School of Info Tech,
Wang, Z., et al. Presented by: Kayla Henneman October 27, 2014 WHO IS HERE: LOCATION AWARE FACE RECOGNITION.
TransRank: A Novel Algorithm for Transfer of Rank Learning Depin Chen, Jun Yan, Gang Wang et al. University of Science and Technology of China, USTC Machine.
Attention Deficit Hyperactivity Disorder (ADHD) Student Classification Using Genetic Algorithm and Artificial Neural Network S. Yenaeng 1, S. Saelee 2.
A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
The use of machine translation tools for cross-lingual text-mining Blaz Fortuna Jozef Stefan Institute, Ljubljana John Shawe-Taylor Southampton University.
“How much context do you need?” An experiment about context size in Interactive Cross-language Question Answering B. Navarro, L. Moreno-Monteagudo, E.
Authors: Ting Wang, Yaoyong Li, Kalina Bontcheva, Hamish Cunningham, Ji Wang Presented by: Khalifeh Al-Jadda Automatic Extraction of Hierarchical Relations.
Presented by Tienwei Tsai July, 2005
1 / 12 PSLC Summer School, June 21, 2007 Identifying Students’ Gradual Understanding of Physics Concepts Using TagHelper Tools Nava L.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Querying Structured Text in an XML Database By Xuemei Luo.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
Structured Use of External Knowledge for Event-based Open Domain Question Answering Hui Yang, Tat-Seng Chua, Shuguang Wang, Chun-Keat Koh National University.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Presenter: Shanshan Lu 03/04/2010
Opinion Holders in Opinion Text from Online Newspapers Youngho Kim, Yuchul Jung and Sung-Hyon Myaeng Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen.
CHAPTER 12 Descriptive, Program Evaluation, and Advanced Methods.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)
CISC Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.
1/21 Automatic Discovery of Intentions in Text and its Application to Question Answering (ACL 2005 Student Research Workshop )
Gang WangDerek HoiemDavid Forsyth. INTRODUCTION APROACH (implement detail) EXPERIMENTS CONCLUSION.
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Machine Learning for Spam Filtering 1 Sai Koushik Haddunoori.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Automatic Question Answering  Introduction  Factoid Based Question Answering.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Question Classification using Support Vector Machine Dell Zhang National University of Singapore Wee Sun Lee National University of Singapore SIGIR2003.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
1 Question Answering and Logistics. 2 Class Logistics  Comments on proposals will be returned next week and may be available as early as Monday  Look.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
1 Machine Learning in Natural Language More on Discriminative models Dan Roth University of Illinois, Urbana-Champaign
Concept-Based Analysis of Scientific Literature Chen-Tse Tsai, Gourab Kundu, Dan Roth UIUC.
Ontology Engineering and Feature Construction for Predicting Friendship Links in the Live Journal Social Network Author:Vikas Bahirwani 、 Doina Caragea.
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
Data Mining Practical Machine Learning Tools and Techniques
Concave Minimization for Support Vector Machine Classifiers
Hierarchical, Perceptron-like Learning for OBIE
Implementation of Learning Systems
Presentation transcript:

A Language Independent Method for Question Classification COLING 2004

Introduction An important subtask of a QA system is question analysis, since it can provide useful clues for identifying potential answers in a large collection of texts. Results of the error analysis of an open-domain QA system showed that 36.4% of the errors were generated by the question classification module (Moldovan et al., 2003). We present here a language independent method for question classification since no complex natural language processing tools are needed.

Related Work In (Zhang and Lee, 2003) they present a new method for question classification using Support Vector Machines. –a tree kernel function that allows to represent the syntactic structure of questions –achieves an accuracy of 90% –a parser is needed in order to acquire the syntactic information. Li and Roth reported a hierarchical approach for question classification based on the SNoW learning architecture (Li and Roth, 2002). –This hierarchical classifier discriminates among 5 coarse classes, which are then refined into 50 more specific classes. –Accuracy: 98.80% for a coarse classification –The learners are trained using lexical and syntactic features

Learning Question Classifiers Question classification introduces the problem of dealing with short sentences and thus we have less information available on each question instance. Question classification approaches are trying to use other information (e.g. chunks and named entities) besides the words within the questions. However, the main disadvantage of relying on semantic analyzers, named entity taggers and the like, is that for some languages these tools are not yet well developed.

Learning Question Classifiers This is our prime motivation for searching for different, more easier to gather, information to solve the question classification problem. Our learning scenario considers as attribute information prefixes of words in combination with attributes whose values are obtained from the Internet. These Internet based attributes are targeted to extract evidence of the possible semantic class of the question.

Learning Question Classifiers - Using Internet We propose using the Internet in order to acquire information that can be used as attributes in our classification problem. The procedure for gathering this information from the web is as follows: –use a set of heuristics to extract from the question a word w, or set of words, that will complement the queries submitted for the search. –go to a search engine, in this case Google, and submit queries using the word w in combination with all the possible semantic classes.

Learning Question Classifiers - Using Internet The procedure for gathering this information from the web (continued) –We count the number of results returned by Google for each query and normalize them by their sum. –The resultant numbers are the values for the attributes used by the learning algorithm. An additional advantage of using the Internet is that by approximating the values of attributes in this way, we take into account words or entities belonging to more than one class (polysemy).

Learning Question Classifiers - Using Internet For instance, Q: “ Who is the President of the French Republic? ” we extract the word President using our heuristics, and run 5 queries in the search engine, one for each possible class. These queries take the following form: –“ President is a person ” –“ President is a place ” –“ President is a date ” –“ President is a measure ” –“ President is an organization ”

Learning Question Classifiers - Heuristics We begin by eliminating from the questions all words that appear in our stop list. The remaining words are sent to the search engine in combination with the possible semantic classes, as described above. If no results are returned for any of the semantic classes we then start eliminating words from right to left until the search engine returns results for at least one of the semantic categories.

Learning Question Classifiers - Heuristics As an example consider the question posed previously: “ Who is the President of the French Republic? ” We eliminate the words from the stop list and then formulate queries for the remaining words. –“ President French Republic is a s i ” where s i = {Person,Organization, Place,Date,Measure}. we start eliminating words from right to left: –“ President French is a s i ” –“ President is a s i ” which returns results for all the semantic classes except for Date.

Experimental Evaluation - Data sets The data set used in this work consists of the questions provided in the DISEQuA Corpus (Magnini et al., 2003). –simple, mostly short, straightforward and factual queries –contains 450 questions, each one formulated in four languages: Dutch, English, Italian and Spanish. –The questions are classified into seven categories: Person, Organization, Measure, Date, Object, Other and Place. The experiments performed in this work used the English, Italian and Spanish versions of these questions.

Experimental Evaluation In our experiments we used the WEKA implementation of SVM (Witten and Frank, 1999). In this setting multi-class problems are solved using pairwise classification. The most common approach to question classification is bag-of-words, so we decided to compare results of using bag-of-words against using just prefixes of the words in the questions.

Experimental Evaluation Table 3: Experimental results of accuracy when training SVM with words, prefixes and Internet-based attributes Table 4: Experimental results combining Internet-valued attributes with words and prefixes

Experimental Evaluation For Spanish however, the best results were achieved when using prefixes of size 5. This can be due to the fact that some of the interrogative words, that by themselves can define the semantic class of questions in this language, –Cu´ando (When) and Cu´anto (How much) could be considered as the same prefix of size 4, i.e. Cu´an. –But if we consider prefixes of size 5, then these two words will form two different prefixes: Cu´and and Cu´ant, thus reducing the loss of information, as oppose to using prefixes of size 4. And for the three languages the Internet-based attributes had rather low accuracies, the lowest being for Italian.

Experimental Evaluation

Comparing our results with those of previous works we can say that our method is promising. For instance Zhang and Sun Lee (Zhang and Lee, 2003) reported an accuracy of 90% for English questions, while Li and Roth (Li and Roth, 2002) achieved 98.8% accuracy. However, they used a training set of 5,500 questions and a test set of 500 questions, while in our experiments we used for training 405 for each 45 test questions. When Zhang and Sun Lee used only 1,000 questions for training they achieved an accuracy of 80.2%. It is well known that machine learning algorithms perform better when a bigger training set is available, so it is expected that experiments of our method with a larger training set will provide improved results.

Conclusions We have presented here experimental results of a language independent question classification method. We do not use semantic or syntactic information. We believe that this method can be successfully applied to other languages, such as Romanian, French, Portuguese and Catalan, that share the morphologic characteristics of the three languages tested here. It is expected that experiments of our method with a larger training set will provide improved results.