Question Classification using Support Vector Machine Dell Zhang National University of Singapore Wee Sun Lee National University of Singapore SIGIR2003.

Slides:



Advertisements
Similar presentations
Support Vector Machine
Advertisements

1 Classification using instance-based learning. 3 March, 2000Advanced Knowledge Management2 Introduction (lazy vs. eager learning) Notion of similarity.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
ECG Signal processing (2)
Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
An Introduction of Support Vector Machine
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Support Vector Machines
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Machine learning continued Image source:
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
1 Machine Learning: Lecture 7 Instance-Based Learning (IBL) (Based on Chapter 8 of Mitchell T.., Machine Learning, 1997)
Classification and Decision Boundaries
On feature distributional clustering for text categorization Bekkerman, El-Yaniv, Tishby and Winter The Technion. June, 27, 2001.
Support Vector Machines and Kernel Methods
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Fuzzy Support Vector Machines (FSVMs) Weijia Wang, Huanren Zhang, Vijendra Purohit, Aditi Gupta.
K nearest neighbor and Rocchio algorithm
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
SVM Active Learning with Application to Image Retrieval
Evaluating Hypotheses
Aprendizagem baseada em instâncias (K vizinhos mais próximos)
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
What is Learning All about ?  Get knowledge of by study, experience, or being taught  Become aware by information or from observation  Commit to memory.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Introduction to Machine Learning Approach Lecture 5.
SIEVE—Search Images Effectively through Visual Elimination Ying Liu, Dengsheng Zhang and Guojun Lu Gippsland School of Info Tech,
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Active Learning for Class Imbalance Problem
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
Universit at Dortmund, LS VIII
A Language Independent Method for Question Classification COLING 2004.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.
Bing LiuCS Department, UIC1 Chapter 8: Semi-supervised learning.
Support Vector Machines Project מגישים : גיל טל ואורן אגם מנחה : מיקי אלעד נובמבר 1999 הטכניון מכון טכנולוגי לישראל הפקולטה להנדסת חשמל המעבדה לעיבוד וניתוח.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
1 An Efficient Classification Approach Based on Grid Code Transformation and Mask-Matching Method Presenter: Yo-Ping Huang.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.
Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
CS 9633 Machine Learning Support Vector Machines
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
An Introduction to Support Vector Machines
COSC 4335: Other Classification Techniques
Support Vector Machines
Machine Learning: UNIT-4 CHAPTER-1
SVMs for Document Ranking
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Support Vector Machines 2
Machine Learning: Lecture 5
Presentation transcript:

Question Classification using Support Vector Machine Dell Zhang National University of Singapore Wee Sun Lee National University of Singapore SIGIR2003

Introduction (1/3) What a current information retrieval system or search engine can do is just “ document retrieval ”. Given some keywords it only returns the relevant documents that contain the keywords. However, what a user really wants is often a precise answer to a question. “ What was the first American in space ? ” user wants the answer “ Alan Shepard ”, but not to read through lots of documents that contains the words “ first ”, ” American ” and “ space ” etc.

Introduction (2/3) In order to correctly answer a question, usually one needs to understand what the question ask for. Question Classification,i.e., putting the questions into several semantic categories, can not only impose some constraints on the plausible answer but also suggest different processing strategies. For instance, if the system understands that the question “ What was the first American in space ? ” ask for a person name, the search space of plausible answers will be significantly reduced. The accuracy of question classification is very important to the overall performance of the question answer system.

Introduction (3/3) Although it is possible to manually write some heuristic rules for question classification, it usually requires tremendous amount of tedious work. In contrast, a machine learning approach can automatically construct a high performance question classification program which leverages thousands or more features of questions. However, there are only a few papers describing machine learning approaches to question classification, and some of them are pessimistic.

Question Classification Question Classification means putting the question into several semantic categories. Here only the TREC-style question,i.e., open-domain factual questions are considered. We follow the two – layered question taxonomy, which contains 6 coarse grained categories and 50 fine grained categories. Most question answering systems use a coarse grained category definition.

Datasets We use the publicly available training and testing datasets provided by USC, UIUC, and TREC. All these datasets have been manually labeled by UIUC according to the coarse and fine grained categories in Table 1. There are about 5500 labeled questions randomly divided into 5 training datasets of size 1000, 2000, 3000, 4000, and The testing dataset contains 500 labeled questions from TREC10 QA track. For each question, we extract two kinds of features : bag-of- words and bag-of-ngrams. Every question is represented as binary feature vectors, because the term frequency (tf) of each word or ngram in a question usually 0 or 1.

Support Vector Machine (1/2) Support Vector Machines are linear functions of the form f(x)= +b, where is the inner product between the weight vector w and the input vector x. SVM can be used as a classifier by setting the class to 1 if f(x) > 0 and to -1 otherwise. The main idea of SVM is to select a hyper-plane that separates the positive and negative examples while maximizing the minimum margin, where the margin for example Xi is Yi*f(Xi) and Yi belong to {-1,1}. This correspond to minimizing subject to Yi*( +b )>=1 for all i. Large margin classifiers are known to have good generalization properties.

Support Vector Machine (2/2) To deal with cases where there may be no separating hyper- plane, the soft margin SVM has been proposed. The soft margin classifier SVM minimize subject to, where C is a parameter that controls the amount of training errors allowed.

Other Learning Algorithms (1/2) Besides SVM, we have tried four other machine learning algorithms : Nearest Neighbors (NN), Na ï ve Bayes (NB), Decision Tree (DT) and Sparse Network of Winnows (SNoW). NN algorithm is a simplified version of well-known kNN algorithm which has been successfully applied in document classification. Given an unlabeled instance, the NN algorithm finds its nearest (most similar) neighbors among the training examples, and use dominant class label of these nearest neighbors as its class label. The similarity between two instances is simply defined as the number of overlapping features between them.

Other Learning Algorithms (2/2) Na ï ve Bayes algorithm is regarded as one of the top performing methods for document classification. Its basic idea is to estimate the parameters of a multinomial generative model, then find the most probable class for a given instance using Bayes ’ rule. Decision Tree (DT) algorithm is a method for approximating discrete valued target function, in which the learned function is represented by a tree of arbitrary degree that classifies instances. SNoW algorithm is specifically tailored for learning in the presence of a very large number of features and can be used as a general purpose multi-class classifier. The learned classifier is a sparse network of linear function.

Experiment Result (1/2) We have trained the above learning algorithms on 5 different size training datasets respectively and tested them on the TREC10 questions. For SVM algorithms, we observed that the bag-of- ngrams features are not much better than the bag-of- words features.

Experiment Result (2/2) In summary, the experiment results show that with only surface text features the SVM outperforms the other four methods for this task.

Tree Kernel ’ s Concept (1/4 ) We propose to use a special kernel function called tree kernel to enable the SVM to take advantage of the syntactic structures of questions. For a question, we first parse it into a syntactic tree using a parser, and then we represent it as a vector in a high dimensional space which is defined over tree fragments. The tree fragments of a syntactic tree are all its sub-trees which include at least one terminal symbol (word) or one production rule. Implicitly we enumerate all the possible tree fragments 1,2, …,m. These tree fragments are the axis of this m dimensional space.

Tree Kernel ’ s Concept (2/4 ) Every syntactic tree T is represented by an m dimensional vector. The i-th element Vi(T) is the weight of the tree fragment in the i-th dimension if the tree fragment is in the tree T and zero otherwise. The tree kernel of two syntactic trees T1 and T2 is the inner product of. For each tree fragment i, i is weighted by, where s(i) is its size and is the weighting factor for size, d(i) is its depth and is the weighting factor for depth. The size of a tree fragment is defined as the number of production rules it contains. The depth of a tree fragment is defined as the depth of the tree fragment root in the entire syntactic tree.

Tree Kernel ’ s Concept (3/4 ) We use the question “ What is atom ? ” to illustrate the above concepts.

Tree Kernel ’ s Concept (4/4 ) The advantage of tree kernel is that it can weight the tree fragments according to their depth so as to identify the question focus. For example “ Which university did the president graduate from ? ”, the focus is “ university ” but not “ president ”, since the depth of “ university ” is less than “ president ”.

Tree Kernel ’ s Computation (1/3 ) We describe how the tree kernel can be computed in polynomial time complexity by dynamic programming. Consider a syntactic tree T. Let N represent the set of nodes in T. Define the binary indicator function to be 1 if the tree fragment i is seen rooted at a node n and 0 otherwise. And represent the depth of the node n by d(n). Thus,. Given two syntactic trees T1 and T2, the tree kernel k(T1,T2) can be compute as follows :

Tree Kernel ’ s Computation (2/3 ) The value of can be computed recursively as follows :

Tree Kernel ’ s Computation (3/3 ) According to the dynamic programming idea, we can traverse T1 and T2 in post-order, fill in a |N1| x |N2| matrix of, then sums up the matrix entries. Thus, time complexity of this algorithm turns out to be at most O(|N1||N2|).

Evaluation (1/2) To find appropriate parameter values for SVM using tree kernel (, ), we run 4-fold cross validation on 1000 training examples. Therefore, =0.1, =0.9. Under the coarse grained category definition, SVM based on tree kernel can bring about 20% errors reduction.

Evaluation (2/2) Under the fine grained category definition, SVM based on tree kernel brings slightly improvements. After carefully checking the misclassification, we believe that syntactic tree doesn ’ t normally contain the information required to distinguish between the various fine categories within a course category. Nevertheless, if we classify the questions into coarse grained categories first and then classify the questions inside each coarse grained category into its fine grained categories, we may still expect better results using the SVM based on tree kernel.

Conclusion The main contributions of this paper are as follows. We show that with only surface text features SVM outperforms four other machine learning methods (NN,NB,DT,SNoW) for question classification. We found that the syntactic structures of question are really helpful to question classification. We propose to use a special kernel function called tree kernel to enable the SVM to take advantage of the syntactic structures of questions. And we describe how the tree kernel can be computed efficiently by dynamic programming. The main future direction of this research is to exploit semantic knowledge (such as WordNet ) for question classification.