Universit at Dortmund, LS VIII

Slides:



Advertisements
Similar presentations
Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.
Advertisements

Introduction to Support Vector Machines (SVM)
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
January 23 rd, Document classification task We are interested to solve a task of Text Classification, i.e. to automatically assign a given document.
ECG Signal processing (2)
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
Query Chains: Learning to Rank from Implicit Feedback Paper Authors: Filip Radlinski Thorsten Joachims Presented By: Steven Carr.
Support Vector Machines
Machine learning continued Image source:
The Disputed Federalist Papers : SVM Feature Selection via Concave Minimization Glenn Fung and Olvi L. Mangasarian CSNA 2002 June 13-16, 2002 Madison,
On feature distributional clustering for text categorization Bekkerman, El-Yaniv, Tishby and Winter The Technion. June, 27, 2001.
The Nature of Statistical Learning Theory by V. Vapnik
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
Announcements See Chapter 5 of Duda, Hart, and Stork. Tutorial by Burge linked to on web page. “Learning quickly when irrelevant attributes abound,” by.
Using Analytic QP and Sparseness to Speed Training of Support Vector Machines John C. Platt Presented by: Travis Desell.
Support Vector Classification (Linearly Separable Case, Primal) The hyperplanethat solves the minimization problem: realizes the maximal margin hyperplane.
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Active Learning with Support Vector Machines
Classification Problem 2-Category Linearly Separable Case A- A+ Malignant Benign.
Binary Classification Problem Linearly Separable Case
Introduction to Machine Learning course fall 2007 Lecturer: Amnon Shashua Teaching Assistant: Yevgeny Seldin School of Computer Science and Engineering.
Linear Discriminant Functions Chapter 5 (Duda et al.)
Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
An Introduction to Support Vector Machines Martin Law.
Support Vector Machines
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Efficient Model Selection for Support Vector Machines
Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor.
Smart RSS Aggregator A text classification problem Alban Scholer & Markus Kirsten 2005.
Machine Learning CSE 681 CH2 - Supervised Learning.
Support Vector Machine (SVM) Based on Nello Cristianini presentation
The Disputed Federalist Papers: Resolution via Support Vector Machine Feature Selection Olvi Mangasarian UW Madison & UCSD La Jolla Glenn Fung Amazon Inc.,
Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.
Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,
Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.
An Introduction to Support Vector Machines (M. Law)
1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
Chapter Ⅳ. Categorization 2007 년 2 월 15 일 인공지능연구실 송승미 Text : THE TEXT MINING HANDBOOK Page. 64 ~ 81.
Linear Models for Classification
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
Transductive Inference for Text Classification using Support Vector Machines - Thorsten Joachims (1999) 서울시립대 전자전기컴퓨터공학부 데이터마이닝 연구실 G 노준호.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support Vector Machines
Support Vector Machines Tao Department of computer science University of Illinois.
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
CS6772 Advanced Machine Learning Fall 2006 Extending Maximum Entropy Discrimination on Mixtures of Gaussians With Transduction Final Project by Barry.
Learning Kernel Classifiers 1. Introduction Summarized by In-Hee Lee.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.
SUPPORT VECTOR MACHINES Presented by: Naman Fatehpuria Sumana Venkatesh.
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
Page 1 CS 546 Machine Learning in NLP Review 1: Supervised Learning, Binary Classifiers Dan Roth Department of Computer Science University of Illinois.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
CS 9633 Machine Learning Support Vector Machines
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
CS 4/527: Artificial Intelligence
An Introduction to Support Vector Machines
Pawan Lingras and Cory Butz
Text Categorization Assigning documents to a fixed set of categories
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Universit at Dortmund, LS VIII Transductive Inference for Text Classification using Support Vector Machines Thorsten Joachims Universit at Dortmund, LS VIII 44221 Dortmund, Germany joachims@ls8.cs.uni-dortmund.de Presented by Yueng-Tien ,Lo

outline Introduction Text Classification Transductive Support Vector Machines What Makes TSVMs Especially well Suited for Text Classification Conclusions and Outlook

Introduction Over the recent years, text classification has become one of the key techniques for organizing online information. Ex. filtering spam from people's email, or learning users' newsreading preferences The work presented here tackles the problem of learning from small training samples by taking a transductive, instead of an inductive approach.

Introduction In the inductive setting the learner tries to induce a decision function which has a low error rate on the whole distribution of examples for the particular learning task. In many situations we do not care about the particular decision function, but rather that we classify a given set of examples (i.e. a test set) with as few errors as possible.

Introduction Relevance Feedback : Netnews Filtering : The user is interested in a good classification of the test set into those documents relevant or irrelevant to the query. Netnews Filtering : Given the few training examples the user labeled on previous days, he or she wants today's most interesting articles. Reorganizing a document collection :

Text Classification The goal of text classification is the automatic assignment of documents to a fixed number of semantic categories. Each such problem answers the question of whether or not a document should be assigned to a particular category.

Documents, which typically are strings of characters, have to be transformed into a representation suitable for the learning algorithm and the classification task.

Transductive Support Vector Machines

Transductive Support Vector Machines

Transductive Support Vector Machines

Transductive Support Vector Machines

Transductive Support Vector Machines For a learning task the learner is given a hypothesis space of functions and an i.i.d. sample of n training examples Each training example consists of a document vector and a binary label In contrast to the inductive setting, the learner is also given an i.i.d. sample of test examples from the same distribution

Transductive Support Vector Machines The transductive learner aims to selects a function from using and so that the expected number of erroneous predictions on the test examples is minimized. is zero if otherwise it is one.

Transductive Support Vector Machines Bounds on the relative uniform deviation of training error and test error With probability where the confidence interval depends on the number of training examples n, the number of test examples k, and the VC-Dimension d of H.

Transductive Support Vector Machines What information do we get from studying the test sample (3) and how can we use it? The training and the test sample split the hypothesis space into a finite number of equivalence classes . This reduces the learning problem from finding a function in the possibly infinite set to finding one of finitely many equivalence classes We can use these equivalence classes to build a structure of increasing VC-Dimension for structural risk minimization

Transductive Support Vector Machines Vapnik shows that with the size of the margin we can control the maximum number of equivalence classes (i. e. the VC-Dimension)

Theorem 1 Consider hyperplanes as hypothesis space . If the attribute vectors of a training sample (2) and a test sample (3) are contained in a ball of diameter D, then there are at most equivalence classes which contain a separating hyper-plane with

Transductive Support Vector Machines Note that the VC-Dimension does not necessarily depend on the number of features, but can be much lower than the dimensionality of the space. Structural risk minimization tells us that we get the smallest bound on the test error if we select the equivalence class from the structure element which minimizes (6).

Transductive SVM (lin. sep. case) Optimization Problem 1 Minimize over

Transductive SVM (non-sep. case) Optimization Problem 2 Minimize over and are parameters set by the user.

What Makes TSVMs Especially well Suited for Text Classification The text classification task is characterized by a special set of properties. High dimensional input space: each(stemmed) word is a feature Document vectors are sparse: For each document, the corresponding document vector contains few entries that are not zero. Few irrelevant features: Experiments in [Joachims, 1998] suggest that most words are relevant.

What Makes TSVMs Especially well Suited for Text Classification But how can TSVMs be any better? when asking the search engine Altavista about all documents containing the words pepper and salt, it returns 327,180 web pages. When asking for the documents with the words pepper and physics, we get only 4,220 hits, although physics is a more popular word on the web than salt. And it is this co-occurrence information that TSVMs exploit as prior knowledge about the learning task.

What Makes TSVMs Especially well Suited for Text Classification Imagine document D1 was given as a training example for class A and document D6 was given as a training example for class B. How should we classify documents D2 to D4 (the test set)?

What Makes TSVMs Especially well Suited for Text Classification The reason we choose this classification of the test data over the others stems from our prior knowledge about the properties of text and common text classification tasks.

What Makes TSVMs Especially well Suited for Text Classification Note again that we got to this classification by studying the location of the test examples, which is not possible for an inductive learner. We see that the maximum margin bias reflects our prior knowledge about text classification well. By analyzing the test set, we can exploit this prior knowledge for learning.

Solving the Optimization Problem Training a transductive SVM means solving the (partly) combinatorial optimization problem OP2 The key idea of the algorithm is that it begins with a labeling of the test data based on the classification of an inductive SVM. Then it improves the solution by switching the labels of test examples so that the objective function decreases.

Solving the Optimization Problem It starts with training an inductive SVM on the training data and classifying the test data accordingly. Then it uniformly increases the influence of the test examples by incrementing the cost-factors and up to the user defined value of (loop 1). While the criterion in the condition of loop 2 identifies two examples for which changing the class labels leads to a decrease in the current objective function, these examples are switched.

Algorithm for training Transductive Support Vector Machines Input: training examples Test examples Parameters: : parameters from OP(2) : number of test examples to be assigned to class + Output: predicted labels of the test examples Classify the test examples using The test examples with the highest value of are assigned to the class the remaining test examples are assigned to class

Algorithm for training Transductive Support Vector Machines While { while { }

Inductive SVM (primal) Optimization Problem 3 The function solve_svm_qp refers to quadratic programs of the following type. Minimize over

Theorem 2 Algorithm 1 converges in a finite number of steps. The condition in loop 2 requires that the examples to be switched have different class labels.

Theorem 2 Let so that we can write.

Theorem 2 The inequality holds due to the selection criterion in loop 2, since and This means that loop 2 is exited after a finite number of iterations, since there is only a finite number of permutations of the test examples. Loop 1 also terminates after a finite number of iterations, since is bounded by .

Conclusions and Outlook By taking a transductive instead of an inductive approach, the test set can be used as an additional source of information about margins. On all data sets the transductive approach showed improvements over the currently best performing method, most substantially for small training samples and large test sets