بسم الله الرحمن الرحيم وقل ربي زدني علما.

Slides:



Advertisements
Similar presentations
Text Categorization.
Advertisements

ECG Signal processing (2)
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
PrasadL18SVM1 Support Vector Machines Adapted from Lectures by Raymond Mooney (UT Austin)
An Introduction of Support Vector Machine
Linear Classifiers/SVMs
An Introduction of Support Vector Machine
Support Vector Machines
Hinrich Schütze and Christina Lioma
The Disputed Federalist Papers : SVM Feature Selection via Concave Minimization Glenn Fung and Olvi L. Mangasarian CSNA 2002 June 13-16, 2002 Madison,
Fuzzy Support Vector Machines (FSVMs) Weijia Wang, Huanren Zhang, Vijendra Purohit, Aditi Gupta.
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.
Linear Separators. Bankruptcy example R is the ratio of earnings to expenses L is the number of late payments on credit cards over the past year. We will.
Support Vector Machines
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
This week: overview on pattern recognition (related to machine learning)
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
ADVANCED CLASSIFICATION TECHNIQUES David Kauchak CS 159 – Fall 2014.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
START OF DAY 5 Reading: Chap. 8. Support Vector Machine.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
CS 478 – Tools for Machine Learning and Data Mining SVM.
An Introduction to Support Vector Machine (SVM)
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Information Retrieval and Organisation Chapter 14 Vector Space Classification Dell Zhang Birkbeck, University of London.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.
Roughly overview of Support vector machines Reference: 1.Support vector machines and machine learning on documents. Christopher D. Manning, Prabhakar Raghavan.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Support Vector Machine Slides from Andrew Moore and Mingyue Tan.
Support Vector Machines
CS 9633 Machine Learning Support Vector Machines
PREDICT 422: Practical Machine Learning
Support Vector Machine
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Large Margin classifiers
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Instance Based Learning
Support Vector Machines
Support Vector Machines
Information Retrieval Christopher Manning and Prabhakar Raghavan
Machine Learning. Support Vector Machines A Support Vector Machine (SVM) can be imagined as a surface that creates a boundary between points of data.
CS 4/527: Artificial Intelligence
LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS
Instance Based Learning (Adapted from various sources)
Support Vector Machines
K Nearest Neighbor Classification
Text Categorization Assigning documents to a fixed set of categories
CSSE463: Image Recognition Day 14
COSC 4335: Other Classification Techniques
Support Vector Machines
Machine Learning. Support Vector Machines A Support Vector Machine (SVM) can be imagined as a surface that creates a boundary between points of data.
Machine Learning. Support Vector Machines A Support Vector Machine (SVM) can be imagined as a surface that creates a boundary between points of data.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Generally Discriminant Analysis
Support Vector Machines
Dr. Sampath Jayarathna Cal Poly Pomona
Supervised machine learning: creating a model
Naïve Bayes Text Classification
SVMs for Document Ranking
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

بسم الله الرحمن الرحيم وقل ربي زدني علما

Vector space classification & Support vector machine (S VM) Text Classification Vector space classification & Support vector machine (S VM) Machine learning on documents

Objectives Classifiers types Linear classifiers Vector space classification Vector space classification methods (Rocchio, kNN) Support vector machines and machine learning on documents

Classifiers types Linear classifiers Non linear classifiers

LINEAR CLASSIFIER define a linear classifier as a two-class classifier that decides class membership by comparing a linear combination of the features to a threshold. the two learning methods Naive Bayes and Rocchio are instances of linear classifiers, the perhaps most important group of text classifiers.

LINEAR SEPARABILITY If there exists a hyperplane that perfectly separates the two classes, then we call the two classes linearly separable. In fact, if linear separability holds, then there is an infinite number of linear separators as illustrated

LINEAR CLASSIFIER In two dimensions, a linear classifier is a line. These lines have the functional form w1x1 + w2x2 = b. The classification rule of a linear classifier is to assign a document to c if w1x1 + w2x2 > b and to if w1x1 + w2x2 ≤ b. Here, (x1, x2)T is the two-dimensional vector representation of the document and (w1, w2)T is the parameter vector

LINEAR CLASSIFIER Linear classification algorithm

LINEAR CLASSIFIER We can generalize this 2Dlinear classifier to higher dimensions by defining a hyperplane The assignment criterion then is: assign to We call a hyperplane that we use as a linear classifier a decision hyperplane.

Non linear case A nonlinear problem.

NOISE DOCUMENT & NOISE FEATURE As is typical in text classification, there are some noise documents in the figure (marked with arrows) that do not fit well into the overall distribution of the classes a noise feature is a misleading feature that, when included in the document representation, on average increases the classification error. Analogously, a noise document is a document that, when included in the training set, misleads the learning method and increases classification error. Noise documents are one reason why training a linear classifier is hard

NOISE DOCUMENT & NOISE FEATURE A linear problem with noise.in this hypothetical web page classification scenario, Chinese-only web pages are solid circles and mixed Chinese-English web pages are squares. The two classes are separated by a linear class boundary (dashed line, short dashes), except for three noise documents (marked with arrows).

Vector space classification Vector space model represents each document as a vector with one real-valued component, usually a tf-idf weight, for each term. Thus, the document space X, the domain of the classification function 𝛾 , is R|V|. W t,d weight for a term in the document tf term frequency df document frequency that term exists in N total number of documents

Vector space Length Normalization A vector can be (length-) normalized by dividing each of its components by its length – for this we use the norm:

Classification Using Vector Spaces The training set is a set of documents, each labeled with its class (e.g., topic) In vector space classification, this set corresponds to a labeled set of points (or, equivalently, vectors) in the vector space Premise 1: Documents in the same class form a contiguous region of space Premise 2: Documents from different classes don’t overlap (much) We define surfaces to delineate classes in the space

Documents in a Vector Space Government Science Arts

Test Document of what class? Government Science Arts

Test Document = Government Is this similarity hypothesis true in general? Government Science Arts Our main objective is how to find good separators

Vector space classification methods Rocchio kNN

Using Rocchio for text classification Use standard tf-idf weighted vectors to represent text documents For training documents in each category, compute a prototype vector by summing the vectors of the training documents in the category. Prototype = centroid of members of class Assign test documents to the category with the closest prototype vector based on cosine similarity.

Definition of centroid Where Dc is the set of all documents that belong to class c and v(d) is the vector space representation of d.

Rocchio Properties Forms a simple generalization of the examples in each class (a prototype). Classification is based on similarity to class prototypes. Does not guarantee classifications are consistent with the given training data. Why not?

Rocchio classification Rocchio classification divides the vector space into regions centered on centroids or prototypes, one for each class, computed as the center of mass of all documents in the class. Rocchio classification is simple and efficient

Rocchio classification

Rocchio classification The CENTROID of a class c is computed as the vector average or center of mass of its members: where Dc is the set of documents in D whose class is c: Dc = {d : <d, c> ∈ D}. We denote the normalized vector of d by The boundary between two classes in Rocchio classification is the set of points with equal distance from the two centroids. For example, |a1| = |a2|,

Rocchio classification |b1| = |b2|, and |c1| = |c2| in the figure. This set of points is always a line. The generalization of a line in M-dimensional space is a hyperplane, which we define as the set of points that satisfy: Where is the M-dimensional normal vector of the hyperplane and b is a constant.

Rocchio classification This definition of hyperplanes includes lines any line in 2D can be defined by w1x1 + w2x2 = b 2-dimensional planes any plane in 3D can be defined by w1x1 + w2x2 + w3x3 = b A line divides a plane in two , a plane divides 3-dimensional space in two, and hyperplanes divide higher dimensional spaces in two.

Rocchio classification Thus, the boundaries of class regions in Rocchio classification are hyperplanes. The classification rule in Rocchio is to classify a point in accordance with the region it falls into. Equivalently, we determine the centroid That the point is closest to and then assign it to c. As an example, consider the star in the previuos Figure It is located in the China region of the space and Rocchio therefore assigns it to China.

Rocchio Algorithm Rocchio classification: Training and testing

Rocchio classification example

Rocchio classification example Vectors and class centroids for the data in the Table

Rocchio classification example The Table shows the tf-idf vector representations of the five documents using the formula (1+log10 tft,d) log10(4/dft) if tft,d > 0 (Equation wf-idft,d = wft,d × idft). Explanation for d1 (chinese ,beijing) In d1(chinese) w=(1+log(2)).log(4/4)=1.3 x 0= 0 In d1(beijing) w=(1+log(1)).log(4/1)= 0.6 By applying length normalization on d1 =√( 02+0.62)=0.6 , d1(chinese)=0/0.6=0, d1(beijing)=0.6/0.6=1

Rocchio classification example Explanation for d4(tokyo , japan, chinese) d4(tokyo)=(1+log(1)).log(4/1)=1 x 0.6=0.6 d4(japan)=(1+log(1)).log(4/1)=1 x 0.6=0.6 d4(chinese)=(1+log(1)).log(4/4)=1 x 0=0 By applying length normalization on d4 d1 =√( 0.62+0.62+02)= √0.72=0.84 , d1(tokyo)=0.6/0.84=0.71 d1(japan)= 0.6/0.84=0.71 d1(chinese)=0/0.84=0

Rocchio classification example The two class centroids are μ (chinese)=1/3(0+0+0)=0 μ (japan)=1/3(0+0+0)=0 μ (tokyo)=1/3(0+0+0)=0 μ (macao)=1/3(0+0+1)=0.3 μ (beijing)=1/3(1+0+0)=0.3 μ (shanghai)=1/3(0+1+0)=0.3 μ (chinese)=1/1 (0)=0 μ (japan)=1/1 (0.71)=0.71 μ (tokyo)=1/1 (0.71)=0.71 μ (macao)=1/1 (0)=0 μ (beijing)=1/1 (0)=0 μ (shanghai)=1/1(0)=0

Rocchio classification example The two class centroids are The distances of the test document from the centroids are Thus, Rocchio assigns d5 to Explanation |μ c-d5|=√((0-0)2 +(0-0.71)2+(0-0.71)2+(0.3-0)2+(0.3-0)2+(0.3-0)2)=1.15 =√((0-0)2+(0.71-0.71)2+(0.71-0.71)2+(0-0)2+(0-0)2)=0

Rocchio classification example b=0.5*((√(02+02+02+0.332+0.332+0.332))2-(√(02+02+0.712+0.712+02+02))2) this hyperplane separates the documents as desired: i=2 >> d2 , i=3>>d3

k nearest neighbor classification k nearest neighbor or kNN classification determines the decision boundary locally. For 1NN we assign each document to the class of its closest neighbor. For kNN we assign each document to the majority class of its k closest neighbors where k is a parameter. The rationale of kNN classification is that, based on the contiguity hypothesis, we expect a test document d to have the same label as the training documents located in the local region surrounding d.

k nearest neighbor classification Voronoi tessellation and decision boundaries (double lines) in 1NN classification. The three classes are: X, circle and diamond.

k nearest neighbor classification 1NN is not very robust. The classification decision of each test document relies on the class of a single training document, which may be incorrectly labeled. kNN for k > 1 is more robust. It assigns documents to the majority class of their k closest neighbors.

a probabilistic version of this kNN classification algorithm. We can estimate the probability of membership in class c as the proportion of the k nearest neighbors in c. The figure gives an example for k = 3. Probability estimates for class membership of the star are ˆP(circle class|star) = 1/3, ˆP(X class|star) = 2/3 ˆP (diamond class|star) = 0. The 3nn estimate

a probabilistic version of this kNN classification algorithm. ˆP1(circle class|star) = 1/3 1nn estimate ˆP1(circle class|star) = 1 differ with 3nn preferring the X class and 1nn preferring the circle class . The parameter k in kNNis often chosen based on experience or knowledge, value of k is typically odd to avoid ties; 3 and 5 are most common.

k nearest neighbor classification Sec.14.3 k nearest neighbor classification Example: k=6 (6NN) P(science| )? Government Science Arts

k Nearest Neighbor Classification kNN = k Nearest Neighbor To classify a document d into class c: Define k-neighborhood N as k nearest neighbors of d Count number of documents i in N that belong to c Estimate P(c|d) as i/k Choose as class argmaxc P(c|d) [ = majority class]

k nearest neighbor classification algorithm

k nearest neighbor classification example The distances of the test document from the four training documents are √((0-0)2+(0-0.71)2+(0-0.71)2+(0-0)2+(1-0)2+(0-0)2)= √((-0.71)2+(-0.71)2+(1)2)= √((0.5)+(0.5)+1)=1.41

Support vector machines: The linearly separable case The SVM in particular defines the criterion to be looking for a decision surface that is maximally far away from any data point. This distance from the decision surface to the closest data point determines the margin of the classifier. The decision function for an SVM is fully specified by a (usually small) subset of the data which defines the position of the separator. These points are referred to as the support vectors (in a vector space, a point can be thought of as a vector between the origin and that point). The support vectors are the 5 points right up against the margin of the classifier.

Support vector machines: The linearly separable case A classifier with a large margin makes no low certainty classification decisions. an SVM classifier insists on a large margin around the decision boundary. Compared to a decision hyperplane , if you have to place a fat separator between classes, you have fewer choices of where it can be put. As a result of this, the memory capacity of the model has been decreased, and hence we expect that its ability to correctly generalize to test data is increased

Another intuition If you have to place a fat separator between classes, you have less choices, and so the capacity of the model has been decreased

Formalize an SVM with algebra A decision hyperplane can be defined by an intercept term b and a decision hyperplane normal vector which is perpendicular to the hyperplane (weight vector). Because the hyperplane is perpendicular to the normal vector, all points on the hyperplane satisfy

Maximum Margin: Formalization w: decision hyperplane normal vector xi: data point i yi: class of data point i (+1 or -1) NB: Not 1/0 Classifier is: f(xi) = sign(wTxi + b) Functional margin of xi is: yi (wTxi + b) But note that we can increase this margin simply by scaling w, b…. Functional margin of dataset is twice the minimum functional margin for any point The factor of 2 comes from measuring the whole width of the margin

Geometric Margin ρ x r x′ Distance from example to the separator is Examples closest to the hyperplane are support vectors. Margin ρ of the separator is the width of separation between support vectors of classes. ρ x Derivation of finding r: Dotted line x’−x is perpendicular to decision boundary so parallel to w. Unit vector is w/|w|, so line is rw/|w|. x’ = x – yrw/|w|. x’ satisfies wTx’+b = 0. So wT(x –yrw/|w|) + b = 0 Recall that |w| = sqrt(wTw). So, solving for r gives: r = y(wTx + b)/|w| r x′ Looking for distance r. Dotted line x’-x is perpendicular to decision boundary so parallel to w. Unit vector is w/|w|, so this one is rw/|w|. x’ = x – rw/|w|. X’ satisfies wx+b = 0. So wT(x –rw/|w|) + b = 0 Recall that |w| = sqrt(wTw). So, solving for r gives: r = y(wTx + b)/|w| w

Linear SVM Mathematically The linearly separable case Assume that all data is at least distance 1 from the hyperplane, then the following two constraints follow for a training set {(xi ,yi)} For support vectors, the inequality becomes an equality Then, since each example’s distance from the hyperplane is The margin is: wTxi + b ≥ 1 if yi = 1 wTxi + b ≤ -1 if yi = -1

Linear Support Vector Machine (SVM) ρ wTxa + b = 1 Hyperplane wT x + b = 0 Extra scale constraint: mini=1,…,n |wTxi + b| = 1 This implies: wT(xa–xb) = 2 ρ = ||xa–xb||2 = 2/||w||2 wTxb + b = -1 wT x + b = 0

Linear SVMs Mathematically (cont.) Then we can formulate the quadratic optimization problem: A better formulation (min ||w|| = max 1/ ||w|| ): Find w and b such that is maximized; and for all {(xi , yi)} wTxi + b ≥ 1 if yi=1; wTxi + b ≤ -1 if yi = -1 Find w and b such that Φ(w) =½ wTw is minimized; and for all {(xi ,yi)}: yi (wTxi + b) ≥ 1

To be continued …. Thank you