بسم الله الرحمن الرحيم وقل ربي زدني علما.

بسم الله الرحمن الرحيم وقل ربي زدني علما

Vector space classification & Support vector machine (S VM)
Text Classification Vector space classification & Support vector machine (S VM) Machine learning on documents

Objectives Classifiers types Linear classifiers
Vector space classification Vector space classification methods (Rocchio, kNN) Support vector machines and machine learning on documents

Classifiers types Linear classifiers Non linear classifiers

LINEAR CLASSIFIER define a linear classifier as a two-class classifier that decides class membership by comparing a linear combination of the features to a threshold. the two learning methods Naive Bayes and Rocchio are instances of linear classifiers, the perhaps most important group of text classifiers.

LINEAR SEPARABILITY If there exists a hyperplane that perfectly separates the two classes, then we call the two classes linearly separable. In fact, if linear separability holds, then there is an infinite number of linear separators as illustrated

LINEAR CLASSIFIER In two dimensions, a linear classifier is a line.
These lines have the functional form w1x1 + w2x2 = b. The classification rule of a linear classifier is to assign a document to c if w1x1 + w2x2 > b and to if w1x1 + w2x2 ≤ b. Here, (x1, x2)T is the two-dimensional vector representation of the document and (w1, w2)T is the parameter vector

LINEAR CLASSIFIER Linear classification algorithm

LINEAR CLASSIFIER We can generalize this 2Dlinear classifier to higher dimensions by defining a hyperplane The assignment criterion then is: assign to We call a hyperplane that we use as a linear classifier a decision hyperplane.

Non linear case A nonlinear problem.

NOISE DOCUMENT & NOISE FEATURE
As is typical in text classification, there are some noise documents in the figure (marked with arrows) that do not fit well into the overall distribution of the classes a noise feature is a misleading feature that, when included in the document representation, on average increases the classification error. Analogously, a noise document is a document that, when included in the training set, misleads the learning method and increases classification error. Noise documents are one reason why training a linear classifier is hard

NOISE DOCUMENT & NOISE FEATURE
A linear problem with noise.in this hypothetical web page classification scenario, Chinese-only web pages are solid circles and mixed Chinese-English web pages are squares. The two classes are separated by a linear class boundary (dashed line, short dashes), except for three noise documents (marked with arrows).

Vector space classification
Vector space model represents each document as a vector with one real-valued component, usually a tf-idf weight, for each term. Thus, the document space X, the domain of the classification function 𝛾 , is R|V|. W t,d weight for a term in the document tf term frequency df document frequency that term exists in N total number of documents

Vector space Length Normalization
A vector can be (length-) normalized by dividing each of its components by its length – for this we use the norm:

Classification Using Vector Spaces
The training set is a set of documents, each labeled with its class (e.g., topic) In vector space classification, this set corresponds to a labeled set of points (or, equivalently, vectors) in the vector space Premise 1: Documents in the same class form a contiguous region of space Premise 2: Documents from different classes don’t overlap (much) We define surfaces to delineate classes in the space

Documents in a Vector Space
Government Science Arts

Test Document of what class?
Government Science Arts

Test Document = Government
Is this similarity hypothesis true in general? Government Science Arts Our main objective is how to find good separators

Vector space classification methods
Rocchio kNN

Using Rocchio for text classification
Use standard tf-idf weighted vectors to represent text documents For training documents in each category, compute a prototype vector by summing the vectors of the training documents in the category. Prototype = centroid of members of class Assign test documents to the category with the closest prototype vector based on cosine similarity.

Definition of centroid
Where Dc is the set of all documents that belong to class c and v(d) is the vector space representation of d.

Rocchio Properties Forms a simple generalization of the examples in each class (a prototype). Classification is based on similarity to class prototypes. Does not guarantee classifications are consistent with the given training data. Why not?

Rocchio classification
Rocchio classification divides the vector space into regions centered on centroids or prototypes, one for each class, computed as the center of mass of all documents in the class. Rocchio classification is simple and efficient

The CENTROID of a class c is computed as the vector average or center of mass of its members: where Dc is the set of documents in D whose class is c: Dc = {d : <d, c> ∈ D}. We denote the normalized vector of d by The boundary between two classes in Rocchio classification is the set of points with equal distance from the two centroids. For example, |a1| = |a2|,

|b1| = |b2|, and |c1| = |c2| in the figure. This set of points is always a line. The generalization of a line in M-dimensional space is a hyperplane, which we define as the set of points that satisfy: Where is the M-dimensional normal vector of the hyperplane and b is a constant.

This definition of hyperplanes includes lines any line in 2D can be defined by w1x1 + w2x2 = b 2-dimensional planes any plane in 3D can be defined by w1x1 + w2x2 + w3x3 = b A line divides a plane in two , a plane divides 3-dimensional space in two, and hyperplanes divide higher dimensional spaces in two.

Thus, the boundaries of class regions in Rocchio classification are hyperplanes. The classification rule in Rocchio is to classify a point in accordance with the region it falls into. Equivalently, we determine the centroid That the point is closest to and then assign it to c. As an example, consider the star in the previuos Figure It is located in the China region of the space and Rocchio therefore assigns it to China.

Rocchio Algorithm Rocchio classification: Training and testing

Rocchio classification example

Vectors and class centroids for the data in the Table

The Table shows the tf-idf vector representations of the five documents using the formula (1+log10 tft,d) log10(4/dft) if tft,d > 0 (Equation wf-idft,d = wft,d × idft). Explanation for d1 (chinese ,beijing) In d1(chinese) w=(1+log(2)).log(4/4)=1.3 x 0= 0 In d1(beijing) w=(1+log(1)).log(4/1)= 0.6 By applying length normalization on d1 =√( )=0.6 , d1(chinese)=0/0.6=0, d1(beijing)=0.6/0.6=1

Explanation for d4(tokyo , japan, chinese) d4(tokyo)=(1+log(1)).log(4/1)=1 x 0.6=0.6 d4(japan)=(1+log(1)).log(4/1)=1 x 0.6=0.6 d4(chinese)=(1+log(1)).log(4/4)=1 x 0=0 By applying length normalization on d4 d1 =√( )= √0.72=0.84 , d1(tokyo)=0.6/0.84=0.71 d1(japan)= 0.6/0.84=0.71 d1(chinese)=0/0.84=0

The two class centroids are μ (chinese)=1/3(0+0+0)=0 μ (japan)=1/3(0+0+0)=0 μ (tokyo)=1/3(0+0+0)=0 μ (macao)=1/3(0+0+1)=0.3 μ (beijing)=1/3(1+0+0)=0.3 μ (shanghai)=1/3(0+1+0)=0.3 μ (chinese)=1/1 (0)=0 μ (japan)=1/1 (0.71)=0.71 μ (tokyo)=1/1 (0.71)=0.71 μ (macao)=1/1 (0)=0 μ (beijing)=1/1 (0)=0 μ (shanghai)=1/1(0)=0

The two class centroids are The distances of the test document from the centroids are Thus, Rocchio assigns d5 to Explanation |μ c-d5|=√((0-0)2 +(0-0.71)2+(0-0.71)2+(0.3-0)2+(0.3-0)2+(0.3-0)2)=1.15 =√((0-0)2+( )2+( )2+(0-0)2+(0-0)2)=0

b=0.5*((√( ))2-(√( ))2) this hyperplane separates the documents as desired: i=2 >> d2 , i=3>>d3

k nearest neighbor classification
k nearest neighbor or kNN classification determines the decision boundary locally. For 1NN we assign each document to the class of its closest neighbor. For kNN we assign each document to the majority class of its k closest neighbors where k is a parameter. The rationale of kNN classification is that, based on the contiguity hypothesis, we expect a test document d to have the same label as the training documents located in the local region surrounding d.

Voronoi tessellation and decision boundaries (double lines) in 1NN classification. The three classes are: X, circle and diamond.

1NN is not very robust. The classification decision of each test document relies on the class of a single training document, which may be incorrectly labeled. kNN for k > 1 is more robust. It assigns documents to the majority class of their k closest neighbors.

a probabilistic version of this kNN classification algorithm.
We can estimate the probability of membership in class c as the proportion of the k nearest neighbors in c. The figure gives an example for k = 3. Probability estimates for class membership of the star are ˆP(circle class|star) = 1/3, ˆP(X class|star) = 2/3 ˆP (diamond class|star) = 0. The 3nn estimate

a probabilistic version of this kNN classification algorithm.
ˆP1(circle class|star) = 1/3 1nn estimate ˆP1(circle class|star) = 1 differ with 3nn preferring the X class and 1nn preferring the circle class . The parameter k in kNNis often chosen based on experience or knowledge, value of k is typically odd to avoid ties; 3 and 5 are most common.

Sec.14.3 k nearest neighbor classification Example: k=6 (6NN) P(science| )? Government Science Arts

k Nearest Neighbor Classification
kNN = k Nearest Neighbor To classify a document d into class c: Define k-neighborhood N as k nearest neighbors of d Count number of documents i in N that belong to c Estimate P(c|d) as i/k Choose as class argmaxc P(c|d) [ = majority class]

k nearest neighbor classification algorithm

k nearest neighbor classification example
The distances of the test document from the four training documents are √((0-0)2+(0-0.71)2+(0-0.71)2+(0-0)2+(1-0)2+(0-0)2)= √((-0.71)2+(-0.71)2+(1)2)= √((0.5)+(0.5)+1)=1.41

Support vector machines: The linearly separable case
The SVM in particular defines the criterion to be looking for a decision surface that is maximally far away from any data point. This distance from the decision surface to the closest data point determines the margin of the classifier. The decision function for an SVM is fully specified by a (usually small) subset of the data which defines the position of the separator. These points are referred to as the support vectors (in a vector space, a point can be thought of as a vector between the origin and that point). The support vectors are the 5 points right up against the margin of the classifier.

Support vector machines: The linearly separable case
A classifier with a large margin makes no low certainty classification decisions. an SVM classifier insists on a large margin around the decision boundary. Compared to a decision hyperplane , if you have to place a fat separator between classes, you have fewer choices of where it can be put. As a result of this, the memory capacity of the model has been decreased, and hence we expect that its ability to correctly generalize to test data is increased

Another intuition If you have to place a fat separator between classes, you have less choices, and so the capacity of the model has been decreased

Formalize an SVM with algebra
A decision hyperplane can be defined by an intercept term b and a decision hyperplane normal vector which is perpendicular to the hyperplane (weight vector). Because the hyperplane is perpendicular to the normal vector, all points on the hyperplane satisfy

Maximum Margin: Formalization
w: decision hyperplane normal vector xi: data point i yi: class of data point i (+1 or -1) NB: Not 1/0 Classifier is: f(xi) = sign(wTxi + b) Functional margin of xi is: yi (wTxi + b) But note that we can increase this margin simply by scaling w, b…. Functional margin of dataset is twice the minimum functional margin for any point The factor of 2 comes from measuring the whole width of the margin

Geometric Margin ρ x r x′ Distance from example to the separator is
Examples closest to the hyperplane are support vectors. Margin ρ of the separator is the width of separation between support vectors of classes. ρ x Derivation of finding r: Dotted line x’−x is perpendicular to decision boundary so parallel to w. Unit vector is w/|w|, so line is rw/|w|. x’ = x – yrw/|w|. x’ satisfies wTx’+b = 0. So wT(x –yrw/|w|) + b = 0 Recall that |w| = sqrt(wTw). So, solving for r gives: r = y(wTx + b)/|w| r x′ Looking for distance r. Dotted line x’-x is perpendicular to decision boundary so parallel to w. Unit vector is w/|w|, so this one is rw/|w|. x’ = x – rw/|w|. X’ satisfies wx+b = 0. So wT(x –rw/|w|) + b = 0 Recall that |w| = sqrt(wTw). So, solving for r gives: r = y(wTx + b)/|w| w

Linear SVM Mathematically The linearly separable case
Assume that all data is at least distance 1 from the hyperplane, then the following two constraints follow for a training set {(xi ,yi)} For support vectors, the inequality becomes an equality Then, since each example’s distance from the hyperplane is The margin is: wTxi + b ≥ 1 if yi = 1 wTxi + b ≤ -1 if yi = -1

Linear Support Vector Machine (SVM)
ρ wTxa + b = 1 Hyperplane wT x + b = 0 Extra scale constraint: mini=1,…,n |wTxi + b| = 1 This implies: wT(xa–xb) = 2 ρ = ||xa–xb||2 = 2/||w||2 wTxb + b = -1 wT x + b = 0

Linear SVMs Mathematically (cont.)
Then we can formulate the quadratic optimization problem: A better formulation (min ||w|| = max 1/ ||w|| ): Find w and b such that is maximized; and for all {(xi , yi)} wTxi + b ≥ 1 if yi=1; wTxi + b ≤ -1 if yi = -1 Find w and b such that Φ(w) =½ wTw is minimized; and for all {(xi ,yi)}: yi (wTxi + b) ≥ 1

To be continued …. Thank you

بسم الله الرحمن الرحيم وقل ربي زدني علما.

Similar presentations

Presentation on theme: "بسم الله الرحمن الرحيم وقل ربي زدني علما."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

بسم الله الرحمن الرحيم وقل ربي زدني علما.

Similar presentations

Presentation on theme: "بسم الله الرحمن الرحيم وقل ربي زدني علما."— Presentation transcript:

Similar presentations

About project

Feedback