# Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me 1.

## Presentation on theme: "Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me 1."— Presentation transcript:

Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me 1

Outline Problem definition and applications Very Quick Intro to Machine Learning and Classification –Learning bounds –Bias-variance tradeoff, No free lunch theorem Maximum Entropy Models Other Classification Techniques Representations –Vector Space Model (and variations) –Feature Selection –Dimensionality Reduction –Representations and independence assumptions –Sparsity and smoothing 2

Spam or not Spam? Most people who’ve ever used email have developed a hatred of spam In the days before Gmail (and still today), you could get hundreds of spam messages per day. “Spam Filters” were developed to automatically classify, with high (but not perfect) accuracy, which messages are spam and which aren’t. 3

Terminology and Definitions Let D be an input space e.g., the space of all possible English documents Let C be an output space e.g., {S, N} for Spam and Not-Spam Let F be the space of all possible functions f:D  C A hypothesis space for D and C is any subset H of F. CIS 8590 NLP 4

Loss Function: Measuring “Accuracy” A loss function is a function L: H x D x C  [0,1] Given a hypothesis h, document d, and class c, L(h,d,c) returns the error or loss of h when making a prediction on d. Simple Example: L(h,d,c) = 0 if h(d)=c, and 1 otherwise. This is called 0-1 loss. 5

Machine Learning Problem 6

Example Text Mining Applications News topic classification (e.g., Google News) C={politics,sports,business,health,tech,…} “SafeSearch” filtering C={pornography, not pornography} Language classification C={English,Spanish,Chinese,…} Sentiment classification C={positive review,negative review} Email sorting C={spam,meeting reminders,invitations, …} – user-defined! 7

Outline Problem definition and applications Very Quick Intro to Machine Learning/Classification –Learning bounds –Bias-variance tradeoff, No free lunch theorem Maximum Entropy Models Other Classification Techniques Representations –Vector Space Model (and variations) –Feature Selection –Dimensionality Reduction –Representations and independence assumptions –Sparsity and smoothing 8

Concrete Example Let C = {“Spam”, “Not Spam”} or {S,N} Let H be the set of conjunctive rules, like: “if document d contains ‘free credit score’ AND ‘click here’  Spam” 9

A Simple Learning Algorithm 1.Pick a class c (S or N) 2.Find the term t that correlates best with c 3.Construct a rule r: “If d contains t  c” 4.Repeatedly find more terms that correlate with c 5.Add the new terms to r, until the accuracy stops improving on the training data. 10

4 Things Everyone Should Know About Machine Learning 1.Assumptions 2.Generalization Bounds and Occam’s Razor 3.Bias-Variance Tradeoff 4.No Free Lunch 11

1. Assumptions Machine learning traditionally makes two important (and often unrealistic) assumptions. 1.There is a probability distribution P (not necessarily known, but it’s assumed to exist) from which all examples d are drawn (training and test examples). 2.Each example is drawn independently from this distribution. Together, these are known as ‘i.i.d.’: independent and identically distributed. 12

Why are the assumptions important? Basically, it’s hard to make a prediction about a document if all of your training examples are totally different. With these assumptions, you’re saying it’s very unlikely (with enough training data) that you’ll see a test example that’s totally different from all of your training data. 13

2. Generalization Bounds 14 Theorem: Generalization Bound by Vapnik-Chervonenkis: With probability 1- δ over the choice of training data, Here, v is the VC-dimension of the hypothesis space. If the hypothesis space is complex, v is big. If it’s simple, v is small.

2. Bounds and Occam’s Razor Occam’s Razor: All other things being equal, the simplest explanation is the best. Generalization bounds lend some theoretical credence to this old rule-of-thumb. 15

3. Bias and Variance Bias: The built-in tendency of a learning machine or hypothesis class to find a hypothesis in a pre-determined region of the space of all possible classifiers. e.g., our rule hypotheses are biased towards axis-parallel lines Variance: The degree to which a learning algorithm is sensitive to small changes in the training data. –If a small change in training data causes a large change in the resulting classifier, then the learning algorithm has “high variance”. 16

3. Bias-Variance Tradeoff As a general rule, the more biased a learning machine, the less variance it has, and the more variance it has, the less biased it is. 17

4. No Free Lunch Theorem Simply put, this famous theorem says: If your learning machine has no bias at all, then it’s impossible to learn anything. The proof is simple, but out of the scope of this lecture. You should check it out. 18

Outline Problem definition and applications Very Quick Intro to Machine Learning and Classification –Bias-variance tradeoff –No free lunch theorem Maximum Entropy Models Other Classification Techniques Representations –Vector Space Model (and variations) –Feature Selection –Dimensionality Reduction –Representations and independence assumptions –Sparsity and smoothing 19

Machine Learning Techniques for NLP NLP people tend to favor certain kinds of learning machines: –Maximum entropy (or log-linear, or logistic regression, or logit) models (gaining in popularity lately) –Bayesian networks (directed graphical models, like Naïve Bayes) –Support vector machines (but only for certain things, like text classification and information extraction) 20

Hypothesis Class A maximum entropy/log-linear model (ME) is any function with this form: 21 Normalization function: “Log-linear”: If you take the log, it’s a linear function.

Feature Functions The functions f i are called feature functions (or sometimes just features). These must be defined by the person designing the learning machine. Example: f i (c,d) = [If c=S, count of how often “free” appears in d. Otherwise, 0.] 22

Parameters The λ i are called the parameters of the model. During training, the learning algorithm tries to find the best value for the λ i. 23

Example ME Hypothesis 24

Why is it “Maximum Entropy”? Before we get into how to train one of these, let’s get an idea of why people use it. The basic intuition is from Occam’s Razor: we want to find the “simplest” probability distribution P(c | d) that explains the training data. Note that this also introduces bias: we’re biasing our search towards “simple” distributions. But what makes a distribution “simple”? 25

Entropy Entropy is a measure of how much uncertainty is in a probability distribution. 26 Examples: Entropy of a deterministic event: H(1,0) = -1 log 1 – 0 log 0 = (-1) * (0) - 0 log 0 = 0

Entropy Entropy is a measure of how much uncertainty is in a probability distribution. 27 Examples: Entropy of flipping a coin: H(1/2,1/2) = -1/2 log 1/2 – 1/2 log 1/2 = -(1/2) * (-1) - (1/2) * (-1) = 1

Entropy Entropy is a measure of how much uncertainty is in a probability distribution. 28 Examples: Entropy of rolling a six-sided die: H(1/6,…1/6) = -1/6 log 1/6 – … - 1/6 log 1/6 = -1/6 * -2.53 - … - 1/6 * -2.53 = 2.53

Entropy Entropy of a biased coin flip: Let P(Heads) represent the probability that the biased coin lands on Heads. 29 Maximum Entropy Setting for P(Heads): P(Heads) = P(not Heads). If event X has N possible outcomes, the maximum entropy setting for p(x 1 ),p(x 2 ),…,p(x N ) is p(x 1 )=p(x 2 )=…=p(x N )=1/N.

Occam’s Razor for Distributions Given a set of empirical expectations of the form E in Train f i (c,d) Find a distribution P(c | d) such that - it provides the same expectations (matches the training data) E ~P(c|d) f i (c,d) = E in Train f i (c,d) - maximizes the entropy H(P) (Occam’s Razor bias) 30

Theorem The maximum entropy distribution for P(c|d), subject to the constraints E ~P(c|d) f i (c,d) = E in Train f i (c,d) must have log-linear form. Thus, max-ent models have to be log- linear models. 31

Training a ME model Training is an optimization problem: find the value for λ that maximizes the conditional log-likelihood of the training data: 32

Quiz Let’s assume that I give you some feature functions f i (d,c). 1.What is the hypothesis class H of MaxEnt models using the feature functions f i ? 2.What is the loss function L? CIS 8590 NLP 33

Training a ME model Optimization is normally performed using some form of gradient descent: 0) Initialize λ 0 to 0 1) Compute the gradient: ∇ CLL 2) Take a step in the direction of the gradient: λ i+1 = λ i + α ∇ CLL 3) Repeat until CLL doesn’t improve: stop when |CLL( λ i+1 ) – CLL( λ i )| < ε 34

Gradient Descent: Geometry CIS 8590 NLP 35 Graph of f(x,y)= -(cos 2 x + cos 2 y) 2 On the bottom plane, the gradient of f is projected as a vector field.

Training a ME model Computing the gradient: 36

Overfitting CIS 8590 NLP 37 Test Train Training iteration Error Rate

Regularization Regularizing an objective (or loss) function is the act of penalizing certain subsets of a hypothesis class. Typically the penalty is based on prior beliefs, like simpler models are better than more complex ones. CIS 8590 NLP 38

Regularizing MaxEnt Models Add a penalty term to the objective function: Both L1 and L2 Regularization penalize hypotheses with weights far from zero. The α term is called the “regularization parameter”. It’s typically set using a grid search. (L1 Regularization) (L2 Regularization)

Training a Regularized ME model Add regularization terms to the existing gradient of the CLL: Note that when λi is positive, the contribution of the regularizer to the gradient is negative, and vice versa. 40 (This is for L2 Regularization. For L1, the partials don’t exist at zero, so a more complicated procedure is required.)

Outline Problem definition and applications Very Quick Intro to Machine Learning and Classification –Bias-variance tradeoff –No free lunch theorem Maximum Entropy Models Other Classification Techniques Representations –Vector Space Model (and variations) –Feature Selection –Dimensionality Reduction –Representations and independence assumptions –Sparsity and smoothing 41

Classification Techniques Book mentions three: –Naïve Bayes –k-Nearest Neighbor –Support Vector Machines Others (besides ME): –Rule-based systems Decision lists (e.g., Ripper) Decision trees (e.g. C4.5) –Perceptron and Neural Networks 42

Bayes Rule Which is shorthand for:

For code, see www.cs.cmu.edu/~tom/mlbook.html click on “Software and Data” www.cs.cmu.edu/~tom/mlbook.html

How can we implement this if the a i are continuous-valued attributes?

Also called “Gaussian distribution”

Gaussian Assume P(a i |v j ) follows Gaussian distribution, use training data to estimate its mean and variance

58 K-nearest neighbor methods William Cohen 10-601 April 2008

59 BellCore’s MovieRecommender Participants sent email to videos@bellcore.com System replied with a list of 500 movies to rate on a 1-10 scale (250 random, 250 popular) –Only subset need to be rated New participant P sends in rated movies via email System compares ratings for P to ratings of (a random sample of) previous users Most similar users are used to predict scores for unrated movies (more later) System returns recommendations in an email message.

60 Suggested Videos for: John A. Jamus. Your must-see list with predicted ratings: 7.0 "Alien (1979)" 6.5 "Blade Runner" 6.2 "Close Encounters Of The Third Kind (1977)" Your video categories with average ratings: 6.7 "Action/Adventure" 6.5 "Science Fiction/Fantasy" 6.3 "Children/Family" 6.0 "Mystery/Suspense" 5.9 "Comedy" 5.8 "Drama"

61 The viewing patterns of 243 viewers were consulted. Patterns of 7 viewers were found to be most similar. Correlation with target viewer: 0.59 viewer-130 (unlisted@merl.com) 0.55 bullert,jane r (bullert@cc.bellcore.com) 0.51 jan_arst (jan_arst@khdld.decnet.philips.nl) 0.46 Ken Cross (moose@denali.EE.CORNELL.EDU) 0.42 rskt (rskt@cc.bellcore.com) 0.41 kkgg (kkgg@Athena.MIT.EDU) 0.41 bnn (bnn@cc.bellcore.com) By category, their joint ratings recommend: Action/Adventure: "Excalibur" 8.0, 4 viewers "Apocalypse Now" 7.2, 4 viewers "Platoon" 8.3, 3 viewers Science Fiction/Fantasy: "Total Recall" 7.2, 5 viewers Children/Family: "Wizard Of Oz, The" 8.5, 4 viewers "Mary Poppins" 7.7, 3 viewers Mystery/Suspense: "Silence Of The Lambs, The" 9.3, 3 viewers Comedy: "National Lampoon's Animal House" 7.5, 4 viewers "Driving Miss Daisy" 7.5, 4 viewers "Hannah and Her Sisters" 8.0, 3 viewers Drama: "It's A Wonderful Life" 8.0, 5 viewers "Dead Poets Society" 7.0, 5 viewers "Rain Man" 7.5, 4 viewers Correlation of predicted ratings with your actual ratings is: 0.64 This number measures ability to evaluate movies accurately for you. 0.15 means low ability. 0.85 means very good ability. 0.50 means fair ability.

62 Algorithms for Collaborative Filtering 1: Memory-Based Algorithms (Breese et al, UAI98) v i,j = vote of user i on item j I i = items for which user i has voted Mean vote for i is Predicted vote for “active user” a is weighted sum weights of n similar users normalizer

63 Basic k-nearest neighbor classification Training method: –Save the training examples At prediction time: –Find the k training examples (x 1,y 1 ),…(x k,y k ) that are closest to the test example x –Predict the most frequent class among those y i ’s. Example: http://cgm.cs.mcgill.ca/~soss/cs644/projects/simard/ http://cgm.cs.mcgill.ca/~soss/cs644/projects/simard/

64 What is the decision boundary? Voronoi diagram

65 Convergence of 1-NN x y x1x1 y1y1 x2x2 y2y2 neighbor P(Y|x 1 ) P(Y|x’’) P(Y|x) assume equal let y*=argmax Pr(y|x)

66 Basic k-nearest neighbor classification Training method: –Save the training examples At prediction time: –Find the k training examples (x 1,y 1 ),…(x k,y k ) that are closest to the test example x –Predict the most frequent class among those y i ’s. Improvements: –Weighting examples from the neighborhood –Measuring “closeness” –Finding “close” examples in a large training set quickly

67 K-NN and irrelevant features ++++++++oooooooooooooooooo ?

68 K-NN and irrelevant features + + + + + + + + o o oo o o o o o o o o o o o o o o ?

69 K-NN and irrelevant features + + + + + + + + o o oo o o o o o o o o o o o o o o ?

70 Ways of rescaling for KNN Normalized L1 distance: Scale by IG: Modified value distance metric:

71 Ways of rescaling for KNN Dot product: Cosine distance: TFIDF weights for text: for doc j, feature i: x i =tf i,j * idf i : #occur. of term i in doc j #docs in corpus #docs in corpus that contain term i

72 Combining distances to neighbors Standard KNN: Distance-weighted KNN:

73

74

75

76 Computing KNN: pros and cons Storage: all training examples are saved in memory –A decision tree or linear classifier is much smaller Time: to classify x, you need to loop over all training examples (x’,y’) to compute distance between x and x’. –However, you get predictions for every class y KNN is nice when there are many many classes –Actually, there are some tricks to speed this up…especially when data is sparse (e.g., text)

77 Efficiently implementing KNN (for text) IDF is nice computationally

78 Tricks with fast KNN K-means using r-NN 1.Pick k points c 1 =x 1,….,c k =x k as centers 2.For each x i, find D i =Neighborhood(x i ) 3.For each x i, let c i =mean(D i ) 4.Go to step 2….

79 Efficiently implementing KNN d j2 d j3 d j4 Selective classification: given a training set and test set, find the N test cases that you can most confidently classify

Support Vector Machines Slides by Ray Mooney et al. U. Texas at Austin machine learning group

Perceptron Revisited: Linear Separators Binary classification can be viewed as the task of separating classes in feature space: w T x + b = 0 w T x + b < 0 w T x + b > 0 f(x) = sign(w T x + b)

Linear Separators Which of the linear separators is optimal?

Classification Margin Distance from example x i to the separator is Examples closest to the hyperplane are support vectors. Margin ρ of the separator is the distance between support vectors. r ρ

Maximum Margin Classification Maximizing the margin is good according to intuition and PAC theory. Implies that only support vectors matter; other training examples are ignorable.

Linear SVM Mathematically Let training set {(x i, y i )} i=1..n, x i  R d, y i  {-1, 1} be separated by a hyperplane with margin ρ. Then for each training example (x i, y i ): For every support vector x s the above inequality is an equality. After rescaling w and b by ρ/2 in the equality, we obtain that distance between each x s and the hyperplane is Then the margin can be expressed through (rescaled) w and b as: w T x i + b ≤ - ρ/2 if y i = -1 w T x i + b ≥ ρ/2 if y i = 1 y i (w T x i + b) ≥ ρ/2 

Linear SVMs Mathematically (cont.) Then we can formulate the quadratic optimization problem: Which can be reformulated as: Find w and b such that is maximized and for all (x i, y i ), i=1..n : y i (w T x i + b) ≥ 1 Find w and b such that Φ(w) = ||w|| 2 =w T w is minimized and for all (x i, y i ), i=1..n : y i (w T x i + b) ≥ 1

Solving the Optimization Problem Need to optimize a quadratic function subject to linear constraints. Quadratic optimization problems are a well-known class of mathematical programming problems for which several (non-trivial) algorithms exist. The solution involves constructing a dual problem where a Lagrange multiplier α i is associated with every inequality constraint in the primal (original) problem: Find w and b such that Φ(w) =w T w is minimized and for all (x i, y i ), i=1..n : y i (w T x i + b) ≥ 1 Find α 1 …α n such that Q( α ) = Σ α i - ½ ΣΣ α i α j y i y j x i T x j is maximized and (1) Σ α i y i = 0 (2) α i ≥ 0 for all α i

The Optimization Problem Solution Given a solution α 1 …α n to the dual problem, solution to the primal is: Each non-zero α i indicates that corresponding x i is a support vector. Then the classifying function is (note that we don’t need w explicitly): Notice that it relies on an inner product between the test point x and the support vectors x i – we will return to this later. Also keep in mind that solving the optimization problem involved computing the inner products x i T x j between all training points. w = Σ α i y i x i b = y k - Σ α i y i x i T x k for any α k > 0 f(x) = Σ α i y i x i T x + b

Soft Margin Classification What if the training set is not linearly separable? Slack variables ξ i can be added to allow misclassification of difficult or noisy examples, resulting margin called soft. ξiξi ξiξi

Soft Margin Classification Mathematically The old formulation: Modified formulation incorporates slack variables: Parameter C can be viewed as a way to control overfitting: it “trades off” the relative importance of maximizing the margin and fitting the training data. Find w and b such that Φ(w) =w T w is minimized and for all (x i,y i ), i=1..n : y i (w T x i + b) ≥ 1 Find w and b such that Φ(w) =w T w + C Σ ξ i is minimized and for all (x i,y i ), i=1..n : y i (w T x i + b) ≥ 1 – ξ i,, ξ i ≥ 0

Soft Margin Classification – Solution Dual problem is identical to separable case (would not be identical if the 2- norm penalty for slack variables CΣξ i 2 was used in primal objective, we would need additional Lagrange multipliers for slack variables): Again, x i with non-zero α i will be support vectors. Solution to the dual problem is: Find α 1 …α N such that Q( α ) = Σ α i - ½ ΣΣ α i α j y i y j x i T x j is maximized and (1) Σ α i y i = 0 (2) 0 ≤ α i ≤ C for all α i w = Σ α i y i x i b= y k (1- ξ k ) - Σ α i y i x i T x k for any k s.t. α k >0 f(x) = Σ α i y i x i T x + b Again, we don’t need to compute w explicitly for classification:

Theoretical Justification for Maximum Margins Vapnik has proved the following: The class of optimal linear separators has VC dimension h bounded from above as where ρ is the margin, D is the diameter of the smallest sphere that can enclose all of the training examples, and m 0 is the dimensionality. Intuitively, this implies that regardless of dimensionality m 0 we can minimize the VC dimension by maximizing the margin ρ. Thus, complexity of the classifier is kept small regardless of dimensionality.

Linear SVMs: Overview The classifier is a separating hyperplane. Most “important” training points are support vectors; they define the hyperplane. Quadratic optimization algorithms can identify which training points x i are support vectors with non-zero Lagrangian multipliers α i. Both in the dual formulation of the problem and in the solution training points appear only inside inner products: Find α 1 …α N such that Q( α ) = Σ α i - ½ ΣΣ α i α j y i y j x i T x j is maximized and (1) Σ α i y i = 0 (2) 0 ≤ α i ≤ C for all α i f(x) = Σ α i y i x i T x + b

Non-linear SVMs Datasets that are linearly separable with some noise work out great: But what are we going to do if the dataset is just too hard? How about… mapping data to a higher-dimensional space: 0 0 0 x2x2 x x x

Non-linear SVMs: Feature spaces General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable: Φ: x → φ(x)

The “Kernel Trick” The linear classifier relies on inner product between vectors K(x i,x j )=x i T x j If every datapoint is mapped into high-dimensional space via some transformation Φ: x → φ(x), the inner product becomes: K(x i,x j )= φ(x i ) T φ(x j ) A kernel function is a function that is eqiuvalent to an inner product in some feature space. Example: 2-dimensional vectors x=[x 1 x 2 ]; let K(x i,x j )=(1 + x i T x j ) 2, Need to show that K(x i,x j )= φ(x i ) T φ(x j ): K(x i,x j )=(1 + x i T x j ) 2, = 1+ x i1 2 x j1 2 + 2 x i1 x j1 x i2 x j2 + x i2 2 x j2 2 + 2x i1 x j1 + 2x i2 x j2 = = [1 x i1 2 √2 x i1 x i2 x i2 2 √2x i1 √2x i2 ] T [1 x j1 2 √2 x j1 x j2 x j2 2 √2x j1 √2x j2 ] = = φ(x i ) T φ(x j ), where φ(x) = [1 x 1 2 √2 x 1 x 2 x 2 2 √2x 1 √2x 2 ] Thus, a kernel function implicitly maps data to a high-dimensional space (without the need to compute each φ(x) explicitly).

What Functions are Kernels? For some functions K(x i,x j ) checking that K(x i,x j )= φ(x i ) T φ(x j ) can be cumbersome. Mercer’s theorem: Every semi-positive definite symmetric function is a kernel Semi-positive definite symmetric functions correspond to a semi- positive definite symmetric Gram matrix: K(x1,x1)K(x1,x1)K(x1,x2)K(x1,x2)K(x1,x3)K(x1,x3)…K(x1,xn)K(x1,xn) K(x2,x1)K(x2,x1)K(x2,x2)K(x2,x2)K(x2,x3)K(x2,x3)K(x2,xn)K(x2,xn) …………… K(xn,x1)K(xn,x1)K(xn,x2)K(xn,x2)K(xn,x3)K(xn,x3)…K(xn,xn)K(xn,xn) K=

Examples of Kernel Functions Linear: K(x i,x j )= x i T x j –Mapping Φ: x → φ(x), where φ(x) is x itself Polynomial of power p: K(x i,x j )= (1+ x i T x j ) p –Mapping Φ: x → φ(x), where φ(x) has dimensions Gaussian (radial-basis function): K(x i,x j ) = –Mapping Φ: x → φ(x), where φ(x) is infinite-dimensional: every point is mapped to a function (a Gaussian); combination of functions for support vectors is the separator. Higher-dimensional space still has intrinsic dimensionality d (the mapping is not onto), but linear separators in it correspond to non- linear separators in original space.

Non-linear SVMs Mathematically Dual problem formulation: The solution is: Optimization techniques for finding α i ’s remain the same! Find α 1 …α n such that Q(α) = Σ α i - ½ ΣΣ α i α j y i y j K(x i, x j ) is maximized and (1) Σ α i y i = 0 (2) α i ≥ 0 for all α i f(x) = Σ α i y i K(x i, x j )+ b

SVM applications SVMs were originally proposed by Boser, Guyon and Vapnik in 1992 and gained increasing popularity in late 1990s. SVMs are currently among the best performers for a number of classification tasks ranging from text to genomic data. SVMs can be applied to complex data types beyond feature vectors (e.g. graphs, sequences, relational data) by designing kernel functions for such data. SVM techniques have been extended to a number of tasks such as regression [Vapnik et al. ’97], principal component analysis [Schölkopf et al. ’99], etc. Most popular optimization algorithms for SVMs use decomposition to hill-climb over a subset of α i ’s at a time, e.g. SMO [Platt ’99] and [Joachims ’99] Tuning SVMs remains a black art: selecting a specific kernel and parameters is usually done in a try-and-see manner.

Performance Comparison (?) linear SVMrbf-SVM NBRocchioDec. TreeskNNC=0.5C=1 earn96.096.1 97.898.098.298.1 acq90.792.185.391.895.595.694.7 money-fx59.667.669.475.478.878.574.3 grain69.879.589.182.691.993.193.4 crude81.281.575.585.889.4 88.7 trade52.277.459.277.979.2 76.6 interest57.672.549.176.775.674.869.1 ship80.983.180.979.887.486.585.8 wheat63.479.485.572.986.686.882.4 corn45.262.287.771.487.587.884.6 microavg.72.379.979.482.686.787.586.4 SVM classifier break-even F from (Joachims, 2002a, p. 114). Results are shown for the 10 largest categories and for microaveraged performance over all 90 categories on the Reuters-21578 data set.Joachims, 2002a

Choosing a classifier TechniqueTrain time Test time “Accuracy”Interpre- tability Bias- Variance Data Complexity Naïve Bayes |W| + |C| |V| |C| * |V d | Medium-lowMediumHigh-biasLow k-NN|W||V| * |V d | MediumLow?High SVM|C||D| 3 |V| ave |C|* |V d | HighLowMixedMedium-low Neural Nets?|C|* |V d | HighLowHigh- variance High Log-linear?|C|* |V d | HighMediumHigh- variance/ mixed Medium Ripper??MediumHighHigh-bias? “Accuracy” – reputation for accuracy in experimental settings. Note that it is impossible to say beforehand which classifier will be most accurate on any given problem. C = set of classes. W = bag of training tokens. V = set of training types. D = set of train docs. V d = types in test document d. V ave = average number of types per doc in training.

Outline Problem definition and applications Very Quick Intro to Machine Learning and Classification –Bias-variance tradeoff –No free lunch theorem Maximum Entropy Models Other Classification Techniques Representations –Vector Space Model (and variations) –Feature Selection –Dimensionality Reduction –Representations and independence assumptions –Sparsity and smoothing 103

Vector Space Model Idea: represent each document as a vector. Why bother? How can we make a document into a vector? 104

Documents as Vectors 105 Example: Document D1: “yes we got no bananas” Document D2: “what you got” Document D3: “yes I like what you got” Vector V1: Vector V2: Vector V3: yeswegotnobananaswhatyouIlike 111110000 001001100 101001111

Documents as Vectors Generically, we convert a document into a vector by: 1.Determine the vocabulary V, or set of all terms in the collection of documents 2.For each document d, compute a score s v (d) for every term v in V. –For instance, s v (d) could be the number of times v appears in d. 106

Why Bother? The vector space model has a number of limitations (discussed later). But two major benefits: 1.Convenience (notational & mathematical) 2.It’s well-understood –That is, there are a lot of side benefits, like similarity and distance metrics, that you get for free. 107

Handy Tools Euclidean distance and norm Cosine similarity Dot product 108

Measuring Similarity Similarity metric: the size of the angle between document vectors. “Cosine Similarity”: 109

Variations of the VSM What should we include in V? –Stoplists –Phrases and ngrams –Feature selection How should we compute s v (d)? –Binary (Bernoulli) –Term frequency (TF) (multinomial) –Inverse Document Frequency (IDF) –TF-IDF –Length normalization –Other … 110

What should we include in the Vocabulary? 111 Example: Document D1: “yes we got no bananas” Document D2: “what you got” Document D3: “yes I like what you got” All three documents include “got”  it’s not very informative for discriminating between the documents. In general, we’d like to include all and only informative features

Zipf Distribution of Language Languages contain a few high-frequency words, a large number of medium frequency words, and a ton of low-frequency words. High-frequency words generally not indicative of one class or another, so not useful. Low-frequency words often very indicative of one class or another, but we may never (or rarely) see them during training.  data sparsity 112

Zipf’s Law CIS 8590 NLP 113 Frequencies of top-10 most-common words in Project Gutenberg, next to a plot of C / rank(word).

Zipf’s Law: log-log plot CIS 8590 NLP 114 Frequencies of top-10 most-common words in Project Gutenberg, next to a plot of C / rank(word), with both axes on a logarithmic scale.

Stop words and stop lists A simple way to get rid of uninteresting features is to eliminate the high-frequency ones These are often called “stop words” - e.g., “the”, “of”, “you”, “got”, “was”, etc. Systems often contain a list (“stop list”) of ~100 stop words, which are pruned from the vocabulary 115

Pruning rare words A less common trick, but sometimes useful: Remove all words with frequency < K Typical values of K are 2, 3, or k<10. This can greatly reduce the size of the vocabulary, since (by Zipf’s Law) there are MANY terms with frequency < K. However, for larger K, you start to throw out some potentially very useful features. CIS 8590 NLP 116

Beyond terms It would be great to include multi-word features like “New York”, rather than just “New” and “York” But: including all pairs of words, or all consecutive pairs of words, as features creates WAY too many to deal with, and many are very sparse. In order to include such features, we need to know more about feature selection (upcoming) 117

Variations of the VSM What should we include in V? –Stoplists –Phrases and ngrams –Feature selection How should we compute s v (d)? –Binary (Bernoulli) –Term frequency (TF) (multinomial) –Inverse Document Frequency (IDF) –TF-IDF –Length normalization –Other … 118

Score for a feature in a document 119 Example: Document: “yes we got no bananas no bananas we got we got” Binary: Term Frequency: yeswegotnobananaswhatyouIlike 111110000 133220000

Inverse Document Frequency An alternative method of scoring a feature Intuition: words that are common to many documents are less informative, so give them less weight. IDF(v) = log (#Documents / #Documents containing v) 120

TF-IDF Term Frequency Inverse Document Frequency TF-IDF v (d) = TF v (d) * IDF(v) TF-IDF v (d) = (#v occurs in D) * log (#Documents / #Documents containing v) 121

TF-IDF weighted vectors 122 Example: Document D1: “yes we got no bananas” Document D2: “what you got” Document D3: “yes I like what you got” Vector V1: Vector V2: Vector V3: yeswegotnobananaswhatyouIlike.1811110000 001001100 10100111.48

Limitations The vector space model has the following limitations: Long documents are poorly represented because they have poor similarity values. Search keywords must precisely match document terms; word substrings might result in a false positive match. Semantic sensitivity: documents with similar context but different term vocabulary won't be associated, resulting in a false negative match. The order in which the terms appear in the document is lost in the vector space representation. 123

Curse of Dimensionality. If the data x lies in high dimensional space, then an enormous amount of data is required to learn distributions, decision rules, or clusters. Example: 50 dimensions. Each dimension has 2 possible values. This gives a total of 2 50 = ~10 15 cells. But the no. of data samples will be far less. There will not be enough data samples to learn.

Dimensionality Reduction Goal: Reduce the dimensionality of the space, while preserving distances Basic idea: find the dimensions that have the most variation in the data, and eliminate the others. Many techniques (SVD, MDS) May or may not help

Feature Selection and Dimensionality Reduction in NLP TF-IDF (reduces the weight of some features, increases weight of others) Mutual Information (MI) Pointwise Mutual Information (PMI) Latent Semantic Analysis – next week Information Gain (IG) Chi-square or other independence tests Pure frequency 126

Mutual Information What makes “Shanghai” a good feature for classifying a document as being about “China”? Intuition: four cases 127 +China-China + ShanghaiHow common? -ShanghaiHow common?

Mutual Information What makes “Shanghai” a good feature for classifying a document as being about “China”? Intuition: four cases If all four cases are equally common, MI = 0. 128 +China-China + ShanghaiXX -ShanghaiXX

Mutual Information What makes “Shanghai” a good feature for classifying a document as being about “China”? Intuition: four cases MI grows when one (or two) case(s) becomes much more common than the others. 129 +China-China + Shanghai10X0 -ShanghaiXX

Mutual Information What makes “Shanghai” a good feature for classifying a document as being about “China”? Intuition: four cases That’s also the case where the feature is useful! 130 +China-China + Shanghai10X~0 -ShanghaiXX

Mutual Information 131

Pointwise Mutual Information What makes “Shanghai” a good feature for classifying a document as being about “China”? PMI focuses on just the (+, +) case: How much more likely than chance is it for Shanghai to appear in a document about China? 132 +China-China + Shanghai10X~0 -ShanghaiXX

Pointwise Mutual Information 133

Feature Engineering This is where domain experts and human judgement come into play. Not much to say …. except that it matters a lot, often more than choosing a classifier 134

Download ppt "Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me 1."

Similar presentations