Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

Similar presentations


Presentation on theme: "1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies."— Presentation transcript:

1 1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies  Categorizing bookmarks  Newsgroup Messages /News Feeds / Micro-blog Posts  Recommending messages, posts, tweets, etc.  Message filtering  News articles  Personalized news  Email messages  Routing  Folderizing  Spam filtering

2 2 Learning for Text Categorization  Text Categorization is an application of classification  Typical Learning Algorithms:  Bayesian (naïve)  Neural network  Relevance Feedback (Rocchio)  Nearest Neighbor  Support Vector Machines (SVM)

3 3 Nearest-Neighbor Learning Algorithm  Learning is just storing the representations of the training examples in data set D  Testing instance x:  Compute similarity between x and all examples in D  Assign x the category of the most similar examples in D  Does not explicitly compute a generalization or category prototypes (i.e., no “modeling”)  Also called:  Case-based  Memory-based  Lazy learning

4 4 K Nearest-Neighbor  Using only the closest example to determine categorization is subject to errors due to  A single atypical example.  Noise (i.e. error) in the category label of a single training example.  More robust alternative is to find the k most-similar examples and return the majority category of these k examples.  Value of k is typically odd to avoid ties, 3 and 5 are most common.

5 5 Similarity Metrics  Nearest neighbor method depends on a similarity (or distance) metric  Simplest for continuous m-dimensional instance space is Euclidian distance  Simplest for m-dimensional binary instance space is Hamming distance (number of feature values that differ)  For text, cosine similarity of TF-IDF weighted vectors is typically most effective

6 Basic Automatic Text Processing  Parse documents to recognize structure and meta-data  e.g. title, date, other fields, html tags, etc.  Scan for word tokens  lexical analysis to recognize keywords, numbers, special characters, etc.  Stopword removal  common words such as “the”, “and”, “or” which are not semantically meaningful in a document  Stem words  morphological processing to group word variants (e.g., “compute”, “computer”, “computing”, “computes”, … can be represented by a single stem “comput” in the index)  Assign weight to words  using frequency in documents and across documents  Store Index  Stored in a Term-Document Matrix (“inverted index”) which stores each document as a vector of keyword weights 6

7 7 tf x idf Weighs  tf x idf measure:  term frequency (tf)  inverse document frequency (idf) -- a way to deal with the problems of the Zipf distribution  Recall the Zipf distribution  Want to weight terms highly if they are  frequent in relevant documents … BUT  infrequent in the collection as a whole  Goal: assign a tf x idf weight to each term in each document

8 8 tf x idf

9 9 Inverse Document Frequency  IDF provides high values for rare words and low values for common words

10 tf x idf Example 10 Doc 1Doc 2Doc 3Doc 4Doc 5Doc 6 dfidf = log2(N/df) T1024010 31.00 T2130002 31.00 T3010200 21.58 T4301540 40.58 T5040001 21.58 T6272130 50.26 T7100551 40.58 T8011003 31.00 Doc 1Doc 2Doc 3Doc 4Doc 5Doc 6 T10.002.004.000.001.000.00 T21.580.00 3.17 T30.001.580.003.170.00 T41.750.000.582.922.340.00 T50.006.340.00 1.58 T60.531.840.530.260.790.00 T70.580.00 2.92 0.58 T80.001.00 0.00 3.00 The initial Term x Doc matrix (Inverted Index) tf x idf Term x Doc matrix Documents represented as vectors of words

11 11 K Nearest Neighbor for Text Training: For each each training example  D Compute the corresponding TF-IDF vector, d x, for document x Test instance y: Compute TF-IDF vector d for document y For each  D Let s x = cosSim(d, d x ) Sort examples, x, in D by decreasing value of s x Let N be the first k examples in D. (get most similar neighbors) Return the majority class of examples in N


Download ppt "1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies."

Similar presentations


Ads by Google