Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Introduction To Categorization Soam Acharya, PhD 1/15/2003.

Similar presentations

Presentation on theme: "An Introduction To Categorization Soam Acharya, PhD 1/15/2003."— Presentation transcript:

1 An Introduction To Categorization Soam Acharya, PhD 1/15/2003

2 What is Categorization? { c 1 … c m } set of predefined categories { d 1 … d n } set of candidate documents Fill decision matrix with values {0,1} Categories are symbolic labels d1d1 …… dndn c1c1 a 11 … … a 1n … …… …… cmcm a m1 …… a mn

3 Uses Document organization Document filtering Word sense disambiguation Web –Internet directories –Organization of search results Clustering

4 Categorization Techniques Knowledge systems Machine Learning

5 Knowledge Systems Manually build an expert system –Makes categorization judgments –Sequence of rules per category –If then category –If document contains buena vista home entertainment then document category isHome Video

6 UltraSeek Content Classification Engine

7 UltraSeek CCE

8 Knowledge System Issues Scalability –Build –Tune Requires Domain Experts Transferability

9 Machine Learning Approach Build a classifier for a category –Training set –Hierarchy of categories Submit candidate documents for automatic classification Expend effort in building a classifier, not in knowing the knowledge domain

10 Machine Learning Process Document Pre- processing documents Classifier Training taxonomy Training Set documents DB

11 Training Set Initial corpus can be divided into: –Training set –Test set Role of workflow tools

12 Document Preprocessing Document Conversion: –Converts file formats (.doc,.ppt,.xls,.pdf etc) to text Tokenizing/Parsing: –Stemming –Document vectorization Dimension reduction

13 Document Vectorization Convert document text into bag of words Each document is a vector of n weighted terms Federal express 3 Severe 3 Mountain 2 Exactly 1 Simple 5 Flight 2 Y2000-Q3 1 Document

14 Document Vectorization Use tfidf function for term weighting tfidf value may be normalized –All vectors of equal length –[0,1] tfidf(t k, d j ) = #(t k, d j ). Log [|T r | / #(t k )] # of times tk occurs in dj # of documents where tk occurs at least once Cardinality of training set

15 Dimension Reduction Reduce dimensionality of vector space Why? –Reduce computational complexity –Address overfitting problem Overtuning classifier How? –Feature selection –Feature extraction

16 Feature Selection Also known as term space reduction Remove stop words Identify best words to be used in categorizing per topic –Document frequency of terms Keep terms that occur in highest number of documents –Other measures Chi square Information gain

17 Feature Extraction Synthesize new features from existing features Term clustering –Use clusters/centroids instead of terms –Co-occurrence and co-absence Latent Semantic Indexing –Compresses vectors into a lower dimensional space

18 Creating a Classifier Define a function, Categorization Status Value, CSV, that for a document d: –CSV i : D -> [0,1] –Confidence that d belongs in c i Boolean Probability Vector distance

19 Creating a Classifier Define a threshold, thresh, such that if CSV i (d) > thresh(i) then categorize d under c i otherwise, dont CSV thresholding –Fixed value across all categories –Vary per category Optimize via testing

20 Naïve Bayes Classifier Probability of doc d j belonging in category c i Training set terms/weights present in d j used to calculate probability of d j belonging to c i

21 Naïve Bayes Classifier If w kj is binary (0, 1) and p ki is short for P(w kx = 1 | c i ) After further derivation, the original equation looks like: Can be used for CSV Constants for all docs

22 Naïve Bayes Classifier Independence assumption Feature selection can be counterproductive

23 k-NN Classifier Compute closeness between candidate documents and category documents Similarity between d j and training set document d z Confidence score indicating whether d z belongs to category c i

24 k-NN Classifier k nearest neighbors –Find k nearest neighbors from all training documents and use their categories –K can also indicate the number of top ranked training documents per category to compare against Similarity computation can be: –Inner product –Cosine coefficient

25 Support Vector Machines decision surface that best separates data points in two classes Support vectors are the training docs that best define hyperplane Optimal hyperplane Max. margin

26 Support Vector Machines Training process involves finding the support vectors Only care about support vectors in the training set, not other documents

27 Neural Networks Train net to learn from a mapping of input words to a category One neural net per category –Too expensive One network overall Perceptron approach without a hidden layer Three layered

28 Classifier Committees Combine multiple classifiers Majority voting Category specialization Mixed results

29 Classification Performance Category ranking evaluation – Recall = categories found and correct –Precision = categories found and correct Micro and Macro averaging over categories Total categories correct Total categories found

30 Classification Performance Hard Two studies –Yiming Yang, 1997 –Yiming Yang and Xin Liu, 1999 SVM, kNN >> Neural Net > Naïve Bayes Performance converges for common categories (with many training docs)

31 Computational Bottlenecks Quiver –# of topics –# of training documents –# of candidate documents

32 Categorization and the Internet Classification as a service –Standardizing vocabulary –Confidentiality –performance Use of hypertext in categorization –Augment existing classifiers to take advantage

33 Hypertext and Categorization An already categorized document links to documents within same category Neighboring documents in a similar category Hierarchical nature of categories Metatags

34 Augmenting Classifiers Inject anchor text for a document into that document –Treat anchor text as separate terms Depends on dataset Mixed experimental results Links may be noisy –Ads –Navigation

35 Topics and the Web Topic distillation –Analysis of hyperlink graph structure Authorities –popular pages Hubs –Links to authorities hubs authorities

36 Topic Distillation Kleinbergs HITS algorithm An initial set of pages: root set –Use this to create an expanded set Weight propagation phase –Each node: authority score and hub score –Alternate Authority = sum of current hub weights of all nodes pointing to it Hub = sum of all authority score of all pages it points to –Normalize node scores and iterate until convergence Output is a set of hubs and authorities

37 Conclusion Why Classifiy? The Classification Process Various Classifiers Which ones are better? Other applications

Download ppt "An Introduction To Categorization Soam Acharya, PhD 1/15/2003."

Similar presentations

Ads by Google