Presentation is loading. Please wait.

Presentation is loading. Please wait.

Text Mining Application Programming Chapter 9 Text Categorization

Similar presentations


Presentation on theme: "Text Mining Application Programming Chapter 9 Text Categorization"— Presentation transcript:

1 Text Mining Application Programming Chapter 9 Text Categorization
Manu Konchady, 2006

2

3 Definition A taxonomy is a classification of organisms into groups based on similarities in structure or origin.

4 Assignment of documents to categories

5 Categorization Problem
The problem of categorization can be described as the classifications of documents into multiple categories. The n categories are predefined with specific keywords that differentiate any category from the other category. The process of identifying these keywords is called feature extraction.

6 Documents are assigned to one or more categories based on the degree of similarity with a category description. A classifier uses a similarity measure to evaluate documents against categories to find the closest category.

7 Several questions unanswered
How many categories are sufficient for the collection? What is the maximum size for a category? Are categories organized in a flat or hierarchical organization? Should documents be assigned to one or more categories?

8 In a dynamic collection, it is difficult to predict the contents of all documents that will be added to the collection. If we have too few categories or the description of a category is very general, then the size of a category can be excessive. When categories are too specific, retrieval is harder without the knowledge of specific keywords, it takes more time to find the right category. For a large set of categories, it makes sense to organize categories in a hierarchy.

9 The decision to assign a document to a category is usually made based on a measure of similarity with other documents or a set of features of the category. When the similarity measure exceeds a threshold, a document is included in the category. The threshold is one of the control parameters to create loose or tightly focused categories.

10 To seek a balance in the specificity of a category such that a category does not become too large or too small is difficult to predict beforehand for a dynamic collection. Categories are periodically adjusted to match the current state of the document collection.

11 Filter Spam Unsolicited mail Junk mail The first method to filter spam were simply a list of words that frequently occurred in spam. Free, money, click, sex, and so on. Problem:?

12 Filter spam using a list of rules
Is the from Does the body of the message contain the word money? Check subject text for the word free.

13 One of the problems with rule-based systems is that new rules must be devised to handle dynamic data.

14 Email classification process

15 Features of Spam Source domain of email
Number of non-alphanumeric characters in text Location of word features Number of recipients

16 Requirements for a spam detector
A good classifier for spam should have the following characteristics: It should be customizable The classifier must adapt to change in the environment. The process of training should be easy.


Download ppt "Text Mining Application Programming Chapter 9 Text Categorization"

Similar presentations


Ads by Google