Presentation is loading. Please wait.

Presentation is loading. Please wait.

Text Feature Extraction. Text Classification Text classification has many applications –Spam email detection –Automated tagging of streams of news articles,

Similar presentations


Presentation on theme: "Text Feature Extraction. Text Classification Text classification has many applications –Spam email detection –Automated tagging of streams of news articles,"— Presentation transcript:

1 Text Feature Extraction

2 Text Classification Text classification has many applications –Spam email detection –Automated tagging of streams of news articles, e.g., Google News –Automated creation of Web-page taxonomies Data Representation –“Bag of words” most commonly used: either counts or binary –Can also use “phrases” for commonly occuring combinations of words Classification Methods –Naïve Bayes widely used (e.g., for spam email) Fast and reasonably accurate –Support vector machines (SVMs) Typically the most accurate method in research studies But more complex computationally –Logistic Regression (regularized) Not as widely used, but can be competitive with SVMs (e.g., Zhang and Oles, 2002)

3 Further Reading on Text Classification Web-related text mining in general –S. Chakrabarti, Mining the Web: Discovering Knowledge from Hypertext Data, Morgan Kaufmann, 2003. –See chapter 5 for discussion of text classification General references on text and language modeling –Foundations of Statistical Language Processing, C. Manning and H. Schutze, MIT Press, 1999. –Speech and Language Processing: An Introduction to Natural Language Processing, Dan Jurafsky and James Martin, Prentice Hall, 2000. SVMs for text classification –T. Joachims, Learning to Classify Text using Support Vector Machines: Methods, Theory and Algorithms, Kluwer, 2002

4 Common Data Sets used for Evaluation Reuters –10700 labeled documents –10% documents with multiple class labels Yahoo! Science Hierarchy –95 disjoint classes with 13,598 pages 20 Newsgroups data –18800 labeled USENET postings –20 leaf classes, 5 root level classes WebKB –8300 documents in 7 categories such as “faculty”, “course”, “student”. Industry –6449 home pages of companies partitioned into 71 classes

5 Trimming the Vocabulary Stopword removal: –remove “non-content” words very frequent “stop words” such as “the”, “and”…. –remove very rare words, e.g., that only occur a few times in 100k documents Stemming: –Reduce all variants of a word to a single term –E.g., {draw, drawing, drawings} -> “draw” –Porter stemming algorithm (1980) relies on a preconstructed suffix list with associated rules e.g. if suffix=IZATION and prefix contains at least one vowel followed by a consonant, replace with suffix=IZE –BINARIZATION => BINARIZE This still often leaves p ~ O(10 4 ) terms => a very high-dimensional classification problem!

6 Feature Selection Performance of text classification algorithms can be optimized by selecting only a subset of the discriminative terms –See classification results later in these slides Greedy search –Start from empty set or full set and add/delete one at a time –Heuristics for adding/deleting –Methods tend not to be particularly sensitive to the specific heuristic used for feature selection, but some form of feature selection often improves performance

7 Example of Role of Feature Selection 9600 documents from US Patent database 20,000 raw features (terms)

8 Classifying Term Vectors Typically multiple different words may be helpful in classifying a particular class, e.g., –Class = “finance” –Words = “stocks”, “return”, “interest”, “rate”, etc. –Thus, classifiers that combine multiple features often do well, e.g, Naïve Bayes, Logistic regression, SVMs, etc

9 On Class Practice Format your own Text Data Data –your own collected text data Method –Stop words removal –Stemming –Key words frequency calculation Software –Coding or by Text editor

10 Format your own Text Data Requirements File Format: Pure text Length of sample: Maximum length for one instance: 250 words Delimiter: single space Data Clean: Stop words removed Class Label: Folder name Example: Text_Example.txt provided on Moodle


Download ppt "Text Feature Extraction. Text Classification Text classification has many applications –Spam email detection –Automated tagging of streams of news articles,"

Similar presentations


Ads by Google