Text Feature Extraction. Text Classification Text classification has many applications –Spam email detection –Automated tagging of streams of news articles,

Slides:



Advertisements
Similar presentations
Text Categorization.
Advertisements

Chapter 5: Introduction to Information Retrieval
Text Categorization Moshe Koppel Lecture 1: Introduction Slides based on Manning, Raghavan and Schutze and odds and ends from here and there.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University Text Mining with Machine Learning.
Extracting Key-Substring-Group Features for Text Classification KDD 2006 Dell Zhang: Univ of London Wee Sun Lee: Nat Univ of Singapore Presented by: Payam.
Information Retrieval in Practice
Text Classification With Support Vector Machines
Data Mining Lectures Lecture 15: Text Classification Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 15: Text Classification Padhraic Smyth Department.
WMES3103 : INFORMATION RETRIEVAL
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
CS 430 / INFO 430 Information Retrieval
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.
1 Introduction to Computational Natural Language Learning Linguistics (Under: Topics in Natural Language Processing ) Computer Science (Under:
CS 277: Data Mining Text Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction.
Chapter 5: Information Retrieval and Web Search
Graph Classification.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Introduction to machine learning
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Classifiers, Part 3 Week 1, Video 5 Classification  There is something you want to predict (“the label”)  The thing you want to predict is categorical.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Efficient Concept-Based Mining Model for Enhancing.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Lecture #32 WWW Search. Review: Data Organization Kinds of things to organize –Menu items –Text –Images –Sound –Videos –Records (I.e. a person ’ s name,
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
Bayesian Networks. Male brain wiring Female brain wiring.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Chapter 1 Introduction to Data Mining
Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
TEXT CLASSIFICATION USING MACHINE LEARNING Student: Hung Vo Course: CP-SC 881 Instructor: Professor Luo Feng Clemson University 04/27/2011.
Chapter 6: Information Retrieval and Web Search
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
27-18 września Data Mining dr Iwona Schab. 2 Semester timetable ORGANIZATIONAL ISSUES, INDTRODUCTION TO DATA MINING 1 Sources of data in business,
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
Web- and Multimedia-based Information Systems Lecture 2.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
Matwin Text classification: In Search of a Representation Stan Matwin School of Information Technology and Engineering University of Ottawa
Information Retrieval
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 12: Text Mining Padhraic Smyth Department of Information.
Nuhi BESIMI, Adrian BESIMI, Visar SHEHU
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Class Imbalance in Text Classification
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
Spam Detection Kingsley Okeke Nimrat Virk. Everyone hates spams!! Spam s, also known as junk s, are unwanted s sent to numerous recipients.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Data Mining and Text Mining. The Standard Data Mining process.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Information Retrieval in Practice
Clustering of Web pages
Text Mining CSC 600: Data Mining Class 20.
Text Based Information Retrieval
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Special Topics in Data Mining Applications Focus on: Text Mining
What is Pattern Recognition?
Text Categorization Assigning documents to a fixed set of categories
Chapter 5: Information Retrieval and Web Search
Text Mining CSC 576: Data Mining.
Information Retrieval
Presentation transcript:

Text Feature Extraction

Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles, e.g., Google News –Automated creation of Web-page taxonomies Data Representation –“Bag of words” most commonly used: either counts or binary –Can also use “phrases” for commonly occuring combinations of words Classification Methods –Naïve Bayes widely used (e.g., for spam ) Fast and reasonably accurate –Support vector machines (SVMs) Typically the most accurate method in research studies But more complex computationally –Logistic Regression (regularized) Not as widely used, but can be competitive with SVMs (e.g., Zhang and Oles, 2002)

Further Reading on Text Classification Web-related text mining in general –S. Chakrabarti, Mining the Web: Discovering Knowledge from Hypertext Data, Morgan Kaufmann, –See chapter 5 for discussion of text classification General references on text and language modeling –Foundations of Statistical Language Processing, C. Manning and H. Schutze, MIT Press, –Speech and Language Processing: An Introduction to Natural Language Processing, Dan Jurafsky and James Martin, Prentice Hall, SVMs for text classification –T. Joachims, Learning to Classify Text using Support Vector Machines: Methods, Theory and Algorithms, Kluwer, 2002

Common Data Sets used for Evaluation Reuters –10700 labeled documents –10% documents with multiple class labels Yahoo! Science Hierarchy –95 disjoint classes with 13,598 pages 20 Newsgroups data –18800 labeled USENET postings –20 leaf classes, 5 root level classes WebKB –8300 documents in 7 categories such as “faculty”, “course”, “student”. Industry –6449 home pages of companies partitioned into 71 classes

Trimming the Vocabulary Stopword removal: –remove “non-content” words very frequent “stop words” such as “the”, “and”…. –remove very rare words, e.g., that only occur a few times in 100k documents Stemming: –Reduce all variants of a word to a single term –E.g., {draw, drawing, drawings} -> “draw” –Porter stemming algorithm (1980) relies on a preconstructed suffix list with associated rules e.g. if suffix=IZATION and prefix contains at least one vowel followed by a consonant, replace with suffix=IZE –BINARIZATION => BINARIZE This still often leaves p ~ O(10 4 ) terms => a very high-dimensional classification problem!

Feature Selection Performance of text classification algorithms can be optimized by selecting only a subset of the discriminative terms –See classification results later in these slides Greedy search –Start from empty set or full set and add/delete one at a time –Heuristics for adding/deleting –Methods tend not to be particularly sensitive to the specific heuristic used for feature selection, but some form of feature selection often improves performance

Example of Role of Feature Selection 9600 documents from US Patent database 20,000 raw features (terms)

Classifying Term Vectors Typically multiple different words may be helpful in classifying a particular class, e.g., –Class = “finance” –Words = “stocks”, “return”, “interest”, “rate”, etc. –Thus, classifiers that combine multiple features often do well, e.g, Naïve Bayes, Logistic regression, SVMs, etc

On Class Practice Format your own Text Data Data –your own collected text data Method –Stop words removal –Stemming –Key words frequency calculation Software –Coding or by Text editor

Format your own Text Data Requirements File Format: Pure text Length of sample: Maximum length for one instance: 250 words Delimiter: single space Data Clean: Stop words removed Class Label: Folder name Example: Text_Example.txt provided on Moodle