Text Mining Application Programming Chapter 9 Text Categorization

Slides:



Advertisements
Similar presentations
Document Filtering Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike.
Advertisements

Basic Communication on the Internet:
Taxonomies, Lexicons and Organizing Knowledge Wendi Pohs, IBM Software Group.
S ENTIMENTAL A NALYSIS O F B LOGS B Y C OMBINING L EXICAL K NOWLEDGE W ITH T EXT C LASSIFICATION. 1 By Prem Melville, Wojciech Gryc, Richard D. Lawrence.
Overview of Data Mining & The Knowledge Discovery Process Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Search Engines and Information Retrieval
Introduction to Automatic Classification Shih-Wen (George) Ke 7 th Dec 2005.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
IR Models: Structural Models
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Presented by Zeehasham Rasheed
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Goal: Goal: Learn to automatically  File s into folders  Filter spam Motivation  Information overload - we are spending more and more time.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Introduction to Data Mining Data mining is a rapidly growing field of business analytics focused on better understanding of characteristics and.
Spam? Not any more !! Detecting spam s using neural networks ECE/CS/ME 539 Project presentation Submitted by Sivanadyan, Thiagarajan.
Description; compare-contrast; narrative; definition; opinion; cause- effect; classification; process.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Web 2.0: Concepts and Applications 4 Organizing Information.
SharePoint Users Group Content Classification Step by Step SharePoint 2007 and 2010.
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
Search Engines and Information Retrieval Chapter 1.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Machine Learning An Introduction. What is Learning?  Herbert Simon: “Learning is any process by which a system improves performance from experience.”
The Internet 8th Edition Tutorial 2 Basic Communication on the Internet: .
Document Categorization Problem: given –a collection of documents, and –a taxonomy of subject areas Classification: Determine the subject area(s) most.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAM Rohan Malkhare Committee : Dr. Eugene Fink Dr. Dewey Rundus Dr. Alan Hevner.
Classification Techniques: Bayesian Classification
Discovering Descriptive Knowledge Lecture 18. Descriptive Knowledge in Science In an earlier lecture, we introduced the representation and use of taxonomies.
Spam Detection Ethan Grefe December 13, 2013.
Acclimatizing Taxonomic Semantics for Hierarchical Content Categorization --- Lei Tang, Jianping Zhang and Huan Liu.
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
Chapter Ⅳ. Categorization 2007 년 2 월 15 일 인공지능연구실 송승미 Text : THE TEXT MINING HANDBOOK Page. 64 ~ 81.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Managing Your Inbox. Flagging Messages Message requires a specific response or action from the recipient Flagging draws attention to your request Quick.
Machine Learning for Spam Filtering 1 Sai Koushik Haddunoori.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
Text Mining Application Programming Chapter 1 Introduction Manu Konchady, 2006.
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
Spam Detection Kingsley Okeke Nimrat Virk. Everyone hates spams!! Spam s, also known as junk s, are unwanted s sent to numerous recipients.
Info Spring Features to Find Send a message Read a message sent to you Reply to a message sent to you Forward a message sent to you Save messages.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
ORGANIZING . 1.Sort messages quickly. 2.Group similar messages in folders or labels. 3.Route mail efficiently to specific folders or labels. 4.Reduce.
3. System Task Botton in Form (Uploader Function)
Queensland University of Technology
Information Organization: Overview
Introduction Machine Learning 14/02/2017.
Web Mining Ref:
Applications of Data Mining in Software Engineering
Text & Web Mining 9/22/2018.
Taxonomies, Lexicons and Organizing Knowledge
Multimedia Information Retrieval
Case-Based Reasoning System for Bearing Design
Text Categorization Rong Jin.
Text Categorization Assigning documents to a fixed set of categories
iSRD Spam Review Detection with Imbalanced Data Distributions
Searching for Truth: Locating Information on the WWW
MIS2502: Data Analytics Clustering and Segmentation
Do humans beat computers at pattern recognition? Andra Miloiu Costina
SVM Based Learning System for F-term Patent Classification
MIS2502: Data Analytics Clustering and Segmentation
Searching for Truth: Locating Information on the WWW
Searching for Truth: Locating Information on the WWW
Semi-Automatic Data-Driven Ontology Construction System
Information Organization: Overview
Text Mining Application Programming Chapter 1 Introduction
Presentation transcript:

Text Mining Application Programming Chapter 9 Text Categorization Manu Konchady, 2006

Definition A taxonomy is a classification of organisms into groups based on similarities in structure or origin.

Assignment of documents to categories

Categorization Problem The problem of categorization can be described as the classifications of documents into multiple categories. The n categories are predefined with specific keywords that differentiate any category from the other category. The process of identifying these keywords is called feature extraction.

Documents are assigned to one or more categories based on the degree of similarity with a category description. A classifier uses a similarity measure to evaluate documents against categories to find the closest category.

Several questions unanswered How many categories are sufficient for the collection? What is the maximum size for a category? Are categories organized in a flat or hierarchical organization? Should documents be assigned to one or more categories?

In a dynamic collection, it is difficult to predict the contents of all documents that will be added to the collection. If we have too few categories or the description of a category is very general, then the size of a category can be excessive. When categories are too specific, retrieval is harder without the knowledge of specific keywords, it takes more time to find the right category. For a large set of categories, it makes sense to organize categories in a hierarchy.

The decision to assign a document to a category is usually made based on a measure of similarity with other documents or a set of features of the category. When the similarity measure exceeds a threshold, a document is included in the category. The threshold is one of the control parameters to create loose or tightly focused categories.

To seek a balance in the specificity of a category such that a category does not become too large or too small is difficult to predict beforehand for a dynamic collection. Categories are periodically adjusted to match the current state of the document collection.

Filter Email Spam Unsolicited mail Junk mail The first method to filter spam were simply a list of words that frequently occurred in spam. Free, money, click, sex, and so on. Problem:?

Filter spam using a list of rules Is the email from someone@spam.com? Does the body of the message contain the word money? Check subject text for the word free.

One of the problems with rule-based systems is that new rules must be devised to handle dynamic data.

Email classification process

Features of Spam Source domain of email Number of non-alphanumeric characters in email text Location of word features Number of email recipients

Requirements for a spam detector A good classifier for spam should have the following characteristics: It should be customizable The classifier must adapt to change in the environment. The process of training should be easy.