Goal: Goal: Learn to automatically  File e-mails into folders  Filter spam e-mailMotivation  Information overload - we are spending more and more time.

Slides:



Advertisements
Similar presentations
Text Categorization.
Advertisements

Chapter 5: Introduction to Information Retrieval
1 A LVQ-based neural network anti-spam approach 楊婉秀 教授 資管碩一 詹元順 /12/07.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Microsoft ® Office Outlook ® 2003 Virtually Working for You presents:
A Maximum Coherence Model for Dictionary-based Cross-language Information Retrieval Yi Liu, Rong Jin, Joyce Y. Chai Dept. of Computer Science and Engineering.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Deep Belief Networks for Spam Filtering
Document Classification Comparison Evangel Sarwar, Josh Woolever, Rebecca Zimmerman.
Spam Filters. What is Spam? Unsolicited (legally, “no existing relationship” Automated Bulk Not necessarily commercial – “flaming”, political.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Text Classification: An Implementation Project Prerak Sanghvi Computer Science and Engineering Department State University of New York at Buffalo.
1 Automated Feature Abstraction of the fMRI Signal using Neural Network Clustering Techniques Stefan Niculescu and Tom Mitchell Siemens Medical Solutions,
Towards Semantic Web: An Attribute- Driven Algorithm to Identifying an Ontology Associated with a Given Web Page Dan Su Department of Computer Science.
Text Classification Using Stochastic Keyword Generation Cong Li, Ji-Rong Wen and Hang Li Microsoft Research Asia August 22nd, 2003.
Chapter 5: Information Retrieval and Web Search
23 October 2002Emmanuel Ormancey1 Spam Filtering at CERN Emmanuel Ormancey - 23 October 2002.
Spam? Not any more !! Detecting spam s using neural networks ECE/CS/ME 539 Project presentation Submitted by Sivanadyan, Thiagarajan.
Filtron: A Learning-Based Anti-Spam Filter Eirinaios Michelakis Ion Androutsopoulos
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
A Neural Network Classifier for Junk Ian Stuart, Sung-Hyuk Cha, and Charles Tappert CSIS Student/Faculty Research Day May 7, 2004.
Back-Propagation MLP Neural Network Optimizer ECE 539 Andrew Beckwith.
SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAM Rohan Malkhare Committee : Dr. Eugene Fink Dr. Dewey Rundus Dr. Alan Hevner.
Classification Performance Evaluation. How do you know that you have a good classifier? Is a feature contributing to overall performance? Is classifier.
Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.
Project 1: Machine Learning Using Neural Networks Ver 1.1.
Chapter 6: Information Retrieval and Web Search
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Adapting Statistical Filtering David Kohlbrenner IT.com TJHSST.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Spam Detection Ethan Grefe December 13, 2013.
National Taiwan University, Taiwan
By Ankur Khator Gaurav Sharma Arpit Mathur 01D05014 SPAM FILTERING.
Managing Your Inbox. Flagging Messages Message requires a specific response or action from the recipient Flagging draws attention to your request Quick.
Machine Learning for Spam Filtering 1 Sai Koushik Haddunoori.
Class Imbalance in Text Classification
Intelligent Reply and Attachment Prediction Mark Dredze, Tova Brooks, Josh Carroll Joshua Magarick, John Blitzer, Fernando Pereira Presented by.
Feature Selection Poonam Buch. 2 The Problem  The success of machine learning algorithms is usually dependent on the quality of data they operate on.
Classification using Co-Training
A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with SVM Lan Man 3 Nov, 2004.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Spam Detection Kingsley Okeke Nimrat Virk. Everyone hates spams!! Spam s, also known as junk s, are unwanted s sent to numerous recipients.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
A Document-Level Sentiment Analysis Approach Using Artificial Neural Network and Sentiment Lexicons Yan Zhu.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
ECE 5424: Introduction to Machine Learning
Huntington Beach Public Library
Multimedia Information Retrieval
Project 1: Text Classification by Neural Networks
Text Categorization Assigning documents to a fixed set of categories
Basics HURY DEPARTMENT OF COMPUTER SCIENCE M.TEJASWINI.
An Improved Neural Network Algorithm for Classifying the Transmission Line Faults Slavko Vasilic Dr Mladen Kezunovic Texas A&M University.
iSRD Spam Review Detection with Imbalanced Data Distributions
Naïve Bayes Classifiers
Department of Electrical Engineering
Chapter 5: Information Retrieval and Web Search
Ensemble learning Reminder - Bagging of Trees Random Forest
Project Presentation B 王 立 B 陳俊甫 B 張又仁
Information Retrieval
Text Mining Application Programming Chapter 9 Text Categorization
Lecture 16. Classification (II): Practical Considerations
Presentation transcript:

Goal: Goal: Learn to automatically  File s into folders  Filter spam Motivation  Information overload - we are spending more and more time filtering s and organizing them into folders in order to facilitate retrieval when necessary  Weaknesses of the programmable automatic filtering provided by the modern software (rules to organize mail into folders or spam mail filtering based on keywords): - Most users do not create such rules as they find it difficult to use the software or simply avoid customizing it - Manually constructing robust rules is difficult as users are constantly creating, deleting, reorganizing their folders - The nature of the s within the folder may well drift over time and the characteristics of the spam (e.g. topics, frequent terms) also change over time => the rules must be constantly tuned by the user that is time consuming and error-prone LINGER Based on Text Categorization  Bags of words representation – all unique words in the entire training corpus are identified  Feature selection chooses the most important words and reduces dimensionality – Inf. Gain (IG), Variance (V)  Feature representation – normalized weighting for every word, representing its importance in the document - Weightings: binary, term frequency, term frequency inverse document frequency (tf-idf) - Normalization at 3 levels ( , mailbox, corpus)  Classifier – neural network (NN), why? - require considerable time for parameter selection and training (-) but can achieve very accurate results; successfully applied in many real world applications (+) - NN trained with backpropagation, 1 output for each class, 1 hidden layer of neurons; early stopping based on validation set or a max. # epochs (10 000) Pre-processing for words extraction  fields used - body, sender (From, Reply-to), recipient (To, CC, Bcc) and Subject (attachments seen as part of body) - These fields are treated equally and a single bag of word representation is created for each - No stemming or stop wording were applied.  Discarding words that only appear once in a corpus  Removing words longer than 20 char from the body  # unique words in a corpus reduced from 9000 to 1000Corpora Filing into foldersSpam filtering  4 versions of PU1 and LingSpam depending on whether stemming and stop word list were used (bare, lemm, stop and lemm_stop) LINGER – A Smart Personal Assistant for Classification James Clark, Irena Koprinska, and Josiah Poon LINGER – A Smart Personal Assistant for Classification James Clark, Irena Koprinska, and Josiah Poon School of Information Technologies, University of Sydney, Sydney, Australia, {jclark, irena, Results Performance Measures  accuracy (A), recall (R), precision (P) and F1measure  Stratified 10-fold cross validation Filing Into Folders Overall Performance  The simpler feature selector V is more effective than IG  U2 and U4 were harder to classify than U1, U3 and U5: - different classification styles: U1, U3, U5 - based on the topic and sender, U2 - on the action performed (e.g. Read&Keep), U4 - on the topic, sender, action performed and also when s needed to be acted upon (e.g. ThisWeek) -large ratio of the number of folders over the number of e- mails for U2 and U4  Comparison with other classifiers Effect of Normalization and Weighting Accuracy [%] for various normalization (e – , m- mailbox, g - global) and weighting (freq. – frequency, tf-idf and boolean)  Best results: mailbox level normalization, tf-idf and frequency weighting Spam Filtering Cost-Sensitive Performance Measures  Blocking a legitimate message is times more costly than non-blocking a spam message  Weighted accuracy (WA): when a legitimate is misclassified/correctly classified, this counts as errors/successes Overall Performance Performance on spam filtering for lemm corpora  3 scenarios: =1, no cost (flagging spam ), =9, moderately accurate filter (notifying senders about blocked e- mails) and =999, highly accurate filter (completely automatic scenario)  LingerIG – perfect results on both PU1 and LingSpam for all ; LingerV outperformed only by stumps and boosted trees Effect of Stemming and Stop Word List – do not help Performance on the 4 different versions of LingSpam Anti-Spam Filter Portability Across Corpora  a) and b) – low SP; many fp (non spam as spam) - Reason: different nature of legitimate in LingSpam (linguistics related) and U5Spam (more diverse) - Features selected from LingSpam are too specific, not a good predictor for U5Spam (a)  b) – low SR as well; many tp (spam as non-spam) - Reason: U5Spam is considerably smaller than LingSpam a) Typical confusion matrices b)  c) and d) – good results -Not perfect feature selection but the NN is able to recover by training  More extensive experiments with diverse, non topic specific corpora, are needed to determine the portability of anti-spam filters across different users