Goal: Goal: Learn to automatically File s into folders Filter spam Motivation Information overload - we are spending more and more time filtering s and organizing them into folders in order to facilitate retrieval when necessary Weaknesses of the programmable automatic filtering provided by the modern software (rules to organize mail into folders or spam mail filtering based on keywords): - Most users do not create such rules as they find it difficult to use the software or simply avoid customizing it - Manually constructing robust rules is difficult as users are constantly creating, deleting, reorganizing their folders - The nature of the s within the folder may well drift over time and the characteristics of the spam (e.g. topics, frequent terms) also change over time => the rules must be constantly tuned by the user that is time consuming and error-prone LINGER Based on Text Categorization Bags of words representation – all unique words in the entire training corpus are identified Feature selection chooses the most important words and reduces dimensionality – Inf. Gain (IG), Variance (V) Feature representation – normalized weighting for every word, representing its importance in the document - Weightings: binary, term frequency, term frequency inverse document frequency (tf-idf) - Normalization at 3 levels ( , mailbox, corpus) Classifier – neural network (NN), why? - require considerable time for parameter selection and training (-) but can achieve very accurate results; successfully applied in many real world applications (+) - NN trained with backpropagation, 1 output for each class, 1 hidden layer of neurons; early stopping based on validation set or a max. # epochs (10 000) Pre-processing for words extraction fields used - body, sender (From, Reply-to), recipient (To, CC, Bcc) and Subject (attachments seen as part of body) - These fields are treated equally and a single bag of word representation is created for each - No stemming or stop wording were applied. Discarding words that only appear once in a corpus Removing words longer than 20 char from the body # unique words in a corpus reduced from 9000 to 1000Corpora Filing into foldersSpam filtering 4 versions of PU1 and LingSpam depending on whether stemming and stop word list were used (bare, lemm, stop and lemm_stop) LINGER – A Smart Personal Assistant for Classification James Clark, Irena Koprinska, and Josiah Poon LINGER – A Smart Personal Assistant for Classification James Clark, Irena Koprinska, and Josiah Poon School of Information Technologies, University of Sydney, Sydney, Australia, {jclark, irena, Results Performance Measures accuracy (A), recall (R), precision (P) and F1measure Stratified 10-fold cross validation Filing Into Folders Overall Performance The simpler feature selector V is more effective than IG U2 and U4 were harder to classify than U1, U3 and U5: - different classification styles: U1, U3, U5 - based on the topic and sender, U2 - on the action performed (e.g. Read&Keep), U4 - on the topic, sender, action performed and also when s needed to be acted upon (e.g. ThisWeek) -large ratio of the number of folders over the number of e- mails for U2 and U4 Comparison with other classifiers Effect of Normalization and Weighting Accuracy [%] for various normalization (e – , m- mailbox, g - global) and weighting (freq. – frequency, tf-idf and boolean) Best results: mailbox level normalization, tf-idf and frequency weighting Spam Filtering Cost-Sensitive Performance Measures Blocking a legitimate message is times more costly than non-blocking a spam message Weighted accuracy (WA): when a legitimate is misclassified/correctly classified, this counts as errors/successes Overall Performance Performance on spam filtering for lemm corpora 3 scenarios: =1, no cost (flagging spam ), =9, moderately accurate filter (notifying senders about blocked e- mails) and =999, highly accurate filter (completely automatic scenario) LingerIG – perfect results on both PU1 and LingSpam for all ; LingerV outperformed only by stumps and boosted trees Effect of Stemming and Stop Word List – do not help Performance on the 4 different versions of LingSpam Anti-Spam Filter Portability Across Corpora a) and b) – low SP; many fp (non spam as spam) - Reason: different nature of legitimate in LingSpam (linguistics related) and U5Spam (more diverse) - Features selected from LingSpam are too specific, not a good predictor for U5Spam (a) b) – low SR as well; many tp (spam as non-spam) - Reason: U5Spam is considerably smaller than LingSpam a) Typical confusion matrices b) c) and d) – good results -Not perfect feature selection but the NN is able to recover by training More extensive experiments with diverse, non topic specific corpora, are needed to determine the portability of anti-spam filters across different users