Presentation is loading. Please wait.

Presentation is loading. Please wait.

Goal: Goal: Learn to automatically  File e-mails into folders  Filter spam e-mailMotivation  Information overload - we are spending more and more time.

Similar presentations


Presentation on theme: "Goal: Goal: Learn to automatically  File e-mails into folders  Filter spam e-mailMotivation  Information overload - we are spending more and more time."— Presentation transcript:

1 Goal: Goal: Learn to automatically  File e-mails into folders  Filter spam e-mailMotivation  Information overload - we are spending more and more time filtering e-mails and organizing them into folders in order to facilitate retrieval when necessary  Weaknesses of the programmable automatic filtering provided by the modern e-mail software (rules to organize mail into folders or spam mail filtering based on keywords): - Most users do not create such rules as they find it difficult to use the software or simply avoid customizing it - Manually constructing robust rules is difficult as users are constantly creating, deleting, reorganizing their folders - The nature of the e-mails within the folder may well drift over time and the characteristics of the spam e-mail (e.g. topics, frequent terms) also change over time => the rules must be constantly tuned by the user that is time consuming and error-prone LINGER Based on Text Categorization  Bags of words representation – all unique words in the entire training corpus are identified  Feature selection chooses the most important words and reduces dimensionality – Inf. Gain (IG), Variance (V)  Feature representation – normalized weighting for every word, representing its importance in the document - Weightings: binary, term frequency, term frequency inverse document frequency (tf-idf) - Normalization at 3 levels (e-mail, mailbox, corpus)  Classifier – neural network (NN), why? - require considerable time for parameter selection and training (-) but can achieve very accurate results; successfully applied in many real world applications (+) - NN trained with backpropagation, 1 output for each class, 1 hidden layer of 20-40 neurons; early stopping based on validation set or a max. # epochs (10 000) Pre-processing for words extraction  E-mail fields used - body, sender (From, Reply-to), recipient (To, CC, Bcc) and Subject (attachments seen as part of body) - These fields are treated equally and a single bag of word representation is created for each e-mail - No stemming or stop wording were applied.  Discarding words that only appear once in a corpus  Removing words longer than 20 char from the body  # unique words in a corpus reduced from 9000 to 1000Corpora Filing into foldersSpam e-mail filtering  4 versions of PU1 and LingSpam depending on whether stemming and stop word list were used (bare, lemm, stop and lemm_stop) LINGER – A Smart Personal Assistant for E-mail Classification James Clark, Irena Koprinska, and Josiah Poon LINGER – A Smart Personal Assistant for E-mail Classification James Clark, Irena Koprinska, and Josiah Poon School of Information Technologies, University of Sydney, Sydney, Australia, e-mail: {jclark, irena, josiah}@it.usyd.edu.au Results Performance Measures  accuracy (A), recall (R), precision (P) and F1measure  Stratified 10-fold cross validation Filing Into Folders Overall Performance  The simpler feature selector V is more effective than IG  U2 and U4 were harder to classify than U1, U3 and U5: - different classification styles: U1, U3, U5 - based on the topic and sender, U2 - on the action performed (e.g. Read&Keep), U4 - on the topic, sender, action performed and also when e-mails needed to be acted upon (e.g. ThisWeek) -large ratio of the number of folders over the number of e- mails for U2 and U4  Comparison with other classifiers Effect of Normalization and Weighting Accuracy [%] for various normalization (e – e-mail, m- mailbox, g - global) and weighting (freq. – frequency, tf-idf and boolean)  Best results: mailbox level normalization, tf-idf and frequency weighting Spam Filtering Cost-Sensitive Performance Measures  Blocking a legitimate message is times more costly than non-blocking a spam message  Weighted accuracy (WA): when a legitimate e-mail is misclassified/correctly classified, this counts as errors/successes Overall Performance Performance on spam filtering for lemm corpora  3 scenarios: =1, no cost (flagging spam e-mail), =9, moderately accurate filter (notifying senders about blocked e- mails) and =999, highly accurate filter (completely automatic scenario)  LingerIG – perfect results on both PU1 and LingSpam for all ; LingerV outperformed only by stumps and boosted trees Effect of Stemming and Stop Word List – do not help Performance on the 4 different versions of LingSpam Anti-Spam Filter Portability Across Corpora  a) and b) – low SP; many fp (non spam as spam) - Reason: different nature of legitimate e-mail in LingSpam (linguistics related) and U5Spam (more diverse) - Features selected from LingSpam are too specific, not a good predictor for U5Spam (a)  b) – low SR as well; many tp (spam as non-spam) - Reason: U5Spam is considerably smaller than LingSpam a) Typical confusion matrices b)  c) and d) – good results -Not perfect feature selection but the NN is able to recover by training  More extensive experiments with diverse, non topic specific corpora, are needed to determine the portability of anti-spam filters across different users


Download ppt "Goal: Goal: Learn to automatically  File e-mails into folders  Filter spam e-mailMotivation  Information overload - we are spending more and more time."

Similar presentations


Ads by Google