Introduction to Automatic Email Classification Shih-Wen (George) Ke 7 th Dec 2005.

Slides:

Advertisements

Similar presentations

A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.

Advertisements

Chapter 5: Introduction to Information Retrieval

Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.

Text Categorization Karl Rees Ling 580 April 2, 2001.

Active Learning to Classify

Partitioned Logistic Regression for Spam Filtering Ming-wei Chang University of Illinois at Urbana-Champaign Wen-tau Yih and Christopher Meek Microsoft.

A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

Systems Engineering and Engineering Management The Chinese University of Hong Kong Parameter Free Bursty Events Detection in Text Streams Gabriel Pui Cheong.

On feature distributional clustering for text categorization Bekkerman, El-Yaniv, Tishby and Winter The Technion. June, 27, 2001.

Extracting Personal Names from Applying Named Entity Recognition to Informal Text Einat Minkov & Richard C. Wang Language Technologies Institute.

Text Classification With Support Vector Machines

BuzzTrack Topic Detection and Tracking in IUI – Intelligent User Interfaces January 2007 Keno Albrecht ETH Zurich Roger Wattenhofer.

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Presented by Zeehasham Rasheed

Multi-view Exploratory Learning for AKBC Problems Bhavana Dalvi and William W. Cohen School Of Computer Science, Carnegie Mellon University Motivation.

Scalable Text Mining with Sparse Generative Models

Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.

Text Classification Using Stochastic Keyword Generation Cong Li, Ji-Rong Wen and Hang Li Microsoft Research Asia August 22nd, 2003.

Text Classification With Labeled and Unlabeled Data Presenter: Aleksandar Milisic Supervisor: Dr. David Albrecht.

Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.

Introduction to machine learning

Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.

1 Bins and Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)

1 ENTROPY-BASED CONCEPT SHIFT DETECTION PETER VORBURGER, ABRAHAM BERNSTEIN IEEE ICDM 2006 Speaker: Li HueiJyun Advisor: Koh JiaLing Date:2007/11/6 1.

Text Classification Chapter 2 of “Learning to Classify Text Using Support Vector Machines” by Thorsten Joachims, Kluwer, 2002.

Universit at Dortmund, LS VIII

Enron Corpus: A New Dataset for Classification By Bryan Klimt and Yiming Yang CEAS 2004 Presented by Will Lee.

Learning from Multi-topic Web Documents for Contextual Advertisement KDD 2008.

LOGO Summarizing Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor ： Dr. Koh Jia-Ling Speaker ： Tu.

Exploiting Wikipedia Categorization for Predicting Age and Gender of Blog Authors K Santosh Aditya Joshi Manish Gupta Vasudeva Varma

Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

IR Homework #3 By J. H. Wang May 4, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:

CMU at TDT 2004 — Novelty Detection Jian Zhang and Yiming Yang Carnegie Mellon University.

CVPR2013 Poster Detecting and Naming Actors in Movies using Generative Appearance Models.

A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.

Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.

Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.

Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.

Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

Data Mining and Decision Support

Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 1 CMU TEAM-A in TDT 2004 Topic Tracking Yiming.

CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

A Supervised Machine Learning Algorithm for Research Articles Leonidas Akritidis, Panayiotis Bozanis Dept. of Computer & Communication Engineering, University.

Xiangnan Kong,Philip S. Yu An Ensemble-based Approach to Fast Classification of Multi-label Data Streams Dept. of Computer Science University of Illinois.

Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.

1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.

Contextual Search and Name Disambiguation in Using Graphs Einat Minkov, William W. Cohen, Andrew Y. Ng Carnegie Mellon University and Stanford University.

Classification Results for Folder Classification on Enron Dataset.

Proposing a New Term Weighting Scheme for Text Categorization LAN Man School of Computing National University of Singapore 12 nd July, 2006.

Twitter as a Corpus for Sentiment Analysis and Opinion Mining

Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

Data Mining and Text Mining. The Standard Data Mining process.

IR Homework #2 By J. H. Wang May 9, Programming Exercise #2: Text Classification Goal: to classify each document into predefined categories Input:

Hierarchical Semi-supervised Classification with Incomplete Class Hierarchies Bhavana Dalvi ¶*, Aditya Mishra †, and William W. Cohen * ¶ Allen Institute.

Queensland University of Technology

Recent Trends in Text Mining

School of Computer Science & Engineering

An Empirical Study of Learning to Rank for Entity Search

Yi-Chia Wang LTI 2nd year Master student

iSRD Spam Review Detection with Imbalanced Data Distributions

Text Mining Application Programming Chapter 9 Text Categorization

Modeling IDS using hybrid intelligent systems

Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.

Presentation transcript:

Introduction to Automatic Classification Shih-Wen (George) Ke 7 th Dec 2005

Overview Introduction to Enron Corpus Traditional Text Classification vs Classification Recent Work on Enron Corpus Our Work on Enron Corpus Summary Future Research Directions in Information Retrieval Further Discussion

Overview The nature of classification is very different to that of traditional text classification tasks. is time-dependent, poorly structured and written in informal format and no standard ways of preparing and evaluating datasets have been proposed.

Introduction Automatic Classification dates back to mid 90’s Classification received little attention until recently because there was no standard dataset available Enron Corpus available in March 2004

Introduction – Enron Corpus Distributed by William Cohen at Carnegie Mellon Uni. Consists of 517,431 messages that belong to 150 users of Enron Corporation Most users use folders to categorise their s Upper bound for the number of folders appears to be the log of the number of messages (Klimt & Yang, 2004)

Classification: Assumptions Categorise into folders – a.k.a. foldering Only personal and professional s are considered here Assume that users use folders to organise their s Other methods of organising s, e.g. flag or label, are not considered here although they may provide more information in Classification

Recent Work on Enron Corpus Bekkerman et al. (2004)Klimt & Yang (2004) Mono & Multiple-classificationMultiple-classification Accuracy (TP/N)P&R, Micro & Macro F 1 SVM performed best in most cases, but not statistically significant Newly created folders adversely affect performance Performance does not necessarily improve as the training set size grows Incoming s are more related to those recently received than those received long ago Enron is suitable for classification evaluation Body field is the most useful feature followed by ‘From’ threads can be a valuable asset to classification but they are difficult to detect and evaluate Foldering strategies differ individually

Our Work on Enron Corpus - Introduction Users sometimes forget which folders they have created or which folders they should file the under So users tend to create new (duplicate) folders Newly created folders adversely affect performance (Bekkerman et al., 2004) Reduce the likelihood of users creating duplicate folders by improving the accuracy of assigning incoming s to folders that were created in the first place Compare state-of-the-art classifiers (kNN, SVM) and our own classifier - PERC in a simulation of real-time situation using various parameter settings

Our Work on Enron Corpus - The PERC The PERC Classifier (PERsonal Classifier)  Find a centroid c i for each category C i  For each test document x :  Find k nearest neighbouring training documents to x  Similarity between x and the training document d j is added to similarity between x and c i  Sort similarity scores sim(x,C i ) in descending order  Decision to assign x to C i can be made using various thresholding strategies

Our Work on Enron Corpus - The PERC The PERC Classifier (PERsonal Classifier) where y(d j,C i ) {0,1} is the classification for training document d j with respect to category C i ; sim(x,d j ) is the similarity between test document x and training document d j ; and sim(x,c i ) is the similarity between test document x and the centroid c i of the category that d j belongs to.

Rationale for the Hybrid Approach Centroid method overcomes data sparseness: s tend to be short. kNN allows the topic of a folder to drift over time. Considering the vector space locally allows matching against features which are currently dominant.

Our Work on Enron Corpus - Results SVM1 (c=1,j=1), SVM2 (c=0.01,j=1) Micro-averaging and Macro-average F1 over all users with standard deviation for kNN, SVM and PERC For Macro-averaging evaluations, PERC significantly outperformed kNN (t=2.786, p=0.032), SVM1 (t=2.533, p=0.044) and SVM2 (t=5.926, p=0.001)

Our Work on Enron Corpus - Conclusions PERC has the highest accuracy of assigning test documents to small folders kNN and PERC performed better with smaller k Parameters of SVM can be sensitive to the number of training documents available Investigate various parameter settings and training/test sets splits Use of time will be investigated A questionnaire-based study is being conducted in order to indicate the behaviour of real users in management

Future Research Directions in IR Use of time information Training/test sets splits Feature extraction, selection Document representation Qualitative evaluation Threads detection, TDT for Mining sequential patterns Burst of activity (Kleinberg, 2002)

References Bekkerman, R., McCallum, A. and Huang, G. (2004) Automatic Categorization of into Folders: Benchmark Experiments on Enron and SRI Corpora. Technical Report IR-418, CIIR, University of Massachusetts. Kleinberg, J. (2002) Bursty and Hierarchical Structure in Streams. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Klimt, B. & Yang, Y. (2004) The Enron Corpus: A New Dataset for Classification Research. European Conference on Machine Learning.