Download presentation

Presentation is loading. Please wait.

Published byEleanor Gibble Modified over 2 years ago

1
Hierarchical Classification of Real Life Documents Ke Wang, Senqiang Zhou Simon Fraser University Yu He National University of Singapore

2
Hierarchical & Multi-classed Documents Topics are organized into a hierarchy of increasing specificity A document is classified into all relevant classes. For example, a document on Dance could be reached from both Arts:Performing_Arts and Recreation topics in Yahoo

3
New Issues Misclassification is non-symmetric –Travel Outdoor Vs. Travel Software Documents are multi-classed –Traditional way: only one class attached Class space is sparse –2 - 1 subsets of classes for k classes –Exploring the similarities between classes k

4
A New Classification Model The model of documents: –{t 1,t 2,….,t n |C 1,…,C k }, where t 1,t 2,….,t n are keywords and C 1,…,C k are classes from a given class hierarchy –{ C 1,…,C k } is called a classset (CS) Construct a classifier –consisting of rules of the form {t i 1,…, t i p } {C i 1,…, C i p }, that assigns a “good” classset to a given new document

5
Class Similarity Two classsets are similar if they “cover” similar documents. Anc(CS): the set of classes in a classset CS plus all ancestor classes. CS 1 is more general than CS 2 if Anc(CS 1 ) Anc(CS 2 ) –{Dance} is more general than {Fast- Dance,Music} because Anc({Dance}) Anc({Fast-dance,Music})

6
Class Similarity (Cont.) A document d is covered by a classset CS if CS is more general than the classset of d Cover(CS) denotes the set of documents covered by CS Cover(CS 1 ) Cover(CS 2 )=Cover(CS 1 CS 2 )

7
Class Similarity (Cont.) The dissimilarity of CS 1 and CS 2 is defined as the normalized difference of their coverage E(CS1,CS2): (| Cover(CS 2 )-Cover(CS 1 )| + |Cover(CS 1 )-Cover(CS 2 )|)/|Cover(CS 1 ) Cover(CS 2 )| The similarity is defined as 1 - E(CS 1,CS 2 )

8
The Confidence Match(T CS ): the set of documents that contain all the terms in T. The confidence of T CS is defined as: Match(T CS ) - d E(CS d,CS) Conf g (T CS ) = Match(T CS )

9
What’s behind the Conf g ? Intuitively, Conf g (T CS ) measures the average similarity between CS and the classsets of the documents that match T CS. If E(CS d,CS) is binary, i.e., 1 or 0, Conf g (T CS ) degenerates to the standard confidence.

10

11

12
Construction of Classifier Step 1: Find association rules –Generate all association rules of the form T CS that satisfy some user-specified minimum support and confidence.

13
Construction of Classifier(Cont.) Step 2: rank the rules –A document is classified by the matching rule that has highest confidence. –This selection is called most confidence first (MCF)

14
Construction of Classifier (Cont.) Step 3: remove rules of low accuracy –Let D be the set of training documents classified by rule T CS, the accuracy of T CS is defined as

15
Construction of Classifier (Cont.) –Conf g (T CS) is defined with respect to all the document s that match the rule, whereas Accu(T CS ) is defined w.r.t the documents classified by the rule. –Remove the rules with accuracy below a certain threshold because they contribute negatively to overall accuracy.

16
Construction of Classifier (Cont.) Step 4: cut off the ranked list –If we cut off the list of rules r 1,…,r m after the first i rules, r 1,…,r i, –Cutoff error = PrefixError(r i )+DefualtError(r i ) –PrefixError(r i ) is the sum of the rule error Error(r j ) for all rules r j, 1 j I –DefualtError(r i ) is the error caused by assigning the default classset to all the documents not classified by any rule r j

17
Experiments

18
Experimental Results The result on IBM data set –The error: Coverage beats the others. –The size: Confidence gets smaller. –The time: Coverage takes longer.

19

20
Classification Error

21
Size & Execution Time

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google