Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hierarchical Classification of Real Life Documents Ke Wang, Senqiang Zhou Simon Fraser University Yu He National University of Singapore.

Similar presentations


Presentation on theme: "Hierarchical Classification of Real Life Documents Ke Wang, Senqiang Zhou Simon Fraser University Yu He National University of Singapore."— Presentation transcript:

1 Hierarchical Classification of Real Life Documents Ke Wang, Senqiang Zhou Simon Fraser University Yu He National University of Singapore

2 Hierarchical & Multi-classed Documents Topics are organized into a hierarchy of increasing specificity A document is classified into all relevant classes. For example, a document on Dance could be reached from both Arts:Performing_Arts and Recreation topics in Yahoo

3 New Issues Misclassification is non-symmetric –Travel  Outdoor Vs. Travel  Software Documents are multi-classed –Traditional way: only one class attached Class space is sparse –2 - 1 subsets of classes for k classes –Exploring the similarities between classes k

4 A New Classification Model The model of documents: –{t 1,t 2,….,t n |C 1,…,C k }, where t 1,t 2,….,t n are keywords and C 1,…,C k are classes from a given class hierarchy –{ C 1,…,C k } is called a classset (CS) Construct a classifier –consisting of rules of the form {t i 1,…, t i p }  {C i 1,…, C i p }, that assigns a “good” classset to a given new document

5 Class Similarity Two classsets are similar if they “cover” similar documents. Anc(CS): the set of classes in a classset CS plus all ancestor classes. CS 1 is more general than CS 2 if Anc(CS 1 )  Anc(CS 2 ) –{Dance} is more general than {Fast- Dance,Music} because Anc({Dance})  Anc({Fast-dance,Music})

6 Class Similarity (Cont.) A document d is covered by a classset CS if CS is more general than the classset of d Cover(CS) denotes the set of documents covered by CS Cover(CS 1 )  Cover(CS 2 )=Cover(CS 1  CS 2 )

7 Class Similarity (Cont.) The dissimilarity of CS 1 and CS 2 is defined as the normalized difference of their coverage E(CS1,CS2): (| Cover(CS 2 )-Cover(CS 1 )| + |Cover(CS 1 )-Cover(CS 2 )|)/|Cover(CS 1 )  Cover(CS 2 )| The similarity is defined as 1 - E(CS 1,CS 2 )

8 The Confidence Match(T  CS ): the set of documents that contain all the terms in T. The confidence of T  CS is defined as: Match(T  CS ) -  d E(CS d,CS) Conf g (T  CS ) = ------------------------------------ Match(T  CS )

9 What’s behind the Conf g ? Intuitively, Conf g (T  CS ) measures the average similarity between CS and the classsets of the documents that match T  CS. If E(CS d,CS) is binary, i.e., 1 or 0, Conf g (T  CS ) degenerates to the standard confidence.

10

11

12 Construction of Classifier Step 1: Find association rules –Generate all association rules of the form T  CS that satisfy some user-specified minimum support and confidence.

13 Construction of Classifier(Cont.) Step 2: rank the rules –A document is classified by the matching rule that has highest confidence. –This selection is called most confidence first (MCF)

14 Construction of Classifier (Cont.) Step 3: remove rules of low accuracy –Let D be the set of training documents classified by rule T  CS, the accuracy of T  CS is defined as

15 Construction of Classifier (Cont.) –Conf g (T  CS) is defined with respect to all the document  s that match the rule, whereas Accu(T  CS ) is defined w.r.t the documents classified by the rule. –Remove the rules with accuracy below a certain threshold because they contribute negatively to overall accuracy.

16 Construction of Classifier (Cont.) Step 4: cut off the ranked list –If we cut off the list of rules r 1,…,r m after the first i rules, r 1,…,r i, –Cutoff error = PrefixError(r i )+DefualtError(r i ) –PrefixError(r i ) is the sum of the rule error Error(r j ) for all rules r j, 1  j  I –DefualtError(r i ) is the error caused by assigning the default classset to all the documents not classified by any rule r j

17 Experiments

18 Experimental Results The result on IBM data set –The error: Coverage beats the others. –The size: Confidence gets smaller. –The time: Coverage takes longer.

19

20 Classification Error

21 Size & Execution Time


Download ppt "Hierarchical Classification of Real Life Documents Ke Wang, Senqiang Zhou Simon Fraser University Yu He National University of Singapore."

Similar presentations


Ads by Google