Dynamic Category Profiling for Text Filtering and Classification

Dynamic Category Profiling for Text Filtering and Classification
Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan, R.O.C.

Goal Promote both precision and recall of text filtering and classification by Finding more suitable features (terms) to measure the degree of acceptance (DOA) of a document d w.r.t. a category c Deriving a method to make filtering and classification decisions based on the DOA estimation

Motivation Previous techniques often find and employ those features (terms) that are Representative for a category Discriminative to distinguish a category Unfortunately, content overlapping was often ignored A document d may be classified into a category c only if their share the same content to some extent

An example Two categories
c1: computer networks c2: computer animations Previous techniques tend to employ the term like “network” and “animation” as features They are both representative & discriminative Unfortunately, the term “computer” may NOT be selected, but It is helpful to filter out those documents that are about network but NOT computers (e.g. traffic network)

To discriminate c from others To validate content overlapping
Therefore, features of a category should be selected dynamically when a document is entered To discriminate c from others To validate content overlapping Features that correlate with c Features that correlate with other categories Features that appear in c but do not appear in d Features that do not appear in c but appear in d Underlying classifier Considered Not considered Dynamic profiling

The method: DP4FC DP4FC: Dynamic Profiling for Filtering & Classification Associating various classifiers with DP4FC Training Document for TF & TC Documents for Threshold Tuning DP4FC Filtered Documents Classified Documents Integrated TF & TC Underlying Classifier Documents for Classifier Building Classifier Building Threshold Tuning DOA Estimation by Dynamic Profiling Testing DOA Estimation

DOA estimation by dynamic profiling
Procedure DOAEstimationByDP(c, d), where (1) c is a category, (2) d is a document for thresholding or testing Return: DOA value of d with respect to c Begin (1) DOAbyDP = 0; (2) For each term t in c but not in d, do (2.1) DOAReduction = Support(t, c)  log2(IDF of t in training data and d); (2.2) DOAbyDP = DOAbyDP - DOAReduction; (3) For each term t in d but not in c, do (3.1) DOAReduction = Support(d, c)  log2(IDF of t in training data and d); (3.2) DOAbyDP = DOAbyDP - DOAReduction; (4) Return DOAbyDP; End.

Making a filtering and classification decision
Two thresholds are derived One is based on the DOA values produced by the underlying classifier The other is based on the DOA values produced by DP4FC A document may be classified into a category only if its DOA values are greater than or equal to the corresponding thresholds of the category

Experiment Aspects Settings (1) Source of experimental data
(A) Reuter-21578 (B) Yahoo text hierarchy (2) Split of test data (A) In-space test data (for evaluating TC) (B) Out-space test data (for evaluating TF) (3) Split of the training data for classifier building (CB) and threshold tuning (TT) (A) 50% for CB; 50% for TT (with 2-fold cross validation) (B) 80% for CB; 20% for TT (with 5-fold cross validation) (4) Parameter settings for the classifier Different sizes of feature sets on which the classification methodologies were built

The underlying classifier
The Rocchio’s classifier (RO) 1*DocPDoc/|P|  2*DocNDoc/|N| P and N are the sets of positive and negative documents, respectively RO is often tested in both text filtering and classification Parameter setting 1=16; 2=4 Previous studies showed that this setting is good for RO

Evaluation criteria For text classification: For text filtering
Precision (P), Recall (R), and F1=2PR/(P+R) For text filtering Filtering Ratio (FR) # out-space documents filtered out / # out-space documents Average Misclassifications (AM) # misclassifications / # out-space documents misclassified into the category space

Results Performance (in F1) in processing in-space documents

Performance (in FR) in processing out-space documents

Performance (in AM) in processing out-space documents

Conclusion For each category, most documents should be filtered out
Content overlapping between a document and a category is thus important It measures how d talks about those contents not in c, and vice versa Unfortunately, it is often ignored by previous techniques It calls for dynamic profiling for each category With dynamic profiling, the classifier’s performance may be both better & more stable in both filtering & classification

Thank you

Dynamic Category Profiling for Text Filtering and Classification

Similar presentations

Presentation on theme: "Dynamic Category Profiling for Text Filtering and Classification"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dynamic Category Profiling for Text Filtering and Classification

Similar presentations

Presentation on theme: "Dynamic Category Profiling for Text Filtering and Classification"— Presentation transcript:

Similar presentations

About project

Feedback