On Compression-Based Text Classification Yuval Marton 1, Ning Wu 2 and Lisa Hellerstein 2 1) University of Maryland and 2) Polytechnic University ECIR-05.

On Compression-Based Text Classification Yuval Marton 1, Ning Wu 2 and Lisa Hellerstein 2 1) University of Maryland and 2) Polytechnic University ECIR-05 Santiago de Compostela, Spain. March 2005

Compression for Text Classification?? Proposed in the last ~10 years. Not well-understood why works. Compression is stupid! slow! Non-standard! Using compression tools is easy.. Does it work? (Controversy. Mess)

Overview What ’ s Text classification (problem setting) Compression-based text classification Classification Procedures ( + Do it yourself !) Compression Methods (RAR, LZW, and gzip) Experimental Evaluation Why?? (Compression as char-based method) Influence of sub/super/non-word features Conclusions and future work

Text Classification Given training corpus (labeled documents). Learn how to label new (test) documents. Our setting: Single-class: document belongs to exactly one class. 3 topic classification and 3 authorship attribution tasks.

Classification by Compression Compression programs build a model or dictionary of their input (language modeling). Better model  better compression Idea: Compress a document using different class models. Label with class achieving highest compression rate. Minimal Description Length (MDL) principle: select model with shortest length of model + data.

Standard MDL (Teahan & Harper) D1D1 D2D2 D3D3 Class 1 D1D1 D2D2 D3D3 Class 2 D1D1 D2D2 D3D3 Class n A1A1 A2A2 AnAn M1M1 M1M1 M2M2 M2M2 MnMn MnMn T Concat. training data  A i Compress A i  model M i Compress T using each M i Assign T to its best compressor … and the winner is…

Do it yourself Five minutes on how to classify text documents e.g., according to their topic or author, using only off-the-shelf compression tools (such as WinZip or RAR) …

AMDL (Khmelev / Kukushkina et al. 2001) D1D1 D2D2 D3D3 Class 1 D1D1 D2D2 D3D3 Class 2 D1D1 D2D2 D3D3 Class n T Concat. training data  A i Concat. A i and T  A i T Compress each A i and A i T Assign T to class i w/ min v i A1A1 A2A2 AnAn A1TA1T Subtract compressed file sizes v i = |A i T| - |A i | A2TA2T AnTAnT

BCN (Benedetto et al. 2002) D1D1 D2D2 D3D3 Class 1 D4D4 D5D5 D6D6 Class 2 D7D7 D8D8 D9D9 Class n T Like AMDL, but concat. each doc D j with T  D j T Compress each D j and D j T Assign T to class i of doc D j with min v DT D1D1 D4D4 D7D7 D1TD1T Subtract compressed file sizes v DT = |D j T| - |D j | D4TD4T D7TD7T D2TD2T D5TD5T D8TD8T D3TD3T D6TD6T D9TD9T

Compression Methods Gzip: Lempel-Ziv compression (LZ77). - “ Dictionary ” -based - Sliding window typically 32K. LZW (Lempel-Ziv-Welch) - Dictionary-based (16 bit). - Fills up on big corpora (typically after ~300KB). RAR (proprietary shareware) - PPMII variant on text. - Markov Model, n-grams frequencies. -32K- -16 bit (~300K) - - (almost) unlimited -

Previous Work Khmelev et al. (+Kukushkina): Russian authors. Thaper: LZ78, char- and word-based PPM. Frank et al.: compression (PPM) bad for topic. Teahan and Harper: compression (PPM) good. Benedetto et al.: gzip good for authors. Goodman: gzip bad! Khmelev and Teahan: RAR (PPM). Peng et al.: Markov Language Models.

Compression Good or Bad? Scoring: we measured Accuracy: Total # correct classifications Total # tests (Micro-averaged accuracy) Why? Single-class labels, no tuning parameters.

AMDL Results Corpus RAR LZWGZIP Author Federalist (2) 0.94 0.830.67 Gutenberg-10 0.82 0.650.62 Reuters-90.780.660.79 Topic Reuters-10 0.87 0.840.83 10news (20news) 0.96 (0.90) 0.660.56 (0.47) Sector (105) 0.90 0.610.19

RAR is a Star! RAR is best performing method on all but small Reuters-9 corpus. Poor performance of gzip on large corpora due to its 32K sliding window. Poor performance of LZW: dictionary fills up after ~ 300KB, other reasons too.

RAR on Standard Corpora - Comparison 90.5% for RAR on 20news: - 89.2% Language Modeling (Peng et al. 2004) - 86.2% Extended NB (Rennie et al. 2003) - 82.1% PPMC (Teahan and Harper 2001) 89.6% for RAR on Sector: - 93.6% SVM (Zhang and Oles 2001) - 92.3% Extended NB (Rennie et al. 2003) - 64.5% Multinomial NB (Ghani 2001)

AMDL vs. BCN Gzip / BCN good. Due to processing each doc separately with T (1-NN). Gzip / AMDL bad. BCN was slow. Probably due to more sys calls and disk I/O. Method Corpus AMDLBCN RARgzipRARgzip Federalist 0.94 0.67 0.78 Guten-10 0.82 0.62 0.750.72 Reuters-9 0.78 0.790.77

Why Good?! Compression tools are character-based. (Stupid, remember?) Better than word-based? WHY? Can they capture sub-word word super-word non-word features?

Pre-processing STD: no change to input. NoP: remove punctuation; replace white spaces (tab, line, parag & page breaks) with spaces. WOS: NoP + Word Order Scrambling. RSW: NoP + random-string words. … and more … themorethebetter the more “the more – the better!” dqftmdwdqflkwe

Non-words: Punctuation Intuition: punctuation usage is characteristic of writing style (authorship attribution). Results: Accuracy remained the same, or even increased, in many cases. RAR insensitive to punctuation removal.

Super-words: word seq. Order Scrambling (WOS) WOS removes punctuation and scrambles word order. WOS leaves sub-word and word info intact. Destroys super-word relations. RAR: accuracy declined in all but one corpus  seems to exploit word seq. (n-grams?). Advantage over SVM state-of-the-art bag-of-words methods. LZW & gzip: no consistent accuracy decline.

Summary Compared effectiveness of compression for text classification (compression methods x classification procedures). RAR (PPM) is a star – under AMDL. - BCN (1-NN) slow(er) and never better in accuracy. - Compression good (Teahan and Harper). - Character-based Markov models good (Peng et al.) Introduced pre-processing testing techniques: novel ways to test how compression (and other character-based methods) exploit sub/super/non-word features. - RAR benefits from super-word info. - Suggests word-based methods might benefit from it too.

Future Research Test / confirm results on more and bigger corpora. Compare to state-of-the-art techniques: Other compression / character-based methods. SVM Word-based n-gram language modeling (Peng et al). Word-based compression? Use Standard MDL (Teahan and Harper). Faster, better insight. Sensitivity to class training data imbalance When is throwing away data desirable for compression?

Thank you!

On Compression-Based Text Classification Yuval Marton 1, Ning Wu 2 and Lisa Hellerstein 2 1) University of Maryland and 2) Polytechnic University ECIR-05.

Similar presentations

Presentation on theme: "On Compression-Based Text Classification Yuval Marton 1, Ning Wu 2 and Lisa Hellerstein 2 1) University of Maryland and 2) Polytechnic University ECIR-05."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

On Compression-Based Text Classification Yuval Marton 1, Ning Wu 2 and Lisa Hellerstein 2 1) University of Maryland and 2) Polytechnic University ECIR-05.

Similar presentations

Presentation on theme: "On Compression-Based Text Classification Yuval Marton 1, Ning Wu 2 and Lisa Hellerstein 2 1) University of Maryland and 2) Polytechnic University ECIR-05."— Presentation transcript:

Similar presentations

About project

Feedback