Presentation is loading. Please wait.

Presentation is loading. Please wait.

On Compression-Based Text Classification Yuval Marton 1, Ning Wu 2 and Lisa Hellerstein 2 1) University of Maryland and 2) Polytechnic University ECIR-05.

Similar presentations


Presentation on theme: "On Compression-Based Text Classification Yuval Marton 1, Ning Wu 2 and Lisa Hellerstein 2 1) University of Maryland and 2) Polytechnic University ECIR-05."— Presentation transcript:

1 On Compression-Based Text Classification Yuval Marton 1, Ning Wu 2 and Lisa Hellerstein 2 1) University of Maryland and 2) Polytechnic University ECIR-05 Santiago de Compostela, Spain. March 2005

2 Compression for Text Classification?? Proposed in the last ~10 years. Not well-understood why works. Compression is stupid! slow! Non-standard! Using compression tools is easy.. Does it work? (Controversy. Mess)

3 Overview What ’ s Text classification (problem setting) Compression-based text classification Classification Procedures ( + Do it yourself !) Compression Methods (RAR, LZW, and gzip) Experimental Evaluation Why?? (Compression as char-based method) Influence of sub/super/non-word features Conclusions and future work

4 Text Classification Given training corpus (labeled documents). Learn how to label new (test) documents. Our setting: Single-class: document belongs to exactly one class. 3 topic classification and 3 authorship attribution tasks.

5 Classification by Compression Compression programs build a model or dictionary of their input (language modeling). Better model  better compression Idea: Compress a document using different class models. Label with class achieving highest compression rate. Minimal Description Length (MDL) principle: select model with shortest length of model + data.

6 Standard MDL (Teahan & Harper) D1D1 D2D2 D3D3 Class 1 D1D1 D2D2 D3D3 Class 2 D1D1 D2D2 D3D3 Class n A1A1 A2A2 AnAn M1M1 M1M1 M2M2 M2M2 MnMn MnMn T Concat. training data  A i Compress A i  model M i Compress T using each M i Assign T to its best compressor … and the winner is…

7 Do it yourself Five minutes on how to classify text documents e.g., according to their topic or author, using only off-the-shelf compression tools (such as WinZip or RAR) …

8 AMDL (Khmelev / Kukushkina et al. 2001) D1D1 D2D2 D3D3 Class 1 D1D1 D2D2 D3D3 Class 2 D1D1 D2D2 D3D3 Class n T Concat. training data  A i Concat. A i and T  A i T Compress each A i and A i T Assign T to class i w/ min v i A1A1 A2A2 AnAn A1TA1T Subtract compressed file sizes v i = |A i T| - |A i | A2TA2T AnTAnT

9 BCN (Benedetto et al. 2002) D1D1 D2D2 D3D3 Class 1 D4D4 D5D5 D6D6 Class 2 D7D7 D8D8 D9D9 Class n T Like AMDL, but concat. each doc D j with T  D j T Compress each D j and D j T Assign T to class i of doc D j with min v DT D1D1 D4D4 D7D7 D1TD1T Subtract compressed file sizes v DT = |D j T| - |D j | D4TD4T D7TD7T D2TD2T D5TD5T D8TD8T D3TD3T D6TD6T D9TD9T

10 Compression Methods Gzip: Lempel-Ziv compression (LZ77). - “ Dictionary ” -based - Sliding window typically 32K. LZW (Lempel-Ziv-Welch) - Dictionary-based (16 bit). - Fills up on big corpora (typically after ~300KB). RAR (proprietary shareware) - PPMII variant on text. - Markov Model, n-grams frequencies. -32K- -16 bit (~300K) - - (almost) unlimited -

11 Previous Work Khmelev et al. (+Kukushkina): Russian authors. Thaper: LZ78, char- and word-based PPM. Frank et al.: compression (PPM) bad for topic. Teahan and Harper: compression (PPM) good. Benedetto et al.: gzip good for authors. Goodman: gzip bad! Khmelev and Teahan: RAR (PPM). Peng et al.: Markov Language Models.

12 Compression Good or Bad? Scoring: we measured Accuracy: Total # correct classifications Total # tests (Micro-averaged accuracy) Why? Single-class labels, no tuning parameters.

13 AMDL Results Corpus RAR LZWGZIP Author Federalist (2) 0.94 0.830.67 Gutenberg-10 0.82 0.650.62 Reuters-90.780.660.79 Topic Reuters-10 0.87 0.840.83 10news (20news) 0.96 (0.90) 0.660.56 (0.47) Sector (105) 0.90 0.610.19

14 RAR is a Star! RAR is best performing method on all but small Reuters-9 corpus. Poor performance of gzip on large corpora due to its 32K sliding window. Poor performance of LZW: dictionary fills up after ~ 300KB, other reasons too.

15 RAR on Standard Corpora - Comparison 90.5% for RAR on 20news: - 89.2% Language Modeling (Peng et al. 2004) - 86.2% Extended NB (Rennie et al. 2003) - 82.1% PPMC (Teahan and Harper 2001) 89.6% for RAR on Sector: - 93.6% SVM (Zhang and Oles 2001) - 92.3% Extended NB (Rennie et al. 2003) - 64.5% Multinomial NB (Ghani 2001)

16 AMDL vs. BCN Gzip / BCN good. Due to processing each doc separately with T (1-NN). Gzip / AMDL bad. BCN was slow. Probably due to more sys calls and disk I/O. Method Corpus AMDLBCN RARgzipRARgzip Federalist 0.94 0.67 0.78 Guten-10 0.82 0.62 0.750.72 Reuters-9 0.78 0.790.77

17 Why Good?! Compression tools are character-based. (Stupid, remember?) Better than word-based? WHY? Can they capture sub-word word super-word non-word features?

18 Pre-processing STD: no change to input. NoP: remove punctuation; replace white spaces (tab, line, parag & page breaks) with spaces. WOS: NoP + Word Order Scrambling. RSW: NoP + random-string words. … and more … themorethebetter the more “the more – the better!” dqftmdwdqflkwe

19 Non-words: Punctuation Intuition: punctuation usage is characteristic of writing style (authorship attribution). Results: Accuracy remained the same, or even increased, in many cases. RAR insensitive to punctuation removal.

20 Super-words: word seq. Order Scrambling (WOS) WOS removes punctuation and scrambles word order. WOS leaves sub-word and word info intact. Destroys super-word relations. RAR: accuracy declined in all but one corpus  seems to exploit word seq. (n-grams?). Advantage over SVM state-of-the-art bag-of-words methods. LZW & gzip: no consistent accuracy decline.

21 Summary Compared effectiveness of compression for text classification (compression methods x classification procedures). RAR (PPM) is a star – under AMDL. - BCN (1-NN) slow(er) and never better in accuracy. - Compression good (Teahan and Harper). - Character-based Markov models good (Peng et al.) Introduced pre-processing testing techniques: novel ways to test how compression (and other character-based methods) exploit sub/super/non-word features. - RAR benefits from super-word info. - Suggests word-based methods might benefit from it too.

22 Future Research Test / confirm results on more and bigger corpora. Compare to state-of-the-art techniques: Other compression / character-based methods. SVM Word-based n-gram language modeling (Peng et al). Word-based compression? Use Standard MDL (Teahan and Harper). Faster, better insight. Sensitivity to class training data imbalance When is throwing away data desirable for compression?

23 Thank you!


Download ppt "On Compression-Based Text Classification Yuval Marton 1, Ning Wu 2 and Lisa Hellerstein 2 1) University of Maryland and 2) Polytechnic University ECIR-05."

Similar presentations


Ads by Google