Presentation is loading. Please wait.

Presentation is loading. Please wait.

Optimizing Text Classification Mark Trenorden Supervisor: Geoff Webb.

Similar presentations


Presentation on theme: "Optimizing Text Classification Mark Trenorden Supervisor: Geoff Webb."— Presentation transcript:

1 Optimizing Text Classification Mark Trenorden Supervisor: Geoff Webb

2 Introduction What is Text Classification? Naïve Bayes Event Models Binomial Model Binning Conclusion

3 Text Classification Grouping documents of the same topics For example, Sport, Politics, e.t.c. Slow process for humans

4 Naïve Bayes P(c j | d i ) = P(c j ) P(d i | c j ) P(d) This is Bayes theorem Naïve Bayes assumes independence between attributes, in this case words. Not a correct assumption however still performs classification well.

5 Event Models Different ways of viewing a document In Bayes rule this translates to different ways of calculating, P(d i | c j ). There are two frequently used models

6 Multi–Variate Bernoulli Model In text classification terms, – A document(d i ) is an EVENT – Words(w t ) within the document are considered as ATTRIBUTES of d i – Number of occurrences of a word in a document is not recorded – When calculating the probability of class member ship all words in the vocabulary are considered even if thet don’t appear in document

7 Multinomial Model Number of occurrences of a word is captured Individual word occurrences are considered as “events” The document is considered to be a collection of events Only words that appear in the document and their counts are considered when calculation class membership

8 Previous Comparison Multi-Variate model good for small vocabulary Multi-Nomial model good for large vocabulary. Multi-Nomial much faster then the Multi- Variate

9 Binomial Model Want to capture occurances and non- occurances as well as word frequencies. P(d i | c j ) = Sum of P(c) + P(w | d) N * P(~w | d) L-N Where c = class, w = word, d = document, L = length and n = no of occurances of word

10 Binomial Results Performed just as well as multinomial with large vocabulary, however much slower. Outperformed Multi-Variate once vocabulary increased However did worse then existing techniques with smaller vocabulary sizes

11 Binomial Results Number of Words in the Vocabulary % Correctly Classed

12 Document Length None of the techniques take in to account document length. Currently, P(d | c) = f (w Є d, c) However we should incorporate document length. P(d | c) = f (w Є d, l, c)

13 Binning Discretization has been found to be effective for numeric variables for Naïve Bayes. Groups documents of similar lengths Theory is the distributions will differ significantly for different lengths This will help improve classification

14 Binning For my tests, bin size = 1000, if less then 2000 documents only use two bins. Increasing Document Size Bin 1Bin 2Bin 3 Bin 4

15 Binning Example Two Bins are created. Tables with word counts for each class within a bin for are created as opposed to one table for all words as per traditional methods. George Bush GWB Not GWB 4/20 7/20 3/20 1/20 Cat Length 0 -10 words Length 11 -20 words 3/20 2/20 3/20 2/20 3/207/20 George BushCat GWB Not GWB

16 Binning Given a unseen document, binning helps refine probabilities. For example If no bins, the probability that the word ‘Bush’ occurs in the GWB class is 10/40 or 25%. If we know that the document is in the 0 -10 words bin the probability of the word ‘Bush’ appearing in GWB is 7/20 or 35%.

17 Binning Results When applied to all datasets binning improved classification accuracy on all techniques

18 Binning Results 7 Sectors Dataset, Multi-Variate Method

19 Binning Results WebKB Dataset, Multi-Nomial Method

20 Conclusion/Future Goals Binning best solution Applicable to all event models In future apply event models and binning techniques to classification techniques other then Naïve Bayes.


Download ppt "Optimizing Text Classification Mark Trenorden Supervisor: Geoff Webb."

Similar presentations


Ads by Google