Download presentation

Published bySamuel Wilcox Modified over 5 years ago

1
**Scalable Text Mining with Sparse Generative Models**

Antti Puurula PhD thesis presentation University of Waikato, New Zealand, 8th June 2015

2
Introduction This thesis presents a framework of probabilistic text mining based on sparse generative models Models developed in the framework show state-of-the-art effectiveness in both text classification and retrieval tasks Proposed sparse inference for using these models improves scalability, enabling text mining for very large-scale tasks

3
**Major Contributions of the Thesis**

Formalizing multinomial modeling of text Smoothing as two-state Hidden Markov Models Fractional counts as probabilistic data Weighted factors as log-linear models Scalable inference on text Sparse inference using inverted indices for statistical models Tied Document Mixture, a model benefiting from sparse inference Extensive evaluation using a combined experimental setup for classification and retrieval

4
Defining Text Mining “Knowledge Discovery in Textual databases” (KDT) [Feldman and Dagan, ] “Text Mining as Integration of Several Related Research Areas” [Grobelnik et al., 2000] Definition used in this thesis: Text mining is an interdisciplinary field of research on the automatic processing of large quantities of text data for valuable information

5
**Related Fields and Application Domains**

6
**Volume of Text Mining Publications**

References per year found for related fields using academic search engines

7
**Volume of Text Mining Publications**

References per year found for related fields using academic search engines

8
**Volume of Text Mining Publications**

References per year found for related fields using academic search engines

9
**Scale of Text Data Existing collections:**

Google Books, 30M books (2013) Twitter, 200M users, 400M messages per day (2013) WhatsApp, 430M users, 50B messages per day (2014) Available research collections: English Wikipedia, 4.5M articles (2014) Google n-grams, 5-grams estimated from 1T words (2007) Annotated English Gigaword, 4B words with metadata (2012) TREC KBA, 394M annotated documents for classification (2014)

10
**Text Mining Methodology in a Nutshell**

Normalize and map documents into a structured representation, such as a vector of word counts Segment a problem into machine learning tasks Solve the tasks using algorithms, most commonly linear models

13
**Linear Models for Text Mining**

Multi-class linear scoring function:

14
**Multinomial Naive Bayes**

Bayes model with multinomials conditioned on label variables: Priors are categorical, label-conditionals are multinomial, and normalizer is constant Directed generative graphical model

20
**Formalizing Smoothing of Multinomials**

All smoothing methods for multinomials can be expressed as , where is an unsmoothed label-conditional model, is the background model, and is the smoothing weight Discounting of counts by discounts is applied to

22
**Two-State HMM Formalization of Smoothing**

Replace multinomial with a 0th order categorical state-emission HMM, with M=2 hidden states: Component m=2 is shared between the 2-state HMMs for each label

23
**Two-State HMM Formalization of Smoothing (2)**

Label-conditionals can be rewritten: Choosing , , and implements the smoothed multinomials

24
**Two-State HMM Formalization of Smoothing (3)**

Maximum likelihood estimation is difficult, due to a sum over terms Given a prior distribution over component assignments , expected log-likelihood estimation decouples:

26
**Formalizing Fractional Counts**

Fractional counts are undefined for categorical and multinomial models Formalization possible with probabilistic data A weight sequence matching a word sequence can be interpreted as probabilities of words occurring in data Expected log-likelihoods and log-probabilities given expected counts reproduce results from using fractional counts

27
**Formalizing Fractional Counts (2)**

Estimation with expected log-likelihood

28
**Formalizing Fractional Counts (3)**

Inference with expected log-probability

29
**Extending MNB with Scaled Factors**

MNB with scaled factors for label priors and document lengths , where label prior and document length factors are scaled and renormalized: and

31
**Sparse Inference for MNB**

Naive MNB posterior inference has complexity Sparse inference using an inverted index with precomputed values: , with , , , and for all This has time complexity

33
**Sparse Inference for Structured Models**

Extension to hierarchically smoothed sequence models Complexity reduced from to

35
**Sparse Inference for Structured Models (2)**

A hierarchically smoothed sequence model: With Jelinek-Mercer smoothing, sparse marginalization: Marginalization complexity reduced from to

37
Tied Document Mixture Replace label-conditional in MNB with a mixture over hierarchically smoothed document models:

38
Experiments Experiments on 16 text classification and 13 ranked retrieval datasets Development and evaluation segments used, both further split into training and testing segments Classification evaluated with Micro-Fscore, retrieval with MAP and NDCG Models optimized for the evaluation measures using a Gaussian random search on development test set

42
**Evaluated Modifications**

MNB, TDM, VSM, LR, and SVM models with modifications compared Generalized TF-IDF used: , with for length scaling and for IDF lifting

53
**Scalability Experiments**

Large English Wikipedia dataset for multi-label classification, segmented into 2.34M training documents and 23.6k test documents Pruned by features (10 to ), documents (10 to ) and labelsets (1 to ) into smaller sets Scalability of naive vs. sparse inference evaluated on MNB and TDM Maximum of 4 hours of computing time allowed for each condition

55
**Summary of Experiment Results**

Effectiveness improvements to MNB: Choice of smoothing – small effect Feature weighting and scaled factors - large effect Tied Document Mixture - very large effect BM25 for ranking outperformed, close to highly optimized SVM for classification Scalability from sparse inference: 10* inference time reduction in largest completed case

56
Conclusion Modified Bayes models are strong models for text mining tasks: sentiment analysis, spam classification, document categorization, ranked retrieval, … Sparse inference enables scalability for new types of tasks and models Possible future applications of the presented framework Text clustering Text regression N-gram language modeling Topic models

57
**Conclusion (2) Thesis statement:**

“Generative models of text combined with inference using inverted indices provide sparse generative models for text mining that are both versatile and scalable, providing state-of-the-art effectiveness and high scalability for various text mining tasks.” Truisms in theory that should be reconsidered: Naive Bayes as the “punching bag of machine learning” “the curse of dimensionality” and “ is optimal time complexity”

Similar presentations

© 2021 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google