Presentation is loading. Please wait.

Presentation is loading. Please wait.

Scalable Text Mining with Sparse Generative Models

Similar presentations

Presentation on theme: "Scalable Text Mining with Sparse Generative Models"— Presentation transcript:

1 Scalable Text Mining with Sparse Generative Models
Antti Puurula PhD thesis presentation University of Waikato, New Zealand, 8th June 2015

2 Introduction This thesis presents a framework of probabilistic text mining based on sparse generative models Models developed in the framework show state-of-the-art effectiveness in both text classification and retrieval tasks Proposed sparse inference for using these models improves scalability, enabling text mining for very large-scale tasks

3 Major Contributions of the Thesis
Formalizing multinomial modeling of text Smoothing as two-state Hidden Markov Models Fractional counts as probabilistic data Weighted factors as log-linear models Scalable inference on text Sparse inference using inverted indices for statistical models Tied Document Mixture, a model benefiting from sparse inference Extensive evaluation using a combined experimental setup for classification and retrieval

4 Defining Text Mining “Knowledge Discovery in Textual databases” (KDT) [Feldman and Dagan, ] “Text Mining as Integration of Several Related Research Areas” [Grobelnik et al., 2000] Definition used in this thesis: Text mining is an interdisciplinary field of research on the automatic processing of large quantities of text data for valuable information

5 Related Fields and Application Domains

6 Volume of Text Mining Publications
References per year found for related fields using academic search engines

7 Volume of Text Mining Publications
References per year found for related fields using academic search engines

8 Volume of Text Mining Publications
References per year found for related fields using academic search engines

9 Scale of Text Data Existing collections:
Google Books, 30M books (2013) Twitter, 200M users, 400M messages per day (2013) WhatsApp, 430M users, 50B messages per day (2014) Available research collections: English Wikipedia, 4.5M articles (2014) Google n-grams, 5-grams estimated from 1T words (2007) Annotated English Gigaword, 4B words with metadata (2012) TREC KBA, 394M annotated documents for classification (2014)

10 Text Mining Methodology in a Nutshell
Normalize and map documents into a structured representation, such as a vector of word counts Segment a problem into machine learning tasks Solve the tasks using algorithms, most commonly linear models



13 Linear Models for Text Mining
Multi-class linear scoring function:

14 Multinomial Naive Bayes
Bayes model with multinomials conditioned on label variables: Priors are categorical, label-conditionals are multinomial, and normalizer is constant Directed generative graphical model






20 Formalizing Smoothing of Multinomials
All smoothing methods for multinomials can be expressed as , where is an unsmoothed label-conditional model, is the background model, and is the smoothing weight Discounting of counts by discounts is applied to


22 Two-State HMM Formalization of Smoothing
Replace multinomial with a 0th order categorical state-emission HMM, with M=2 hidden states: Component m=2 is shared between the 2-state HMMs for each label

23 Two-State HMM Formalization of Smoothing (2)
Label-conditionals can be rewritten: Choosing , , and implements the smoothed multinomials

24 Two-State HMM Formalization of Smoothing (3)
Maximum likelihood estimation is difficult, due to a sum over terms Given a prior distribution over component assignments , expected log-likelihood estimation decouples:


26 Formalizing Fractional Counts
Fractional counts are undefined for categorical and multinomial models Formalization possible with probabilistic data A weight sequence matching a word sequence can be interpreted as probabilities of words occurring in data Expected log-likelihoods and log-probabilities given expected counts reproduce results from using fractional counts

27 Formalizing Fractional Counts (2)
Estimation with expected log-likelihood

28 Formalizing Fractional Counts (3)
Inference with expected log-probability

29 Extending MNB with Scaled Factors
MNB with scaled factors for label priors and document lengths , where label prior and document length factors are scaled and renormalized: and


31 Sparse Inference for MNB
Naive MNB posterior inference has complexity Sparse inference using an inverted index with precomputed values: , with , , , and for all This has time complexity


33 Sparse Inference for Structured Models
Extension to hierarchically smoothed sequence models Complexity reduced from to


35 Sparse Inference for Structured Models (2)
A hierarchically smoothed sequence model: With Jelinek-Mercer smoothing, sparse marginalization: Marginalization complexity reduced from to


37 Tied Document Mixture Replace label-conditional in MNB with a mixture over hierarchically smoothed document models:

38 Experiments Experiments on 16 text classification and 13 ranked retrieval datasets Development and evaluation segments used, both further split into training and testing segments Classification evaluated with Micro-Fscore, retrieval with MAP and NDCG Models optimized for the evaluation measures using a Gaussian random search on development test set




42 Evaluated Modifications
MNB, TDM, VSM, LR, and SVM models with modifications compared Generalized TF-IDF used: , with for length scaling and for IDF lifting











53 Scalability Experiments
Large English Wikipedia dataset for multi-label classification, segmented into 2.34M training documents and 23.6k test documents Pruned by features (10 to ), documents (10 to ) and labelsets (1 to ) into smaller sets Scalability of naive vs. sparse inference evaluated on MNB and TDM Maximum of 4 hours of computing time allowed for each condition


55 Summary of Experiment Results
Effectiveness improvements to MNB: Choice of smoothing – small effect Feature weighting and scaled factors - large effect Tied Document Mixture - very large effect BM25 for ranking outperformed, close to highly optimized SVM for classification Scalability from sparse inference: 10* inference time reduction in largest completed case

56 Conclusion Modified Bayes models are strong models for text mining tasks: sentiment analysis, spam classification, document categorization, ranked retrieval, … Sparse inference enables scalability for new types of tasks and models Possible future applications of the presented framework Text clustering Text regression N-gram language modeling Topic models

57 Conclusion (2) Thesis statement:
“Generative models of text combined with inference using inverted indices provide sparse generative models for text mining that are both versatile and scalable, providing state-of-the-art effectiveness and high scalability for various text mining tasks.” Truisms in theory that should be reconsidered: Naive Bayes as the “punching bag of machine learning” “the curse of dimensionality” and “ is optimal time complexity”

Download ppt "Scalable Text Mining with Sparse Generative Models"

Similar presentations

Ads by Google