Presentation is loading. Please wait.

Presentation is loading. Please wait.

LDA Training System 8/22/2012.

Similar presentations


Presentation on theme: "LDA Training System 8/22/2012."— Presentation transcript:

1 LDA Training System xueminzhao@tencent.com 8/22/2012

2 Outline Introduction SparseLDA Rethinking LDA: Why Priors Matter LDA Training System Design: MapReduce-LDA

3 Outline Introduction SparseLDA Rethinking LDA: Why Priors Matter LDA Training System Design: MapReduce-LDA

4 Problem – Text Relevance Q1: apple pie Q2: iphone crack Doc1: Apple Computer Inc. is a well known company located in California, USA. Doc2: The apple is the pomaceous fruit of the apple tree, spcies Malus domestica in the rose.

5 Topic Models

6 Topic Model – Generative Process

7 Topic Model - Inference

8 Latent Dirichlet Allocation

9 Outline Introduction SparseLDA Rethinking LDA: Why Priors Matter LDA Training System Design: MapReduce-LDA

10 Gibbs Sampling for LDA

11

12 Document-Topic Statistics

13 Topic-Word Statistics

14

15 For each token,

16

17

18

19

20 Sample a new topic

21 For each token,

22 Summary so far

23 The normalizing constant

24

25

26 Statistics are sparse

27 Summary so far

28 Huge savings: time and memory

29 Outline Introduction SparseLDA Rethinking LDA: Why Priors Matter LDA Training System Design: MapReduce-LDA

30 Priors for LDA

31

32

33

34

35 Comparing Priors for LDA

36 Optimizing m

37 Selecting T

38 Outline Introduction SparseLDA Rethinking LDA: Why Priors Matter LDA Training System Design: MapReduce-LDA

39 Overview

40 MapReduce Jobs

41 Scalability Hypothesis - memory 40GB per machine; - 5 words per doc. Scalability - if # limit; - if # limit.

42 Experiment for Correctness Validation

43 References D. Blei, Andrew Ng, and M. Jordan, Latent Dirichlet Allocation, JMLR2003. Thomas L. Griffiths, and Mark Steyvers, Finding scientific topics, PNAS2004. Gregor Heinrich, Parameter estimation for text analysis, Technical Report, 2009. Limin Yao, David Mimno, and Andrew McCallum. Efficient Methods for Topic Model Inference on StreamingDocument Collections. KDD'09. Hanna M. Wallach, David Mimno, and Andrew McCallum, Rethinking LDA: Why Priors Matter, NIPS2009. David Newman, Arthur Asuncion, Padhraic Smyth, and Max Welling, Distributed Inference for Latent Dirichlet Allocation, NIPS2007. Yi Wang, Hongjie Bai, Matt Stanton, Wen-Yen Chen, and Edward Y. Chang, PLDA: Parallel Latent Dirichlet Allocation for Large-scale Applications, AAIM2009. Xueminzhao. LDA design doc. http://x.x.x.x/~xueminzhao/html_docs/internal/modules/lda.html. http://x.x.x.x/~xueminzhao/html_docs/internal/modules/lda.html

44


Download ppt "LDA Training System 8/22/2012."

Similar presentations


Ads by Google