Download presentation
Presentation is loading. Please wait.
1
LDA Training System xueminzhao@tencent.com 8/22/2012
2
Outline Introduction SparseLDA Rethinking LDA: Why Priors Matter LDA Training System Design: MapReduce-LDA
3
Outline Introduction SparseLDA Rethinking LDA: Why Priors Matter LDA Training System Design: MapReduce-LDA
4
Problem – Text Relevance Q1: apple pie Q2: iphone crack Doc1: Apple Computer Inc. is a well known company located in California, USA. Doc2: The apple is the pomaceous fruit of the apple tree, spcies Malus domestica in the rose.
5
Topic Models
6
Topic Model – Generative Process
7
Topic Model - Inference
8
Latent Dirichlet Allocation
9
Outline Introduction SparseLDA Rethinking LDA: Why Priors Matter LDA Training System Design: MapReduce-LDA
10
Gibbs Sampling for LDA
12
Document-Topic Statistics
13
Topic-Word Statistics
15
For each token,
20
Sample a new topic
21
For each token,
22
Summary so far
23
The normalizing constant
26
Statistics are sparse
27
Summary so far
28
Huge savings: time and memory
29
Outline Introduction SparseLDA Rethinking LDA: Why Priors Matter LDA Training System Design: MapReduce-LDA
30
Priors for LDA
35
Comparing Priors for LDA
36
Optimizing m
37
Selecting T
38
Outline Introduction SparseLDA Rethinking LDA: Why Priors Matter LDA Training System Design: MapReduce-LDA
39
Overview
40
MapReduce Jobs
41
Scalability Hypothesis - memory 40GB per machine; - 5 words per doc. Scalability - if # limit; - if # limit.
42
Experiment for Correctness Validation
43
References D. Blei, Andrew Ng, and M. Jordan, Latent Dirichlet Allocation, JMLR2003. Thomas L. Griffiths, and Mark Steyvers, Finding scientific topics, PNAS2004. Gregor Heinrich, Parameter estimation for text analysis, Technical Report, 2009. Limin Yao, David Mimno, and Andrew McCallum. Efficient Methods for Topic Model Inference on StreamingDocument Collections. KDD'09. Hanna M. Wallach, David Mimno, and Andrew McCallum, Rethinking LDA: Why Priors Matter, NIPS2009. David Newman, Arthur Asuncion, Padhraic Smyth, and Max Welling, Distributed Inference for Latent Dirichlet Allocation, NIPS2007. Yi Wang, Hongjie Bai, Matt Stanton, Wen-Yen Chen, and Edward Y. Chang, PLDA: Parallel Latent Dirichlet Allocation for Large-scale Applications, AAIM2009. Xueminzhao. LDA design doc. http://x.x.x.x/~xueminzhao/html_docs/internal/modules/lda.html. http://x.x.x.x/~xueminzhao/html_docs/internal/modules/lda.html
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.