Presentation is loading. Please wait.

Presentation is loading. Please wait.

Latent Association Analysis of Document Pairs Gengxin Miao University of California, Santa Barbara Presented at the IBM T.J. Watson Research Center Hawthorne,

Similar presentations

Presentation on theme: "Latent Association Analysis of Document Pairs Gengxin Miao University of California, Santa Barbara Presented at the IBM T.J. Watson Research Center Hawthorne,"— Presentation transcript:

1 Latent Association Analysis of Document Pairs Gengxin Miao University of California, Santa Barbara Presented at the IBM T.J. Watson Research Center Hawthorne, NY December 2, 2011

2 Gengxin MiaoUC Santa Barbara2 Networked Texts DB2 logon Diseases Symptoms Treatments Texts flow on expert networksSemantically associated texts Interconnected text streams Users Queries Web pages Belong to the same search task

3 Semantically Associated Documents +

4 Gengxin MiaoUC Santa Barbara4 Applications Software system maintenance Root cause finding Problem prediction Machine translation Question answering Healthcare assistance

5 Gengxin MiaoUC Santa Barbara5 Huge Datasets Beyond human learners capability

6 Gengxin MiaoUC Santa Barbara6 Modeling Options Word-level mapping Topic-level mapping Document-level mapping Source Document SetTarget Document Set

7 Gengxin MiaoUC Santa Barbara7 Word-Level Mapping (UAI09) Learns a dictionary between the two document sets Applies to machine translation Word mappings are typically noisy

8 Gengxin MiaoUC Santa Barbara8 Topic-Level Mapping (EMNLP09) Assumes the associated documents share the same topic proportion Works well for translations between languages

9 Gengxin MiaoUC Santa Barbara9 Document-Level Mapping (our work) One-to-many or many-to-one mappings are broken down into one-to-one document pairs Two documents are associated by their association factor

10 Gengxin MiaoUC Santa Barbara10 Latent Association Analysis – Framework Generative process Draw an association factor for each document pair Draw topic proportions for both the source and the target document Draw the words in each document Generative ModelsRanking AlgorithmsExperiment

11 Gengxin MiaoUC Santa Barbara11 Latent Association Analysis – An Instantiation Canonical Correlation Analysis (CCA) Captures the semantic association in document pairs Correlated Topic Model (CTM) Captures the document and word co-occurrence Generative ModelsRanking AlgorithmsExperiment

12 Gengxin MiaoUC Santa Barbara12 The Generative Process Generative ModelsRanking AlgorithmsExperiment A pair of documents arise from the following process Draw an L-dimensional association factor For the source/target document, draw the topic proportions For each word in the documents, draw a topic and a word

13 Gengxin MiaoUC Santa Barbara13 Problems Generative ModelsRanking AlgorithmsExperiment Inference Given a model M and a document pair How to determine the association factor, topic proportions and topic assignments that best describe the document pair? Model fitting Given a set of document pairs How to calculate the parameters in M that best describes the entire document pair set?

14 Gengxin MiaoUC Santa Barbara14 Inference Generative ModelsRanking AlgorithmsExperiment Objective function Given a model and a document pair Calculate the topic assignments and the topic proportions Posterior distribution is intractable to compute The topic assignments and the topic proportions are correlated when conditioned on observations

15 Gengxin MiaoUC Santa Barbara15 Variational Inference Decouple the parameters using a variational distribution Q Fit the variational parameters to approximate the true posterior distribution Generative ModelsRanking AlgorithmsExperiment

16 Gengxin MiaoUC Santa Barbara16 Variational Parameters Generative ModelsRanking AlgorithmsExperiment

17 Gengxin MiaoUC Santa Barbara17 Model Fitting Generative ModelsRanking AlgorithmsExperiment

18 Gengxin MiaoUC Santa Barbara18 LAA Ranking Methods Generative ModelsRanking AlgorithmsExperiment Direct Ranking Ranking function for a candidate document pair Word frequency can distort the probability Latent Ranking

19 Gengxin MiaoUC Santa Barbara19 Two-Step Ranking Generative ModelsRanking AlgorithmsExperiment Separate Topic Models Source document has topic proportion Target document has topic proportion Topic-Level Mapping Canonical Correlation Analysis captures the association between the topic proportions Rank Target Documents

20 Gengxin MiaoUC Santa Barbara20 Experiments Datasets IT-Change: Changes made to an IT environment and the consequent problems 24,317 document pairs 20,000 used for training, the rest used for testing IT-Solution: IT problems and their solutions 19,696 document pairs 15,000 used for training, the rest used for testing Evaluation Randomly select 100 document pairs in testing dataset For each source document, rank the 100 target documents Use the rank of the correct target document as accuracy measurement Generative ModelsRanking AlgorithmsExperiment

21 Gengxin MiaoUC Santa Barbara21 Accuracy Analysis Generative ModelsRanking AlgorithmsExperiment

22 Gengxin MiaoUC Santa Barbara22 Example Generative ModelsRanking AlgorithmsExperiment

23 Gengxin MiaoUC Santa Barbara23 Summary The LAA framework is capable of modeling two document sets associated by a bipartite graph One-to-many mappings or many-to-one mappings of documents are taken into consideration We instantiated LAA with CCA and CTM, but the framework can be used with other instantiations that fit specific applications The LAA-latent ranking algorithm ranks the correct target document better than other state-of-the-art algorithms

24 Gengxin MiaoUC Santa Barbara24 Acknowledgment Prof. Louise E. Moser Prof. Xifeng Yan Dr. Shu Tao Dr. Ziyu Guan Dr. Nikos Anerousis

25 Q & A? Thanks!

26 Gengxin MiaoUC Santa Barbara26 Unigram Model Generative ModelsRanking AlgorithmsExperiment

27 Gengxin MiaoUC Santa Barbara27 Mixture of Unigrams Generative ModelsRanking AlgorithmsExperiment

28 Gengxin MiaoUC Santa Barbara28 Probabilistic Latent Semantic Indexing Generative ModelsRanking AlgorithmsExperiment

29 Gengxin MiaoUC Santa Barbara29 LDA and CTM Generative ModelsRanking AlgorithmsExperiment topic 2 topic 3 topic 1 topic 2topic 3

Download ppt "Latent Association Analysis of Document Pairs Gengxin Miao University of California, Santa Barbara Presented at the IBM T.J. Watson Research Center Hawthorne,"

Similar presentations

Ads by Google