Presentation is loading. Please wait.

Presentation is loading. Please wait.

2015-5-11 Clustering Search Results Using PLSA 洪春涛.

Similar presentations

Presentation on theme: "2015-5-11 Clustering Search Results Using PLSA 洪春涛."— Presentation transcript:

1 Clustering Search Results Using PLSA 洪春涛

2 Outlines Motivation Introduction to document clustering and PLSA algorithm Working progress and testing results

3 Motivation Current Internet search engines are giving us too much information Clustering the search results may help find the desired information quickly

4 The writer Truman Capote The film Truman Capote A demo of the searching result from Google.

5 Document clustering Put the ‘similar’ documents together => How do we define ‘similar’?

6 Vector Space Model of documents The Vector Space Model (VSM) sees a document as a vector of terms: Doc1: I see a bright future. Doc2:I see nothing. Iseeabrightfuturenothing doc doc

7 The distance between doc1 and doc2 is then defined as Cosine as Distance Between Documents

8 Problems with cosine similarity Synonymy: different words may have the same meaning –Car manufacturer=automobile maker Polysemy: a word may have several different meanings - ‘Truman Capote’ may mean the writer or the film => We need a model that reflects the ‘meaning’

9 Probabilistic Latent Semantic Analysis Graphical model of PLSA: D1 Z1 W1 D: document Z: latent class W: word These can also be written as: D2 Z1 W D

10 Through Maximization Likelihood, one gets the estimated parameters: P(d|z) This is what we want – a document-topic matrix that reflects meanings of the documents. P(w|z) P(z)

11 Our approach 1.Get the P(d|z) matrix by PLSA, and 2.Use k-means clustering algorithm on the matrix

12 Problems with this approach PLSA takes too much time solution: optimization & parallelization

13 Algorithm Outline Expectation Maximization(EM) Algorithm: Tempered EM: E-step: M-step:

14 Basic Data Structures p_w_z_current, p_w_z_prev: dense double matrix W*Z p_d_z_current, p_d_z_prev: dense double matrix D*Z p_z_current, p_z_prev: double arrayZ n_d_w: sparse integer matrixN

15 Lemur Implementation In-need calculation of p_z_d_w Computational complexity: O(W*D*Z 2 ) For the new3 dataset containing 9558 documents, unique terms, it takes days to finish a TEM iteration

16 Optimization of the Algorithm Reduce complexity –calculate p_z_d_w just once in an iteration –complexity reduced to O(N*Z) Reduce cache miss by reverting loops for(int d=1;d

17 Parallelization: Access Pattern Data Race solution: divide the co-occurrence table into blocks

18 Block Dispatching Algorithm

19 Block Dividing Algorithm cranmed

20 Experiment Setup

21 Speedup HPC134Tulsa

22 Memory Bandwidth Usage

23 Memory Related Pipeline Stalls

24 Available Memory Bandwidth of the Two Machines

25 END

26 Backup slides

27 Test Results PLSAVSM Tr K1b sports Table 1. F-score of PLSA and VSM

28 sizeZ Lemur Optimized Table 2. Time used in one EM iteration (in second) Uses the k1b dataset (2340 docs, unique terms, terms)

29 Thanks!

Download ppt "2015-5-11 Clustering Search Results Using PLSA 洪春涛."

Similar presentations

Ads by Google