Presentation is loading. Please wait.

Presentation is loading. Please wait.

2015-5-11 Clustering Search Results Using PLSA 洪春涛.

Similar presentations


Presentation on theme: "2015-5-11 Clustering Search Results Using PLSA 洪春涛."— Presentation transcript:

1 2015-5-11 Clustering Search Results Using PLSA 洪春涛

2 2015-5-12 Outlines Motivation Introduction to document clustering and PLSA algorithm Working progress and testing results

3 2015-5-13 Motivation Current Internet search engines are giving us too much information Clustering the search results may help find the desired information quickly

4 2015-5-14 The writer Truman Capote The film Truman Capote A demo of the searching result from Google.

5 2015-5-15 Document clustering Put the ‘similar’ documents together => How do we define ‘similar’?

6 2015-5-16 Vector Space Model of documents The Vector Space Model (VSM) sees a document as a vector of terms: Doc1: I see a bright future. Doc2:I see nothing. Iseeabrightfuturenothing doc1111110 doc2110001

7 2015-5-17 The distance between doc1 and doc2 is then defined as Cosine as Distance Between Documents

8 2015-5-18 Problems with cosine similarity Synonymy: different words may have the same meaning –Car manufacturer=automobile maker Polysemy: a word may have several different meanings - ‘Truman Capote’ may mean the writer or the film => We need a model that reflects the ‘meaning’

9 2015-5-19 Probabilistic Latent Semantic Analysis Graphical model of PLSA: D1 Z1 W1 D: document Z: latent class W: word These can also be written as: D2 Z1 W1 0.1 0.9 0.3 0.7 D2 0.8 0.2

10 2015-5-110 Through Maximization Likelihood, one gets the estimated parameters: P(d|z) This is what we want – a document-topic matrix that reflects meanings of the documents. P(w|z) P(z)

11 2015-5-111 Our approach 1.Get the P(d|z) matrix by PLSA, and 2.Use k-means clustering algorithm on the matrix

12 2015-5-112 Problems with this approach PLSA takes too much time solution: optimization & parallelization

13 2015-5-113 Algorithm Outline Expectation Maximization(EM) Algorithm: Tempered EM: E-step: M-step:

14 2015-5-114 Basic Data Structures p_w_z_current, p_w_z_prev: dense double matrix W*Z p_d_z_current, p_d_z_prev: dense double matrix D*Z p_z_current, p_z_prev: double arrayZ n_d_w: sparse integer matrixN

15 Lemur Implementation In-need calculation of p_z_d_w Computational complexity: O(W*D*Z 2 ) For the new3 dataset containing 9558 documents, 83487 unique terms, it takes days to finish a TEM iteration 2015-5-115

16 Optimization of the Algorithm Reduce complexity –calculate p_z_d_w just once in an iteration –complexity reduced to O(N*Z) Reduce cache miss by reverting loops for(int d=1;d<numDocs;d++){ for(int w=0;w<numTermsInThisDoc;w++){ for(int z=0;z<numZ;z++){ …. } 2015-5-116

17 Parallelization: Access Pattern 2015-5-117 Data Race solution: divide the co-occurrence table into blocks

18 Block Dispatching Algorithm 2015-5-118

19 Block Dividing Algorithm 2015-5-119 cranmed

20 Experiment Setup 2015-5-120

21 Speedup 2015-5-121 HPC134Tulsa

22 Memory Bandwidth Usage 2015-5-122

23 Memory Related Pipeline Stalls 2015-5-123

24 Available Memory Bandwidth of the Two Machines 2015-5-124

25 END 2015-5-125

26 2015-5-126 Backup slides

27 2015-5-127 Test Results PLSAVSM Tr230.49770.5273 K1b0.84730.5724 sports0.75750.5563 Table 1. F-score of PLSA and VSM

28 2015-5-128 sizeZ102050100 Lemur29482631015 Optimized23.2713 Table 2. Time used in one EM iteration (in second) Uses the k1b dataset (2340 docs, 21247 unique terms, 530374 terms)

29 2015-5-129 Thanks!


Download ppt "2015-5-11 Clustering Search Results Using PLSA 洪春涛."

Similar presentations


Ads by Google