Download presentation

Presentation is loading. Please wait.

Published byAngelique Tupper Modified about 1 year ago

1
2015-5-11 Clustering Search Results Using PLSA 洪春涛

2
2015-5-12 Outlines Motivation Introduction to document clustering and PLSA algorithm Working progress and testing results

3
2015-5-13 Motivation Current Internet search engines are giving us too much information Clustering the search results may help find the desired information quickly

4
2015-5-14 The writer Truman Capote The film Truman Capote A demo of the searching result from Google.

5
2015-5-15 Document clustering Put the ‘similar’ documents together => How do we define ‘similar’?

6
2015-5-16 Vector Space Model of documents The Vector Space Model (VSM) sees a document as a vector of terms: Doc1: I see a bright future. Doc2:I see nothing. Iseeabrightfuturenothing doc1111110 doc2110001

7
2015-5-17 The distance between doc1 and doc2 is then defined as Cosine as Distance Between Documents

8
2015-5-18 Problems with cosine similarity Synonymy: different words may have the same meaning –Car manufacturer=automobile maker Polysemy: a word may have several different meanings - ‘Truman Capote’ may mean the writer or the film => We need a model that reflects the ‘meaning’

9
2015-5-19 Probabilistic Latent Semantic Analysis Graphical model of PLSA: D1 Z1 W1 D: document Z: latent class W: word These can also be written as: D2 Z1 W1 0.1 0.9 0.3 0.7 D2 0.8 0.2

10
2015-5-110 Through Maximization Likelihood, one gets the estimated parameters: P(d|z) This is what we want – a document-topic matrix that reflects meanings of the documents. P(w|z) P(z)

11
2015-5-111 Our approach 1.Get the P(d|z) matrix by PLSA, and 2.Use k-means clustering algorithm on the matrix

12
2015-5-112 Problems with this approach PLSA takes too much time solution: optimization & parallelization

13
2015-5-113 Algorithm Outline Expectation Maximization(EM) Algorithm: Tempered EM: E-step: M-step:

14
2015-5-114 Basic Data Structures p_w_z_current, p_w_z_prev: dense double matrix W*Z p_d_z_current, p_d_z_prev: dense double matrix D*Z p_z_current, p_z_prev: double arrayZ n_d_w: sparse integer matrixN

15
Lemur Implementation In-need calculation of p_z_d_w Computational complexity: O(W*D*Z 2 ) For the new3 dataset containing 9558 documents, 83487 unique terms, it takes days to finish a TEM iteration 2015-5-115

16
Optimization of the Algorithm Reduce complexity –calculate p_z_d_w just once in an iteration –complexity reduced to O(N*Z) Reduce cache miss by reverting loops for(int d=1;d

17
Parallelization: Access Pattern 2015-5-117 Data Race solution: divide the co-occurrence table into blocks

18
Block Dispatching Algorithm 2015-5-118

19
Block Dividing Algorithm 2015-5-119 cranmed

20
Experiment Setup 2015-5-120

21
Speedup 2015-5-121 HPC134Tulsa

22
Memory Bandwidth Usage 2015-5-122

23
Memory Related Pipeline Stalls 2015-5-123

24
Available Memory Bandwidth of the Two Machines 2015-5-124

25
END 2015-5-125

26
2015-5-126 Backup slides

27
2015-5-127 Test Results PLSAVSM Tr230.49770.5273 K1b0.84730.5724 sports0.75750.5563 Table 1. F-score of PLSA and VSM

28
2015-5-128 sizeZ102050100 Lemur29482631015 Optimized23.2713 Table 2. Time used in one EM iteration (in second) Uses the k1b dataset (2340 docs, 21247 unique terms, 530374 terms)

29
2015-5-129 Thanks!

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google