Download presentation
Presentation is loading. Please wait.
Published byAngelique Tupper Modified over 9 years ago
1
2015-5-11 Clustering Search Results Using PLSA 洪春涛
2
2015-5-12 Outlines Motivation Introduction to document clustering and PLSA algorithm Working progress and testing results
3
2015-5-13 Motivation Current Internet search engines are giving us too much information Clustering the search results may help find the desired information quickly
4
2015-5-14 The writer Truman Capote The film Truman Capote A demo of the searching result from Google.
5
2015-5-15 Document clustering Put the ‘similar’ documents together => How do we define ‘similar’?
6
2015-5-16 Vector Space Model of documents The Vector Space Model (VSM) sees a document as a vector of terms: Doc1: I see a bright future. Doc2:I see nothing. Iseeabrightfuturenothing doc1111110 doc2110001
7
2015-5-17 The distance between doc1 and doc2 is then defined as Cosine as Distance Between Documents
8
2015-5-18 Problems with cosine similarity Synonymy: different words may have the same meaning –Car manufacturer=automobile maker Polysemy: a word may have several different meanings - ‘Truman Capote’ may mean the writer or the film => We need a model that reflects the ‘meaning’
9
2015-5-19 Probabilistic Latent Semantic Analysis Graphical model of PLSA: D1 Z1 W1 D: document Z: latent class W: word These can also be written as: D2 Z1 W1 0.1 0.9 0.3 0.7 D2 0.8 0.2
10
2015-5-110 Through Maximization Likelihood, one gets the estimated parameters: P(d|z) This is what we want – a document-topic matrix that reflects meanings of the documents. P(w|z) P(z)
11
2015-5-111 Our approach 1.Get the P(d|z) matrix by PLSA, and 2.Use k-means clustering algorithm on the matrix
12
2015-5-112 Problems with this approach PLSA takes too much time solution: optimization & parallelization
13
2015-5-113 Algorithm Outline Expectation Maximization(EM) Algorithm: Tempered EM: E-step: M-step:
14
2015-5-114 Basic Data Structures p_w_z_current, p_w_z_prev: dense double matrix W*Z p_d_z_current, p_d_z_prev: dense double matrix D*Z p_z_current, p_z_prev: double arrayZ n_d_w: sparse integer matrixN
15
Lemur Implementation In-need calculation of p_z_d_w Computational complexity: O(W*D*Z 2 ) For the new3 dataset containing 9558 documents, 83487 unique terms, it takes days to finish a TEM iteration 2015-5-115
16
Optimization of the Algorithm Reduce complexity –calculate p_z_d_w just once in an iteration –complexity reduced to O(N*Z) Reduce cache miss by reverting loops for(int d=1;d<numDocs;d++){ for(int w=0;w<numTermsInThisDoc;w++){ for(int z=0;z<numZ;z++){ …. } 2015-5-116
17
Parallelization: Access Pattern 2015-5-117 Data Race solution: divide the co-occurrence table into blocks
18
Block Dispatching Algorithm 2015-5-118
19
Block Dividing Algorithm 2015-5-119 cranmed
20
Experiment Setup 2015-5-120
21
Speedup 2015-5-121 HPC134Tulsa
22
Memory Bandwidth Usage 2015-5-122
23
Memory Related Pipeline Stalls 2015-5-123
24
Available Memory Bandwidth of the Two Machines 2015-5-124
25
END 2015-5-125
26
2015-5-126 Backup slides
27
2015-5-127 Test Results PLSAVSM Tr230.49770.5273 K1b0.84730.5724 sports0.75750.5563 Table 1. F-score of PLSA and VSM
28
2015-5-128 sizeZ102050100 Lemur29482631015 Optimized23.2713 Table 2. Time used in one EM iteration (in second) Uses the k1b dataset (2340 docs, 21247 unique terms, 530374 terms)
29
2015-5-129 Thanks!
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.