# 2015-5-11 Clustering Search Results Using PLSA 洪春涛.

## Presentation on theme: "2015-5-11 Clustering Search Results Using PLSA 洪春涛."— Presentation transcript:

2015-5-11 Clustering Search Results Using PLSA 洪春涛

2015-5-12 Outlines Motivation Introduction to document clustering and PLSA algorithm Working progress and testing results

2015-5-13 Motivation Current Internet search engines are giving us too much information Clustering the search results may help find the desired information quickly

2015-5-14 The writer Truman Capote The film Truman Capote A demo of the searching result from Google.

2015-5-15 Document clustering Put the ‘similar’ documents together => How do we define ‘similar’?

2015-5-16 Vector Space Model of documents The Vector Space Model (VSM) sees a document as a vector of terms: Doc1: I see a bright future. Doc2:I see nothing. Iseeabrightfuturenothing doc1111110 doc2110001

2015-5-17 The distance between doc1 and doc2 is then defined as Cosine as Distance Between Documents

2015-5-18 Problems with cosine similarity Synonymy: different words may have the same meaning –Car manufacturer=automobile maker Polysemy: a word may have several different meanings - ‘Truman Capote’ may mean the writer or the film => We need a model that reflects the ‘meaning’

2015-5-19 Probabilistic Latent Semantic Analysis Graphical model of PLSA: D1 Z1 W1 D: document Z: latent class W: word These can also be written as: D2 Z1 W1 0.1 0.9 0.3 0.7 D2 0.8 0.2

2015-5-110 Through Maximization Likelihood, one gets the estimated parameters: P(d|z) This is what we want – a document-topic matrix that reflects meanings of the documents. P(w|z) P(z)

2015-5-111 Our approach 1.Get the P(d|z) matrix by PLSA, and 2.Use k-means clustering algorithm on the matrix

2015-5-112 Problems with this approach PLSA takes too much time solution: optimization & parallelization

2015-5-113 Algorithm Outline Expectation Maximization(EM) Algorithm: Tempered EM: E-step: M-step:

2015-5-114 Basic Data Structures p_w_z_current, p_w_z_prev: dense double matrix W*Z p_d_z_current, p_d_z_prev: dense double matrix D*Z p_z_current, p_z_prev: double arrayZ n_d_w: sparse integer matrixN

Lemur Implementation In-need calculation of p_z_d_w Computational complexity: O(W*D*Z 2 ) For the new3 dataset containing 9558 documents, 83487 unique terms, it takes days to finish a TEM iteration 2015-5-115

Optimization of the Algorithm Reduce complexity –calculate p_z_d_w just once in an iteration –complexity reduced to O(N*Z) Reduce cache miss by reverting loops for(int d=1;d { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/13/3926262/slides/slide_16.jpg", "name": "Optimization of the Algorithm Reduce complexity –calculate p_z_d_w just once in an iteration –complexity reduced to O(N*Z) Reduce cache miss by reverting loops for(int d=1;d

Parallelization: Access Pattern 2015-5-117 Data Race solution: divide the co-occurrence table into blocks

Block Dispatching Algorithm 2015-5-118

Block Dividing Algorithm 2015-5-119 cranmed

Experiment Setup 2015-5-120

Speedup 2015-5-121 HPC134Tulsa

Memory Bandwidth Usage 2015-5-122

Memory Related Pipeline Stalls 2015-5-123

Available Memory Bandwidth of the Two Machines 2015-5-124

END 2015-5-125

2015-5-126 Backup slides

2015-5-127 Test Results PLSAVSM Tr230.49770.5273 K1b0.84730.5724 sports0.75750.5563 Table 1. F-score of PLSA and VSM

2015-5-128 sizeZ102050100 Lemur29482631015 Optimized23.2713 Table 2. Time used in one EM iteration (in second) Uses the k1b dataset (2340 docs, 21247 unique terms, 530374 terms)

2015-5-129 Thanks!