Download presentation

Presentation is loading. Please wait.

Published byAngelique Tupper Modified over 2 years ago

1
2015-5-11 Clustering Search Results Using PLSA 洪春涛

2
2015-5-12 Outlines Motivation Introduction to document clustering and PLSA algorithm Working progress and testing results

3
2015-5-13 Motivation Current Internet search engines are giving us too much information Clustering the search results may help find the desired information quickly

4
2015-5-14 The writer Truman Capote The film Truman Capote A demo of the searching result from Google.

5
2015-5-15 Document clustering Put the ‘similar’ documents together => How do we define ‘similar’?

6
2015-5-16 Vector Space Model of documents The Vector Space Model (VSM) sees a document as a vector of terms: Doc1: I see a bright future. Doc2:I see nothing. Iseeabrightfuturenothing doc1111110 doc2110001

7
2015-5-17 The distance between doc1 and doc2 is then defined as Cosine as Distance Between Documents

8
2015-5-18 Problems with cosine similarity Synonymy: different words may have the same meaning –Car manufacturer=automobile maker Polysemy: a word may have several different meanings - ‘Truman Capote’ may mean the writer or the film => We need a model that reflects the ‘meaning’

9
2015-5-19 Probabilistic Latent Semantic Analysis Graphical model of PLSA: D1 Z1 W1 D: document Z: latent class W: word These can also be written as: D2 Z1 W1 0.1 0.9 0.3 0.7 D2 0.8 0.2

10
2015-5-110 Through Maximization Likelihood, one gets the estimated parameters: P(d|z) This is what we want – a document-topic matrix that reflects meanings of the documents. P(w|z) P(z)

11
2015-5-111 Our approach 1.Get the P(d|z) matrix by PLSA, and 2.Use k-means clustering algorithm on the matrix

12
2015-5-112 Problems with this approach PLSA takes too much time solution: optimization & parallelization

13
2015-5-113 Algorithm Outline Expectation Maximization(EM) Algorithm: Tempered EM: E-step: M-step:

14
2015-5-114 Basic Data Structures p_w_z_current, p_w_z_prev: dense double matrix W*Z p_d_z_current, p_d_z_prev: dense double matrix D*Z p_z_current, p_z_prev: double arrayZ n_d_w: sparse integer matrixN

15
Lemur Implementation In-need calculation of p_z_d_w Computational complexity: O(W*D*Z 2 ) For the new3 dataset containing 9558 documents, 83487 unique terms, it takes days to finish a TEM iteration 2015-5-115

16
Optimization of the Algorithm Reduce complexity –calculate p_z_d_w just once in an iteration –complexity reduced to O(N*Z) Reduce cache miss by reverting loops for(int d=1;d

17
Parallelization: Access Pattern 2015-5-117 Data Race solution: divide the co-occurrence table into blocks

18
Block Dispatching Algorithm 2015-5-118

19
Block Dividing Algorithm 2015-5-119 cranmed

20
Experiment Setup 2015-5-120

21
Speedup 2015-5-121 HPC134Tulsa

22
Memory Bandwidth Usage 2015-5-122

23
Memory Related Pipeline Stalls 2015-5-123

24
Available Memory Bandwidth of the Two Machines 2015-5-124

25
END 2015-5-125

26
2015-5-126 Backup slides

27
2015-5-127 Test Results PLSAVSM Tr230.49770.5273 K1b0.84730.5724 sports0.75750.5563 Table 1. F-score of PLSA and VSM

28
2015-5-128 sizeZ102050100 Lemur29482631015 Optimized23.2713 Table 2. Time used in one EM iteration (in second) Uses the k1b dataset (2340 docs, 21247 unique terms, 530374 terms)

29
2015-5-129 Thanks!

Similar presentations

OK

Generalized Model Selection For Unsupervised Learning in High Dimension Vaithyanathan and Dom IBM Almaden Research Center NIPS ’ 99.

Generalized Model Selection For Unsupervised Learning in High Dimension Vaithyanathan and Dom IBM Almaden Research Center NIPS ’ 99.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on maintaining good health Ppt on how stock market works Ppt on credit policy and procedure Ppt on different types of transport Ppt on event driven programming c# Ppt on natural resources of pakistan Ppt on media literacy Ppt on building brand equity Ppt on leadership styles with examples Ppt on online marketing research