Presentation is loading. Please wait.

Presentation is loading. Please wait.

Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,

Similar presentations


Presentation on theme: "Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,"— Presentation transcript:

1 Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics, Varaždin

2 Outline Information retrieval in vector space model (VSM) or bag of words representation Techniques for conceptual indexing –Latent semantic indexing –Concept indexing Comparison: Academic example Experiment Further work

3 Information retrieval in VSM 1/3 Task of information retrieval: to extract documents that are relevant for user query in document collection In VSM documents are presented in high dimensional space Dimension of space depends on number of indexing terms which are chosen to be relevant for the collection (4000- 5000 in my experiments) VSM is implemented by forming term-document matrix

4 Information retrieval in VSM 2/3 Term-document matrix is m  n matrix where m is number of terms and n is number of documents row of term-document matrix = term column of term- document matrix = document Figure 1. Term-document matrix

5 Information retrieval in VSM 3/3 query has the same shape as document (m dimensional vector) measure of similarity between query q and a document a j is a cosine of angle between those two vectors

6 Retrieval performance evaluation Measures for evaluation: –Recall –Precision –Average precision Recall Precision r i is number of relevant documents among i highest ranked documents r n is total number of relevant documents in collection Average precision – average precision for distinct levels of recall

7 Techniques for conceptual indexing In term-matching method similarity between query and the document is tested lexically Polysemy (words having multiple meaning) and synonymy (multiple words having the same meaning) are two fundamental problems in efficient information retrieval Here we compare two techniques for conceptual indexing based on projection of vectors of documents (in means of least squares) on lower-dimensional vector space –Latent semantic indexing (LSI) –Concept indexing (CI)

8 Latent semantic indexing Introduced in 1990; improved in 1995 S. Deerwester, S. Dumas, G. Furnas, T. Landauer, R. Harsman: Indexing by latent semantic analysis, J. American Society for Information Science, 41, 1990, pp. 391-407 M. W. Berry, S.T. Dumas, G.W. O’Brien: Using linear algebra for intelligent information retrieval, SIAM Review, 37, 1995, pp. 573-595 Based on spectral analysis of term-document matrix

9 Latent semantic indexing For every m×n matrix A there is singular value decomposition (SVD) U orthogonal m×m matrix whose columns are left singular vectors of A  diagonal matrix on whose diagonal are singular values of matrix A in descending order V orthogonal n×n matrix whose columns are right singular vectors of A

10 Latent semantic indexing For LSI truncated SVD is used where U k is m×k matrix whose columns are first k left singular vectors of A  k is k×k diagonal matrix whose diagonal is formed by k leading singular values of A V k is n×k matrix whose columns are first k right singular vectors of A Rows of U k = terms Rows of V k = documents

11 (Truncated) SVD

12 Latent semantic indexing Using the truncated LSI we include only first k independent linear components of A (singular vectors and values) Documents are projected in means of least squares on space spread by first k singular vectors of A (LSI space) First k components capture the major associational structure in in the term-document matrix and throw out the noise Minor differences in terminology used in documents are ignored Closeness of objects (queries and documents) is determined by overall pattern of term usage, so it is context based Documents which contain synonyms are closer in LSI space than in original space; documents which contain polysemy in different context are more far in LSI space than in original space

13 Concept indexing (CI) Indexing using concept decomposition (CD) instead of SVD like in LSI Concept decomposition was introduced in 2001 I.S.Dhillon, D.S. Modha: Concept decomposition for large sparse text data using clustering, Machine Learning, 42:1, 2001, pp. 143-175

14 Concept decomposition First step: clustering of documents in term-document matrix A on k groups Clustering algorithms: –Spherical k-means algorithm –Fuzzy k-means algorithm Spherical k-means algorithm is a variant of k-means algorithm which uses the fact that vectors of documents are of the unit norm Centroids of groups = concept vectors Concept matrix is matrix whose columns are centroids of groups c j – centroid of j-th group

15 Concept decomposition Second step: calculating the concept decomposition Concept decomposition D k of term-document matrix A is least squares approximation of A on the space of concept vectors where Z is solution of the least squares problem Rows of C k = terms Columns of Z = documents

16 Comparison: Academic example Collection of 15 documents (Titles of books) –9 from the field of data mining –5 from the field of linear algebra –1 combination of these fields (application of linear algebra for data mining) List of terms was formed 1)By words contained in at least two documents 2)Words on stop list were ejected 3)Stemming was performed On term-document matrix we apply –Truncated SVD (k=2) –Concept decomposition (k=2)

17 Documents 1/2 D1Survey of text mining: clustering, classification, and retrieval D2Automatic text processing: the transformation analysis and retrieval of information by computer D3Elementary linear algebra: A matrix approach D4Matrix algebra & its applications statistics and econometrics D5Effective databases for text & document management D6Matrices, vector spaces, and information retrieval D7Matrix analysis and applied linear algebra D8Topological vector spaces and algebras

18 Documents 2/2 D9Information retrieval: data structures & algorithms D10Vector spaces and algebras for chemistry and physics D11Classification, clustering and data analysis D12Clustering of large data sets D13Clustering algorithms D14Document warehousing and text mining: techniques for improving business operations, marketing and sales D15Data mining and knowledge discovery

19 Terms Data mining termsLinear algebra termsNeutral terms TextLinearAnalysis MiningAlgebraApplication ClusteringMatrixAlgorithm ClassificationVector RetrievalSpace Information Document Data

20 Projection of terms by SVD

21 Projection of terms by CD

22 Queries Q1: Data mining –Relevant documents : All data mining documents Q2: Using linear algebra for data mining –Relevant document: D6

23 Projection of documents by SVD

24 Projection of documents by CD

25 Results of information retrieval (Q1)

26 Results of information retrieval (Q2)

27 Collections MEDLINE –1033 documents –30 queries –Relevant judgements CRANFIELD –1400 documents –225 queries –Relevant judgements

28 Test A Comparison of errors of approximation term- document matrix by 1) k-rank SVD 2) k-rank CD

29 MEDLINE - errors of approximation

30 CRANFIELD - errors of approximation

31 Test B Average inner product between concept vectors c j, j=1,2,…,k Comparison of average inner product for –Concept vectors obtained by spherical k-means algorithm –Concept vectors obtained by fuzzy k-means algorithm

32 MEDLINE – average inner product

33 CRANFIELD – average inner product

34 Test C Comparison of mean average precision of information retrieval and precision-recall plots Mean average precision for term-matching method: –MEDLINE : 43,54 –CRANFIELD : 20,89

35 MEDLINE – mean average precision

36 CRANFIELD – mean average precision

37 MEDLINE – precision-recall plot

38 CRANFIELD – precision-recall plot

39 Test D Correlation between mean average precision (MAP) and clustering quality Measure of cluster quality – generalized within groups sum of square errors function J fuzz a j, j=1,2,…,n are vectors of documents, c i, i=1,2,…,k are concept vectors  ij is the fuzzy membership degree of document a j in the group whose concept is c i b  1,  is weight exponent

40 MEDLINE - Correlation (clustering quality and MAP) 46 observations for rank of approximation k  [1,100] Correlation between mean average precision and J fuzz is r=-0,968198 with significance p<<0,01 Correlation between rank of approximation and mean average precision is r= 0,70247 ( p<<0,01) Correlation between rank of approximation and J fuzz is r= -0,831071 ( p<<0,01)

41 CRANFILD - Correlation (clustering quality and MAP) 46 observations for rank of approximation k  [1,100] Correlation between mean average precision and J fuzz is r=-0,988293 with significance p<<0,01 Correlation between rank of approximation and mean average precision is r= 0,914489 ( p<<0,01) Correlation between rank of approximation and J fuzz is r= -0,904415 ( p<<0,01)

42 Regression line: clustering quality and MAP (MEDLINE)

43 Regression line: clustering quality and MAP (CRANFIELD)

44 Conclusion 1/3 By SVD approximation term-document matrix is projected on the first k left singular vectors, which for orthogonal base for LSI space By CD approximation term-document matrix is projected on the k centroids of groups (concept vectors) Concept vectors form the base for CI space; they tend to orthogonality as k raises Concept vectors obtained by fuzzy k-means algorithm tend to orthogonality faster then those obtained by spherical k-means algorithm CI using CD by fuzzy k-means algorithm gives higher MAP of information retrieval then LSI on both collections we have used

45 Conclusion 2/3 CI using CD by spherical k-means algorithm gives lower (but comparable) MAP of information retrieval then LSI on both collections we have used According the results of MAP k=75 for MEDLINE collection, and k=200 for CRANFIELD collection is good choice of rank of approximation By LSI and CI documents are presented in smaller matrices: –For MEDLINE collection term-document matrix is stored in 5940×1033 matrix – approximations of documents are stored in 75×1033 matrix –For CRANFIELD collection term-document matrix is stored in 4758×1400 matrix - approximations of documents are stored in 200×1400 matrix

46 Conclusion 3/3 LSI and CI work better on MEDLINE collection When evaluated for different ranks of approximation MAP is more stable for LSI then for CI There is high correlation between MAP and clustering quality

47 Further work 1)To apply CI on the problem of classification in supervised setting 2)To propose solutions of problem adding new documents in collection for CI method Adding new documents in collection requires recomputation of SVD or CD It is computationally inefficient 2 approximation methods are developed for adding new document in collection for LSI method

48


Download ppt "Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,"

Similar presentations


Ads by Google