On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems Chunqiang Tang, Sandhya Dwarkadas, Zhichen Xu University of Rochester; Yahoo! Inc. ACM.

On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems Chunqiang Tang, Sandhya Dwarkadas, Zhichen Xu University of Rochester; Yahoo! Inc. ACM SIGIR2004 Session: Dimensionality reduction

Abstract (1/2) Promising direction Combine IR with peer-to-peer technology for scalability, fault-tolerance and low administration cost pSearch Places docs onto a p2p overlay network according to semantic vectors produced using Latent Semantic Indexing (LSI) Limitation (inherits LSI) When the corpus is large, retrieval quality is bad The Singular Value Decomposition (SVD) in LSI is unscalable in terms of both memory and time.

Abstract (2/2) Contributions To reduce the cost of SVD, we reduce the size of its input matrix through doc clustering and term selection Proper normalization of semantic vectors for terms and docs improves recall by 76% To improve further improve retrieval quality, we use low-dimensional subvectors of semantic vectors to cluster documents in the overlay and then use Okapi to guide the search and doc selection

Introduction (1/3) Info. grow exponentially Exceeds 10^18 bytes each year P2P systems Scalability, fault-tolerance, self-organizing nature, raising hope for building large-scale IR systems pSearch Populates docs in the network according to doc semantics derived from LSI The search cost for a query is reduced to route hops

Introduction (2/3) The limitations of pSearch When the corpus is large, retrieval quality is bad The SVD that LSI uses to derive semantic vectors of docs is not scalable in terms of memory consumption and computation time Propose techniques to address these limitations eLSI (efficient LSI): doc clustering and term selection Proper normalization of semantic vector for terms and docs improve recall by 76% LSI+Okapi: use low-dimensional subvectors of semantic vector to implicitly cluster docs, and then use Okapi to guide search process and doc selection

Introduction (3/3) Contributions Deriving low-dimensional representation for high- dimensional data is a common theme for many fields. ex: Principal Component Analysis (PCA), LSI The proper configuration we found for LSI should be of general interest to the LSI community Since nearest neighbor search in a high-dimensional space is prohibitive, we propose pSearch.

pSearch System Overview (1/4) An example of how the system works pSearch uses a CAN to organize Engine nodes into an extension of LSI to answer queries, called pLSI. Vector Space Model (VSM) ltc term weighting

pSearch System Overview (2/4) Latent Semantic Indexing A: term-doc matrix, rank=r LSI approximate A with a rank-k matrix by omitting all but the k largest singular values Content-Addressable Network (CAN) CAN partitions a d- dimensional Cartesian space into zones and assign each zone to a node

pSearch System Overview (3/4) The pLSI Algorithm The pLSI algorithm combine LSI and CAN to build pSearch Upon reaching the destination, the query is flooded to nodes within a small radius r Content-directed search algorithm: each node samples content stored on its neighbors and use them to decide which one to search next LSI uses k=50~350 dimensional space for small corpora

pSearch System Overview (4/4) Dimension mismatch between CAN and LSI The real dimension of a CAN can ’ t be higher than l=O(log(n)) Partitions a k-dimensional semantic vector into multiple l- dimension subvectors Given a doc, we store its index at p places in the CAN using its first p subvectors as DHT keys (p=4) Two similar subvectors ensure their full vectors are also similar accuracy = |A ∩ B|/|A| A: retrieve 15 docs for each TREC7&8 based on 300 dimension semantic vectors

Improving Retrieval Quality (1/5) Proper LSI Configuration Term normalization Doc normalization The choice of using to project vectors Experiment SVDPACK Corpus: disk4 and 5 from TREC, 528,543 docs, 2GB Queries: the title field of topics 351-450 use ltc to generate the term-doc matrix for SVD Due to memory limitations, select only 15% of the TREC corpus to construct a 83,098-term by 79,316-docs matrix to SVD, which project vectors into a 300-dimension space. (memory: 1.7GB, time: 57mins on 2GHz Pentium 4)

Improving Retrieval Quality (2/5) Improvement retrieve 1,000 docs for each query and report the average # of relevant docs return more 76% more relevant docs, when norm both normalizing terms improves performance by emphasizing terms normalizing docs corroborates the belief that cosine is a robust measure for similarity

Improving Retrieval Quality (3/5) TREC vs. Medlars Corpus Medlars: 1,033 docs and 30 queries Docs and queries are projected into a 50-dimension space 50-dimension is sufficient for the small corpus 300-dimension is insufficient for the large corpus Normalization is beneficial if the dimension of the semantic space are insufficient in capturing the fine structure of the corpus.

Improving Retrieval Quality (4/5) LSI is bad for large corpus LSI does not exploit doc length in ranking 300-dimension semantic space is insufficient for TREC. LSI ’ s performance can be improved by increasing dimensionality. LSI+Okapi use 4-plane pLSI (each plane 25 dimensions) each plane retrieve 1,000 docs, use Okapi to rank the returned 4,000 docs

Improving Retrieval Quality (5/5) Precision-recall for TREC Precision-recall for Medlars High-end precision for TREC P@i: precision when retrieving i docs for a query The performance of LSI+Okapi High-end precision approaches that of Okapi, but the low-end still lags behind. The low-end precision can be improved by allowing each plane to return more candidate docs for Okapi to rank, but this would increase the search cost.

Improving the Efficiency of LSI (1/) Traditionally LSI use term-doc matrix as the input for SVD for a matrix A ≡ R t*d with about c nonzero elements per column, the time complexity of SVD is O(t*d*c) The eLSI algorithm Use spherical k-means to cluster docs C =[c 1 c 2 … c s ] ≡ R t*s The aggregate weight of a term i: we select a subset of e rows from matrix C to construct a row-reduced matrix e: top e terms with the largest aggregate weight

Improving the Efficiency of LSI (2/) For TREC corpus The complete term-doc matrix has 408,653 rows and 528,155 columns The matrix has less than 2,000 rows and 2,000 cols Projection Projects terms into the semantic space using Vk Project a doc (or query) vector q into the semantic space and normalize it to unit length

Improving the Efficiency of LSI (3/) Other Dimensionality Reduction Methods Random Projection (RP) The first step of all other algorithms partitions docs into k clusters, G=[g 1 g 2 … g k ] ≡ R t*k Concept Indexing (CI) The third algorithm solves the least-squares problem QR decomposition

Improving the Efficiency of LSI (4/) RP-eLSI F is a random matrix Comparing Dimension Reduction Methods RP performs well when the dim of the reduced space is sufficient in capturing the real dim of the data

On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems Chunqiang Tang, Sandhya Dwarkadas, Zhichen Xu University of Rochester; Yahoo! Inc. ACM.

Similar presentations

Presentation on theme: "On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems Chunqiang Tang, Sandhya Dwarkadas, Zhichen Xu University of Rochester; Yahoo! Inc. ACM."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems Chunqiang Tang, Sandhya Dwarkadas, Zhichen Xu University of Rochester; Yahoo! Inc. ACM.

Similar presentations

Presentation on theme: "On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems Chunqiang Tang, Sandhya Dwarkadas, Zhichen Xu University of Rochester; Yahoo! Inc. ACM."— Presentation transcript:

Similar presentations

About project

Feedback