On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems Chunqiang Tang, Sandhya Dwarkadas, Zhichen Xu University of Rochester; Yahoo! Inc. ACM.

Slides:

Advertisements

Similar presentations

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.

Advertisements

Introduction to Information Retrieval Outline ❶ Latent semantic indexing ❷ Dimensionality reduction ❸ LSI in information retrieval 1.

Text Databases Text Types

Dimensionality Reduction PCA -- SVD

From last time What’s the real point of using vector spaces?: A user’s query can be viewed as a (very) short document. Query becomes a vector in the same.

Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,

What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.

Hinrich Schütze and Christina Lioma

DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December

A Scalable Semantic Indexing Framework for Peer-to-Peer Information Retrieval University of Illinois at Urbana-Champain Zhichen XuYan Chen Northwestern.

1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University

Vector Space Information Retrieval Using Concept Projection Presented by Zhiguo Li

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005

Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.

Routing of Structured Queries in Large-Scale Distributed Systems Workshop on Large-Scale Distributed Systems for Information Retrieval ACM.

Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.

TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.

1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.

IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.

The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing 1.

1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Dimension of Meaning Author: Hinrich Schutze Presenter: Marian Olteanu.

Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.

E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:

DATA MINING LECTURE 7 Dimensionality Reduction PCA – SVD

Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.

Information Retrieval Latent Semantic Indexing. Speeding up cosine computation What if we could take our vectors and “pack” them into fewer dimensions.

Chapter 2 Dimensionality Reduction. Linear Methods

1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.

1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.

1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.

Document retrieval Similarity –Vector space model –Multi dimension Search –Range query –KNN query Query processing example.

CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:

1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.

Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al.

Generic text summarization using relevance measure and latent semantic analysis Gong Yihong and Xin Liu SIGIR, April 2015 Yubin Lim.

Chapter 6: Information Retrieval and Web Search

Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.

Text Categorization Moshe Koppel Lecture 12:Latent Semantic Indexing Adapted from slides by Prabhaker Raghavan, Chris Manning and TK Prasad.

June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.

SINGULAR VALUE DECOMPOSITION (SVD)

Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.

Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science ＆ Information Engineering.

Scalable Content- Addressable Networks Prepared by Kuhan Paramsothy March 5, 2007.

Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.

1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 6. Dimensionality Reduction.

Clustering C.Watters CS6403.

Techniques for Collaboration in Text Filtering 1 Ian Soboroff Department of Computer Science and Electrical Engineering University of Maryland, Baltimore.

LATENT SEMANTIC INDEXING BY SINGULAR VALUE DECOMPOSITION

1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret

V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.

Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.

Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,

A Patent Document Retrieval System Addressing Both Semantic and Syntactic Properties Liang Chen*,Naoyuki Tokuda+, Hisahiro Adachi+ *University of Northern.

Natural Language Processing Topics in Information Retrieval August, 2002.

Web Search and Data Mining Lecture 4 Adapted from Manning, Raghavan and Schuetze.

DATA MINING LECTURE 8 Sequence Segmentation Dimensionality Reduction.

A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.

From Frequency to Meaning: Vector Space Models of Semantics

Document Clustering Based on Non-negative Matrix Factorization

11/11/2018 Distributed Content-based Search on Structured Peer-to-Peer Overlay Networks It is a pleasure to visit here I am going to talk about some work.

Singular Value Decomposition

Paraskevi Raftopoulou, Euripides G.M. Petrakis

Restructuring Sparse High Dimensional Data for Effective Retrieval

Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement Rie Kubota Ando. Latent semantic space: Iterative.

Latent Semantic Analysis

Presentation transcript:

On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems Chunqiang Tang, Sandhya Dwarkadas, Zhichen Xu University of Rochester; Yahoo! Inc. ACM SIGIR2004 Session: Dimensionality reduction

Abstract (1/2) Promising direction Combine IR with peer-to-peer technology for scalability, fault-tolerance and low administration cost pSearch Places docs onto a p2p overlay network according to semantic vectors produced using Latent Semantic Indexing (LSI) Limitation (inherits LSI) When the corpus is large, retrieval quality is bad The Singular Value Decomposition (SVD) in LSI is unscalable in terms of both memory and time.

Abstract (2/2) Contributions To reduce the cost of SVD, we reduce the size of its input matrix through doc clustering and term selection Proper normalization of semantic vectors for terms and docs improves recall by 76% To improve further improve retrieval quality, we use low-dimensional subvectors of semantic vectors to cluster documents in the overlay and then use Okapi to guide the search and doc selection

Introduction (1/3) Info. grow exponentially Exceeds 10^18 bytes each year P2P systems Scalability, fault-tolerance, self-organizing nature, raising hope for building large-scale IR systems pSearch Populates docs in the network according to doc semantics derived from LSI The search cost for a query is reduced to route hops

Introduction (2/3) The limitations of pSearch When the corpus is large, retrieval quality is bad The SVD that LSI uses to derive semantic vectors of docs is not scalable in terms of memory consumption and computation time Propose techniques to address these limitations eLSI (efficient LSI): doc clustering and term selection Proper normalization of semantic vector for terms and docs improve recall by 76% LSI+Okapi: use low-dimensional subvectors of semantic vector to implicitly cluster docs, and then use Okapi to guide search process and doc selection

Introduction (3/3) Contributions Deriving low-dimensional representation for high- dimensional data is a common theme for many fields. ex: Principal Component Analysis (PCA), LSI The proper configuration we found for LSI should be of general interest to the LSI community Since nearest neighbor search in a high-dimensional space is prohibitive, we propose pSearch.

pSearch System Overview (1/4) An example of how the system works pSearch uses a CAN to organize Engine nodes into an extension of LSI to answer queries, called pLSI. Vector Space Model (VSM) ltc term weighting

pSearch System Overview (2/4) Latent Semantic Indexing A: term-doc matrix, rank=r LSI approximate A with a rank-k matrix by omitting all but the k largest singular values Content-Addressable Network (CAN) CAN partitions a d- dimensional Cartesian space into zones and assign each zone to a node

pSearch System Overview (3/4) The pLSI Algorithm The pLSI algorithm combine LSI and CAN to build pSearch Upon reaching the destination, the query is flooded to nodes within a small radius r Content-directed search algorithm: each node samples content stored on its neighbors and use them to decide which one to search next LSI uses k=50~350 dimensional space for small corpora

pSearch System Overview (4/4) Dimension mismatch between CAN and LSI The real dimension of a CAN can ’ t be higher than l=O(log(n)) Partitions a k-dimensional semantic vector into multiple l- dimension subvectors Given a doc, we store its index at p places in the CAN using its first p subvectors as DHT keys (p=4) Two similar subvectors ensure their full vectors are also similar accuracy = |A ∩ B|/|A| A: retrieve 15 docs for each TREC7&8 based on 300 dimension semantic vectors

Improving Retrieval Quality (1/5) Proper LSI Configuration Term normalization Doc normalization The choice of using to project vectors Experiment SVDPACK Corpus: disk4 and 5 from TREC, 528,543 docs, 2GB Queries: the title field of topics use ltc to generate the term-doc matrix for SVD Due to memory limitations, select only 15% of the TREC corpus to construct a 83,098-term by 79,316-docs matrix to SVD, which project vectors into a 300-dimension space. (memory: 1.7GB, time: 57mins on 2GHz Pentium 4)

Improving Retrieval Quality (2/5) Improvement retrieve 1,000 docs for each query and report the average # of relevant docs return more 76% more relevant docs, when norm both normalizing terms improves performance by emphasizing terms normalizing docs corroborates the belief that cosine is a robust measure for similarity

Improving Retrieval Quality (3/5) TREC vs. Medlars Corpus Medlars: 1,033 docs and 30 queries Docs and queries are projected into a 50-dimension space 50-dimension is sufficient for the small corpus 300-dimension is insufficient for the large corpus Normalization is beneficial if the dimension of the semantic space are insufficient in capturing the fine structure of the corpus.

Improving Retrieval Quality (4/5) LSI is bad for large corpus LSI does not exploit doc length in ranking 300-dimension semantic space is insufficient for TREC. LSI ’ s performance can be improved by increasing dimensionality. LSI+Okapi use 4-plane pLSI (each plane 25 dimensions) each plane retrieve 1,000 docs, use Okapi to rank the returned 4,000 docs

Improving Retrieval Quality (5/5) Precision-recall for TREC Precision-recall for Medlars High-end precision for TREC precision when retrieving i docs for a query The performance of LSI+Okapi High-end precision approaches that of Okapi, but the low-end still lags behind. The low-end precision can be improved by allowing each plane to return more candidate docs for Okapi to rank, but this would increase the search cost.

Improving the Efficiency of LSI (1/) Traditionally LSI use term-doc matrix as the input for SVD for a matrix A ≡ R t*d with about c nonzero elements per column, the time complexity of SVD is O(t*d*c) The eLSI algorithm Use spherical k-means to cluster docs C =[c 1 c 2 … c s ] ≡ R t*s The aggregate weight of a term i: we select a subset of e rows from matrix C to construct a row-reduced matrix e: top e terms with the largest aggregate weight

Improving the Efficiency of LSI (2/) For TREC corpus The complete term-doc matrix has 408,653 rows and 528,155 columns The matrix has less than 2,000 rows and 2,000 cols Projection Projects terms into the semantic space using Vk Project a doc (or query) vector q into the semantic space and normalize it to unit length

Improving the Efficiency of LSI (3/) Other Dimensionality Reduction Methods Random Projection (RP) The first step of all other algorithms partitions docs into k clusters, G=[g 1 g 2 … g k ] ≡ R t*k Concept Indexing (CI) The third algorithm solves the least-squares problem QR decomposition

Improving the Efficiency of LSI (4/) RP-eLSI F is a random matrix Comparing Dimension Reduction Methods RP performs well when the dim of the reduced space is sufficient in capturing the real dim of the data