Document retrieval Similarity –Vector space model –Multi dimension Search –Range query –KNN query Query processing example.

Slides:



Advertisements
Similar presentations
CMU SCS : Multimedia Databases and Data Mining Lecture #17: Text - part IV (LSI) C. Faloutsos.
Advertisements

The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.
Peer to Peer and Distributed Hash Tables
Scalable Content-Addressable Network Lintao Liu
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Searching on Multi-Dimensional Data
Chord: A scalable peer-to- peer lookup service for Internet applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashock, Hari Balakrishnan.
Multidimensional Data
Pivoting M-tree: A Metric Access Method for Efficient Similarity Search Tomáš Skopal Department of Computer Science, VŠB-Technical.
From last time What’s the real point of using vector spaces?: A user’s query can be viewed as a (very) short document. Query becomes a vector in the same.
Common approach 1. Define space: assign random ID (160-bit) to each node and key 2. Define a metric topology in this space,  that is, the space of keys.
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Dimensionality Reduction
A Scalable Semantic Indexing Framework for Peer-to-Peer Information Retrieval University of Illinois at Urbana-Champain Zhichen XuYan Chen Northwestern.
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
Vector Space Information Retrieval Using Concept Projection Presented by Zhiguo Li
Scalable and Distributed Similarity Search in Metric Spaces Michal Batko Claudio Gennaro Pavel Zezula.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Topics in Reliable Distributed Systems Fall Dr. Idit Keidar.
1 CS 194: Distributed Systems Distributed Hash Tables Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering and Computer.
Dimensionality Reduction
Peer-to-Peer Networks Slides largely adopted from Ion Stoica’s lecture at UCB.
ICDE A Peer-to-peer Framework for Caching Range Queries Ozgur D. Sahin Abhishek Gupta Divyakant Agrawal Amr El Abbadi Department of Computer Science.
Distributed Model-Based Learning PhD student: Zhang, Xiaofeng.
Multidimensional Data Many applications of databases are ``geographic'' = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.
Multi Feature Indexing Network MUFIN Similarity Search Platform for many Applications Pavel Zezula Faculty of Informatics Masaryk University, Brno MUFIN:
Other Structured P2P Systems CAN, BATON Lecture 4 1.
A Distributed Architecture for Multi-dimensional Indexing and Data Retrieval in Grid Environments Athanasia Asiki, Katerina Doka, Ioannis Konstantinou,
Using the Small-World Model to Improve Freenet Performance Hui Zhang Ashish Goel Ramesh Govindan USC.
Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications Xiaozhou Li COS 461: Computer Networks (precept 04/06/12) Princeton University.
Range and kNN Searching in P2P Manesh Subhash Ni Yuan Sun Chong.
Shape Analysis and Retrieval Statistical Shape Descriptors Notes courtesy of Funk et al., SIGGRAPH 2004.
Routing Indices For P-to-P Systems ICDCS Introduction Search in a P2P system –Mechanisms without an index –Mechanisms with specialized index nodes.
On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems Chunqiang Tang, Sandhya Dwarkadas, Zhichen Xu University of Rochester; Yahoo! Inc. ACM.
Handling Spatial Data In P2P Systems Verena Kantere, Timos Sellis, Yannis Kouvaras.
M- tree: an efficient access method for similarity search in metric spaces Reporter : Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
Efficient Metric Index For Similarity Search Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, Gang Chen.
National Institute of Advanced Industrial Science and Technology Query Processing for Distributed RDF Databases Using a Three-dimensional Hash Index Akiyoshi.
No. 1 Knowledge Acquisition from Documents with both Fixed and Free Formats* Shigeich Hirasawa Department of Industrial and Management Systems Engineering.
1 Distributed Hash Table CS780-3 Lecture Notes In courtesy of Heng Yin.
Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept. of Electronic.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
LOOKING UP DATA IN P2P SYSTEMS Hari Balakrishnan M. Frans Kaashoek David Karger Robert Morris Ion Stoica MIT LCS.
DASFAA 2005, Beijing 1 Nearest Neighbours Search using the PM-tree Tomáš Skopal 1 Jaroslav Pokorný 1 Václav Snášel 2 1 Charles University in Prague Department.
P2P Search COP6731 Advanced Database Systems. P2P Computing  Powerful personal computer Share computing resources P2P Computing  Advantages: Shared.
FastMap : Algorithm for Indexing, Data- Mining and Visualization of Traditional and Multimedia Datasets.
Algebraic Techniques for Analysis of Large Discrete-Valued Datasets 
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Incrementally Improving Lookup Latency in Distributed Hash Table Systems Hui Zhang 1, Ashish Goel 2, Ramesh Govindan 1 1 University of Southern California.
Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications * CS587x Lecture Department of Computer Science Iowa State University *I. Stoica,
Similarity Search without Tears: the OMNI- Family of All-Purpose Access Methods Michael Kelleher Kiyotaka Iwataki The Department of Computer and Information.
Image Retrieval and Ranking using L.S.I and Cross View Learning Sumit Kumar Vivek Gupta
Multidimensional Access Structures
11/11/2018 Distributed Content-based Search on Structured Peer-to-Peer Overlay Networks It is a pleasure to visit here I am going to talk about some work.
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
Junqi Zhang+ Xiangdong Zhou+ Wei Wang+ Baile Shi+ Jian Pei*
Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement Rie Kubota Ando. Latent semantic space: Iterative.
Presentation transcript:

Document retrieval Similarity –Vector space model –Multi dimension Search –Range query –KNN query Query processing example

Range Query x axis y axis b c a E 1 d e f g h i j k l m E 2 a b cd e E 1 E 2 E 3 E 4 E 5 Root E 1 E 2 E 3 E 4 f g h E 5 l m E 7 i j k E 6 E 6 E 7

Information retrieval in Structured P2P overlay High dimension -> Low dimension –Dimension reduction! Support range query and KNN query Guarantee precision and recall

pSearch: Information Retrieval in Structured Overlays

Peer-to-Peer VSM (pVSM) VSM : vector space model Basic ideas –The m most heavily-weighted terms t i, i=1,…,m are identified –The corresponding (h(t i ), index) pairs are stored in DHT Index : pointer to the actual document.

Example

Peer-to-Peer LSI (pLSI) LSI : Latent semantic indexing –Use SVD to transform and truncate a matrix of a terms vectors computed from VSM to discover the semantics of terms and documents Basic idea –l: dimensionality of LSI semantic space –k: dimensionality of Can cartesian space –Make l=k

pLSI (cont.) Challenges for pLSI –Sphere distribution of semantic vectors –Solution Transforming the sphere space

Latent Semantic Indexing vector space model SVD project new vectors compute similarity 12 TSVD

M-Chord: A Scalable Distributed Similarity Search Structure

iDistance – Indexing the Distance Space partitioning into n clusters –Reference points p i Each cluster mapped to an interval Each object x mapped to 1-d iDist(x)=i*c+dist(p i,x) Values indexed in a B+-Tree

Query R(q,r) –If a query intersects with a cluster dist(p i,q)-r ≦ r i –Scan the interval [i*c+dist(p i,q)-r,i*c+dist(p i,q)+r]

M-Chord Basic principles –Choose a set of n pivots p 0,…,p n-1 from a priori given sample dataset –Divide the set of indexed objects I into clusters C 0,…, C n-1 : –Every object x may be excluded without evaluating d(q,x) if

M-chord Pivot selection –Influence the performance of the search algorithm Publish –Use iDistance to map the dataset into a one- dimensional domain and join this domain with the Chord protocol –Using order preserving function h to a [0,2 m ) interval

M-Chord Data structure –Chord routing information –B+-tree storage for the (K i-1, K i ] (mod 2 m ) interval

M-Chord Range search –for each cluster C i, determine interval I i of keys to be scanned: –send an INTERVALSEARCH(I i, q, r) request to node N Ii responsible for the midpoint of interval –wait for all responses and create the final answer set.

M-Chord INTERVALSEARCH(I i, q, r)

M-Chord KNN search –The iDistance approach to KNN query processing a sequence of range queries with growing radius is not suitable for distributed environment multiple range iterations would result in an unpleasant number of successive message transmissions increasing the overall response time –Solution Employ a low-cost heuristic to find k objects that are near q Run the Range(q, Qk) query and return the nearest objects from the query result