A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi.

Slides:

Advertisements

Similar presentations

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

Advertisements

For more information please send to or EFFICIENT QUERY SUBSCRIPTION PROCESSING.

Chapter 5: Introduction to Information Retrieval

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Clustering and Load Balancing Optimization for Redundant Content Removal Shanzhong Zhu (Ask.com) Alexandra Potapova, Maha Alabduljalil (Univ. of California.

“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS

Information Retrieval in Practice

Search Engines and Information Retrieval

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

1 Statistical correlation analysis in image retrieval Reporter : Erica Li 2004/9/30.

A Scalable Semantic Indexing Framework for Peer-to-Peer Information Retrieval University of Illinois at Urbana-Champain Zhichen XuYan Chen Northwestern.

INFO 624 Week 3 Retrieval System Evaluation

Routing of Structured Queries in Large-Scale Distributed Systems Workshop on Large-Scale Distributed Systems for Information Retrieval ACM.

1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman

Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.

Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.

Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University.

Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.

Overview of Search Engines

Navigating and Browsing 3D Models in 3DLIB Hesham Anan, Kurt Maly, Mohammad Zubair Computer Science Dept. Old Dominion University, Norfolk, VA, (anan,

Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.

Search Engines and Information Retrieval Chapter 1.

CS523 INFORMATION RETRIEVAL COURSE INTRODUCTION YÜCEL SAYGIN SABANCI UNIVERSITY.

Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.

Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.

Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.

Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.

Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.

An Efficient Approach for Content Delivery in Overlay Networks Mohammad Malli Chadi Barakat, Walid Dabbous Planete Project To appear in proceedings of.

Full-Text Search in P2P Networks Christof Leng Databases and Distributed Systems Group TU Darmstadt.

Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.

« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.

1 Efficient Search Ranking in Social Network ACM CIKM2007 Monique V. Vieira, Bruno M. Fonseca, Rodrigo Damazio, Paulo B. Golgher, Davi de Castro Reis,

« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)

The Effect of Collection Organization and Query Locality on IR Performance 2003/07/28 Park,

Efficient Peer to Peer Keyword Searching Nathan Gray.

On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems Chunqiang Tang, Sandhya Dwarkadas, Zhichen Xu University of Rochester; Yahoo! Inc. ACM.

Proposal for Term Project J. H. Wang Mar. 2, 2015.

Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further.

Efficient Instant-Fuzzy Search with Proximity Ranking Authors: Inci Centidil, Jamshid Esmaelnezhad, Taewoo Kim, and Chen Li IDCE Conference 2014 Presented.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 2007 Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Ivana.

Kaleidoscope – Adding Colors to Kademlia Gil Einziger, Roy Friedman, Eyal Kibbar Computer Science, Technion 1.

Spatio-temporal Pattern Queries M. Hadjieleftheriou G. Kollios P. Bakalov V. J. Tsotras.

CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

Web- and Multimedia-based Information Systems Lecture 2.

Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.

1 Information Retrieval LECTURE 1 : Introduction.

Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval O. Chum, et al. Presented by Brandon Smith Computer Vision.

Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.

Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,

Information-Theoretic Co- Clustering Inderjit S. Dhillon et al. University of Texas, Austin presented by Xuanhui Wang.

26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.

Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.

1 CS 430: Information Discovery Lecture 5 Ranking.

Document Clustering and Collection Selection Diego Puppin Web Mining,

Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.

CS791 - Technologies of Google Spring A Webbased Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.

1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.

Information Retrieval in Practice

Gleb Skobeltsyn Flavio Junqueira Vassilis Plachouras

Collection Fusion in Carrot2

Proposal for Term Project

Information Retrieval in Practice

Structure and Content Scoring for XML

Structure and Content Scoring for XML

Relax and Adapt: Computing Top-k Matches to XPath Queries

Presentation transcript:

A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

Introduction

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa Motivations The Web is getting bigger and bigger, and users are more and more picky! Precise results are needed very fast The index is growing, due to added page and advanced indexing Big IR problems for the Web, books, multimedia search engine

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa Motivations (2) There is the need for new solutions, able to give high quality results with reduced computing load Parallel Computing looks like the most natural choice to help algorithms to face this growth rate [Baeza-Yates et al. 2007a] Billions of pages and data available (several TB): the index is still very big (about 5X the collection size) New approaches to partitioning are key to the next phase

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa Parallel (Distributed) IRSs

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa Term vs Doc partitioning

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa Term vs Doc partitioning Reduced computing load for term part. Only the servers with relevant terms Problems of load balancing Heavier communication patterns Doc.part. better balancing but all documents are scanned How to reduce the load with doc.part.?

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa Main contributions 1. Query vector doc model More efficient for partitioning and selection (co-clustering and PCAP) 2. Load-driven routing Exploits better the available load Based on the effective load of the system 3. Incremental Caching Improves throughput AND quality

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa Acknowledgments Fabrizio Silvestri Raffaele Perego Ricardo Baeza-Yates Adbur Chowdury, Ophir Frieder, Gerhard Weikum, and the various reviewers…

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa Other contributions More compact collection representation 1/5 CORI and outperforming A way to select documents (50%) to move out of the index The documents in the supplemental index contribute to only 3% top results A simple way to update the index in a doc. partitioned system Extended simulation 6 M documents, 800k test queries, real computing costs, several configurations tested

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa Reviewers’ Request: Frieder More detailed discussion of the coclustering algorithm Improved cost scheme Experiments to be extended in the future

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa Reviewers’ Requests: Weikum Improved description of pipelined term- partitioned IR system Improved description of coclustering Better definition of shingles New realistic cost model Deeper discussion of cache and silent documents

How to Improve Partitions

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa Partitioning Strategy p1p1 p2p2p Document Collection Random Content-based (e.g. K-Means, Link-based Clust.) Usage-Based

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa The QV Model Co-clustering queries documents Document j is returned in answer to query i. Document j is not relevant to query i. Query Cluster Document Cluster Each document cluster corresponds to a different partition. In this case three partitions are generated For each query cluster a vocabulary is built out of all the different query terms of the queries in the cluster

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa Theoretical Model of Co- clustering The algorithm we use [Dhillon et al., 2003] finds the clustering that minimizes the loss of information between the original matrix and the clustered matrix (given the number of row and column clusters) Efficient implementation, very robust solution Stable to test period, number of clusters, training set used, matrix model (scores, boolean, repeated)

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa QV for Collection Selection Query clusters Query Partitions are ranked according to their relevance to the query Document clusters We called this strategy PCAP

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa PCAP collection selection

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa Experimental Settings Experiments were carried out using WBR99: 5,939,061 documents; 22 GB uncompressed text Snapshot of the Brazilian Web (domain.br) back in A query log from todobr.com relative to the period Jan-Oct Zettair as the IR Core Training: 190,000 queries, Test: 800,000 queries We created doc. clusters and 128 query clusters. Model tested on the successive week (the fourth week). Metrics used: Intersection: percentage of relevant results returned using only k servers out of 16+1 (from [Puppin et al., 2006]). Competitive similarity: percentage of relevance score obtained using only k servers out of 16+1 (adapted from [Chierichetti et al., 2007]).

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa Quality Metrics

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa Very Effective Partitioning and Selection CORI on Random Partitioning Intersection at CORI on QV Partitioning Intersection at PCAP on QV Partitioning Intersection at In the case of Random CORI performs really bad! Almost equal to relevants/Nclusters. E.g. 5/17 = ~ 0.3 CORI on QV vs. CORI on random performs about 5.2 times better. PCAP on QV vs. CORI on random performs about 5.8 times better. PCAP on QV vs. CORI on QV performs about 1.1 times better.

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa Strength Popular queries are driving the distribution Low-dimensional space to represent documents More efficient collection representation QV may be built while answering queries

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa Weakness Dependent from the training set Actually… NOT! Cannot manage new query terms Very small fraction, CORI does not help Inc. caching can help Collection selection dependent from assignment But addition does not break performance

Issues with Load Distribution

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa Load Balancing Still the maximum load is ~ 25% of the maximum capacity available at each IR Core Load is measured as the maximum number of queries answered by each IR core within a sliding query window of 1000 queries.

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa Load Balancing Strategies Load-driven basic Servers are ranked according to their relevance, using a collection selection function. The first gets priority 1, then linearly down to 1/17. Every server i has to answer if: L(i) < p(i) * L Load-driven boost Priority is 1 for the first T server, then linearly down to 1/(17-T)

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa Experimental Settings (2) The broker models the load in the cores as the number of queries served from the last W queries Assumption: cost =1, for each query and collection We will change this We count the number of relevant results we can get by polling the servers, up to the chosen load threshold

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa Load Balancing Results FIXED 4BASIC BOOST Intersection (# of relevant results retrieved) FIXED 4BASIC BOOST Competitive Similarity (% of rank score retrieved)

Caching and Collection Selection

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa Interaction with a Cache Result caching is commonly used in WSEs [Baeza-Yates et al., 2007a; Baeza-Yates et al., 2007b]. Caching has the effect of reshaping the power-law underlying the query distribution [Baeza-Yates et al., 2007a]. We designed a novel caching strategy (i.e. Incremental Caching) integrated with collection selection

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa Incremental Caching IR Core 1 IR Core 2 IR Core 3 IR Core 4 Incremental Cache Q…………Q………… ………Q…………Q… …Q…………Q……… ……Q…………Q…… Q Q Q Q Q Results Servers Polled XXXX An incremental cache is effective both at load reduction, and at improving result quality.

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa Incremental Caching Results BASIC BOOST INCREMENTAL Intersection (like - # of relevant res retrieved) BASIC BOOST INCREMENTAL Competitive Similarity (% of rank score retrieved)

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa

Refined Cost Model and Prioritization

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa Collection Prioritization We reverse the load control from the broker to the cores The broker broadcasts the query, and sends info about the relative rank of each core (the priority) Each core serves query if L(i) < p(i) L L(i) = sum of the comp. cost (timing) of served queries

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa Extended Tests We actually partitioned the documents onto different servers We indexed locally, and we measured the timing of each query The actual timing is used to compute the load and drive the system Load cap is AVERAGE load The peak can heavily vary!

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa

…the bill, please!

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa Conclusions We presented an architecture for a distributed search engine, based on collection selection The load-driven strategy and the incremental caching can retrieve very high quality results, with reduced load Verified with an extensive simulation

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa Impact and Benefits If a given precision is expected, we can use FEWER servers With a given number of servers, we get HIGHER precision Confirmed with different metrics Smaller load for the IR system, with more focus on top results Nice trade-off cost vs. quality

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa Impact and Benefits (2) Load-driven routing can be used to absorb query peaks to offer higher/lower quality results to selected users Consistent ranking due to local indexing Inc. caching can be used to reduce the negative effects of selection

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa Furthermore Caching posting lists is very effective on local indices Simple way to add new documents Inc. caching could help with impact- ordered posting lists Caching could be based on line value (query frequency, number of polled servers)

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa Future Work Comparison with other results in clustering (k-means, link-based, P2P, LSI, SVD) Test on a large-scale, real-world search engine Real-world implementation at Google TOIS paper to wrap up

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa References [Puppin et al., 2006] Diego Puppin, Fabrizio Silvestri, Domenico Laforenza. “Query-Driven Document Partitioning and Collection Selection”. Invited Paper. Proceedings of INFOSCALE ‘06. [Puppin & Silvestri, 2006] Diego Puppin, Fabrizio Silvestri. “The Query-Vector Document Model”. Proceedings of CIKM ‘06. [Puppin et al., 2007] Diego Puppin, Ricardo Baeza-Yates, Raffaele Perego, Fabrizio Silvestri. “Incremental Caching for Collection Selection Architectures”. Proceedings of INFOSCALE ‘07.

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa References [Baeza-Yates et al., 2007a] Ricardo Baeza-Yates, Carlos Castillo, Flavio Junqueira, Vassilis Plachouras, Fabrizio Silvestri. “Challenges in Distributed Information Retrieval”. Invited Paper. Proceedings of ICDE [Chierichetti et al., 2007] F. Chierichetti, A. Panconesi, P. Raghavan, M. Sozio, A. Tiberi, E. Upfal. “Finding Near Neighbors Through Cluster Pruning”. Proceedings of PODS [Baeza-Yates et al., 2007b] Ricardo Baeza-Yates, Aristides Gionis, Flavio Junqueira, Vanessa Murdock, Vassilis Plachouras, Fabrizio Silvestri. “The Impact of Caching on Search Engines”. Proceedings of SIGIR 2007.

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa References [Dhillon et al., 2003] Dhillon, I. S. and Mallela, S. and Modha, D. S., “Information-Theoretic Co-Clustering”. Proceedings of The Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD-2003)

Backup Slides

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa

Adding Documents

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa Adding Documents It is important to assign new documents to the fittest clusters New versions, New pages etc. The new documents will be found along with the previously assigned documents Hopefully the coll. selection will find them with similar docs

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa A Modest Proposal The body of the new document is used as query for the PCAP selection The body is compared to the query clusters We will find a similarity between doc. body and query cluster We use PCAP to rank doc. collections

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa Implementation The first 1000 byte of (stripped) body doc are used The new doc is assigned to the doc. cluster with the top PCAP score New docs are locally indexed No need to re-train / re-assign New docs have consistent score and ranking

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa Test Configurations

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection September 2007University of Pisa