Distributed Information Retrieval

Distributed Information Retrieval
Jiaul Paik

The “Search” (aka Information Retrieval)

The Search Engine Black Box
Documents Query Results

Search Engine: Inside Index Documents Query Results Query processing
Query Representation Document Representation Index Scoring Function Results

Information Retrieval: Main Parts
Indexing Organize data for faster query processing Scoring Usefulness of an item/document with respect to a query

Index Structure The term Index Postings File 1: quick aid 2: come jump
Documents The term Index Postings File 1, 3 1, 3, 5, 7 3, 5, 7 1, 3, 5, 7, 8 3, 5 1, 3, 7 2, 6, 8 2, 4, 6 2, 4, 6, 8 2, 4, 8 3 4, 8 1, 5, 7 6, 8 1: quick aid 2: come jump 3: good party .. 10: all over aid all back brown Index come dog fox good jump lazy men now over party Dictionary-ordered list of words quick their time 15

Scoring Score = Function(x, y, …, z)
No. of documents in the collection No. of keywords present in a document No. of documents in the collection containing a keyword

Two Architectures Centralized architecture Distributed architecture
Single machine holds the data and index Distributed architecture Data and indexes are stored in several connected machines

Distributed Information Retrieval
Major Components Resource Description Resource Selection Result Merging

Distributed Information Retrieval: Architectures
Peer to Peer Architecture Broker Based Architecture

Peer to Peer Architecture
Indexes are located with the resources. 2. Some part of the indexes are distributed to other resources. 3. Queries are distributed across the resources and results are merged by the peer that originated the query. Peer B Peer A Peer C Peer D Query

Broker-based Architecture
User Broker Data hosts Index Data

Rest of the talk Broker Based Architecture

Resource Description Broker needs to know which resource contains what
Goal: is to build a high-level representation of each federated resource.

Resource Description: Structure
Resource description gives the content of a resource. Full content Index of the full content Metadata Statistics: term frequency, term rareness, average document length, ... Other information

Distributed Information Retrieval: Co-operation
Two ways to get resource description Cooperative environment Provides full access to documents and indices and responds to queries. Un-cooperative environment Can’t directly access documents and indices It only responds to queries.

Resource Description in Cooperative Environments
Very simple as a broker has full access to collection A broker could harvest full collection(s) and deal with queries locally (not a good idea!!) A resource could provide a broker with information (a description) useful for retrieval

Resource Description: Protocol
Stanford Protocol for Internet and Retrieval Search (STARTS) For each resource STARTS stores some resource metadata and content summary Query language (filter expressions, ranking expressions, …) Statistics (term frequency, document frequency, number of documents) Score range Stopword list (common words: in, the, a …..) Etc

Resource Description in Un-cooperative Environments
Is far more difficult broker does not have access to full collections or metadata and content summary. Broker needs to acquire this information without any help from a resource Important information to acquire for the resource description collection size, term statistics, document scores. Note: the required information can only be estimated, NOT exact

Getting Resource Description: Query-based Sampling
Collection query Query documents Randomly select a word and send query using the word Get top k (say 5) documents Extract words and store in a table Take another word as query and repeat steps 2-4

Query-based Sampling: Issues
How to select queries? When to stop?

Selecting Sampling Queries: where from?
Other Resource Description (ORD) Selects terms from a reference dictionary. Learned Resource Description (LRD) Selects terms from the retrieved documents based on term statistics.

Selecting Sampling Queries: how?
Queries can be selected by Random selection Document Frequency (df ) Collection Frequency (ctf ) Average Term Frequency (ctf /df )

Stopping Criteria Not a well studied problem, mostly approached heuristically Stop after downloading unique documents Stop after taking some fraction of documents (say 1%) (difficult in un- cooperative setup, since collection size unknown)

Resource Selection Goal Why important?
Is to identify and rank relevant resources for a given query Why important? Reduces query response time Reduces network congestion

Resource Selection: Methods
Large Document Model Resources are selected based on the similarity between resource descriptor (as a unit) and a given query

Resource Selection: Methods
Small document model Resources are selected based on the ranking of their documents for a given query

Small-Document Model Rank documents from resource descriptors for a given user's query. Consider the top n documents from the resource description. Calculate a score for a resource R based on its documents in the top-n.

Score Normalization and Result Merging
Objective Is to normalize and merge the results from multiple resources.

Score Normalization Some Simple Methods MinMax
Normalzied score = 𝒐𝒓𝒊𝒈𝒊𝒏𝒂𝒍 𝒔𝒄𝒐𝒓𝒆 − 𝐦𝐢𝐧𝐢𝐦𝐮𝐦 𝒔𝒄𝒐𝒓𝒆 𝐦𝐚𝐱𝒊𝒎𝒖𝒎 𝒔𝒄𝒐𝒓𝒆 − 𝐦𝐢𝐧𝐢𝐦𝐮𝐦 𝒔𝒄𝒐𝒓𝒆 Z-score Normalized score = 𝒐𝒓𝒊𝒈𝒊𝒏𝒂𝒍 𝒔𝒄𝒐𝒓𝒆 −𝒎𝒆𝒂𝒏 𝒔𝒄𝒐𝒓𝒆 𝒔𝒕𝒂𝒏𝒅𝒂𝒓𝒅 𝒅𝒆𝒗𝒊𝒂𝒕𝒊𝒐𝒏 𝒐𝒇 𝒔𝒄𝒐𝒓𝒆

Thank you!!

Distributed Information Retrieval

Similar presentations

Presentation on theme: "Distributed Information Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Distributed Information Retrieval

Similar presentations

Presentation on theme: "Distributed Information Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback