Download presentation
Presentation is loading. Please wait.
1
Distributed Information Retrieval
Jiaul Paik
2
The “Search” (aka Information Retrieval)
3
The Search Engine Black Box
Documents Query Results
4
Search Engine: Inside Index Documents Query Results Query processing
Query Representation Document Representation Index Scoring Function Results
5
Information Retrieval: Main Parts
Indexing Organize data for faster query processing Scoring Usefulness of an item/document with respect to a query
6
Index Structure The term Index Postings File 1: quick aid 2: come jump
Documents The term Index Postings File 1, 3 1, 3, 5, 7 3, 5, 7 1, 3, 5, 7, 8 3, 5 1, 3, 7 2, 6, 8 2, 4, 6 2, 4, 6, 8 2, 4, 8 3 4, 8 1, 5, 7 6, 8 1: quick aid 2: come jump 3: good party .. 10: all over aid all back brown Index come dog fox good jump lazy men now over party Dictionary-ordered list of words quick their time 15
7
Scoring Score = Function(x, y, …, z)
No. of documents in the collection No. of keywords present in a document No. of documents in the collection containing a keyword
8
Two Architectures Centralized architecture Distributed architecture
Single machine holds the data and index Distributed architecture Data and indexes are stored in several connected machines
9
Distributed Information Retrieval
Major Components Resource Description Resource Selection Result Merging
10
Distributed Information Retrieval: Architectures
Peer to Peer Architecture Broker Based Architecture
11
Peer to Peer Architecture
Indexes are located with the resources. 2. Some part of the indexes are distributed to other resources. 3. Queries are distributed across the resources and results are merged by the peer that originated the query. Peer B Peer A Peer C Peer D Query
12
Broker-based Architecture
User Broker Data hosts Index Data
13
Rest of the talk Broker Based Architecture
14
Resource Description Broker needs to know which resource contains what
Goal: is to build a high-level representation of each federated resource.
15
Resource Description: Structure
Resource description gives the content of a resource. Full content Index of the full content Metadata Statistics: term frequency, term rareness, average document length, ... Other information
16
Distributed Information Retrieval: Co-operation
Two ways to get resource description Cooperative environment Provides full access to documents and indices and responds to queries. Un-cooperative environment Can’t directly access documents and indices It only responds to queries.
17
Resource Description in Cooperative Environments
Very simple as a broker has full access to collection A broker could harvest full collection(s) and deal with queries locally (not a good idea!!) A resource could provide a broker with information (a description) useful for retrieval
18
Resource Description: Protocol
Stanford Protocol for Internet and Retrieval Search (STARTS) For each resource STARTS stores some resource metadata and content summary Query language (filter expressions, ranking expressions, …) Statistics (term frequency, document frequency, number of documents) Score range Stopword list (common words: in, the, a …..) Etc
19
Resource Description in Un-cooperative Environments
Is far more difficult broker does not have access to full collections or metadata and content summary. Broker needs to acquire this information without any help from a resource Important information to acquire for the resource description collection size, term statistics, document scores. Note: the required information can only be estimated, NOT exact
20
Getting Resource Description: Query-based Sampling
Collection query Query documents Randomly select a word and send query using the word Get top k (say 5) documents Extract words and store in a table Take another word as query and repeat steps 2-4
21
Query-based Sampling: Issues
How to select queries? When to stop?
22
Selecting Sampling Queries: where from?
Other Resource Description (ORD) Selects terms from a reference dictionary. Learned Resource Description (LRD) Selects terms from the retrieved documents based on term statistics.
23
Selecting Sampling Queries: how?
Queries can be selected by Random selection Document Frequency (df ) Collection Frequency (ctf ) Average Term Frequency (ctf /df )
24
Stopping Criteria Not a well studied problem, mostly approached heuristically Stop after downloading unique documents Stop after taking some fraction of documents (say 1%) (difficult in un- cooperative setup, since collection size unknown)
25
Resource Selection Goal Why important?
Is to identify and rank relevant resources for a given query Why important? Reduces query response time Reduces network congestion
26
Resource Selection: Methods
Large Document Model Resources are selected based on the similarity between resource descriptor (as a unit) and a given query
27
Resource Selection: Methods
Small document model Resources are selected based on the ranking of their documents for a given query
28
Small-Document Model Rank documents from resource descriptors for a given user's query. Consider the top n documents from the resource description. Calculate a score for a resource R based on its documents in the top-n.
29
Score Normalization and Result Merging
Objective Is to normalize and merge the results from multiple resources.
30
Score Normalization Some Simple Methods MinMax
Normalzied score = 𝒐𝒓𝒊𝒈𝒊𝒏𝒂𝒍 𝒔𝒄𝒐𝒓𝒆 − 𝐦𝐢𝐧𝐢𝐦𝐮𝐦 𝒔𝒄𝒐𝒓𝒆 𝐦𝐚𝐱𝒊𝒎𝒖𝒎 𝒔𝒄𝒐𝒓𝒆 − 𝐦𝐢𝐧𝐢𝐦𝐮𝐦 𝒔𝒄𝒐𝒓𝒆 Z-score Normalized score = 𝒐𝒓𝒊𝒈𝒊𝒏𝒂𝒍 𝒔𝒄𝒐𝒓𝒆 −𝒎𝒆𝒂𝒏 𝒔𝒄𝒐𝒓𝒆 𝒔𝒕𝒂𝒏𝒅𝒂𝒓𝒅 𝒅𝒆𝒗𝒊𝒂𝒕𝒊𝒐𝒏 𝒐𝒇 𝒔𝒄𝒐𝒓𝒆
31
Thank you!!
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.