Chapter 5: Query Operations Hassan Bashiri April 2009 1.

Chapter 5: Query Operations Hassan Bashiri April 2009 1

Cross-Language What is CLIR? Users enter their query in one language and the search engine retrieves relevant documents in other languages. Retrieval System English QueryFrench Documents 2

11 Term-aligned Sentence-aligned Document-aligned Unaligned Parallel Comparable Knowledge-based Corpus-based Controlled Vocabulary Free Text Cross-Language Text Retrieval Query Translation Document Translation Text Translation Vector Translation Ontology-based Dictionary-based Thesaurus-based 3

Query Language Visual languages Example: library shown on the screen. Act: take books, open catalogs, etc. Better Boolean queries: “I need books by Cervantes AND Lope de Vega”?! 4

IR Interface Query interface Selection interface Examination interface Document delivery 5

Retrieval System Model Query Formulation Detection Delivery Selection Examination Index Docs User Indexing 6

Starfield 7

Query Formulation u No detailed knowledge of collection and retrieval environment é difficult to formulate queries well designed for retrieval é Need many formulations of queries for good retrieval u First formulation: naïve attempt to retrieve relevant information u Documents initially retrieved: é Examined for relevance information é Improved query formulations for retrieving additional relevant documents u Query reformulation: é Expanding original query with new terms é Reweighting the terms in expanded query 8

Three approaches ä Approaches based on feedback from users (relevance feedback) ä Approaches based on information derived from set of initially retrieved documents (local set of documents) ä Approaches based on global information derived from document collection 9

User relevance feedback Most popular query reformulation strategy Cycle: User presented with list of retrieved documents User marks those which are relevant In practice: top 10-20 ranked documents are examined Incremental Select important terms from documents assessed relevant by users Enhance importance of these terms in a new query Expected: New query moves towards relevant documents and away from non- relevant documents For Instance Q1:US Open Q2:US Open Robocup 10

User relevance feedback Two basic techniques Query expansion Add new terms from relevant documents Term reweighting Modify term weights based on user relevance judgements 11

Query Expansion and Term Reweighting for the Vector Model basic idea Relevant documents resemble each other Non-relevant documents have term-weight vectors which are dissimilar from the ones for the relevant documents The reformulated query is moved to closer to the term-weight vector space of relevant documents 12

Query Expansion and Term Reweighting for the Vector Model (Continued) C r : set of relevant documents set of non-relevant documents D r : set of relevant documents, as identified by the user D n : set of non-relevant documents the retrieved documents collection 14

User relevance feedback: Vector Space Model : set of relevant documents, as identified by the user, among the retrieved documents; : set of non-relevant documents among the retrieved documents; : set of relevant documents among all documents in the collection; : number of documents in the sets, respectively; : tuning constants. 15

Standard-Rochio Ide-Regular Ide-Dec-Hi , ,  : tuning constants (usually,  >  )  =1 (Rochio, 1971)  =  =  =1 (Ide, 1971)  =0: positive feedback the highest ranked non-relevant document Similar performance Calculate the modified query 16

Analysis advantages simplicity good results disadvantages No optimality criterion is adopted 17

User relevance feedback: Probabilistic Model The similarity of a document d j to a query q : the probability of observing the term k i in the set R of relevant documents : the probability of observing the term k i in the set R of non-relevant documents Initial search: 18

Feedback search: User relevance feedback: Probabilistic Model 19

Feedback search: No query expansion occurs User relevance feedback: Probabilistic Model 20

For small values of |D r | and |D r,i | (i.e., |D r |=1, |D r,i |=0) Alternative 1: Alternative 2: User relevance feedback: Probabilistic Model 21

Analysis advantages Feedback process is directly related to the derivation of new weights for query terms The term reweighting is optimal disadvantages Document term weights are not considered No query expansion is used 22

Query Expansion Global Local Similarity Thesaurus Statistical Thesaurus Context Analysis Clustering Scalar Clustering Metric Clustering Association Clustering 23

Automatic Local Analysis user relevance feedback Known relevant documents contain terms which can be used to describe a larger cluster of relevant documents with assistance from the user (clustering) automatic analysis Obtain a description (i.t.o terms) for a larger cluster of relevant documents automatically global strategy: global thesaurus-like structure is trained from all documents before querying local strategy: terms from the documents retrieved for a given query are selected at query time 24

Query Expansion based on a Similarity Thesaurus Query expansion is done in three steps as follows:  Represent the query in the concept space used for representation of the index terms 2 Based on the global similarity thesaurus, compute a similarity sim(q,kv) between each term kv correlated to the query terms and the whole query q. 3 Expand the query with the top r ranked terms according to sim(q,kv) 25

Query Expansion – Step 1 To the query q is associated a vector q in the term- concept space given by where wi,q is a weight associated to the index- query pair[ki,q] 26

Query Expansion – Step 2 Compute a similarity sim(q,kv) between each term kv and the user query q where cu,v is the correlation factor 27

Query Expansion – Step 3 Add the top r ranked terms according to sim(q,kv) to the original query q to form the expanded query q’ To each expansion term kv in the query q’ is assigned a weight wv,q’ given by The expanded query q’ is then used to retrieve new documents to the user 28

Query Expansion - Sample Doc1 = D, D, A, B, C, A, B, C Doc2 = E, C, E, A, A, D Doc3 = D, C, B, B, D, A, B, C, A Doc4 = A c(A,A) = 10.991 c(A,C) = 10.781 c(A,D) = 10.781... c(D,E) = 10.398 c(B,E) = 10.396 c(E,E) = 10.224 29

Query Expansion - Sample Query: q = A E E sim(q,A) = 24.298 sim(q,C) = 23.833 sim(q,D) = 23.833 sim(q,B) = 23.830 sim(q,E) = 23.435 New query: q’ = A C D E E w(A,q')= 6.88 w(C,q')= 6.75 w(D,q')= 6.75 w(E,q')= 6.64 30

Query Expansion Methods of local analysis extract information from local set of documents retrieved to expand the query An alternative is to expand the query using information from the whole set of documents 31

Local Cluster stem V(s): a non-empty subset of words which are grammatical variants of each other e.g., {polish, polishing, polished} A canonical form s of V(s) is called a stem e.g., polish local document set D l the set of documents retrieved for a given query local vocabulary V l (S l ) the set of all distinct words (stems) in the local document set 32

Local Cluster basic concept Expanding the query with terms correlated to the query terms The correlated terms are presented in the local clusters built from the local document set local clusters association clusters: co-occurrences of pairs of terms in documents metric clusters: distance factor between two terms scalar clusters: terms with similar neighborhoods have some synonymity relationship 33

Association Clusters idea Based on the co-occurrence of stems (or terms) inside documents association matrix f si,j : the frequency of a stem s i in a document d j (  D l ) m=(f si,j ): an association matrix with |S l | rows and |D l | columns : a local stem-stem association matrix 34

: a correlation between the stems s u and s v : normalized matrix s u,v =c u,v : unnormalized matrix local association cluster around the stem s u Take u-th row Return the set of n largest values s u,v (u  v) an element in 35

Metric Clusters idea Consider the distance between two terms in the computation of their correlation factor local stem-stem metric correlation matrix r(k i,k j ): the number of words between keywords k i and k j in a same document c u,v : metric correlation between stems s u and s v 36

: normalized matrix s u,v =c u,v : unnormalized matrix local metric cluster around the stem s u Take u-th row Return the set of n largest values s u,v (u  v) 37

Scalar Clusters idea Two stems with similar neighborhoods have synonymity relationship The relationship is indirect or induced by the neighborhood scalar association matrix The row corresponding to a specific term in a term co-occurrence matrix forms its neighborhood local scalar cluster around the stem s u Take u-th row Return the set of n largest values s u,v (u  v) 38

Interactive Search Formulation neighbors of the query term s v Terms s u belonging to clusters associated to s v, i.e., s u  S v (n) s u is called a searchonym of s v x Su Sv Sv(n) x x x x x x x x x x x x x x x x x x x 39

Similarity Thesaurus The similarity thesaurus is based on term to term relationships rather than on a matrix of co-occurrence. This relationship are not derived directly from co- occurrence of terms inside documents. They are obtained by considering that the terms are concepts in a concept space. In this concept space, each term is indexed by the documents in which it appears. Terms assume the original role of documents while documents are interpreted as indexing elements 40

Similarity Thesaurus Inverse term frequency for document dj t: number of terms in the collection N: number of documents in the collection fi,j: frequency of occurrence of the term ki in the document dj tj: vocabulary of document dj itfj: inverse term frequency for document dj To ki is associated a vector 41

Similarity Thesaurus where wi,j is a weight associated to index- document pair[ki,dj]. These weights are computed as follows 42

Similarity Thesaurus The relationship between two terms ku and kv is computed as a correlation factor cu,v given by The global similarity thesaurus is built through the computation of correlation factor Cu,v for each pair of indexing terms [ku,kv] in the collection 43

Represent the query in the concept space used for representation of the index terms Based on the global similarity thesaurus, compute a similarity sim(q,k v ) between each term k v correlated to the query terms and the whole query q query term expand term 44

Expand the query with the top r ranked terms according to sim(q,k v ) 45

Similarity Thesaurus This computation is expensive Global similarity thesaurus has to be computed only once and can be updated incrementally 46

Chapter 5: Query Operations Hassan Bashiri April 2009 1.

Similar presentations

Presentation on theme: "Chapter 5: Query Operations Hassan Bashiri April 2009 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 5: Query Operations Hassan Bashiri April 2009 1.

Similar presentations

Presentation on theme: "Chapter 5: Query Operations Hassan Bashiri April 2009 1."— Presentation transcript:

Similar presentations

About project

Feedback