Server Ranking for Distributed Text Retrieval Systems on the Internet (Yuwono and Lee) presented by Travis Emmitt.

Server Ranking for Distributed Text Retrieval Systems on the Internet (Yuwono and Lee) presented by Travis Emmitt

General Architecture Coll_1Coll_2Coll_NColl_3 document relevant to Query_A document relevant to Query_B document relevant to Query_C Broker_1Broker_2Broker_M User_X Query_A User_Y User_Z Query_B Query_C Query_A Need DF info from collections Clones, created when needed

Terminology cooperating autonomous index servers collection fusion problem collections = databases = sites broker servers =?= meta search engines –collection of server-servers index servers = collection servers documents = information resources = texts

More Terminology words –before stemming and stopping –example: { the, computer, computing, French } terms –after stemming and stopping –example: { comput, French } keywords –meaning varies depending upon context

Subscripts Often see TF i,j and IDF j within the context of a single collection –In a multiple collection environment, this notational shorthand can lead to ambiguity. –Should instead use TF h,i,j and IDF h,j h, i, and j are identifiers [possibly integers] –c h is a collection –doc h,i is a document in collection c h –t h,j is a term in collection c h

More Terminology N h = number of documents in collection c h V h = vocabulary / set of all terms in c h M h = number of terms in collection c h –M q = number of terms in query q –M h = |V h |

TF h,i,j = Term Frequency Definition: number of times term t h,j occurs in document doc h,i gGLOSS assumes TF h,i,j = 0 or avgTF h,j –avgTF h,j = Sum i=1 Nh (TF h,i,j ) / N h –TFs assumed to be identical for all documents in collection c h that contain one or more occurances of term t j

TF h,i,max = Maximum Term Frequency t h,i,max = term occuring the most frequently in document doc h,i TF h,i,max = number of times that term t h,i,max occurs in document doc h,i Example: doc h,i = “Cat cat dog cat cat” –t h,i,max = “cat” –TF h,i,max = 4

IDF h,j = Inverse Document Frequency DF h,j = document frequency –number of docs in collection c h containing term t h,j IDF h,j = 1 / DF h,j –the literal interpretation of “inverse” IDF h,j = log (N h / DF h,j ) –how it’s used –normalization technique Note: term t h,j must appear in at least one document in the collection, or DF h,I will be 0 and IDF h,j will be undefined

W h,i,j (scheme) = Term Weight Definition: the “weight” assigned to term t h,j in document doc h,i by a weighting scheme W q,j ( scheme ) = the weight assigned to term t q,j in query q by a weighting scheme –We drop one subscript b/c queries don’t belong to collections, unless you consider the set of queries to be a collection in itself [no one seems to do this] Note: for single term queries, weights might suffice

W h,i,j (atn) “atn” is a code representing choices made during a three part calculation process [a,t,n] X = (0.5 + 0.5 TF h,i,j /TF h,i,max ) -- the TF part Y = log (N h /DF h,j )- - the IDF part W h,i,j (atn) = X * Y Note: TF h,i,max might be the maximum term frequency in doc h,i with the added constraint that the max term must occur in the query. If so, then X is dependent upon query composition and must therefore wait until query time to be calculated.

W h,i,j (atc) X = (0.5 + 0.5 TF h,i,j /TF h,i,max ) -- the TF part Y = log (N h /DF h,j ) -- the IDF part Z = sqrt (Sum k=1 Mh X 2 * Y 2 )-- normalization W h,i,j (atc) = X * Y / Z atc is atn with vector-length normalization –atc is better for comparing long documents –atn is better for comparing short documents, and is cheaper to calculate

Query Time TFs, IDFs, and [possibly] Ws can be calculated prior to performing any queries. Queries are made up of one or more terms. –Some systems perceive queries as documents. –Others seem them as sets of keywords. The job at query time job is to determine how well each document/collection “matches” a query. We calculate a similarity score for each document/collection relative to a query.

S h,i,q (scheme) = Similarity Score Definition: estimated similarity of document doc h,i to query q using a scheme Also called relevance score S h,i,q ( scheme ) = Sum j=1 Mq (W h,i,j ( scheme )*W q,j ( scheme )) -- Eq 1 CVV assumes that Wq,j( scheme ) = 1 for all terms t j that occur in query q, so: –S h,i,q (atn) = Sum j=1 Mq ( W h,i,j (atn) ) -- Eq 3

Ranking and Returning the “Best” Documents Rank documents in descending order of similarity scores to the query. One method: get all docs with similarity scores above a specified threshold theta CVV retrieves top-H+ documents –Include all documents tied with H th best document –Assume H th best doc’s similarity score > 0

Multiple Collection Search Also called collection selection In CVV, brokers need access to DFs –must be centralized, periodically updated –all IDFs then provided to collection servers Why? 1) “N is the number of texts in the database” [page 2] 2) “We refer to index servers as collection servers, as each of them can be viewed as a database carrying a collection of documents.” [page 2] 3) N and DF are both particular to a collection, so what extra- collection information is needed in Equation 3?

CVV = Cue-Validity Variance Also called the CVV ranking method Goodness can be derived completely from DF i,j and N i

CVV Terminology C = set of collections in the system |C| = number of collections in the system N i = number of documents in collection c i DF i,j = # times term t j occurs incollection c i or # documents in c i containing term t j CV i,j = cue-validity of term t j for collection c i CVV i,j = cue-validity variance of t i for c i G i,q = goodness of collection c i to query q

CVV: Calculation A = DF i,j /N i B = Sum k!=i |C| (DF k,j ) / Sum k!=i |C| (N k ) ?= Sum k!=i |C| (DF k,j / N k ) CV i,j = A / (A + B) avgCV j = Sum i=1 |C| (CV i,j ) / |C| CVV i,j = Sum i=1 |C| (CV i,j - avgCV j ) 2 / |C| G i,q = Sum j=1 |C| (CVV j * DF i,j ) I assume that’s what they meant (that M = |C|)

Goodness...of a collection relative to a query Denoted G i,q where i is a collection id, q is a query id G i,q is a sum of scores, over all terms in the query Each score represents how well term q j characterizes collection c i [ i is a collection id, j is a term id ] G i,q = Sum j=1 M (CVV j * DF i,j ) The collection with the highest Goodness is the “best” [most relevant] collection for this query

Goodness: Example Query_A = “cat dog” q 1 = cat, q 2 = dog, M = |q| = 2 You can look at this as [w/ user-friend subscripts] : G Coll_1,Query_A = score Coll_1,cat + score Coll_1,dog G Coll_2,Query_A = score Coll_2,cat + score Coll_2,dog... Note: The authors overload the identifier q. At times it represesents a query id [see Equation 1]. At other times, it represents a set [bag?] of terms: {q i, i from 1 to M}.

Query Term Weights What if Query_A = “cat cat dog”? –Do we allow this? Should we weigh cat more heavily than dog? If so, how? Example: score Coll_1,cat =10, score Coll_1,dog =5 score Coll_2,cat = 5, score Coll_2,dog =11 –Intuitively, Coll_1 is more relevant to Query_A Scores might be computed prior to processing a query –get all collections’ scores for all terms in the vocab –add appropriate pre-computed scores when given a query

QTW: CVV Assumptions The authors are concerned primarily with Internet queries [unlike us]. They assume [based on their observations of users’ query tendencies] that terms appear at most once in a query. Their design doesn’t support query term weights, only cares whether a term is present in the query. Their design cannot be used to easily “find me documents like this one”.

QTW: Approach #1 Approach #1 : q 1 =cat, q 2 =dog –Ignore duplicates. –Results in a “binary term vector”. –G Coll_1,Query_A = 10 + 5 = 15 G Coll_2,Query_A = 5 + 11 = 16 -- top guess –Here we see their algorithm would consider Coll_2 to be more relevant than Coll_1, which is counter to our intuition.

QTW: Approach #2 Approach #2 : q 1 =cat, q 2 =cat, q 3 =dog –You need to make q be a bag [allowing duplicate elements] instead of a set [doesn’t allow dups] –G Coll_1,Query_A = 10 + 10 + 5 = 25 -- top guess G Coll_2,Query_A = 5 + 5 + 11 = 21 –Results in the “correct” answer. –Easy to implement once you have a bag set up. –However, primitive brokers will have to calculate [or locate if pre-calculated] cat’s scores twice.

QTW: Approach #3 Approach #3 : q 1 =cat, q 2 =dog, w 1 =2, w 2 =1 –The “true” term vector approach. –G Coll_1,Query_A = 10*2 + 10*1 = 25 -- top guess G Coll_2,Query_A = 5*2 + 11*1 = 21 –Results in “correct” answer. –Don’t need to calculate scores multiple times. –If query term weights tend to be: >1 -- you save space: [cat,50] instead of fifty “cat cat...” almost all are 1 -- less efficient

QTW: Approach #3 (cont) Approach #3 -- the most TREC-friendly –TREC queries often have duplicate terms –Approach #3 results in “correct” answers and is more efficient than Approach #2 #3 sometimes better for WWW search: –“Find me more docs like this” -- doc similarities –Iterative search engines can use query term weights to hone queries [example on next page] –Possibility of negative term weights [see example]

QTW: Iterative Querying (example) Query_1: “travis(5) emmitt(5) football(5)” –results in lots on Emmitt Smith, nothing on Travis –User tells engine that “emmitt smith” is irrelevant –Engine adjusts each query term weight in the “black list” by -1, then performs a revised query: Query_2: “travis(5) emmitt(4) football(5) smith(-1)” –Hopefully yields less Emmitt Smith, more Travis –Repeat cycle of user feedback, weight tweaking, & requerying until the user is satisfied [or gives up] Can’t do this easily without term weights

QTW: User Profiles Might also have user profiles: –Allison loves cats, hates XXX, likes football –Her profile: cats(+3), XXX(-3), football(+1) –Adjustments made to every query she issues. Issues: “wearing different hats”, relying on keywords, want sensitivity to context: –“XXX” especially bad when JPEGs present –“XXX” not bad when in source code: “XXX:=1;”

QTW: Conclusion The bottom line is that query term weights can be useful, not just in a TREC scenario but in an Internet search scenario. CVV can probably be changed to support query term weights [might’ve already been] The QTW discussion was included mostly as a segue to interesting, advanced issues: iterative querying, user profiles, context.

Query Forwarding Single-Cast approach –Get documents from best collection only. –Fast and simple. No result merging. –Question: How often will this in fact suffice? Multi-Cast approach –Get documents from best n collections. –Slower, requires result merging. –Desired if best collection isn’t complete.

Result Merging local doc ranks -> global doc ranks r i,j = rank of document doc j in collection c i –Ambiguous when dealing with multiple queries and multiple similarity estimation schemes [which is what we do]. –Should actually be r i,j,q (scheme) c min,q = collection w/ least similarity to query q G min,q = goodness score of c min,q relative to query q

Result Merging (cont) D i = estimated score distance between documents in ranks x and x+1 –D i = G min, q / (H * G i,q ) s i,j = 1 - (r i,j - 1) * D i –global relevance score of the j th -ranked doc in c i –need to re-rank documents globally

CVV Assumption #1 Assumption 1: The best document in collection c i is equally relevant to query q (has the same global score) as the best document in collection c k for any k != i and G i,q, G k,q > 0. Nitpick: if k=i, G i,q = G k,q, so no reason for k != i

CVV Assumption #1: Motivation They don’t want to require the same search algorithm at each site [collection server]. Sites will therefore tend to use different scales for Goodness; you can’t simply compare scores directly. They want a “collection containing a few but highly relevant documents to contribute to the final result.”

CVV Assumption #1: Critique What about collections with a few weak documents? Or a few redundant documents [that occur in other, “better” collections]? They omit collections with goodness scores less than half the highest goodness score –The best document could exist by itself in an otherwise lame collection. The overall Goodness for that collection might be lower than half the max (since doc scores are used).

CVV Assumption #2 Assumption 2: The distance, in terms of absolute relevance score difference, between two consecutive document ranks in the result set of a collection is inversely proportional to the goodness score of the collection.

CVV vs gGLOSS Their characterization of gGLOSS: –“a keyword based distributed database broker system” –“relies on the weight-sum of every term in a collection.” –assumes that within a collection c i all docs contains either 0 or avgTF h,i,j occurances of term t –assumes document weight computed similarly in all collections –Y = log (N ^ / DF ^ j ) where N ^ and DF ^ are “global”

Performance Comparisons Accuracy calculated from cosine of the angle between the estimated goodness vector and a baseline goodness vector. –Based on the top H+ –Independent of precision and recall. They of course say that CVV is best –gGLOSS appears much better than CORI (!)

Server Ranking for Distributed Text Retrieval Systems on the Internet (Yuwono and Lee) presented by Travis Emmitt.

Similar presentations

Presentation on theme: "Server Ranking for Distributed Text Retrieval Systems on the Internet (Yuwono and Lee) presented by Travis Emmitt."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Server Ranking for Distributed Text Retrieval Systems on the Internet (Yuwono and Lee) presented by Travis Emmitt.

Similar presentations

Presentation on theme: "Server Ranking for Distributed Text Retrieval Systems on the Internet (Yuwono and Lee) presented by Travis Emmitt."— Presentation transcript:

Similar presentations

About project

Feedback