9/19/2000Information Organization and Retrieval Vector and Probabilistic Ranking Ray Larson & Marti Hearst University of California, Berkeley School of.

9/19/2000Information Organization and Retrieval Vector and Probabilistic Ranking Ray Larson & Marti Hearst University of California, Berkeley School of Information Management and Systems SIMS 202: Information Organization and Retrieval

9/19/2000Information Organization and Retrieval Review Inverted files The Vector Space Model Term weighting

9/19/2000Information Organization and Retrieval Inverted indexes Permit fast search for individual terms For each term, you get a list consisting of: –document ID –frequency of term in doc (optional) –position of term in doc (optional) These lists can be used to solve Boolean queries: country -> d1, d2 manor -> d2 country AND manor -> d2 Also used for statistical ranking algorithms

9/19/2000Information Organization and Retrieval How Inverted Files are Used Dictionary Postings Query on “time” AND “dark” 2 docs with “time” in dictionary -> IDs 1 and 2 from posting file 1 doc with “dark” in dictionary -> ID 2 from posting file Therefore, only doc 2 satisfied the query.

9/19/2000Information Organization and Retrieval Vector Space Model Documents are represented as vectors in term space –Terms are usually stems –Documents represented by binary vectors of terms Queries represented the same as documents Query and Document weights are based on length and direction of their vector A vector distance measure between the query and documents is used to rank retrieved documents

9/19/2000Information Organization and Retrieval Documents in 3D Space Assumption: Documents that are “close together” in space are similar in meaning.

9/19/2000Information Organization and Retrieval Vector Space Documents and Queries D1D1 D2D2 D3D3 D4D4 D5D5 D6D6 D7D7 D8D8 D9D9 D 10 D 11 t2t2 t3t3 t1t1 Boolean term combinations Q is a query – also represented as a vector

9/19/2000Information Organization and Retrieval Documents in Vector Space t1t1 t2t2 t3t3 D1D1 D2D2 D 10 D3D3 D9D9 D4D4 D7D7 D8D8 D5D5 D 11 D6D6

9/19/2000Information Organization and Retrieval Assigning Weights to Terms Binary Weights Raw term frequency tf x idf –Recall the Zipf distribution –Want to weight terms highly if they are frequent in relevant documents … BUT infrequent in the collection as a whole Automatically derived thesaurus terms

9/19/2000Information Organization and Retrieval Binary Weights Only the presence (1) or absence (0) of a term is included in the vector

9/19/2000Information Organization and Retrieval Raw Term Weights The frequency of occurrence for the term in each document is included in the vector

9/19/2000Information Organization and Retrieval Assigning Weights tf x idf measure: –term frequency (tf) –inverse document frequency (idf) -- a way to deal with the problems of the Zipf distribution Goal: assign a tf * idf weight to each term in each document

9/19/2000Information Organization and Retrieval tf x idf

9/19/2000Information Organization and Retrieval Inverse Document Frequency IDF provides high values for rare words and low values for common words For a collection of 10000 documents

9/19/2000Information Organization and Retrieval Similarity Measures Simple matching (coordination level match) Dice’s Coefficient Jaccard’s Coefficient Cosine Coefficient Overlap Coefficient

9/19/2000Information Organization and Retrieval Vector Space Visualization

9/19/2000Information Organization and Retrieval Text Clustering Clustering is “The art of finding groups in data.” -- Kaufmann and Rousseeu Term 1 Term 2

9/19/2000Information Organization and Retrieval K-Means Clustering 1 Create a pair-wise similarity measure 2 Find K centers using agglomerative clustering –take a small sample –group bottom up until K groups found 3 Assign each document to nearest center, forming new clusters 4 Repeat 3 as necessary

9/19/2000Information Organization and Retrieval Scatter/Gather Cutting, Pedersen, Tukey & Karger 92, 93 Hearst & Pedersen 95 Cluster sets of documents into general “themes”, like a table of contents Display the contents of the clusters by showing topical terms and typical titles User chooses subsets of the clusters and re-clusters the documents within Resulting new groups have different “themes”

9/19/2000Information Organization and Retrieval S/G Example: query on “star” Encyclopedia text 14 sports 8 symbols47 film, tv 68 film, tv (p) 7 music 97 astrophysics 67 astronomy(p)12 stellar phenomena 10 flora/fauna 49 galaxies, stars 29 constellations 7 miscelleneous Clustering and re-clustering is entirely automated

9/19/2000Information Organization and Retrieval Another use of clustering Use clustering to map the entire huge multidimensional document space into a huge number of small clusters. “Project” these onto a 2D graphical representation:

9/19/2000Information Organization and Retrieval Clustering Multi-Dimensional Document Space (image from Wise et al 95)

9/19/2000Information Organization and Retrieval Concept “Landscapes” Pharmocology Anatomy Legal Disease Hospitals (e.g., Lin, Chen, Wise et al.) Too many concepts, or too coarse Too many concepts, or too coarse Single concept per document Single concept per document No titles No titles Browsing without search Browsing without search

9/19/2000Information Organization and Retrieval Clustering Advantages: –See some main themes Disadvantage: –Many ways documents could group together are hidden Thinking point: what is the relationship to classification systems and facets?

9/19/2000Information Organization and Retrieval Today Vector Space Ranking Probabilistic Models and Ranking (lots of math)

9/19/2000Information Organization and Retrieval tf x idf normalization Normalize the term weights (so longer documents are not unfairly given more weight) –normalize usually means force all values to fall within a certain range, usually between 0 and 1, inclusive.

9/19/2000Information Organization and Retrieval Vector space similarity (use the weights to compare the documents)

9/19/2000Information Organization and Retrieval Vector Space Similarity Measure combine tf x idf into a similarity measure

9/19/2000Information Organization and Retrieval To Think About How does this ranking algorithm behave? –Make a set of hypothetical documents consisting of terms and their weights –Create some hypothetical queries –How are the documents ranked, depending on the weights of their terms and the queries’ terms?

9/19/2000Information Organization and Retrieval Computing Similarity Scores 1.0 0.8 0.6 0.8 0.4 0.60.41.00.2

9/19/2000Information Organization and Retrieval Computing a similarity score

9/19/2000Information Organization and Retrieval Vector Space with Term Weights and Cosine Matching 1.0 0.8 0.6 0.4 0.2 0.80.60.40.201.0 D2D2 D1D1 Q Term B Term A D i =(d i1,w di1 ;d i2, w di2 ;…;d it, w dit ) Q =(q i1,w qi1 ;q i2, w qi2 ;…;q it, w qit ) Q = (0.4,0.8) D1=(0.8,0.3) D2=(0.2,0.7)

9/19/2000Information Organization and Retrieval Weighting schemes We have seen something of –Binary –Raw term weights –TF*IDF There are many other possibilities –IDF alone –Normalized term frequency

9/19/2000Information Organization and Retrieval Term Weights in SMART SMART is an experimental IR system developed by Gerard Salton (and continued by Chris Buckley) at Cornell. Designed for laboratory experiments in IR –Easy to mix and match different weighting methods –Really terrible user interface –Intended for use by code hackers (and even they have trouble using it)

9/19/2000Information Organization and Retrieval Term Weights in SMART In SMART weights are decomposed into three factors:

9/19/2000Information Organization and Retrieval SMART Freq Components Binary maxnorm augmented log

9/19/2000Information Organization and Retrieval Collection Weighting in SMART Inverse squared probabilistic frequency

9/19/2000Information Organization and Retrieval Term Normalization in SMART sum cosine fourth max

9/19/2000Information Organization and Retrieval Problems with Vector Space There is no real theoretical basis for the assumption of a term space –it is more for visualization that having any real basis –most similarity measures work about the same regardless of model Terms are not really orthogonal dimensions –Terms are not independent of all other terms

9/19/2000Information Organization and Retrieval Probabilistic Models Rigorous formal model attempts to predict the probability that a given document will be relevant to a given query Ranks retrieved documents according to this probability of relevance (Probability Ranking Principle) Rely on accurate estimates of probabilities

9/19/2000Information Organization and Retrieval Probability Ranking Principle If a reference retrieval system’s response to each request is a ranking of the documents in the collections in the order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data has been made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data. Stephen E. Robertson, J. Documentation 1977

9/19/2000Information Organization and Retrieval Model 1 – Maron and Kuhns Concerned with estimating probabilities of relevance at the point of indexing: –If a patron came with a request using term t i, what is the probability that she/he would be satisfied with document D j ?

9/19/2000Information Organization and Retrieval Bayes Formula Bayesian statistical inference used in both models…

9/19/2000Information Organization and Retrieval Model 1 Bayes A is the class of events of using the library D i is the class of events of Document i being judged relevant I j is the class of queries consisting of the single term I j P(D i |A,I j ) = probability that if a query is submitted to the system then a relevant document is retrieved

9/19/2000Information Organization and Retrieval Model 2 – Robertson & Sparck Jones Document Relevance Document indexing Given a term t and a query q + - + r n-r n - R-r N-n-R+r N-n R N-R N

9/19/2000Information Organization and Retrieval Robertson-Spark Jones Weights Retrospective formulation --

9/19/2000Information Organization and Retrieval Robertson-Sparck Jones Weights Predictive formulation

9/19/2000Information Organization and Retrieval Model 1 A patron submits a query (call it Q) consisting of some specification of her/his information need. Different patrons submitting the same stated query may differ as to whether or not they judge a specific document to be relevant. The function of the retrieval system is to compute for each individual document the probability that it will be judged relevant by a patron who has submitted query Q. Robertson, Maron & Cooper, 1982

9/19/2000Information Organization and Retrieval Model 2 Documents have many different properties; some documents have all the properties that the patron asked for, and other documents have only some or none of the properties. If the inquiring patron were to examine all of the documents in the collection she/he might find that some having all the sought after properties were relevant, but others (with the same properties) were not relevant. And conversely, he/she might find that some of the documents having none (or only a few) of the sought after properties were relevant, others not. The function of a document retrieval system is to compute the probability that a document is relevant, given that it has one (or a set) of specified properties. Robertson, Maron & Cooper, 1982

9/19/2000Information Organization and Retrieval Probabilistic Models: Some Unifying Notation D = All present and future documents Q = All present and future queries (D i,Q j ) = A document query pair x = class of similar documents, y = class of similar queries, Relevance (R) is a relation:

9/19/2000Information Organization and Retrieval Probabilistic Models Model 1 -- Probabilistic Indexing, P(R|y,D i ) Model 2 -- Probabilistic Querying, P(R|Q j,x) Model 3 -- Merged Model, P(R| Q j, D i ) Model 0 -- P(R|y,x) Probabilities are estimated based on prior usage or relevance estimation

9/19/2000Information Organization and Retrieval Probabilistic Models Q D x y DiDi QjQj

9/19/2000Information Organization and Retrieval Logistic Regression Based on work by William Cooper, Fred Gey and Daniel Dabney. Builds a regression model for relevance prediction based on a set of training data Uses less restrictive independence assumptions than Model 2 –Linked Dependence

9/19/2000Information Organization and Retrieval Probabilistic Models: Logistic Regression Estimates for relevance based on log-linear model with various statistical measures of document content as independent variables. Log odds of relevance is a linear function of attributes: Term contributions summed: Probability of Relevance is inverse of log odds:

9/19/2000Information Organization and Retrieval Logistic Regression 100 - 90 - 80 - 70 - 60 - 50 - 40 - 30 - 20 - 10 - 0 - 0 10 20 30 40 50 60 Term Frequency in Document Relevance

9/19/2000Information Organization and Retrieval Probabilistic Models: Logistic Regression attributes Average Absolute Query Frequency Query Length Average Absolute Document Frequency Document Length Average Inverse Document Frequency Inverse Document Frequency Number of Terms in common between query and document -- logged

9/19/2000Information Organization and Retrieval Probabilistic Models: Logistic Regression Probability of relevance is based on Logistic regression from a sample set of documents to determine values of the coefficients. At retrieval the probability estimate is obtained by: For the 6 X attribute measures shown previously

9/19/2000Information Organization and Retrieval Probabilistic Models Strong theoretical basis In principle should supply the best predictions of relevance given available information Can be implemented similarly to Vector Relevance information is required -- or is “guestimated” Important indicators of relevance may not be term -- though terms only are usually used Optimally requires on- going collection of relevance information AdvantagesDisadvantages

9/19/2000Information Organization and Retrieval Vector and Probabilistic Models Support “natural language” queries Treat documents and queries the same Support relevance feedback searching Support ranked retrieval Differ primarily in theoretical basis and in how the ranking is calculated –Vector assumes relevance –Probabilistic relies on relevance judgments or estimates

9/19/2000Information Organization and Retrieval Current use of Probabilistic Models Virtually all the major systems in TREC now use the “Okapi BM25 formula” which incorporates the Robertson-Sparck Jones weights…

9/19/2000Information Organization and Retrieval Okapi BM25 Where: Q is a query containing terms T K is k 1 ((1-b) + b.dl/avdl) k 1, b and k 3 are parameters, usually 1.2, 0.75 and 7- 1000 tf is the frequency of the term in a specific document qtf is the frequency of the term in a topic from which Q was derived dl and avdl are the document length and the average document length measured in some convenient unit w (1) is the Robertson-Sparck Jones weight.

9/19/2000Information Organization and Retrieval Logistic Regression and Cheshire II The Cheshire II system uses Logistic Regression equations estimated from TREC full-text data. Demo (?)

9/19/2000Information Organization and Retrieval Vector and Probabilistic Ranking Ray Larson & Marti Hearst University of California, Berkeley School of.

Similar presentations

Presentation on theme: "9/19/2000Information Organization and Retrieval Vector and Probabilistic Ranking Ray Larson & Marti Hearst University of California, Berkeley School of."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

9/19/2000Information Organization and Retrieval Vector and Probabilistic Ranking Ray Larson & Marti Hearst University of California, Berkeley School of.

Similar presentations

Presentation on theme: "9/19/2000Information Organization and Retrieval Vector and Probabilistic Ranking Ray Larson & Marti Hearst University of California, Berkeley School of."— Presentation transcript:

Similar presentations

About project

Feedback