Information Retrieval

Information Retrieval
Heng Ji November 1, 2016

Transfer of Information
Communication = transmission of information Thoughts Thoughts Telepathy? Words Words Writing Under this view, information is closely linked to communication. How does the “information as communication” view relate to the “information as process” view? Sounds Sounds Speech Encoding Decoding

Information Theory Better called “communication theory”
Developed by Claude Shannon in 1940’s Concerned with the transmission of electrical signals over wires How do we send information quickly and reliably? Underlies modern electronic communication: Voice and data traffic… Over copper, fiber optic, wireless, etc. Famous result: Channel Capacity Theorem Formal measure of information in terms of entropy Information = “reduction in surprise” Here’s yet another view of information, this time from the point of view of electrical engineers.

The Noisy Channel Model
Communication = producing the same message at the destination that was sent at the source The message must be encoded for transmission across a medium (called channel) But the channel is noisy and can distort the message Semantics (meaning) is irrelevant Source Destination Transmitter Receiver message channel message noise

A Synthesis Information retrieval as communication over time and space, across a noisy channel Source Destination Transmitter Receiver message channel message noise So we have three separate views? How do they fit together? Sender Recipient Encoding Decoding storage message noise indexing/writing retrieval/reading

Information Hierarchy
More refined and abstract Wisdom Knowledge “Information” actually exists as part of a hierarchy, of how refined it is Information Data

Information Hierarchy
Data The raw material of information Information Data organized and presented in a particular manner Knowledge “Justified true belief” Information that can be acted upon Wisdom Distilled and integrated knowledge Demonstrative of high-level “understanding”

A (Facetious) Example Data Information Knowledge Wisdom
98.6º F, 99.5º F, 100.3º F, 101º F, … Information Hourly body temperature: 98.6º F, 99.5º F, 100.3º F, 101º F, … Knowledge If you have a temperature above 100º F, you most likely have a fever Wisdom If you don’t feel well, go see a doctor

What is Information Retrieval?
Most people equate IR with web-search highly visible, commercially successful endeavors leverage 3+ decades of academic research IR: finding any kind of relevant information web-pages, news events, answers, images, … “relevance” is a key notion

“Retrieval?” “Fetch something” that’s been stored
Recover a stored state of knowledge Search through stored messages to find some messages relevant to the task at hand We’re communicating over time and space. The message recreates the mental state of the sender. Sender Recipient Encoding Decoding message storage message indexing/writing Retrieval/reading noise 5

History Systemic approach User-centered approach
User outside the system Static/fixed information need Retrieval effectiveness measured Batch retrieval simulations User-centered approach User part of the system, interacting with other components, trying to resolve an anomalous state of knowledge Task-oriented evaluation

Interesting Examples Google image search Google video search
People Search Social Network Search

IR System Document corpus IR Query String System Ranked Documents
. 13

What types of information?
Text (Documents and portions thereof) XML and structured documents Images Audio (sound effects, songs, etc.) Video Source code Applications/Web services Why would you want to search for each type of information? Types of information needs and types of information form a matrix: this course is mostly focused on one section thereof.

The IR Black Box Documents Query Hits

Inside The IR Black Box Index Documents Query Hits Representation
Function Representation Function Query Representation Document Representation Index Comparison Function Hits

Building the IR Black Box
Different models of information retrieval Boolean model Vector space model Probabilistic models Language models PageRank Representing the meaning of documents How do we capture the meaning of documents? Is meaning just the sum of all terms? Indexing How do we actually store all those words? How do we access indexed terms quickly?

The Central Problem in IR
Information Seeker Authors Concepts Concepts Why is IR hard? Because language is hard! Query Terms Document Terms Do these represent the same concepts?

The Central Problem in IR: Variety
Author Searcher Concepts Concepts Why is IR hard? Because language is hard! Query Terms Document Terms “tragic love story” “fateful star-crossed romance” Do these represent the same concepts?

The Central Problem in IR: Ambiguity
The Yuri dolgoruky is the first in a series of new nuclear submarines to be commissioned this year but the bulava nuclear-armed missile developed to equip the submarine has failed tests and the deployment prospects are uncertain.

Relevance Relevance is a subjective judgment and may include:
Being on the proper subject. Being timely (recent information). Being authoritative (from a trusted source). Satisfying the goals of the user and his/her intended use of the information (information need). 22 22

IR Ranking Early IR focused on set-based retrieval
Boolean queries, set of conditions to be satisfied document either matches the query or not like classifying the collection into relevant / non-relevant sets still used by professional searchers “advanced search” in many systems Modern IR: ranked retrieval free-form query expresses user’s information need rank documents by decreasing likelihood of relevance many studies prove it is superior

A heuristic formula for IR
Rank docs by similarity to the query suppose the query is “cryogenic labs” Similarity = # query words in the doc favors documents with both “labs” and “cryogenic” mathematically: Logical variations (set-based) Boolean AND (require all words): Boolean OR (any of the words):

Term Frequency (TF) Observation: Modify our similarity measure:
key words tend to be repeated in a document Modify our similarity measure: give more weight if word occurs multiple times Problem: biased towards long documents spurious occurrences normalize by length:

Inverse Document Frequency (IDF)
Observation: rare words carry more meaning: cryogenic, apollo frequent words are linguistic glue: of, the, said, went Modify our similarity measure: give more weight to rare words … but don’t be too aggressive (why?) |C| … total number of documents df(q) … total number of documents that contain q

TF normalization Observation: Correction:
D1={cryogenic,labs}, D2 ={cryogenic,cryogenic} which document is more relevant? which one is ranked higher? (df(labs) > df(cryogenic)) Correction: first occurrence more important than a repeat (why?) “squash” the linearity of TF:

State-of-the-art Formula
Common words less important Repetitions of query words  good More query words  good Penalize very long documents

Strengths and Weaknesses
Precise, if you know the right strategies Precise, if you have an idea of what you’re looking for Implementations are fast and efficient Weaknesses Users must learn Boolean logic Boolean logic insufficient to capture the richness of language No control over size of result set: either too many hits or none When do you stop reading? All documents in the result set are considered “equally good” What about partial matches? Documents that “don’t quite match” the query may be useful also

Vector-space approach to IR
cat cat cat cat cat cat cat cat pig dog dog cat pig θ pig cat pig dog Assumption: Documents that are “close together” in vector space “talk about” the same things Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ “closeness”)

Some formulas for Similarity
Dot product Cosine Dice Jaccard t1 D Q t2

An Example A document space is defined by three terms:
hardware, software, users the vocabulary A set of documents are defined as: A1=(1, 0, 0), A2=(0, 1, 0), A3=(0, 0, 1) A4=(1, 1, 0), A5=(1, 0, 1), A6=(0, 1, 1) A7=(1, 1, 1) A8=(1, 0, 1). A9=(0, 1, 1) If the Query is “hardware and software” what documents should be retrieved?

An Example (cont.) In Boolean query matching:
document A4, A7 will be retrieved (“AND”) retrieved: A1, A2, A4, A5, A6, A7, A8, A9 (“OR”) In similarity matching (cosine): q=(1, 1, 0) S(q, A1)=0.71, S(q, A2)=0.71, S(q, A3)=0 S(q, A4)=1, S(q, A5)=0.5, S(q, A6)=0.5 S(q, A7)=0.82, S(q, A8)=0.5, S(q, A9)=0.5 Document retrieved set (with ranking)= {A4, A7, A1, A2, A5, A6, A8, A9}

Probabilistic model Given D, estimate P(R|D) and P(NR|D)
P(R|D)=P(D|R)*P(R)/P(D) (P(D), P(R) constant)  P(D|R) D = {t1=x1, t2=x2, …}

Prob. model (cont’d) For document ranking

Prob. model (cont’d) How to estimate pi and qi?
ri Rel. doc. with ti ni-ri Irrel.doc. with ti ni Doc. Ri-ri Rel. doc. without ti N-Ri–n+ri Irrel.doc. without ti N-ni Doc. without ti Ri Rel. doc N-Ri N Samples How to estimate pi and qi? A set of N relevant and irrelevant samples:

Prob. model (cont’d) Smoothing (Robertson-Sparck-Jones formula)
When no sample is available: pi=0.5, qi=(ni+0.5)/(N+0.5)ni/N May be implemented as VSM

An Appraisal of Probabilistic Models
Among the oldest formal models in IR Maron & Kuhns, 1960: Since an IR system cannot predict with certainty which document is relevant, we should deal with probabilities Assumptions for getting reasonable approximations of the needed probabilities: Boolean representation of documents/queries/relevance Term independence Out-of-query terms do not affect retrieval Document relevance values are independent

An Appraisal of Probabilistic Models
The difference between ‘vector space’ and ‘probabilistic’ IR is not that great: In either case you build an information retrieval scheme in the exact same way. Difference: for probabilistic IR, at the end, you score queries not by cosine similarity and tf-idf in a vector space, but by a slightly different formula motivated by probability theory

Language-modeling Approach
query is a random sample from a “perfect” document words are “sampled” independently of each other rank documents by the probability of generating query D query P ( ) P ( ) P ( ) P ( ) P ( ) = = 4/9 * 2/9 * 4/9 * 3/9

Naive Bayes and LM generative models
We want to classify document d. We want to classify a query q. Classes: geographical regions like China, UK, Kenya. Each document in the collection is a different class. Assume that d was generated by the generative model. Assume that q was generated by a generative model. Key question: Which of the classes is most likely to have generated the document? Which document (=class) is most likely to have generated the query q? Or: for which class do we have the most evidence? For which document (as the source of the query) do we have the most evidence? 41

Using language models (LMs) for IR
LM = language model We view the document as a generative model that generates the query. What we need to do: Define the precise generative model we want to use Estimate parameters (different parameters for each document’s model) Smooth to avoid zeros Apply to query and find document most likely to have generated the query Present most likely document(s) to user Note that x – y is pretty much what we did in Naive Bayes.

What is a language model?
We can view a finite state automaton as a deterministic language model. I wish I wish I wish I wish Cannot generate: “wish I wish” or “I wish I”. Our basic model: each document was generated by a different automaton like this except that these automata are probabilistic. 43

A probabilistic language model
This is a one-state probabilistic finite-state automaton – a unigram language model – and the state emission distribution for its one state q1. STOP is not a word, but a special symbol indicating that the automaton stops. frog said that toad likes frog STOP P(string) = 0.01 · 0.03 · 0.04 · 0.01 · 0.02 · 0.01 · 0.02 = 44

A different language model for each document
frog said that toad likes frog STOP P(string|Md1 ) = 0.01 · 0.03 · 0.04 · 0.01 · 0.02 · 0.01 · = = 4.8 · 10-12 P(string|Md2 ) = 0.01 · 0.03 · 0.05 · 0.02 · 0.02 · 0.01 · 0.02 = = 12 · P(string|Md1 ) < P(string|Md2 ) Thus, document d2 is “more relevant” to the string “frog said that toad likes frog STOP” than d1 is. 45

Using language models in IR
Each document is treated as (the basis for) a language model. Given a query q Rank documents based on P(d|q) P(q) is the same for all documents, so ignore P(d) is the prior – often treated as the same for all d But we can give a prior to “high-quality” documents, e.g., those with high PageRank. P(q|d) is the probability of q given d. So to rank documents according to relevance to q, ranking according to P(q|d) and P(d|q) is equivalent. 46

Where we are In the LM approach to IR, we attempt to model the query generation process. Then we rank documents by the probability that a query would be observed as a random sample from the respective document model. That is, we rank according to P(q|d). Next: how do we compute P(q|d)? 47

How to compute P(q|d) We will make the same conditional independence assumption as for Naive Bayes. (|q|: length ofr q; tk : the token occurring at position k in q) This is equivalent to: tft,q: term frequency (# occurrences) of t in q Multinomial model (omitting constant factor) 48

Parameter estimation Missing piece: Where do the parameters P(t|Md). come from? Start with maximum likelihood estimates (as we did for Naive Bayes) (|d|: length of d; tft,d : # occurrences of t in d) As in Naive Bayes, we have a problem with zeros. A single t with P(t|Md) = 0 will make zero. We would give a single term “veto power”. For example, for query [Michael Jackson top hits] a document about “top songs” (but not using the word “hits”) would have P(t|Md) = 0. – That’s bad. We need to smooth the estimates to avoid zeros. 49

Smoothing Key intuition: A nonoccurring term is possible (even though it didn’t occur), . . . . . . but no more likely than would be expected by chance in the collection. Notation: Mc: the collection model; cft: the number of occurrences of t in the collection; : the total number of tokens in the collection. We will use to “smooth” P(t|d) away from zero. 50

Mixture model P(t|d) = λP(t|Md) + (1 - λ)P(t|Mc)
Mixes the probability from the document with the general collection frequency of the word. High value of λ: “conjunctive-like” search – tends to retrieve documents containing all query words. Low value of λ: more disjunctive, suitable for long queries Correctly setting λ is very important for good performance. 51

Mixture model: Summary
What we model: The user has a document in mind and generates the query from this document. The equation represents the probability that the document that the user had in mind was in fact this one. 52

Example Collection: d1 and d2
d1 : Jackson was one of the most talented entertainers of all time d2: Michael Jackson anointed himself King of Pop Query q: Michael Jackson Use mixture model with λ = 1/2 P(q|d1) = [(0/11 + 1/18)/2] · [(1/11 + 2/18)/2] ≈ 0.003 P(q|d2) = [(1/7 + 1/18)/2] · [(1/7 + 2/18)/2] ≈ 0.013 Ranking: d2 > d1 53

Exercise: Compute ranking
Collection: d1 and d2 d1 : Xerox reports a profit but revenue is down d2: Lucene narrows quarter loss but decreases further Query q: revenue down Use mixture model with λ = 1/2 P(q|d1) = [(1/8 + 2/16)/2] · [(1/8 + 1/16)/2] = 1/8 · 3/32 = 3/256 P(q|d2) = [(1/8 + 2/16)/2] · [(0/8 + 1/16)/2] = 1/8 · 1/32 = 1/256 Ranking: d2 > d1 54

LMs vs. vector space model (1)
LMs have some things in common with vector space models. Term frequency is directed in the model. But it is not scaled in LMs. Probabilities are inherently “length-normalized”. Cosine normalization does something similar for vector space. Mixing document and collection frequencies has an effect similar to idf. Terms rare in the general collection, but common in some documents will have a greater influence on the ranking. 55

LMs vs. vector space model (2)
LMs vs. vector space model: commonalities Term frequency is directly in the model. Probabilities are inherently “length-normalized”. Mixing document and collection frequencies has an effect similar to idf. LMs vs. vector space model: differences LMs: based on probability theory Vector space: based on similarity, a geometric/ linear algebra notion Collection frequency vs. document frequency Details of term frequency, length normalization etc. 56

Language models for IR: Assumptions
Simplifying assumption: Queries and documents are objects of same type. Not true! There are other LMs for IR that do not make this assumption. The vector space model makes the same assumption. Simplifying assumption: Terms are conditionally independent. Again, vector space model (and Naive Bayes) makes the same assumption. Cleaner statement of assumptions than vector space Thus, better theoretical foundation than vector space … but “pure” LMs perform much worse than “tuned” LMs. 57

Relevance Using Hyperlinks
Number of documents relevant to a query can be enormous if only term frequencies are taken into account Using term frequencies makes “spamming” easy E.g., a travel agency can add many occurrences of the words “travel” to its page to make its rank very high Most of the time people are looking for pages from popular sites Idea: use popularity of Web site (e.g., how many people visit it) to rank site pages that match given keywords Problem: hard to find actual popularity of site Solution: next slide

Relevance Using Hyperlinks (Cont.)
Solution: use number of hyperlinks to a site as a measure of the popularity or prestige of the site Count only one hyperlink from each site (why? - see previous slide) Popularity measure is for site, not for individual page But, most hyperlinks are to root of site Also, concept of “site” difficult to define since a URL prefix like cs.yale.edu contains many unrelated pages of varying popularity Refinements When computing prestige based on links to a site, give more weight to links from sites that themselves have higher prestige Definition is circular Set up and solve system of simultaneous linear equations Above idea is basis of the Google PageRank ranking mechanism

PageRank in Google To be simple or to be useful?

PageRank in Google (Cont’)
B I2 Assign a numeric value to each page The more a page is referred to by important pages, the more this page is important d: damping factor (0.85) Many other criteria: e.g. proximity of query words “…information retrieval …” better than “… information … retrieval …”

Relevance Using Hyperlinks (Cont.)
Connections to social networking theories that ranked prestige of people E.g., the president of the U.S.A has a high prestige since many people know him Someone known by multiple prestigious people has high prestige Hub and authority based ranking A hub is a page that stores links to many pages (on a topic) An authority is a page that contains actual information on a topic Each page gets a hub prestige based on prestige of authorities that it points to Each page gets an authority prestige based on prestige of hubs that point to it Again, prestige definitions are cyclic, and can be got by solving linear equations Use authority prestige when ranking answers to a query

HITS: Hubs and authorities
63

HITS update rules A: link matrix h: vector of hub scores
a: vector of authority scores HITS algorithm: Compute h = Aa Compute a = ATh Iterate until convergence Output (i) list of hubs ranked according to hub score and (ii) list of authorities ranked according to authority score 64

Keyword Search Simplest notion of relevance is that the query string appears verbatim in the document. Slightly less strict notion is that the words in the query appear frequently in the document, in any order (bag of words). 65 65

Problems with Keywords
May not retrieve relevant documents that include synonymous terms. “restaurant” vs. “café” “PRC” vs. “China” May retrieve irrelevant documents that include ambiguous terms. “bat” (baseball vs. mammal) “Apple” (company vs. fruit) “bit” (unit of data vs. act of eating) 66 66

Query Expansion Most errors caused by vocabulary mismatch query: “cars”, document: “automobiles” solution: automatically add highly-related words Thesaurus / WordNet lookup: add semantically-related words (synonyms) cannot take context into account: “rail car” vs. “race car” vs. “car and cdr” Statistical Expansion: add statistically-related words (co-occurrence) very successful

Indri Query Examples <parameters><query>#combine( #weight( #1(explosion) #1(blast) #1(wounded) #1(injured) #1(death) #1(deaths)) #weight( #1(Davao Cityinternational airport) #1(Tuesday) #1(DAVAO) #1(Philippines) #1(DXDC) #1(Davao Medical Center)))</query></parameters>

Synonyms and Homonyms Synonyms Homonyms
E.g., document: “motorcycle repair”, query: “motorcycle maintenance” Need to realize that “maintenance” and “repair” are synonyms System can extend query as “motorcycle and (repair or maintenance)” Homonyms E.g., “object” has different meanings as noun/verb Can disambiguate meanings (to some extent) from the context Extending queries automatically using synonyms can be problematic Need to understand intended meaning in order to infer synonyms Or verify synonyms with user Synonyms may have other meanings as well

Concept-Based Querying
Approach For each word, determine the concept it represents from context Use one or more ontologies: Hierarchical structure showing relationship between concepts E.g., the ISA relationship that we saw in the E-R model This approach can be used to standardize terminology in a specific field Ontologies can link multiple languages Foundation of the Semantic Web (not covered here)

Indexing of Documents An inverted index maps each keyword Ki to a set of documents Si that contain the keyword Documents identified by identifiers Inverted index may record Keyword locations within document to allow proximity based ranking Counts of number of occurrences of keyword to compute TF and operation: Finds documents that contain all of K1, K2, ..., Kn. Intersection S1 S2 .....  Sn or operation: documents that contain at least one of K1, K2, …, Kn union, S1 S2 .....  Sn,. Each Si is kept sorted to allow efficient intersection/union by merging “not” can also be efficiently implemented by merging of sorted lists

Indexing of Documents Goal = Find the important meanings and create an internal representation Factors to consider: Accuracy to represent meanings (semantics) Exhaustiveness (cover all the contents) Facility for computer to manipulate What is the best representation of contents? Char. string (char trigrams): not precise enough Word: good coverage, not precise Phrase: poor coverage, more precise Concept: poor coverage, precise Coverage (Recall) Accuracy (Precision) String Word Phrase Concept

Indexer steps Sequence of (Modified token, Document ID) pairs. Doc 1
I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious

Multiple term entries in a single document are merged.
Frequency information is added.

An example

Stopwords / Stoplist function words do not bear useful information for IR of, in, about, with, I, although, … Stoplist: contain stopwords, not to be used as index Prepositions Articles Pronouns Some adverbs and adjectives Some frequent words (e.g. document) The removal of stopwords usually improves IR effectiveness A few “standard” stoplists are commonly used.

Stemming Reason: Stemming:
Different word forms may bear similar meaning (e.g. search, searching): create a “standard” representation for them Stemming: Removing some endings of word computer compute computes computing computed computation comput

Lemmatization transform to standard form according to syntactic category. E.g. verb + ing  verb noun + s  noun Need POS tagging More accurate than stemming, but needs more resources crucial to choose stemming/lemmatization rules noise v.s. recognition rate compromise between precision and recall light/no stemming severe stemming -recall +precision +recall -precision

Simple conjunctive query (two terms)
Consider the query: BRUTUS AND CALPURNIA To find all matching documents using inverted index: Locate BRUTUS in the dictionary Retrieve its postings list from the postings file Locate CALPURNIA in the dictionary Intersect the two postings lists Return intersection to user 79

Intersecting two posting lists
This is linear in the length of the postings lists. Note: This only works if postings lists are sorted. 80

Does Google use the Boolean model?
On Google, the default interpretation of a query [w1 w wn] is w1 AND w2 AND . . .AND wn Cases where you get hits that do not contain one of the wi : anchor text page contains variant of wi (morphology, spelling correction, synonym) long queries (n large) boolean expression generates very few hits Simple Boolean vs. Ranking of result set Simple Boolean retrieval returns matching documents in no particular order. Google (and most well designed Boolean engines) rank the result set – they rank good hits (according to some estimator of relevance) higher than bad hits. 81

IR Evaluation Efficiency: time, space Effectiveness:
How is a system capable of retrieving relevant documents? Is a system better than another one? Metrics often used (together): Precision = retrieved relevant docs / retrieved docs Recall = retrieved relevant docs / relevant docs relevant retrieved retrieved relevant

IR Evaluation (Cont’) Information-retrieval systems save space by using index structures that support only approximate retrieval. May result in: false negative (false drop) - some relevant documents may not be retrieved. false positive - some irrelevant documents may be retrieved. For many applications a good index should not permit any false drops, but may permit a few false positives. Relevant performance metrics: precision - what percentage of the retrieved documents are relevant to the query. recall - what percentage of the documents relevant to the query were retrieved.

IR Evaluation (Cont’) Recall vs. precision tradeoff:
Can increase recall by retrieving many documents (down to a low level of relevance ranking), but many irrelevant documents would be fetched, reducing precision Measures of retrieval effectiveness: Recall as a function of number of documents fetched, or Precision as a function of recall Equivalently, as a function of number of documents fetched E.g., “precision of 75% at recall of 50%, and 60% at a recall of 75%” Problem: which documents are actually relevant, and which are not

General form of precision/recall
Precision change w.r.t. Recall (not a fixed point) Systems cannot compare at one Precision/Recall point Average precision (on 11 points of recall: 0.0, 0.1, …, 1.0)

An illustration of P/R calculation
List Rel? Doc1 Y Doc2 Doc3 Doc4 Doc5 … Assume: 5 relevant docs.

MAP (Mean Average Precision)
rij = rank of the j-th relevant document for Qi |Ri| = #rel. doc. for Qi n = # test queries E.g. Rank: st rel. doc. 5 8 2nd rel. doc. 10 3rd rel. doc.

Some other measures Noise = retrieved irrelevant docs / retrieved docs
Silence = non-retrieved relevant docs / relevant docs Noise = 1 – Precision; Silence = 1 – Recall Fallout = retrieved irrel. docs / irrel. docs Single value measures: F-measure = 2 P * R / (P + R) Average precision = average at 11 points of recall Precision at n document (often used for Web IR) Expected search length (no. irrelevant documents to read before obtaining n relevant doc.)

Interactive system’s evaluation
Definition: Evaluation = the process of systematically collecting data that informs us about what it is like for a particular user or group of users to use a product/system for a particular task in a certain type of environment. Most of this is typically is taught in HCI or Human Factors courses.

Problems Attitudes: Designers assume that if they and their colleagues can use the system and find it attractive, others will too Features vs. usability or security Executives want the product on the market yesterday Problems “can” be addressed in versions 1.x Consumers accept low levels of usability “I’m so silly” The photocopier story

Two main types of evaluation
Formative evaluation is done at different stages of development to check that the product meets users’ needs. Part of the user-centered design approach Supports design decisions at various stages May test parts of the system or alternative designs Summative evaluation assesses the quality of a finished product. May test the usability or the output quality May compare competing systems

What to evaluate Iterative design & evaluation is a continuous process that examines: Early ideas for conceptual model Early prototypes of the new system Later, more complete prototypes Designers need to check that they understand users’ requirements and that the design assumptions hold.

Four evaluation paradigms
‘quick and dirty’ usability testing field studies predictive evaluation

Quick and dirty ‘quick & dirty’ evaluation describes the common practice in which designers informally get feedback from users or consultants to confirm that their ideas are in-line with users’ needs and are liked. Quick & dirty evaluations are done any time. The emphasis is on fast input to the design process rather than carefully documented findings.

Usability testing Usability testing involves recording typical users’ performance on typical tasks in controlled settings. Field observations may also be used. As the users perform these tasks they are watched & recorded on video & their key presses are logged. This data is used to calculate performance times, identify errors & help explain why the users did what they did. User satisfaction questionnaires & interviews are used to elicit users’ opinions.

Usability testing It is very time consuming to conduct and analyze
Explain the system, do some training Explain the task, do a mock task Questionnaires before and after the test & after each task Pilot test is usually needed Insufficient number of subjects for ‘proper’ statistical analysis In laboratory conditions, subjects do not behave exactly like in a normal environment

Field studies Field studies are done in natural settings
The aim is to understand what users do naturally and how technology impacts them. In product design field studies can be used to: - identify opportunities for new technology - determine design requirements - decide how best to introduce new technology - evaluate technology in use

Predictive evaluation
Experts apply their knowledge of typical users, often guided by heuristics, to predict usability problems. Another approach involves theoretically based models. A key feature of predictive evaluation is that users need not be present Relatively quick & inexpensive

The TREC experiments Once per year
A set of documents and queries are distributed to the participants (the standard answers are unknown) (April) Participants work (very hard) to construct, fine-tune their systems, and submit the answers (1000/query) at the deadline (July) NIST people manually evaluate the answers and provide correct answers (and classification of IR systems) (July – August) TREC conference (November)

TREC evaluation methodology
Known document collection (>100K) and query set (50) Submission of 1000 documents for each query by each participant Merge 100 first documents of each participant -> global pool Human relevance judgment of the global pool The other documents are assumed to be irrelevant Evaluation of each system (with 1000 answers) Partial relevance judgments But stable for system ranking

Tracks (tasks) Ad Hoc track: given document collection, different topics Routing (filtering): stable interests (user profile), incoming document flow CLIR: Ad Hoc, but with queries in a different language Web: a large set of Web pages Question-Answering: When did Nixon visit China? Interactive: put users into action with system Spoken document retrieval Image and video retrieval Information tracking: new topic / follow up

CLEF and NTCIR CLEF = Cross-Language Experimental Forum NTCIR:
for European languages organized by Europeans Each per year (March – Oct.) NTCIR: Organized by NII (Japan) For Asian languages cycle of 1.5 year

Impact of TREC Provide large collections for further experiments
Compare different systems/techniques on realistic data Develop new methodology for system evaluation Similar experiments are organized in other areas (NLP, Machine translation, Summarization, …)

IR on the Web No stable document collection (spider, crawler)
Invalid document, duplication, etc. Huge number of documents (partial collection) Multimedia documents Great variation of document quality Multilingual problem …

Web Search Application of IR to HTML documents on the World Wide Web.
Differences: Must assemble document corpus by spidering the web. Can exploit the structural layout information in HTML (XML). Documents change uncontrollably. Can exploit the link structure of the web. 105 105

Web Search System Web Spider Document corpus IR Query String System
Ranked Documents 1. Page1 2. Page2 3. Page3 . 106

Challenges Scale, distribution of documents
Controversy over the unit of indexing What is a document ? (hypertext) What does the use expect to be retrieved ? High heterogeneity Document structure, size, quality, level of abstraction / specialization User search or domain expertise, expectations Retrieval strategies What do people want ? Evaluation

Web documents / data No traditional collection Structure Huge
Time and space to crawl index IRSs cannot store copies of documents Dynamic, volatile, anarchic, un-controlled Homogeneous sub-collections Structure In documents (un-/semi-/fully-structured) Between docs: network of inter-connected nodes Hyper-links - conceptual vs. physical documents

Web documents / data Mark-up Multi-lingual documents Multi-media
HTML – look & feel XML – structure, semantics Dublin Core Metadata Can webpage authors be trusted to correctly mark-up / index their pages ? Multi-lingual documents Multi-media

Theoretical models for indexing / searching
Content-based weighting As in traditional IRS, but trying to incorporate hyperlinks the dynamic nature of the Web (page validity, page caching) Link-based weighting Quality of webpages Hubs & authorities Bookmarked pages Iterative estimation of quality

Architecture Centralized Distributed
Main server contains the index, built by an indexer, searched by a query engine Advantage: control, easy update Disadvantage: system requirements (memory, disk, safety/recovery) Distributed Brokers & gatherers Advantage: flexibility, load balancing, redundancy Disadvantage: software complexity, update

User variability Power and flexibility for expert users vs. intuitiveness and ease of use for novice users Multi-modal user interface Distinguish between experts and beginners, offer distinct interfaces (functionality) Advantage: can make assumptions on users Disadvantage: habit formation, cognitive shift Uni-modal interface Make essential functionality obvious Make advanced functionality accessible

Search strategies Web directories Query-based searching
Link-based browsing (provided by the browser, not the IRS) “More like this” Known site (bookmarking) A combination of the above

Support for Relevance Feedback
RF can improve search effectiveness … but is rarely used Voluntary vs. forced feedback At document vs. word level “Magic” vs. control

Some techniques to improve IR effectiveness
Interaction with user (relevance feedback) - Keywords only cover part of the contents - User can help by indicating relevant/irrelevant document The use of relevance feedback To improve query expression: Qnew = *Qold + *Rel_d - *Nrel_d where Rel_d = centroid of relevant documents NRel_d = centroid of non-relevant documents

Modified relevance feedback
Users usually do not cooperate (e.g. AltaVista in early years) Pseudo-relevance feedback (Blind RF) Using the top-ranked documents as if they are relevant: Select m terms from n top-ranked documents One can usually obtain about 10% improvement

Term clustering Based on `similarity’ between terms
Collocation in documents, paragraphs, sentences Based on document clustering Terms specific for bottom-level document clusters are assumed to represent a topic Use Thesauri Query expansion

User modelling Build a model / profile of the user by recording
the `context’ topics of interest preferences based on interpreting (his/her actions): Implicit or explicit relevance feedback Recommendations from `peers’ Customization of the environment

Personalised systems Information filtering
Ex: in a TV guide only show programs of interest Use user model to disambiguate queries Query expansion Update the model continuously Customize the functionality and the look-and-feel of the system Ex: skins; remember the levels of the user interface

Information Retrieval

Similar presentations

Presentation on theme: "Information Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information Retrieval

Similar presentations

Presentation on theme: "Information Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback