chapter 1 I N F O R M A T I O N R E T R I E V A L & VISUALIZATION

chapter 1 I N F O R M A T I O N R E T R I E V A L & VISUALIZATION

Outline What is the IR problem?
How to organize an IR system? (Or the main processes in IR) Tokenizing Indexing Retrieval System evaluation Some current research topics

Information Retrieval
Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).

IR vs. databases: Structured vs unstructured data
Structured data tends to refer to information in “tables” Employee Manager Salary Smith Jones 50000 Chang Smith 60000 Ivy Smith 50000 Typically allows numerical range and exact match (for text) queries, e.g., Salary < AND Manager = Smith. 4

Unstructured data Typically refers to free text Allows
Keyword queries including operators More sophisticated “concept” queries e.g., find all web pages dealing with drug abuse 5

Unstructured (text) vs. structured (database) data in 1996

Unstructured (text) vs. structured (database) data in 2009

IR problem First applications: in libraries (1950s)
ISBN: Author: Salton, Gerard Title: Automatic text processing: the transformation, analysis, and retrieval of information by computer Editor: Addison-Wesley Date: 1989 Content: <Text> external attributes and internal attribute (content) Search by external attributes = Search in DB IR: search by content

The problem of IR Goal = find documents relevant to an information need from a large document set Info. need Query IR system Document collection Retrieval Answer list

Example Google Web

The process of retrieving information

The process of retrieving information
Generating logical view of documents. Building an index of the text database. Initiating retrieval process by specifying a user need. Generating logical view of user need (query). Retrieving documents by Processing the query. Ranking documents upon relevance score. Initiating a user feedback cycle (modifying user need _modifying query).

Basic assumptions of Information Retrieval
Collection: Fixed set of documents Goal: Retrieve documents with information that is relevant to the user’s information need and helps the user complete a task

Generating logical view of documents Inverted index construction
Documents to be indexed. Friends, Romans, countrymen. Tokenizer Token stream. Friends Romans Countrymen More on these later. Linguistic modules Modified tokens. friend roman countryman Indexer Inverted index. friend roman countryman 2 4 13 16 1

Generating logical view of documents Parsing a document
What format is it in? pdf/word/excel/html? What language is it in? What character set is in use? Each of these is a classification problem. But these tasks are often done heuristically …

Generating logical view of documents Tokenization
Input: “Friends, Romans and Countrymen” Output: Tokens Friends Romans Countrymen A token is an instance of a sequence of characters Each such token is now a candidate for an index entry, after further processing Described below But what are valid tokens to emit?

Stop words With a stop list, you exclude from the dictionary entirely the commonest words. Intuition: They have little semantic content: the, a, and, to, be… Normalization to terms We need to “normalize” words in indexed text as well as query words into the same form We want to match U.S.A. and USA

Case folding Reduce all letters to lower case exception: upper case in mid-sentence? e.g., General Motors Fed vs. fed SAIL vs. sail Thesauri and soundex car = automobile color = colour What about spelling mistakes? One approach is soundex, which forms equivalence classes of words based on phonetic heuristics

Generating logical view of documents Tokenizing
Lemmatization Reduce inflectional/variant forms to base form E.g., am, are, is  be car, cars, car's, cars'  car the boy's cars are different colors  the boy car be different color

Stemming Reduce terms to their “roots” before indexing “Stemming” suggest crude affix chopping language dependent e.g., automate(s), automatic, automation all reduced to automat. for exampl compress and compress ar both accept as equival to compress for example compressed and compression are both accepted as equivalent to compress.

Porter’s algorithm Commonest algorithm for stemming English sses  ss ies  i ational  ate tional  tion Weight of word sensitive rules (m>1) EMENT → replacement → replac cement → cement

Building an index of the text database Term-document incidence –Matrix Dictionary
1 if play contains word, 0 otherwise Brutus AND Caesar BUT NOT Calpurnia

Building an index of the text database Bigger collections
Consider N = 1 million documents, each with about 1000 words. Avg 6 bytes/word including spaces/punctuation 6GB of data in the documents. Say there are M = 500K distinct terms among these.

Building an index of the text database Can’t build the matrix
500K x 1M matrix has half-a-trillion 0’s and 1’s. But it has no more than one billion 1’s. matrix is extremely sparse. What’s a better representation? We only record the 1 positions. Why?

Building an index of the text database Inverted index
For each term t, we must store a list of all documents that contain t. Identify each by a docID, a document serial number Can we used fixed-size arrays for this? Brutus 1 2 4 11 31 45 173 174 Caesar 1 2 4 5 6 16 57 132 Calpurnia 2 31 54 101 What happens if the word Caesar is added to document 14?

Building an index of the text database Inverted index
We need variable-size postings lists On disk, a continuous run of postings is normal and best In memory, can use linked lists or variable length arrays Some tradeoffs in size/ease of insertion Posting Brutus 1 2 4 11 31 45 173 174 Dictionary Caesar 1 2 4 5 6 16 57 132 Linked lists generally preferred to arrays Dynamic space allocation Insertion of terms into documents easy Space overhead of pointers Calpurnia 2 31 54 101 Postings Sorted by docID (more later on why).

Building an index of the text database Indexer steps: Token sequence
Sequence of (Modified token, Document ID) pairs. Doc 1 Doc 2 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious

Building an index of the text database Indexer steps: Sort
Sort by terms And then docID Core indexing step

Building an index of the text database Indexer steps: Dictionary & Postings
Multiple term entries in a single document are merged. Split into Dictionary and Postings Doc. frequency information is added. Why frequency? Will discuss later.

Building an index of the text database Where do we pay in storage?
Lists of docIDs Terms and counts How do we index efficiently? How much storage do we need? Pointers

An example

Retrieval The index we just built
How do we process a query? what kinds of queries can we process? Retrieving

Term-document incidence
1 if play contains word, 0 otherwise Brutus AND Caesar BUT NOT Calpurnia

Incidence vectors So we have a 0/1 vector for each term.
To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented)  bitwise AND. AND AND = 1 represent the play is ok satisfy the condition Here play 1,4 satisfy the query (antony and cleopatra and hamlet) 34

Answers to query Antony and Cleopatra, Act III, Scene ii
Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain. Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. 35

Query processing: AND Consider processing the query: Brutus AND Caesar
Locate Brutus in the Dictionary; Retrieve its postings. Locate Caesar in the Dictionary; “Merge” the two postings: 2 4 8 16 32 64 128 Brutus 1 2 3 5 8 13 21 34 Caesar

The merge Walk through the two postings simultaneously, in time linear in the total number of postings entries 34 128 2 4 8 16 32 64 1 3 5 13 21 2 4 8 16 32 64 128 Brutus Caesar 2 8 1 2 3 5 8 13 21 34 If the list lengths are x and y, the merge takes O(x+y) operations. Crucial: postings sorted by docID.

Boolean queries: Exact match
The Boolean retrieval model is being able to ask a query that is a Boolean expression: Boolean Queries are queries using AND, OR and NOT to join query terms Views each document as a set of words Is precise: document matches condition or not. Perhaps the simplest model to build an IR system on Primary commercial retrieval tool for 3 decades. Many search systems you still use are Boolean: , library catalog, Mac OS X Spotlight

Ranking search results
Boolean queries give inclusion or exclusion of docs. Often we want to rank/group results Need to measure proximity from query to each doc. Need to decide whether docs presented to user are singletons, or a group of docs covering various aspects of the query.

Ranked retrieval Thus far, our queries have all been Boolean.
Documents either match or don’t. Good for expert users with precise understanding of their needs and the collection. Also good for applications: Applications can easily consume 1000s of results. Not good for the majority of users. Most users incapable of writing Boolean queries (or they are, but they think it’s too much work). Most users don’t want to wade through 1000s of results. This is particularly true of web search.

Ranked retrieval models
Rather than a set of documents satisfying a query expression, in ranked retrieval models, the system returns an ordering over the (top) documents in the collection with respect to a query Free text queries: Rather than a query language of operators and expressions, the user’s query is just one or more words in a human language In principle, there are two separate choices here, but in practice, ranked retrieval models have normally been associated with free text queries and vice versa 41

Feast or famine: not a problem in ranked retrieval
When a system produces a ranked result set, large result sets are not an issue Indeed, the size of the result set is not an issue We just show the top k ( ≈ 10) results We don’t overwhelm the user Premise: the ranking algorithm works

Scoring as the basis of ranked retrieval
We wish to return in order the documents most likely to be useful to the searcher How can we rank-order the documents in the collection with respect to a query? Assign a score – say in [0, 1] – to each document This score measures how well document and query “match”.

Query-document matching scores
We need a way of assigning a score to a query/document pair Let’s start with a one-term query If the query term does not occur in the document: score should be 0 The more frequent the query term in the document, the higher the score (should be) We will look at a number of alternatives for this.

Take 1: Jaccard coefficient
A commonly used measure of overlap of two sets A and B jaccard(A,B) = |A ∩ B| / |A ∪ B| jaccard(A,A) = 1 jaccard(A,B) = 0 if A ∩ B = 0 A and B don’t have to be the same size. Always assigns a number between 0 and 1.

Issues with Jaccard for scoring
It doesn’t consider term frequency (how many times a term occurs in a document) Rare terms in a collection are more informative than frequent terms. Jaccard doesn’t consider this information We need a more sophisticated way of normalizing for length Later , we’ll use . . . instead of |A ∩ B|/|A ∪ B| (Jaccard) for length normalization.

Binary term-document incidence matrix
Each document is represented by a binary vector ∈ {0,1}|V|

Term-document count matrices
Consider the number of occurrences of a term in a document: Each document is a count vector in ℕv: a column below

Bag of words model Vector representation doesn’t consider the ordering of words in a document John is quicker than Mary and Mary is quicker than John have the same vectors This is called the bag of words model. In a sense, this is a step back: The positional index was able to distinguish these two documents. We will look at “recovering” positional information later . For now: bag of words model

NB: frequency = count in IR
Term frequency tf The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d. We want to use tf when computing query-document match scores. But how? Raw term frequency is not what we want: A document with 10 occurrences of the term is more relevant than a document with 1 occurrence of the term. But not 10 times more relevant. Relevance does not increase proportionally with term frequency. NB: frequency = count in IR

Log-frequency weighting
The log frequency weight of term t in d is 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc. Score for a document-query pair: sum over terms t in both q and d: score The score is 0 if none of the query terms is present in the document.

Document frequency Rare terms are more informative than frequent terms
Recall stop words Consider a term in the query that is rare in the collection (e.g., arachnocentric) A document containing this term is very likely to be relevant to the query arachnocentric → We want a high weight for rare terms like arachnocentric.

Document frequency, continued
Frequent terms are less informative than rare terms Consider a query term that is frequent in the collection (e.g., high, increase, line) A document containing such a term is more likely to be relevant than a document that doesn’t But it’s not a sure indicator of relevance. → For frequent terms, we want high positive weights for words like high, increase, and line But lower weights than for rare terms. We will use document frequency (df) to capture this.

idf weight dft is the document frequency of t: the number of documents that contain t dft is an inverse measure of the informativeness of t dft  N We define the idf (inverse document frequency) of t by We use log (N/dft) instead of N/dft to “dampen” the effect of idf. Will turn out the base of the log is immaterial.

idf example, suppose N = 1 million
term dft idft calpurnia 1 animal 100 sunday 1,000 fly 10,000 under 100,000 the 1,000,000 There is one idf value for each term t in a collection. 55

Effect of idf on ranking
Does idf have an effect on ranking for one-term queries, like iPhone idf has no effect on ranking one term queries idf affects the ranking of documents for queries with at least two terms For the query capricious person, idf weighting makes occurrences of capricious count for much more in the final document ranking than occurrences of person. 56

Collection vs. Document frequency
The collection frequency of t is the number of occurrences of t in the collection, counting multiple occurrences. Example: Which word is a better search term (and should get a higher weight)? Word Collection frequency Document frequency insurance 10440 3997 try 10422 8760 Why do you get these numbers? Suggests df is better. 57

tf-idf weighting The tf-idf weight of a term is the product of its tf weight and its idf weight. Best known weighting scheme in information retrieval Note: the “-” in tf-idf is a hyphen, not a minus sign! Alternative names: tf.idf, tf x idf Increases with the number of occurrences within a document Increases with the rarity of the term in the collection

Final ranking of documents for a query
59

Binary → count → weight matrix
Each document is now represented by a real-valued vector of tf-idf weights ∈ R|V|

Documents as vectors So we have a |V|-dimensional vector space
Terms are axes of the space Documents are points or vectors in this space Very high-dimensional: tens of millions of dimensions when you apply this to a web search engine These are very sparse vectors - most entries are zero.

Queries as vectors Key idea 1: Do the same for queries: represent them as vectors in the space Key idea 2: Rank documents according to their proximity to the query in this space proximity = similarity of vectors proximity ≈ inverse of distance Recall: We do this because we want to get away from the you’re-either-in-or-out Boolean model. Instead: rank more relevant documents higher than less relevant documents

Formalizing vector space proximity
Sec. 6.3 Formalizing vector space proximity First cut: distance between two points ( = distance between the end points of the two vectors) Euclidean distance? Euclidean distance is a bad idea . . . . . . because Euclidean distance is large for vectors of different lengths.

Why distance is a bad idea
Sec. 6.3 Why distance is a bad idea The Euclidean distance between q and d2 is large even though the distribution of terms in the query q and the distribution of terms in the document d2 are very similar.

Use angle instead of distance
Sec. 6.3 Use angle instead of distance Thought experiment: take a document d and append it to itself. Call this document d′. “Semantically” d and d′ have the same content The Euclidean distance between the two documents can be quite large The angle between the two documents is 0, corresponding to maximal similarity. Key idea: Rank documents according to angle with query.

From angles to cosines The following two notions are equivalent.
Sec. 6.3 From angles to cosines The following two notions are equivalent. Rank documents in increasing g order of the angle between query and document Rank documents in decreasing order of cosine(query,document) Cosine is a monotonically decreasing function for the interval [0o, 180o]

Sec. 6.3 From angles to cosines But how – and why – should we be computing cosines?

Sec. 6.3 Length normalization A vector can be (length-) normalized by dividing each of its components by its length – for this we use the L2 norm: Dividing a vector by its L2 norm makes it a unit (length) vector (on surface of unit hypersphere) Effect on the two documents d and d′ (d appended to itself) from earlier slide: they have identical vectors after length-normalization. Long and short documents now have comparable weights

cosine(query,document)
Dot product Unit vectors qi is the tf-idf weight of term i in the query di is the tf-idf weight of term i in the document cos(q,d) is the cosine similarity of q and d … or, equivalently, the cosine of the angle between q and d. See Law of Cosines (Cosine Rule) wikipedia page 69

Cosine for length-normalized vectors
For length-normalized vectors, cosine similarity is simply the dot product (or scalar product): for q, d length-normalized. 70

Cosine similarity illustrated

Cosine similarity amongst 3 documents
How similar are the novels SaS: Sense and Sensibility PaP: Pride and Prejudice, and WH: Wuthering Heights? term SaS PaP WH affection 115 58 20 jealous 10 7 11 gossip 2 6 wuthering 38 Term frequencies (counts) Note: To simplify this example, we don’t do idf weighting.

3 documents example contd.
Log frequency weighting After length normalization term SaS PaP WH affection 3.06 2.76 2.30 jealous 2.00 1.85 2.04 gossip 1.30 1.78 wuthering 2.58 term SaS PaP WH affection 0.789 0.832 0.524 jealous 0.515 0.555 0.465 gossip 0.335 0.405 wuthering 0.588 Now length normalizion for word “affection” = 3.06/ cos(SaS,PaP) ≈ 0.789 × × × × 0.0 ≈ 0.94 cos(SaS,WH) ≈ 0.79 cos(PaP,WH) ≈ 0.69 Why do we have cos(SaS,PaP) > cos(SAS,WH)?

Computing cosine scores

tf-idf weighting has many variants
n default is just term frequency ltc is best known form of weighting Columns headed ‘n’ are acronyms for weight schemes. Why is the base of the log in idf immaterial? 75

Weighting may differ in queries vs documents
Sec. 6.4 Weighting may differ in queries vs documents Many search engines allow for different weightings for queries vs. documents SMART Notation: denotes the combination in use in an engine, with the notation ddd.qqq, using the acronyms from the previous table A very standard weighting scheme is: lnc.ltc Document: logarithmic tf (l as first character), no idf and cosine normalization Query: logarithmic tf (l in leftmost column), idf (t in second column), no normalization … Leaving off idf weighting on documents is good for both efficiency and system effectiveness reasons. A bad idea?

tf-idf example: lnc.ltc
Document: car insurance auto insurance Query: best car insurance Term Query Document Prod tf-raw tf-wt df idf wt n’lize auto 5000 2.3 1 0.52 best 50000 1.3 0.34 car 10000 2.0 0.27 insurance 1000 3.0 0.78 2 0.68 0.53 1+log (1) Log(1,000,000/1000) 1x3 3/Sqr( ) 1+log (2) 1.3/ (sqrt( ) =0.78x0.68 Exercise: what is N, the number of docs? N = 1,000,000 Doc length = Score = = 0.8

How good are the retrieved docs?
Precision : Fraction of retrieved docs that are relevant to user’s information need Recall : Fraction of relevant docs in collection that are retrieved 78

Unranked retrieval evaluation: Precision and Recall
Sec. 8.3 Unranked retrieval evaluation: Precision and Recall Precision: fraction of retrieved docs that are relevant = P(relevant|retrieved) Recall: fraction of relevant docs that are retrieved = P(retrieved|relevant) Precision P = tp/(tp + fp) Recall R = tp/(tp + fn) Relevant Nonrelevant Retrieved tp fp Not Retrieved fn tn

Rank precision Compute the precision values at some selected rank positions. Mainly used in Web search evaluation. For a Web search engine, we can compute precisions for the top 5, 10, 15, 20, 25 and 30 returned pages as the user seldom looks at more than 30 pages. Recall is not very meaningful in Web search. Why? CS583, Bing Liu, UIC

Should we instead use the accuracy measure for evaluation?
Sec. 8.3 Should we instead use the accuracy measure for evaluation? Given a query, an engine classifies each doc as “Relevant” or “Nonrelevant” The accuracy of an engine: the fraction of these classifications that are correct (tp + tn) / ( tp + fp + fn + tn) Accuracy is a commonly used evaluation measure in machine learning classification work Why is this not a very useful evaluation measure in IR?

Why not just use accuracy?
Sec. 8.3 Why not just use accuracy? How to build a % accurate search engine on a low budget…. People doing information retrieval want to find something and have a certain tolerance for junk. Snoogle.com Search for: 0 matching results found.

Sec. 8.3 Precision/Recall You can get high recall (but low precision) by retrieving all docs for all queries! Recall is a non-decreasing function of the number of docs retrieved In a good system, precision decreases as either the number of docs retrieved or recall increases This is not a theorem, but a result with strong empirical confirmation

Sec. 8.3 A combined measure: F Combined measure that assesses precision/recall tradeoff is F measure (weighted harmonic mean): People usually use balanced F1 measure i.e., with  = 1 or  = ½ Harmonic mean is a conservative average See CJ van Rijsbergen, Information Retrieval

Evaluating ranked results
Sec. 8.4 Evaluating ranked results Evaluation of ranked results: The system can return any number of results By taking various numbers of the top returned documents (levels of recall), the evaluator can produce a precision-recall curve

User Feed Back Initiating a user feedback cycle depending on results
modifying user need modifying query.

Relevance feedback Relevance feedback is one of the techniques for improving retrieval effectiveness. The steps: the user first identifies some relevant (Dr) and irrelevant documents (Dir) in the initial list of retrieved documents the system expands the query q by extracting some additional terms from the sample relevant and irrelevant documents to produce qe Perform a second round of retrieval. Rocchio method (α, β and γ are parameters) CS583, Bing Liu, UIC

Rocchio text classifier
In fact, a variation of the Rocchio method above, called the Rocchio classification method, can be used to improve retrieval effectiveness too so are other machine learning methods. Why? Rocchio classifier is constructed by producing a prototype vector ci for each class i (relevant or irrelevant in this case): In classification, cosine is used. CS583, Bing Liu, UIC

Some current research topics
a bag-of-words model While term frequency is certainly one of the most useful indicators of relevance, relying solely on term frequency amounts to viewing a document simply as a bag-of-words and ignores much of its structure.

information derived from document structure Term order is an example of structural information that some recent approaches utilize. Such approaches assign higher relevance scores to documents in which query terms appear in the same order as they appear in the query

Document-feature based approaches take into account the occurrence of query terms in locations of varying prominence, such as title, header text, or body text. The most prominent document-feature based approach is BM25F, built upon Okapi BM25 . Document feature techniques are most commonly used in web search as features can be easily identified due to the presence of HTML tags

Term proximity, is an idea which has seen recent interest. In this context, term proximity refers to the lexical distance between query terms, calculated as the number of words separating query terms in a document

Chronological Term Rank The chronological rank of a term (CTR) within a document is defined as the rank of the term in the sequence of words in the original document. We refer to this statistic as “chronological” to emphasize its correspondence to the sequential occurrence of the terms within the document from the beginning to end let D = (t1, , tn) be a document where ti are terms (words) ordered according to their sequence in the original document. Let trt := i, where the chronological rank tr of term t is assigned as the subscript i of the earliest occurrence of t in D.

Index construction Easy! See the example, CS583, Bing Liu, UIC

Different search engines
The real differences among different search engines are their index weighting schemes Including location of terms, e.g., title, body, emphasized words, etc. their query processing methods (e.g., query classification, expansion, etc) their ranking algorithms Few of these are published by any of the search engine companies. They aretightly guarded secrets. CS583, Bing Liu, UIC

chapter 1 I N F O R M A T I O N R E T R I E V A L & VISUALIZATION

Similar presentations

Presentation on theme: "chapter 1 I N F O R M A T I O N R E T R I E V A L & VISUALIZATION"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

chapter 1 I N F O R M A T I O N R E T R I E V A L & VISUALIZATION

Similar presentations

Presentation on theme: "chapter 1 I N F O R M A T I O N R E T R I E V A L & VISUALIZATION"— Presentation transcript:

Similar presentations

About project

Feedback