Presentation is loading. Please wait.

Presentation is loading. Please wait.

INFORMATION RETRIEVAL Yu Hong and Heng Ji October 15, 2014.

Similar presentations


Presentation on theme: "INFORMATION RETRIEVAL Yu Hong and Heng Ji October 15, 2014."— Presentation transcript:

1 INFORMATION RETRIEVAL Yu Hong and Heng Ji October 15, 2014

2 Outline Introduction IR Approaches and Ranking Query Construction Document Indexing IR Evaluation Web Search INDRI

3 Information

4 Basic Function of Information Information = transmission of thought Thoughts Words Sounds Thoughts Words Sounds EncodingDecoding Speech Writing Telepathy?

5 Information Theory Better called “communication theory” Developed by Claude Shannon in 1940’s Concerned with the transmission of electrical signals over wires How do we send information quickly and reliably? Underlies modern electronic communication: Voice and data traffic… Over copper, fiber optic, wireless, etc. Famous result: Channel Capacity Theorem Formal measure of information in terms of entropy Information = “reduction in surprise”

6 The Noisy Channel Model Information Transmission = producing the same message at the destination as that was sent at the source The message must be encoded for transmission across a medium (called channel) But the channel is noisy and can distort the message SourceDestination channel message Receiver message Transmitter noise

7 A Synthesis Information retrieval as communication over time and space, across a noisy channel SourceDestination TransmitterReceiver channel message noise SenderRecipient EncodingDecodingstoragemessage noise indexing/writingacquisition/reading

8 What is Information Retrieval? Most people equate IR with web-search highly visible, commercially successful endeavors leverage 3+ decades of academic research IR: finding any kind of relevant information web-pages, news events, answers, images, … “relevance” is a key notion

9 What is Information Retrieval (IR)? Most people equate IR with web-search highly visible, commercially successful endeavors leverage 3+ decades of academic research IR: finding any kind of relevant information web-pages, news events, answers, images, … “relevance” is a key notion

10 Interesting Examples Google image search Google video search People Search Social Network Search

11 Interesting Examples Google image search Google video search People Search Social Network Search

12 IR System IR System Query String Document corpus Ranked Documents 1. Doc1 2. Doc2 3. Doc3. SenderRecipient EncodingDecodingstoragemessage noise indexing/writingacquisition/reading

13 The IR Black Box Documents Query Results

14 Inside The IR Black Box Documents Query Results Representation Function Representation Function Query RepresentationDocument Representation Comparison Function Index

15 Building the IR Black Box Fetching model Comparison model Representation Model Indexing Model

16 Building the IR Black Box Fetching models Crawling model Gentle Crawling model Comparison models Boolean model Vector space model Probabilistic models Language models PageRank Representation Models How do we capture the meaning of documents? Is meaning just the sum of all terms? Indexing Models How do we actually store all those words? How do we access indexed terms quickly?

17 Outline Introduction IR Approaches and Ranking Query Construction Document Indexing IR Evaluation Web Search INDRI

18 Fetching model: Crawling Documents Web pages Search Engines

19 Documents Query Results Representation Function Representation Function Query RepresentationDocument Representation Comparison Function Index Crawling Fetching Function World Wide Web

20 Fetching model: Crawling Q1: How many web pages should we fetch? As many as we can. IR System Query String Document corpus Ranked Documents 1. Doc1 2. Doc2 3. Doc3. More web pages = Richer knowledge = Intelligent Search engine

21 Fetching model: Crawling Q1: How many web pages should we fetch? As many as we can. Fetching model is enriching the knowledge in the brain of the search engine IR System Fetching Function I know everything now, hahahahaha!

22 Fetching model: Crawling Q2: How to fetch the web pages? First, we should know the basic network structure of the web Basic Structure: Nodes and Links (hyperlinks) World Wide Web Basic Structure

23 Fetching model: Crawling Q2: How to fetch the web pages? Crawling program (Crawler) visit each node in the web through hyperlink. Basic Network Structure IR System

24 Fetching model: Crawling Q2: How to fetch the web pages? Q2-1: what are the known nodes? It means that the crawler know the addresses of nodes The nodes are web pages So the addresses are the URLs (URL: Uniform Resource Locater) Such as: etc.www.yahoo.comwww.sohu.comwww.sina.com Q2-2: what are the unknown nodes? It means that the crawler don’t know the addresses of nodes The seed nodes are the known ones Before dispatching the crawler, a search engine will introduce some addresses of the web pages to the crawler. The web pages are the earliest known nodes (so called seeds)

25 Fetching model: Crawling Q2: How to fetch the web pages? Q2-3: How can the crawler find the unknown nodes? Nod. Known Nod. Unknown Doc. I can do this. Believe me.

26 Fetching model: Crawling Q2: How to fetch the web pages? Q2-3: How can the crawler find the unknown nodes? Nod. Unknown Doc. I can do this. Believe me.

27

28 Fetching model: Crawling Q2: How to fetch the web pages? Q2-3: How can the crawler find the unknown nodes? Nod. Unknown Doc. I can do this. Believe me.

29 Fetching model: Crawling Q2: How to fetch the web pages? Q2-3: How can the crawler find the unknown nodes? Nod. Known Nod. Unknown Doc. Good news for me. PARSER Known

30 Fetching model: Crawling Q2: How to fetch the web pages? Q2-3: How can the crawler find the unknown nodes? If you introduce a web page to the crawler (let it known the web address), the crawler will use a parser of source code to mine lots of new web pages. Of cause, the crawler have known their addresses. But if you don’t tell the crawler anything, it will be on strike because it can do nothing. That is the reason why we need the seed nodes (seed web pages) to awaken the crawler. Give me some seeds.

31 Fetching model: Crawling Q2: How to fetch the web pages? To traverse the whole network of the web, the crawler need some auxiliary equipment. A register of FIFO (First in, First out) data structure, such as QUEUE. An Access Control Program (ACP) Source Code Parser (SCP) Seed nodes I need some equipment. crawler FIFO Register ACP SCP

32 Fetching model: Crawling Q2: How to fetch the web pages? Robotic crawling procedure (Only five steps) Initialization: push seed nodes (known web pages) into the empty queue Step 1: Take out a node from the queue (FIFO) and visit it (ACP) Step 2: Steal necessary information from the source code of the node (SCP) Step 3: Send the stolen text information (title, text body, keywords and Language) back to search engine for storage (ACP) Step 4: Push the newly found nodes into the queue Step 5: Execute Step 1-5 iteratively I am working now.

33 Fetching model: Crawling Q2: How to fetch the web pages? Trough the steps, the number of the known nodes continuously grows The underlying reason why the crawler can travers the whole web Crawler stops working until the register is empty Although the register is empty, the information of all nodes in the web has been stolen and stored in the server of the search engine. Slot Seed Slot New Node Slot New Node Slot New Node I control this.

34 Fetching model: Crawling Problems 1) Actually, the crawler can not traverse the whole web. Such as encountering the infinite loop when falling into a partial closed- circle network (snare) in the web Node No.

35 Fetching model: Crawling Slot Node Slot Node Slot Node Slot Node Slot Node Slot Node https:// https://screen.yahoo.com/live/ https://games.yahoo.com/ https://mobile.yahoo.com/ https://groups.yahoo.com/neo https://answers.yahoo.com/ https://weather.yahoo.com/ https://autos.yahoo.com/ https://shopping.yahoo.com/ https://www.yahoo.com/health https://www.yahoo.com/food https://www.yahoo.com/style Problems 2) Crude Crawling. A portal web site causes a series of homologous nodes in the register. Abided by the FIFO rule, the iterative crawling of the nodes will continuously visit the mutual server of the nodes. It is crude crawling. Network of Web A class of homologous web pages linking to a portal sit

36 Fetching model: Crawling Homework 1) How to overcome the infinite loop cased by the partial closed-circle network in the web? 2) Please find a way to crawl the web like a gentlemen (not crude). Please select one of the problems as the topic of your homework. A short paper is necessary. No more than 500 words in the paper. But please include at least your idea and a methodology. The methodology can be described with natural languages, flow diagram, or algorithm. Send it to me. Thanks.

37 Building the IR Black Box Fetching models Crawling model Gentle Crawling model Comparison models Boolean model Vector space model Probabilistic models Language models PageRank Representation Models How do we capture the meaning of documents? Is meaning just the sum of all terms? Indexing Models How do we actually store all those words? How do we access indexed terms quickly?

38 Documents Query Results Representation Function Representation Function Query RepresentationDocument Representation Comparison Function Index

39 Documents Query Results Representation Function Representation Function Query RepresentationDocument Representation Comparison Function Index Ignore Now

40 A heuristic formula for IR (Boolean model) Rank docs by similarity to the query suppose the query is “spiderman film” Relevance= # query words in the doc favors documents with both “spiderman” and “film” mathematically: Logical variations (set-based) Boolean AND (require all words): Boolean OR (any of the words):

41 Term Frequency (TF) Observation : key words tend to be repeated in a document Modify our similarity measure: give more weight if word occurs multiple times Problem: biased towards long documents spurious occurrences normalize by length:

42 Inverse Document Frequency (IDF) Observation : rare words carry more meaning: cryogenic, apollo frequent words are linguistic glue: of, the, said, went Modify our similarity measure: give more weight to rare words … but don’t be too aggressive (why?) |C| … total number of documents df(q) … total number of documents that contain q

43 TF normalization Observation : D 1 ={cryogenic,labs}, D 2 ={cryogenic,cryogenic} which document is more relevant? which one is ranked higher? (df(labs) > df(cryogenic)) Correction : first occurrence more important than a repeat (why?) “squash” the linearity of TF: 1 2 3

44 State-of-the-art Formula More query words  good Repetitions of query words  good Penalize very long documents Common words less important

45 Strengths and Weaknesses Strengths Precise, if you know the right strategies Precise, if you have an idea of what you’re looking for Implementations are fast and efficient Weaknesses Users must learn Boolean logic Boolean logic insufficient to capture the richness of language No control over size of result set: either too many hits or none When do you stop reading? All documents in the result set are considered “equally good” What about partial matches? Documents that “don’t quite match” the query may be useful also

46 Vector-space approach to IR cat pig dog cat cat pig dog dog cat cat cat cat cat pig pig cat θ Assumption: Documents that are “close together” in vector space “talk about” the same things Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ “closeness”)

47 Some formulas for Similarity Dot product Cosine Dice Jaccard t1 t2 D Q

48 An Example A document space is defined by three terms: hardware, software, users the vocabulary A set of documents are defined as: A1=(1, 0, 0),A2=(0, 1, 0), A3=(0, 0, 1) A4=(1, 1, 0),A5=(1, 0, 1), A6=(0, 1, 1) A7=(1, 1, 1)A8=(1, 0, 1).A9=(0, 1, 1) If the Query is “hardware and software” what documents should be retrieved?

49 An Example (cont.) In Boolean query matching: document A4, A7 will be retrieved (“AND”) retrieved: A1, A2, A4, A5, A6, A7, A8, A9 (“OR”) In similarity matching (cosine): q=(1, 1, 0) S(q, A1)=0.71, S(q, A2)=0.71,S(q, A3)=0 S(q, A4)=1, S(q, A5)=0.5, S(q, A6)=0.5 S(q, A7)=0.82, S(q, A8)=0.5, S(q, A9)=0.5 Document retrieved set (with ranking)= {A4, A7, A1, A2, A5, A6, A8, A9}

50 Probabilistic model Given D, estimate P(R|D) and P(NR|D) P(R|D)=P(D|R)*P(R)/P(D) (P(D), P(R) constant)  P(D|R) D = {t 1 =x 1, t 2 =x 2, …}

51 Prob. model (cont’d) For document ranking

52 Prob. model (cont’d) How to estimate p i and q i ? A set of N relevant and irrelevant samples: r i Rel. doc. with t i n i -r i Irrel.doc. with t i n i Doc. with t i R i -r i Rel. doc. without t i N-R i –n+r i Irrel.doc. without t i N-n i Doc. without ti R i Rel. doc N-R i Irrel.doc. N Samples

53 Prob. model (cont’d) Smoothing (Robertson-Sparck-Jones formula) When no sample is available: p i =0.5, q i =(n i +0.5)/(N+0.5)  n i /N May be implemented as VSM

54 An Appraisal of Probabilistic Models  Among the oldest formal models in IR  Maron & Kuhns, 1960: Since an IR system cannot predict with certainty which document is relevant, we should deal with probabilities  Assumptions for getting reasonable approximations of the needed probabilities:  Boolean representation of documents/queries/relevance  Term independence  Out-of-query terms do not affect retrieval  Document relevance values are independent

55 An Appraisal of Probabilistic Models  The difference between ‘vector space’ and ‘probabilistic’ IR is not that great:  In either case you build an information retrieval scheme in the exact same way.  Difference: for probabilistic IR, at the end, you score queries not by cosine similarity and tf-idf in a vector space, but by a slightly different formula motivated by probability theory

56 Language-modeling Approach query is a random sample from a “perfect” document words are “sampled” independently of each other rank documents by the probability of generating query P ( ) P ( ) P ( ) P ( )P ( ) = = 4/9 * 2/9 * 4/9 * 3/9 query D

57 57 Naive Bayes and LM generative models  We want to classify document d. We want to classify a query q.  Classes: geographical regions like China, UK, Kenya. Each document in the collection is a different class.  Assume that d was generated by the generative model. Assume that q was generated by a generative model.  Key question: Which of the classes is most likely to have generated the document? Which document (=class) is most likely to have generated the query q?  Or: for which class do we have the most evidence? For which document (as the source of the query) do we have the most evidence?

58 Using language models (LMs) for IR ❶ LM = language model ❷ We view the document as a generative model that generates the query. ❸ What we need to do: ❹ Define the precise generative model we want to use ❺ Estimate parameters (different parameters for each document’s model) ❻ Smooth to avoid zeros ❼ Apply to query and find document most likely to have generated the query ❽ Present most likely document(s) to user ❾ Note that x – y is pretty much what we did in Naive Bayes.

59 59 What is a language model? We can view a finite state automaton as a deterministic language model. I wish I wish I wish I wish... Cannot generate: “wish I wish” or “I wish I”. Our basic model: each document was generated by a different automaton like this except that these automata are probabilistic.

60 60 A probabilistic language model This is a one-state probabilistic finite-state automaton – a unigram language model – and the state emission distribution for its one state q 1. STOP is not a word, but a special symbol indicating that the automaton stops. frog said that toad likes frog STOP P(string) = 0.01 · 0.03 · 0.04 · 0.01 · 0.02 · 0.01 · 0.02 =

61 61 A different language model for each document frog said that toad likes frog STOP P(string|M d1 ) = 0.01 · 0.03 · 0.04 · 0.01 · 0.02 · 0.01 · 0.02 = = 4.8 · P(string|M d2 ) = 0.01 · 0.03 · 0.05 · 0.02 · 0.02 · 0.01 · 0.02 = = 12 · P(string|M d1 ) < P(string|M d2 ) Thus, document d 2 is “more relevant” to the string “frog said that toad likes frog STOP” than d 1 is.

62 62 Using language models in IR  Each document is treated as (the basis for) a language model.  Given a query q  Rank documents based on P(d|q)  P(q) is the same for all documents, so ignore  P(d) is the prior – often treated as the same for all d  But we can give a prior to “high-quality” documents, e.g., those with high PageRank.  P(q|d) is the probability of q given d.  So to rank documents according to relevance to q, ranking according to P(q|d) and P(d|q) is equivalent.

63 63 Where we are  In the LM approach to IR, we attempt to model the query generation process.  Then we rank documents by the probability that a query would be observed as a random sample from the respective document model.  That is, we rank according to P(q|d).  Next: how do we compute P(q|d)?

64 64 How to compute P(q|d)  We will make the same conditional independence assumption as for Naive Bayes. (|q|: length ofr q; t k : the token occurring at position k in q)  This is equivalent to:  tf t,q : term frequency (# occurrences) of t in q  Multinomial model (omitting constant factor)

65 65 Parameter estimation  Missing piece: Where do the parameters P(t|M d ). come from?  Start with maximum likelihood estimates (as we did for Naive Bayes) (|d|: length of d; tf t,d : # occurrences of t in d)  As in Naive Bayes, we have a problem with zeros.  A single t with P(t|M d ) = 0 will make zero.  We would give a single term “veto power”.  For example, for query [Michael Jackson top hits] a document about “top songs” (but not using the word “hits”) would have P(t|M d ) = 0. – That’s bad.  We need to smooth the estimates to avoid zeros.

66 66 Smoothing  Key intuition: A nonoccurring term is possible (even though it didn’t occur),... ... but no more likely than would be expected by chance in the collection.  Notation: M c : the collection model; cf t : the number of occurrences of t in the collection; : the total number of tokens in the collection.  We will use to “smooth” P(t|d) away from zero.

67 67 Mixture model  P(t|d) = λP(t|M d ) + (1 - λ)P(t|M c )  Mixes the probability from the document with the general collection frequency of the word.  High value of λ: “conjunctive-like” search – tends to retrieve documents containing all query words.  Low value of λ: more disjunctive, suitable for long queries  Correctly setting λ is very important for good performance.

68 68 Mixture model: Summary  What we model: The user has a document in mind and generates the query from this document.  The equation represents the probability that the document that the user had in mind was in fact this one.

69 69 Example  Collection: d 1 and d 2  d 1 : Jackson was one of the most talented entertainers of all time  d 2 : Michael Jackson anointed himself King of Pop  Query q: Michael Jackson  Use mixture model with λ = 1/2  P(q|d 1 ) = [(0/11 + 1/18)/2] · [(1/11 + 2/18)/2] ≈  P(q|d 2 ) = [(1/7 + 1/18)/2] · [(1/7 + 2/18)/2] ≈  Ranking: d 2 > d 1

70 70 Exercise: Compute ranking  Collection: d 1 and d 2  d 1 : Xerox reports a profit but revenue is down  d 2 : Lucene narrows quarter loss but decreases further  Query q: revenue down  Use mixture model with λ = 1/2  P(q|d 1 ) = [(1/8 + 2/16)/2] · [(1/8 + 1/16)/2] = 1/8 · 3/32 = 3/256  P(q|d 2 ) = [(1/8 + 2/16)/2] · [(0/8 + 1/16)/2] = 1/8 · 1/32 = 1/256  Ranking: d 2 > d 1

71 71 LMs vs. vector space model (1)  LMs have some things in common with vector space models.  Term frequency is directed in the model.  But it is not scaled in LMs.  Probabilities are inherently “length-normalized”.  Cosine normalization does something similar for vector space.  Mixing document and collection frequencies has an effect similar to idf.  Terms rare in the general collection, but common in some documents will have a greater influence on the ranking.

72 72 LMs vs. vector space model (2)  LMs vs. vector space model: commonalities  Term frequency is directly in the model.  Probabilities are inherently “length-normalized”.  Mixing document and collection frequencies has an effect similar to idf.  LMs vs. vector space model: differences  LMs: based on probability theory  Vector space: based on similarity, a geometric/ linear algebra notion  Collection frequency vs. document frequency  Details of term frequency, length normalization etc.

73 73 Language models for IR: Assumptions  Simplifying assumption: Queries and documents are objects of same type. Not true!  There are other LMs for IR that do not make this assumption.  The vector space model makes the same assumption.  Simplifying assumption: Terms are conditionally independent.  Again, vector space model (and Naive Bayes) makes the same assumption.  Cleaner statement of assumptions than vector space  Thus, better theoretical foundation than vector space  … but “pure” LMs perform much worse than “tuned” LMs.

74 Relevance Using Hyperlinks Number of documents relevant to a query can be enormous if only term frequencies are taken into account Using term frequencies makes “spamming” easy E.g., a travel agency can add many occurrences of the words “travel” to its page to make its rank very high Most of the time people are looking for pages from popular sites Idea: use popularity of Web site (e.g., how many people visit it) to rank site pages that match given keywords Problem: hard to find actual popularity of site Solution: next slide

75 Relevance Using Hyperlinks (Cont.) Solution: use number of hyperlinks to a site as a measure of the popularity or prestige of the site Count only one hyperlink from each site (why? - see previous slide) Popularity measure is for site, not for individual page But, most hyperlinks are to root of site Also, concept of “site” difficult to define since a URL prefix like cs.yale.edu contains many unrelated pages of varying popularity Refinements When computing prestige based on links to a site, give more weight to links from sites that themselves have higher prestige Definition is circular Set up and solve system of simultaneous linear equations Above idea is basis of the Google PageRank ranking mechanism

76 PageRank in Google

77 PageRank in Google (Cont’) Assign a numeric value to each page The more a page is referred to by important pages, the more this page is important d: damping factor (0.85) Many other criteria: e.g. proximity of query words “…information retrieval …” better than “… information … retrieval …” A B I1I1 I2I2

78 Relevance Using Hyperlinks (Cont.) Connections to social networking theories that ranked prestige of people E.g., the president of the U.S.A has a high prestige since many people know him Someone known by multiple prestigious people has high prestige Hub and authority based ranking A hub is a page that stores links to many pages (on a topic) An authority is a page that contains actual information on a topic Each page gets a hub prestige based on prestige of authorities that it points to Each page gets an authority prestige based on prestige of hubs that point to it Again, prestige definitions are cyclic, and can be got by solving linear equations Use authority prestige when ranking answers to a query

79 79 HITS: Hubs and authorities

80 80 HITS update rules  A: link matrix  h: vector of hub scores  a: vector of authority scores  HITS algorithm:  Compute h = Aa  Compute a = A T h  Iterate until convergence  Output (i) list of hubs ranked according to hub score and (ii) list of authorities ranked according to authority score

81 Outline Introduction IR Approaches and Ranking Query Construction Document Indexing IR Evaluation Web Search INDRI

82 82 Keyword Search Simplest notion of relevance is that the query string appears verbatim in the document. Slightly less strict notion is that the words in the query appear frequently in the document, in any order (bag of words).

83 83 Problems with Keywords May not retrieve relevant documents that include synonymous terms. “restaurant” vs. “café” “PRC” vs. “China” May retrieve irrelevant documents that include ambiguous terms. “bat” (baseball vs. mammal) “Apple” (company vs. fruit) “bit” (unit of data vs. act of eating)

84 Query Expansion Most errors caused by vocabulary mismatch query: “cars”, document: “automobiles” solution: automatically add highly-related words Thesaurus / WordNet lookup: add semantically-related words (synonyms) cannot take context into account: “rail car” vs. “race car” vs. “car and cdr” Statistical Expansion: add statistically-related words (co-occurrence) very successful

85 Indri Query Examples #combine( #weight( #1(explosion) #1(blast) #1(wounded) #1(injured) #1(death) #1(deaths)) #weight( #1(Davao Cityinternational airport) #1(Tuesday) #1(DAVAO) #1(Philippines) #1(DXDC) #1(Davao Medical Center)))

86 Synonyms and Homonyms Synonyms E.g., document: “motorcycle repair”, query: “motorcycle maintenance” Need to realize that “maintenance” and “repair” are synonyms System can extend query as “motorcycle and (repair or maintenance)” Homonyms E.g., “object” has different meanings as noun/verb Can disambiguate meanings (to some extent) from the context Extending queries automatically using synonyms can be problematic Need to understand intended meaning in order to infer synonyms Or verify synonyms with user Synonyms may have other meanings as well

87 Concept-Based Querying Approach For each word, determine the concept it represents from context Use one or more ontologies: Hierarchical structure showing relationship between concepts E.g., the ISA relationship that we saw in the E-R model This approach can be used to standardize terminology in a specific field Ontologies can link multiple languages Foundation of the Semantic Web (not covered here)

88 Outline Introduction IR Approaches and Ranking Query Construction Document Indexing IR Evaluation Web Search INDRI

89 Indexing of Documents An inverted index maps each keyword K i to a set of documents S i that contain the keyword Documents identified by identifiers Inverted index may record Keyword locations within document to allow proximity based ranking Counts of number of occurrences of keyword to compute TF and operation: Finds documents that contain all of K 1, K 2,..., K n. Intersection S 1  S 2 .....  S n or operation: documents that contain at least one of K 1, K 2, …, K n union, S 1  S 2 .....  S n,. Each S i is kept sorted to allow efficient intersection/union by merging “not” can also be efficiently implemented by merging of sorted lists

90 Goal = Find the important meanings and create an internal representation Factors to consider: Accuracy to represent meanings (semantics) Exhaustiveness (cover all the contents) Facility for computer to manipulate What is the best representation of contents? Char. string (char trigrams): not precise enough Word: good coverage, not precise Phrase: poor coverage, more precise Concept: poor coverage, precise Coverage (Recall) Accuracy (Precision) String Word Phrase Concept Indexing of Documents

91 Sequence of (Modified token, Document ID) pairs. I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. Doc 1 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious Doc 2 Indexer steps

92 Multiple term entries in a single document are merged. Frequency information is added.

93 An example

94 function words do not bear useful information for IR of, in, about, with, I, although, … Stoplist: contain stopwords, not to be used as index Prepositions Articles Pronouns Some adverbs and adjectives Some frequent words (e.g. document) The removal of stopwords usually improves IR effectiveness A few “ standard ” stoplists are commonly used. Stopwords / Stoplist

95 Stemming Reason: Different word forms may bear similar meaning (e.g. search, searching): create a “ standard ” representation for them Stemming: Removing some endings of word computer compute computes computing computed computation comput

96 Lemmatization transform to standard form according to syntactic category. E.g. verb + ing  verb noun + s  noun Need POS tagging More accurate than stemming, but needs more resources crucial to choose stemming/lemmatization rules noise v.s. recognition rate compromise between precision and recall light/no stemmingsevere stemming -recall +precision+recall -precision

97 97 Simple conjunctive query (two terms)  Consider the query: B RUTUS AND C ALPURNIA  To find all matching documents using inverted index: ❶ Locate B RUTUS in the dictionary ❷ Retrieve its postings list from the postings file ❸ Locate C ALPURNIA in the dictionary ❹ Retrieve its postings list from the postings file ❺ Intersect the two postings lists ❻ Return intersection to user

98 98 Intersecting two posting lists  This is linear in the length of the postings lists.  Note: This only works if postings lists are sorted.

99 99 Does Google use the Boolean model?  On Google, the default interpretation of a query [w 1 w 2...w n ] is w 1 AND w 2 AND...AND w n  Cases where you get hits that do not contain one of the wi :  anchor text  page contains variant of w i (morphology, spelling correction, synonym)  long queries (n large)  boolean expression generates very few hits  Simple Boolean vs. Ranking of result set  Simple Boolean retrieval returns matching documents in no particular order.  Google (and most well designed Boolean engines) rank the result set – they rank good hits (according to some estimator of relevance) higher than bad hits.

100 Outline Introduction IR Approaches and Ranking Query Construction Document Indexing IR Evaluation Web Search INDRI

101 Efficiency: time, space Effectiveness: How is a system capable of retrieving relevant documents? Is a system better than another one? Metrics often used (together): Precision = retrieved relevant docs / retrieved docs Recall = retrieved relevant docs / relevant docs relevantretrieved retrieved relevant IR Evaluation

102 IR Evaluation (Cont’) Information-retrieval systems save space by using index structures that support only approximate retrieval. May result in: false negative (false drop) - some relevant documents may not be retrieved. false positive - some irrelevant documents may be retrieved. For many applications a good index should not permit any false drops, but may permit a few false positives. Relevant performance metrics: precision - what percentage of the retrieved documents are relevant to the query. recall - what percentage of the documents relevant to the query were retrieved.

103 Recall vs. precision tradeoff: Can increase recall by retrieving many documents (down to a low level of relevance ranking), but many irrelevant documents would be fetched, reducing precision Measures of retrieval effectiveness: Recall as a function of number of documents fetched, or Precision as a function of recall Equivalently, as a function of number of documents fetched E.g., “precision of 75% at recall of 50%, and 60% at a recall of 75%” Problem: which documents are actually relevant, and which are not IR Evaluation (Cont’)

104 General form of precision/recall -Precision change w.r.t. Recall (not a fixed point) -Systems cannot compare at one Precision/Recall point -Average precision (on 11 points of recall: 0.0, 0.1, …, 1.0)

105 An illustration of P/R calculation ListRel? Doc1Y Doc2 Doc3Y Doc4Y Doc5 … Assume: 5 relevant docs.

106 MAP (Mean Average Precision) r ij = rank of the j-th relevant document for Q i |R i | = #rel. doc. for Q i n = # test queries E.g. Rank:141 st rel. doc. 582 nd rel. doc. 103 rd rel. doc.

107 Some other measures Noise = retrieved irrelevant docs / retrieved docs Silence = non-retrieved relevant docs / relevant docs Noise = 1 – Precision; Silence = 1 – Recall Fallout = retrieved irrel. docs / irrel. docs Single value measures: F-measure = 2 P * R / (P + R) Average precision = average at 11 points of recall Precision at n document (often used for Web IR) Expected search length (no. irrelevant documents to read before obtaining n relevant doc.)

108 Interactive system’s evaluation Definition: Evaluation = the process of systematically collecting data that informs us about what it is like for a particular user or group of users to use a product/system for a particular task in a certain type of environment.

109 Problems Attitudes: Designers assume that if they and their colleagues can use the system and find it attractive, others will too Features vs. usability or security Executives want the product on the market yesterday Problems “can” be addressed in versions 1.x Consumers accept low levels of usability “I’m so silly”

110 Two main types of evaluation Formative evaluation is done at different stages of development to check that the product meets users’ needs. Part of the user-centered design approach Supports design decisions at various stages May test parts of the system or alternative designs Summative evaluation assesses the quality of a finished product. May test the usability or the output quality May compare competing systems

111 What to evaluate Iterative design & evaluation is a continuous process that examines: Early ideas for conceptual model Early prototypes of the new system Later, more complete prototypes Designers need to check that they understand users’ requirements and that the design assumptions hold.

112 Four evaluation paradigms ‘quick and dirty’ usability testing field studies predictive evaluation

113 Quick and dirty ‘quick & dirty’ evaluation describes the common practice in which designers informally get feedback from users or consultants to confirm that their ideas are in-line with users’ needs and are liked. Quick & dirty evaluations are done any time. The emphasis is on fast input to the design process rather than carefully documented findings.

114 Usability testing Usability testing involves recording typical users’ performance on typical tasks in controlled settings. Field observations may also be used. As the users perform these tasks they are watched & recorded on video & their key presses are logged. This data is used to calculate performance times, identify errors & help explain why the users did what they did. User satisfaction questionnaires & interviews are used to elicit users’ opinions.

115 Usability testing It is very time consuming to conduct and analyze Explain the system, do some training Explain the task, do a mock task Questionnaires before and after the test & after each task Pilot test is usually needed Insufficient number of subjects for ‘proper’ statistical analysis In laboratory conditions, subjects do not behave exactly like in a normal environment

116 Field studies Field studies are done in natural settings The aim is to understand what users do naturally and how technology impacts them. In product design field studies can be used to: - identify opportunities for new technology - determine design requirements - decide how best to introduce new technology - evaluate technology in use

117 Predictive evaluation Experts apply their knowledge of typical users, often guided by heuristics, to predict usability problems. Another approach involves theoretically based models. A key feature of predictive evaluation is that users need not be present Relatively quick & inexpensive

118 The TREC experiments Once per year A set of documents and queries are distributed to the participants (the standard answers are unknown) (April) Participants work (very hard) to construct, fine- tune their systems, and submit the answers (1000/query) at the deadline (July) NIST people manually evaluate the answers and provide correct answers (and classification of IR systems) (July – August) TREC conference (November)

119 TREC evaluation methodology Known document collection (>100K) and query set (50) Submission of 1000 documents for each query by each participant Merge 100 first documents of each participant -> global pool Human relevance judgment of the global pool The other documents are assumed to be irrelevant Evaluation of each system (with 1000 answers) Partial relevance judgments But stable for system ranking

120 Tracks (tasks) Ad Hoc track: given document collection, different topics Routing (filtering): stable interests (user profile), incoming document flow CLIR: Ad Hoc, but with queries in a different language Web: a large set of Web pages Question-Answering: When did Nixon visit China? Interactive: put users into action with system Spoken document retrieval Image and video retrieval Information tracking: new topic / follow up

121 CLEF and NTCIR CLEF = Cross-Language Experimental Forum for European languages organized by Europeans Each per year (March – Oct.) NTCIR: Organized by NII (Japan) For Asian languages cycle of 1.5 year

122 Impact of TREC Provide large collections for further experiments Compare different systems/techniques on realistic data Develop new methodology for system evaluation Similar experiments are organized in other areas (NLP, Machine translation, Summarization, …)

123 Outline Introduction IR Approaches and Ranking Query Construction Document Indexing IR Evaluation Web Search INDRI

124 IR on the Web No stable document collection (spider, crawler) Invalid document, duplication, etc. Huge number of documents (partial collection) Multimedia documents Great variation of document quality Multilingual problem …

125 125 Web Search Application of IR to HTML documents on the World Wide Web. Differences: Must assemble document corpus by spidering the web. Can exploit the structural layout information in HTML (XML). Documents change uncontrollably. Can exploit the link structure of the web.

126 126 Web Search System Query String IR System Ranked Documents 1. Page1 2. Page2 3. Page3. Document corpus Web Spider

127 Challenges Scale, distribution of documents Controversy over the unit of indexing What is a document ? (hypertext) What does the use expect to be retrieved ? High heterogeneity Document structure, size, quality, level of abstraction / specialization User search or domain expertise, expectations Retrieval strategies What do people want ? Evaluation

128 Web documents / data No traditional collection Huge Time and space to crawl index IRSs cannot store copies of documents Dynamic, volatile, anarchic, un-controlled Homogeneous sub-collections Structure In documents (un-/semi-/fully-structured) Between docs: network of inter-connected nodes Hyper-links - conceptual vs. physical documents

129 Web documents / data Mark-up HTML – look & feel XML – structure, semantics Dublin Core Metadata Can webpage authors be trusted to correctly mark-up / index their pages ? Multi-lingual documents Multi-media

130 Theoretical models for indexing / searching Content-based weighting As in traditional IRS, but trying to incorporate hyperlinks the dynamic nature of the Web (page validity, page caching) Link-based weighting Quality of webpages Hubs & authorities Bookmarked pages Iterative estimation of quality

131 Architecture Centralized Main server contains the index, built by an indexer, searched by a query engine Advantage: control, easy update Disadvantage: system requirements (memory, disk, safety/recovery) Distributed Brokers & gatherers Advantage: flexibility, load balancing, redundancy Disadvantage: software complexity, update

132 User variability Power and flexibility for expert users vs. intuitiveness and ease of use for novice users Multi-modal user interface Distinguish between experts and beginners, offer distinct interfaces (functionality) Advantage: can make assumptions on users Disadvantage: habit formation, cognitive shift Uni-modal interface Make essential functionality obvious Make advanced functionality accessible

133 Search strategies Web directories Query-based searching Link-based browsing (provided by the browser, not the IRS) “More like this” Known site (bookmarking) A combination of the above

134 Support for Relevance Feedback RF can improve search effectiveness … but is rarely used Voluntary vs. forced feedback At document vs. word level “Magic” vs. control

135 Some techniques to improve IR effectiveness Interaction with user (relevance feedback) - Keywords only cover part of the contents - User can help by indicating relevant/irrelevant document The use of relevance feedback To improve query expression: Q new =  *Q old +  *Rel_d -  *Nrel_d where Rel_d = centroid of relevant documents NRel_d = centroid of non-relevant documents

136 Modified relevance feedback Users usually do not cooperate (e.g. AltaVista in early years) Pseudo-relevance feedback (Blind RF) Using the top-ranked documents as if they are relevant: Select m terms from n top-ranked documents One can usually obtain about 10% improvement

137 Term clustering Based on `similarity’ between terms Collocation in documents, paragraphs, sentences Based on document clustering Terms specific for bottom-level document clusters are assumed to represent a topic Use Thesauri Query expansion

138 User modelling Build a model / profile of the user by recording the `context’ topics of interest preferences based on interpreting (his/her actions): Implicit or explicit relevance feedback Recommendations from `peers’ Customization of the environment

139 Personalised systems Information filtering Ex: in a TV guide only show programs of interest Use user model to disambiguate queries Query expansion Update the model continuously Customize the functionality and the look-and-feel of the system Ex: skins; remember the levels of the user interface

140 Autonomous agents Purpose: find relevant information on behalf of the user Input: the user profile Output: pull vs. push Positive aspects: Can work in the background, implicitly Can update the master with new, relevant info Negative aspects: control Integration with collaborative systems

141 Outline Introduction IR Approaches and Ranking Query Construction Document Indexing IR Evaluation Web Search INDRI

142 Document Representation Department Descriptions The following list describes … Agriculture … Chemistry … Computer Science … Electrical Engineering … … Zoology department descriptions agriculture chemistry … zoology the following list describes … agriculture … context 1. agriculture 2. chemistry … 36. zoology extents 1. the following list describes agriculture … extents 1. department descriptions extents

143 Model Based on original inference network retrieval framework [Turtle and Croft ’91] Casts retrieval as inference in simple graphical model Extensions made to original model Incorporation of probabilities based on language modeling rather than tf.idf Multiple language models allowed in the network (one per indexed context)

144 Model D θ title θ body θ h1 r1r1 rNrN … r1r1 rNrN … r1r1 rNrN … I q1q1 q2q2 α,β title α,β body α,β h1 Document node (observed) Model hyperparameters (observed) Context language models Representation nodes (terms, phrases, etc…) Belief nodes (#combine, #not, #max) Information need node (belief node)

145 Model I D θ title θ body θ h1 r1r1 rNrN … r1r1 rNrN … r1r1 rNrN … q1q1 q2q2 α,β title α,β body α,β h1

146 P( r | θ ) Probability of observing a term, phrase, or “concept” given a context language model r i nodes are binary Assume r ~ Bernoulli( θ ) “Model B” – [Metzler, Lavrenko, Croft ’04] Nearly any model may be used here tf.idf-based estimates (INQUERY) Mixture models

147 Model I D θ title θ body θ h1 r1r1 rNrN … r1r1 rNrN … r1r1 rNrN … q1q1 q2q2 α,β title α,β body α,β h1

148 P( θ | α, β, D ) Prior over context language model determined by α, β Assume P( θ | α, β ) ~ Beta( α, β ) Bernoulli’s conjugate prior α w = μP( w | C ) + 1 β w = μP( ¬ w | C ) + 1 μ is a free parameter

149 Model I D θ title θ body θ h1 r1r1 rNrN … r1r1 rNrN … r1r1 rNrN … q1q1 q2q2 α,β title α,β body α,β h1

150 P( q | r ) and P( I | r ) Belief nodes are created dynamically based on query Belief node CPTs are derived from standard link matrices Combine evidence from parents in various ways Allows fast inference by making marginalization computationally tractable Information need node is simply a belief node that combines all network evidence into a single value Documents are ranked according to: P( I | α, β, D)

151 Example: #AND AB Q P(Q=true|a,b)AB 0false 0 true 0 false 1true

152 Query Language Extension of INQUERY query language Structured query language Term weighting Ordered / unordered windows Synonyms Additional features Language modeling motivated constructs Added flexibility to deal with fields via contexts Generalization of passage retrieval (extent retrieval) Robust query language that handles many current language modeling tasks

153 Terms TypeExampleMatches Stemmed termdogAll occurrences of dog (and its stems) Surface term“dogs”Exact occurrences of dogs (without stemming) Term group (synonym group) All occurrences of dogs (without stemming) or canine (and its stems) Extent match#any:personAny occurrence of an extent of type person

154 Date / Numeric Fields Example Matches #less#less (URLDEPTH 3) Any URLDEPTH numeric field extent with value less than 3 #greater#greater (READINGLEVEL 3) Any READINGINGLEVEL numeric field extent with value greater than 3 #between#between (SENTIMENT 0 2) Any SENTIMENT numeric field extent with value between 0 and 2 #equals#equals (VERSION 5) Any VERSION numeric field extent with value equal to 5 #date:before#date:before (1 Jan 1900) Any DATE field before 1900 #date:after#date:after (June ) Any DATE field after June 1, 2004 #date:between#date:between (1 Jun Sep 2001) Any DATE field in summer 2000.

155 Proximity TypeExampleMatches #odN (e 1 … e m ) or #N (e 1 … e m ) #od5 (saddam hussein) or #5 (saddam hussein) All occurrences of saddam and hussein appearing ordered within 5 words of each other #uwN (e 1 … e m ) #uw5 (information retrieval) All occurrences of information and retrieval that appear in any order within a window of 5 words #uw (e 1 … e m ) #uw (john kerry) All occurrences of john and kerry that appear in any order within any sized window #phrase (e 1 … e m ) #phrase ( #1 (willy wonka) #uw3 (chocolate factory)) System dependent implementation (defaults to #od m)

156 Context Restriction ExampleMatches yahoo.titleAll occurrences of yahoo appearing in the title context yahoo.title,paragraphAll occurrences of yahoo appearing in both a title and paragraph contexts (may not be possible) All occurrences of yahoo appearing in either a title context or a paragraph context #5 (apple ipod).title All matching windows contained within a title context

157 Context Evaluation ExampleEvaluated google.(title)The term google evaluated using the title context as the document google.(title, paragraph)The term google evaluated using the concatenation of the title and paragraph contexts as the document google.figure(paragraph)The term google restricted to figure tags within the paragraph context.

158 Belief Operators INQUERYINDRI #sum / #and#combine #wsum*#weight #or #not #max * #wsum is still available in INDRI, but should be used with discretion

159 Extent / Passage Retrieval ExampleEvaluated #combine [section](dog canine)Evaluates #combine (dog canine) for each extent associated with the section context #combine [title, section](dog canine) Same as previous, except is evaluated for each extent associated with either the title context or the section context #combine [passage100:50](white house)Evaluates #combine (dog canine) 100 word passages, treating every 50 words as the beginning of a new passage #sum ( #sum [section](dog))Returns a single score that is the #sum of the scores returned from #sum (dog) evaluated for each section extent #max ( #sum [section](dog)) Same as previous, except returns the maximum score

160 Extent Retrieval Example Introduction Statistical language modeling allows formal methods to be applied to information retrieval.... Multinomial Model Here we provide a quick review of multinomial language models.... Multiple-Bernoulli Model We now examine two formal methods for statistically modeling documents and queries based on the multiple-Bernoulli distribution.... … Query: #combine[section]( dirichlet smoothing ) SCOREDOCIDBEGINEND 0.50IR IR IR ………… Treat each section extent as a “document” 2.Score each “document” according to #combine( … ) 3.Return a ranked list of extents

161 Other Operators TypeExampleDescription Filter require #filreq ( #less (READINGLEVEL 10) ben franklin) ) Requires that documents have a reading level less than 10. Documents then ranked by query ben franklin Filter reject #filrej ( #greater (URLDEPTH 1) microsoft) ) Rejects (does not score) documents with a URL depth greater than 1. Documents then ranked by query microsoft Prior #prior ( DATE ) Applies the document prior specified for the DATE field

162 System Overview Indexing Inverted lists for terms and fields Repository consists of inverted lists, parsed documents, and document vectors Query processing Local or distributed Computing local / global statistics Features


Download ppt "INFORMATION RETRIEVAL Yu Hong and Heng Ji October 15, 2014."

Similar presentations


Ads by Google