Presentation is loading. Please wait.

Presentation is loading. Please wait.

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 1 龙星计划课程 : 信息检索 Overview of Text Retrieval: Part 2 ChengXiang Zhai (

Similar presentations


Presentation on theme: "2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 1 龙星计划课程 : 信息检索 Overview of Text Retrieval: Part 2 ChengXiang Zhai ("— Presentation transcript:

1 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 1 龙星计划课程 : 信息检索 Overview of Text Retrieval: Part 2 ChengXiang Zhai ( 翟成祥 ) Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign http://www-faculty.cs.uiuc.edu/~czhai, czhai@cs.uiuc.edu

2 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 2 Outline Other retrieval models Implementation of a TR System Applications of TR techniques

3 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 3 P-norm (Extended Boolean) (Salton et al. 83) Motivation: how to rank documents with a Boolean query? Intuitions –Docs satisfying the query constraint should get the highest ranks –Partial satisfaction of query constraint can be used to rank other docs Question: How to capture “partial satisfaction”?

4 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 4 P-norm: Basic Ideas Normalized term weights for doc rep ([0,1]) Define similarity between a Boolean query and a doc vector Q= T1 AND T2 (0,0) (1,0) (0,1)(1,1) (x,y) Q= T1 ORT2 (0,0) (1,0) (0,1)(1,1) (x,y)

5 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 5 P-norm: Formulas Since the similarity value is normalized to [0,1], these two formulas can be applied recursively. 1 P +  vector-space Boolean/Fuzzy logic

6 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 6 P-norm: Summary A general (and elegant) similarity function for Boolean query and a regular document vector Connecting Boolean model and vector space model with models in between Allowing different “confidence” on Boolean operators (different p for different operators) A model worth more exploration (how to learn optimal p values from feedback?)

7 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 7 Probabilistic Retrieval Models

8 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 8 Overview of Retrieval Models Relevance  (Rep(q), Rep(d)) Similarity P(r=1|q,d) r  {0,1} Probability of Relevance P(d  q) or P(q  d) Probabilistic inference Different rep & similarity Vector space model (Salton et al., 75) Prob. distr. model (Wong & Yao, 89) … Generative Model Regression Model (Fox 83) Classical prob. Model (Robertson & Sparck Jones, 76) Doc generation Query generation LM approach (Ponte & Croft, 98) (Lafferty & Zhai, 01a) Prob. concept space model (Wong & Yao, 95) Different inference system Inference network model (Turtle & Croft, 91) Learn to Rank (Joachims 02) (Burges et al. 05)

9 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 9 The Basic Question What is the probability that THIS document is relevant to THIS query? Formally… 3 random variables: query Q, document D, relevance R  {0,1} Given a particular query q, a particular document d, p(R=1|Q=q,D=d)=?

10 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 10 Probability of Relevance Three random variables –Query Q –Document D –Relevance R  {0,1} Goal: rank D based on P(R=1|Q,D) –Evaluate P(R=1|Q,D) –Actually, only need to compare P(R=1|Q,D1) with P(R=1|Q,D2), I.e., rank documents Several different ways to refine P(R=1|Q,D)

11 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 11 Refining P(R=1|Q,D) Method 1: conditional models Basic idea: relevance depends on how well a query matches a document –Define features on Q x D, e.g., #matched terms, # the highest IDF of a matched term, #doclen,.. –P(R=1|Q,D)=g(f1(Q,D), f2(Q,D),…,fn(Q,D),  ) –Using training data (known relevance judgments) to estimate parameter  –Apply the model to rank new documents Special case: logistic regression

12 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 12 Logistic Regression (Cooper 92, Gey 94) logit function: logistic (sigmoid) function: X P(R=1|Q,D) 1.0 Uses 6 features X 1, …, X 6

13 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 13 Features/Attributes Average Absolute Query Frequency Query Length Average Absolute Document Frequency Document Length Average Inverse Document Frequency Inverse Document Frequency Number of Terms in common between query and document -- logged

14 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 14 Logistic Regression: Pros & Cons Advantages –Absolute probability of relevance available –May re-use all the past relevance judgments Problems –Performance is very sensitive to the selection of features –No much guidance on feature selection In practice, performance tends to be average

15 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 15 Refining P(R=1|Q,D) Method 2: generative models Basic idea –Define P(Q,D|R) –Compute O(R=1|Q,D) using Bayes’ rule Special cases –Document “generation”: P(Q,D|R)=P(D|Q,R)P(Q|R) –Query “generation”: P(Q,D|R)=P(Q|D,R)P(D|R) Ignored for ranking D

16 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 16 Document Generation Model of relevant docs for Q Model of non-relevant docs for Q Assume independent attributes A 1 …A k ….(why?) Let D=d 1 …d k, where d k  {0,1} is the value of attribute A k (Similarly Q=q 1 …q k ) Non-query terms are equally likely to appear in relevant and non-relevant docs

17 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 17 Robertson-Sparck Jones Model (Robertson & Sparck Jones 76) Two parameters for each term A i : p i = P(A i =1|Q,R=1): prob. that term A i occurs in a relevant doc q i = P(A i =1|Q,R=0): prob. that term A i occurs in a non-relevant doc (RSJ model) How to estimate parameters? Suppose we have relevance judgments, “+0.5” and “+1” can be justified by Bayesian estimation

18 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 18 RSJ Model: No Relevance Info (Croft & Harper 79) (RSJ model) How to estimate parameters? Suppose we do not have relevance judgments, - We will assume p i to be a constant - Estimate q i by assuming all documents to be non-relevant N: # documents in collection n i : # documents in which term A i occurs

19 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 19 RSJ Model: Summary The most important classic prob. IR model Use only term presence/absence, thus also referred to as Binary Independence Model Essentially Naïve Bayes for doc ranking Most natural for relevance/pseudo feedback When without relevance judgments, the model parameters must be estimated in an ad hoc way Performance isn’t as good as tuned VS model

20 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 20 Improving RSJ: Adding TF Let D=d 1 …d k, where d k is the frequency count of term A k Basic doc. generation model: 2-Poisson mixture model Many more parameters to estimate! (how many exactly?)

21 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 21 BM25/Okapi Approximation (Robertson et al. 94) Idea: Approximate p(R=1|Q,D) with a simpler function that share similar properties Observations: –log O(R=1|Q,D) is a sum of term weights W i –W i = 0, if TF i =0 –W i increases monotonically with TF i –W i has an asymptotic limit The simple function is

22 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 22 Adding Doc. Length & Query TF Incorporating doc length –Motivation: The 2-Poisson model assumes equal document length –Implementation: “Carefully” penalize long doc Incorporating query TF –Motivation: Appears to be not well-justified –Implementation: A similar TF transformation The final formula is called BM25, achieving top TREC performance

23 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 23 The BM25 Formula “Okapi TF/BM25 TF”

24 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 24 Extensions of “Doc Generation” Models Capture term dependence (Rijsbergen & Harper 78) Alternative ways to incorporate TF (Croft 83, Kalt96) Feature/term selection for feedback (Okapi’s TREC reports) Other Possibilities (machine learning … )

25 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 25 Query Generation Assuming uniform prior, we have Query likelihood p(q|  d )Document prior Now, the question is how to compute ? Generally involves two steps: (1) estimate a language model based on D (2) compute the query likelihood according to the estimated model Leading to the so-called “Language Modeling Approach” …

26 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 26 What is a Statistical LM? A probability distribution over word sequences –p(“ Today is Wednesday ”)  0.001 –p(“ Today Wednesday is ”)  0.0000000000001 –p(“ The eigenvalue is positive” )  0.00001 Context-dependent! Can also be regarded as a probabilistic mechanism for “generating” text, thus also called a “generative” model

27 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 27 The Simplest Language Model (Unigram Model) Generate a piece of text by generating each word INDEPENDENTLY Thus, p(w 1 w 2... w n )=p(w 1 )p(w 2 )…p(w n ) Parameters: {p(w i )} p(w 1 )+…+p(w N )=1 (N is voc. size) Essentially a multinomial distribution over words A piece of text can be regarded as a sample drawn according to this word distribution

28 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 28 Text Generation with Unigram LM (Unigram) Language Model  p(w|  ) … text 0.2 mining 0.1 association 0.01 clustering 0.02 … food 0.00001 … Topic 1: Text mining … food 0.25 nutrition 0.1 healthy 0.05 diet 0.02 … Topic 2: Health Document Text mining paper Food nutrition paper Sampling

29 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 29 Estimation of Unigram LM (Unigram) Language Model  p(w|  )=? Document text 10 mining 5 association 3 database 3 algorithm 2 … query 1 efficient 1 … text ? mining ? association ? database ? … query ? … Estimation A “text mining paper” (total #words=100) 10/100 5/100 3/100 1/100

30 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 30 Language Models for Retrieval (Ponte & Croft 98) Document Text mining paper Food nutrition paper Language Model … text ? mining ? assocation ? clustering ? … food ? … food ? nutrition ? healthy ? diet ? … Query = “data mining algorithms” ? Which model would most likely have generated this query?

31 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 31 Ranking Docs by Query Likelihood d1d1 d2d2 dNdN q d1d1 d2d2 dNdN Doc LM p(q|  d 1 ) p(q|  d 2 ) p(q|  d N ) Query likelihood

32 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 32 Retrieval as Language Model Estimation Document ranking based on query likelihood Retrieval problem  Estimation of p(w i |d) Smoothing is an important issue, and distinguishes different approaches Document language model

33 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 33 How to Estimate p(w|d)? Simplest solution: Maximum Likelihood Estimator –P(w|d) = relative frequency of word w in d –What if a word doesn’t appear in the text? P(w|d)=0 In general, what probability should we give a word that has not been observed? If we want to assign non-zero probabilities to such words, we’ll have to discount the probabilities of observed words This is what “smoothing” is about …

34 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 34 Language Model Smoothing (Illustration) P(w) Word w Max. Likelihood Estimate Smoothed LM

35 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 35 How to Smooth? All smoothing methods try to –discount the probability of words seen in a document –re-allocate the extra counts so that unseen words will have a non-zero count A simple method (additive smoothing): Add a constant  to the counts of each word Problems? “Add one”, Laplace smoothing Vocabulary size Counts of w in d Length of d (total counts)

36 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 36 A General Smoothing Scheme All smoothing methods try to –discount the probability of words seen in a doc –re-allocate the extra probability so that unseen words will have a non-zero probability Most use a reference model (collection language model) to discriminate unseen words Discounted ML estimate Collection language model

37 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 37 Smoothing & TF-IDF Weighting Plug in the general smoothing scheme to the query likelihood retrieval formula, we obtain Ignore for ranking IDF weighting TF weighting Doc length normalization (long doc is expected to have a smaller  d ) Smoothing with p(w|C)  TF-IDF + length norm.

38 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 38 Derivation of the Query Likelihood Retrieval Formula Discounted ML estimate Reference language model Retrieval formula using the general smoothing scheme Key rewriting step Similar rewritings are very common when using LMs for IR…

39 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 39 More Smoothing Methods Method 1 (Absolute discounting): Subtract a constant  from the counts of each word Method 2 (Linear interpolation, Jelinek-Mercer): “Shrink” uniformly toward p(w|REF) # uniq words parameter ML estimate

40 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 40 More Smoothing Methods (cont.) Method 3 (Dirichlet Prior/Bayesian): Assume pseudo counts  p(w|REF) Method 4 (Good Turing): Assume total # unseen events to be n 1 (# of singletons), and adjust the seen events in the same way parameter

41 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 41 Dirichlet Prior Smoothing ML estimator: M=argmax M p(d|M) Bayesian estimator: – First consider posterior: p(M|d) =p(d|M)p(M)/p(d) – Then, consider the mean or mode of the posterior dist. p(d|M) : Sampling distribution (of data) P(M)=p(  1,…,  N ) : our prior on the model parameters conjugate = prior can be interpreted as “extra”/“pseudo” data Dirichlet distribution is a conjugate prior for multinomial sampling distribution “extra”/“pseudo” word counts  i =  p(w i |REF)

42 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 42 Dirichlet Prior Smoothing (cont.) Posterior distribution of parameters: The predictive distribution is the same as the mean: Dirichlet prior smoothing

43 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 43 Advantages of Language Models Solid statistical foundation Parameters can be optimized automatically using statistical estimation methods Can easily model many different retrieval tasks To be covered more later

44 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 44 What You Should Know Global relationship among different probabilistic models How logistic regression works How the Robertson-Sparck Jones model works The BM25 formula All document-generation models have trouble when no relevance judgments are available How the language modeling approach (query likelihood scoring) works How Dirichlet prior smoothing works 3 state of the art retrieval models: Pivoted Norm  Okapi/BM25  Query Likelihood (Dirichlet prior smoothing)

45 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 45 Implementation of an IR System

46 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 46 IR System Architecture User query judgments docs results Query Rep Doc Rep Ranking Feedback INDEXING SEARCHING QUERY MODIFICATION INTERFACE

47 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 47 Indexing Indexing = Convert documents to data structures that enable fast search Inverted index is the dominating indexing method (used by all search engines) Other indices (e.g., document index) may be needed for feedback

48 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 48 Inverted Index Fast access to all docs containing a given term (along with freq and pos information) For each term, we get a list of tuples (docID, freq, pos). Given a query, we can fetch the lists for all query terms and work on the involved documents. –Boolean query: set operation –Natural language query: term weight summing More efficient than scanning docs (why?)

49 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 49 Inverted Index Example This is a sample document with one sample sentence Doc 1 This is another sample document Doc 2 DictionaryPostings Term# docs Total freq This22 is22 sample23 another11 ……… Doc idFreq 11 21 11 21 12 21 21 …… ……

50 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 50 Data Structures for Inverted Index Dictionary: modest size –Needs fast random access –Preferred to be in memory –Hash table, B-tree, trie, … Postings: huge –Sequential access is expected –Can stay on disk –May contain docID, term freq., term pos, etc –Compression is desirable

51 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 51 Inverted Index Compression Observations –Inverted list is sorted (e.g., by docid or termfq) –Small numbers tend to occur more frequently Implications –“d-gap” (store difference): d1, d2-d1, d3-d2-d1,… –Exploit skewed frequency distribution: fewer bits for small (high frequency) integers Binary code, unary code,  -code,  -code

52 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 52 Integer Compression Methods In general, to exploit skewed distribution Binary: equal-length coding Unary: x  1 is coded as x-1 one bits followed by 0, e.g., 3=> 110; 5=>11110  -code: x=> unary code for 1+  log x  followed by uniform code for x-2  log x  in  log x  bits, e.g., 3=>101, 5=>11001  -code: same as  -code,but replace the unary prefix with  -code. E.g., 3=>1001, 5=>10101

53 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 53 Constructing Inverted Index The main difficulty is to build a huge index with limited memory Memory-based methods: not usable for large collections Sort-based methods: –Step 1: collect local (termID, docID, freq) tuples –Step 2: sort local tuples (to make “runs”) –Step 3: pair-wise merge runs –Step 4: Output inverted file

54 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 54 Sort-based Inversion... Term Lexicon: the 1 cold 2 days 3 a 4... DocID Lexicon: doc1 1 doc2 2 doc3 3... doc1 doc300... …... Sort by doc-id Parse & Count... …... Sort by term-id “Local” sort... …... Merge sort All info about term 1

55 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 55 Searching Given a query, score documents efficiently Boolean query –Fetch the inverted list for all query terms –Perform set operations to get the subset of docs that satisfy the Boolean condition –E.g., Q1=“info” AND “security”, Q2=“info” OR “security” info: d1, d2, d3, d4 security: d2, d4, d6 Results: {d2,d4} (Q1) {d1,d2,d3,d4,d6} (Q2)

56 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 56 Ranking Documents Assumption: score(d,q)=f[g(w(d,q,t 1 ),…w(d,q,t n )), w(d),w(q)], where, t i ’s are the matched terms Maintain a score accumulator for each doc to compute function g For each query term t i –Fetch the inverted list {(d 1,f 1 ),…,(d n,f n )} –For each entry (d j,f j ), Compute w(d j,q,t i ), and Update score accumulator for doc d i Adjust the score to compute f, and sort

57 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 57 Ranking Documents: Example Query = “info security” S(d,q)=g(t 1 )+…+g(t n ) [sum of freq of matched terms] Info: (d1, 3), (d2, 4), (d3, 1), (d4, 5) Security: (d2, 3), (d4,1), (d5, 3) Accumulators: d1 d2 d3 d4 d5 0 0 0 0 0 (d1,3) => 3 0 0 0 0 (d2,4) => 3 4 0 0 0 (d3,1) => 3 4 1 0 0 (d4,5) => 3 4 1 5 0 (d2,3) => 3 7 1 5 0 (d4,1) => 3 7 1 6 0 (d5,3) => 3 7 1 6 3 info security

58 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 58 Further Improving Efficiency Keep only the most promising accumulators Sort the inverted list in decreasing order of weights and fetch only N entries with the highest weights Pre-compute as much as possible

59 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 59 Open Source IR Toolkits Smart (Cornell) MG (RMIT & Melbourne, Australia; Waikato, New Zealand), Lemur (CMU/Univ. of Massachusetts) Terrier (Glasgow) Lucene (Open Source)

60 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 60 Smart The most influential IR system/toolkit Developed at Cornell since 1960’s Vector space model with lots of weighting options Written in C The Cornell/AT&T groups have used the Smart system to achieve top TREC performance

61 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 61 MG A highly efficient toolkit for retrieval of text and images Developed by people at Univ. of Waikato, Univ. of Melbourne, and RMIT in 1990’s Written in C, running on Unix Vector space model with lots of compression and speed up tricks People have used it to achieve good TREC performance

62 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 62 Lemur/Indri An IR toolkit emphasizing language models Developed at CMU and Univ. of Massachusetts in 2000’s Written in C++, highly extensible Vector space and probabilistic models including language models Achieving good TREC performance with a simple language model

63 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 63 Terrier A large-scale retrieval toolkit with lots of applications (e.g., desktop search) and TREC support Developed at University of Glasgow, UK Written in Java, open source “Divergence from randomness” retrieval model and other modern retrieval formulas

64 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 64 Lucene Open Source IR toolkit Initially developed by Doug Cutting in Java Now has been ported to some other languages Good for building IR/Web applications Many applications have been built using Lucene (e.g., Nutch Search Engine) Currently the retrieval algorithms have poor accuracy

65 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 65 What You Should Know What is an inverted index Why does an inverted index help make search fast How to construct a large inverted index Simple integer compression methods How to use an inverted index to rank documents efficiently HOW TO IMPLEMENT A SIMPLE IR SYSTEM

66 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 66 Applications of Basic IR Techniques

67 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 67 Some “Basic” IR Techniques Stemming Stop words Weighting of terms (e.g., TF-IDF) Vector/Unigram representation of text Text similarity (e.g., cosine) Relevance/pseudo feedback (e.g., Rocchio) They are not just for retrieval!

68 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 68 Generality of Basic Techniques Raw text Term similarity Doc similarity Vector centroid CLUSTERING d CATEGORIZATION META-DATA/ ANNOTATION d d d d d d d d d d d d d d t t t t t t t t t t t t Stemming & Stop words Tokenized text Term Weighting w 11 w 12… w 1n w 21 w 22… w 2n … … w m1 w m2… w mn t 1 t 2 … t n d 1 d 2 … d m Sentence selection SUMMARIZATION

69 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 69 Sample Applications Information Filtering (covered earlier) Text Categorization Document/Term Clustering Text Summarization

70 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 70 Text Categorization Pre-given categories and labeled document examples (Categories may form hierarchy) Classify new documents A standard supervised learning problem Categorization System … Sports Business Education Science … Sports Business Education

71 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 71 “Retrieval-based” Categorization Treat each category as representing an “information need” Treat examples in each category as “relevant documents” Use feedback approaches to learn a good “query” Match all the learned queries to a new document A document gets the category(categories) represented by the best matching query(queries)

72 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 72 Prototype-based Classifier Key elements (“retrieval techniques”) –Prototype/document representation (e.g., term vector) –Document-prototype distance measure (e.g., dot product) –Prototype vector learning: Rocchio feedback Example

73 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 73 K-Nearest Neighbor Classifier Keep all training examples Find k examples that are most similar to the new document (“neighbor” documents) Assign the category that is most common in these neighbor documents (neighbors vote for the category) Can be improved by considering the distance of a neighbor ( A closer neighbor has more influence) Technical elements (“retrieval techniques”) –Document representation –Document distance measure

74 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 74 Example of K-NN Classifier (k=1) (k=4)

75 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 75 Examples of Text Categorization News article classification Meta-data annotation Automatic Email sorting Web page classification

76 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 76 Sample Applications Information Filtering Text Categorization  Document/Term Clustering Text Summarization

77 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 77 The Clustering Problem Discover “natural structure” Group similar objects together Object can be document, term, passages Example

78 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 78 Similarity-based Clustering (as opposed to “ model-based ” ) Define a similarity function to measure similarity between two objects Gradually group similar objects together in a bottom-up fashion Stop when some stopping criterion is met Variations: different ways to compute group similarity based on individual object similarity

79 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 79 Similarity-induced Structure

80 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 80 How to Compute Group Similarity? Given two groups g1 and g2, Single-link algorithm: s(g1,g2)= similarity of the closest pair complete-link algorithm: s(g1,g2)= similarity of the farthest pair average-link algorithm: s(g1,g2)= average of similarity of all pairs Three Popular Methods:

81 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 81 Three Methods Illustrated Single-link algorithm ? g1 g2 complete-link algorithm …… average-link algorithm

82 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 82 Examples of Doc/Term Clustering Clustering of retrieval results Clustering of documents in the whole collection Term clustering to define “concept” or “theme” Automatic construction of hyperlinks In general, very useful for text mining

83 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 83 Sample Applications Information Filtering Text Categorization Document/Term Clustering  Text Summarization

84 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 84 The Summarization Problem Essentially “semantic compression” of text Selection-based vs. generation-based summary In general, we need a purpose for summarization, but it’s hard to define it

85 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 85 “Retrieval-based” Summarization Observation: term vector  summary? Basic approach –Rank “sentences”, and select top N as a summary Methods for ranking sentences –Based on term weights –Based on position of sentences –Based on the similarity of sentence and document vector

86 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 86 Simple Discourse Analysis ---------- vector 1 vector 2 vector 3 … vector n-1 vector n similarity

87 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 87 A Simple Summarization Method ---------- sentence 1 sentence 2 sentence 3 summary Doc vector Most similar in each segment

88 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 88 Examples of Summarization News summary Summarize retrieval results –Single doc summary –Multi-doc summary Summarize a cluster of documents (automatic label creation for clusters)

89 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 89 What You Should Know Retrieval touches some basic issues in text information management (what are these basic issues?) How to apply simple retrieval techniques, such as the vector space model, to information filtering, text categorization, clustering, and summarization There are many other tasks that can potentially benefit from simple IR techniques

90 2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 90 Roadmap This lecture –Other retrieval models – IR system implementation –Applications of basic TR techniques Next lecture: more in-depth treatment of language models


Download ppt "2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 1 龙星计划课程 : 信息检索 Overview of Text Retrieval: Part 2 ChengXiang Zhai ("

Similar presentations


Ads by Google