Presentation is loading. Please wait.

Presentation is loading. Please wait.

Vector Space Model CS 652 Information Extraction and Integration.

Similar presentations


Presentation on theme: "Vector Space Model CS 652 Information Extraction and Integration."— Presentation transcript:

1 Vector Space Model CS 652 Information Extraction and Integration

2 2 Introduction Docs Information Need Index Terms doc query Ranking match

3 3 Introduction A ranking is an ordering of the documents retrieved that (hopefully) reflects the relevance of the documents to the user query A ranking is based on fundamental premises regarding the notion of relevance, such as: common sets of index terms sharing of weighted terms likelihood of relevance Each set of premises leads to a distinct IR model

4 4 IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval Browsing U s e r T a s k Classic Models Boolean Vector (Space) Probabilistic Set Theoretic Fuzzy Extended Boolean Probabilistic Inference Network Belief Network Algebraic Generalized Vector (Space) Latent Semantic Index Neural Networks Browsing Flat Structure Guided Hypertext

5 5 Basic Concepts Each document is described by a set of representative keywords or index terms Index terms are document words (i.e. nouns), which have meaning by themselves for remembering the main themes of a document However, search engines assume that all words are index terms (full text representation)

6 6 Basic Concepts Not all terms are equally useful for representing the document contents The importance of the index terms is represented by weights associated to them Let k i be an index term d j be a document w i j is a weight associated with (k i,d j ), which quantifies the importance of k i for describing the contents of d j

7 7 The Vector (Space) Model Define: w i j > 0 whenever k i  d j w i q >= 0 associated with the pair (k i,q) vec(d j ) = (w 1 j, w 2 j,..., w t j ), document vector of d j vec(q) = (w 1 q, w 2 q,..., w t q ), query vector of q The unitary vectors vec(d i ) and vec( q j ) are assumed to be orthonormal (i.e., index terms are assumed to occur independently within the documents) Queries and documents are represented as weighted vectors

8 8 The Vector (Space) Model Sim(q,d j ) = cos(  ) = [vec(d j )  vec(q)] / |d j |  |q| = [  t i=1 w i j  w i q ] /   t i=1 w i j 2    t i=1 w i q 2 where  is the inner product operator & |q| is the length of q Since w i j  0 and w i q  0, 1  sim(q, d j )  0 A document is retrieved even if it matches the query terms only partially i j djdj q 

9 9 The Vector (Space) Model Sim(q, d j ) = [  t i=1 w i j  w i q ] / |d j |  |q| How to compute the weights w i j and w i q ? A good weight must take into account two effects: quantification of intra-document contents (similarity) tf factor, the term frequency within a document quantification of inter-documents separation (dissi- milarity) idf factor, the inverse document frequency w i j = tf(i, j)  idf(i)

10 10 The Vector (Space) Model Let, N be the total number of documents in the collection n i be the number of documents which contain k i freq(i, j), the raw frequency of k i within d j A normalized tf factor is given by f(i, j) = freq(i, j) / max(freq(l, j)), where the maximum is computed over all terms which occur within the document d j The inverse document frequency (idf) factor is idf(i) = log (N / n i ) the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with term k i.

11 11 The Vector (Space) Model The best term-weighting schemes use weights which are give by w i j = f(i, j)  log(N / n i ) the strategy is called a tf-idf weighting scheme For the query term weights, a suggestion is W i q = (0.5 + [0.5  freq(i, q) / max(freq(l, q))])  log(N/n i ) The vector model with tf-idf weights is a good ranking strategy with general collections The VSM is usually as good as the known ranking alternatives. It is also simple and fast to compute

12 12 The Vector (Space) Model Advantages: term-weighting improves quality of the answer set partial matching allows retrieval of documents that approximate the query conditions cosine ranking formula sorts documents according to degree of similarity to the query A popular IR model because of its simplicity & speed Disadvantages: assumes mutually independence of index terms (??); not clear that this is bad though

13 Naïve Bayes Classifier CS 652 Information Extraction and Integration

14 14 Bayes Theorem The basic starting point for inference problems using probability theory as logic

15 15 Bayes Theorem.008.992.98.02.03.97 P(+|cancer)P(cancer)=(.98).008=.0078 P(+|~cancer)P(~cancer)=(.03).992=.0298

16 16 Basic Formulas for Probabilities

17 17 Naïve Bayes Classifier

18 18 Naïve Bayes Classifier

19 19 Naïve Bayes Classifier

20 20 Naïve Bayes Algorithm

21 21 Naïve Bayes Subtleties

22 22 Naïve Bayes Subtleties m-estimate of probability

23 23 Learning to Classify Text Classify text into manually defined groups Estimate probability of class membership Rank by relevance Discover grouping, relationships – Between texts – Between real-world entities mentioned in text

24 24 Learn_Naïve_Bayes_Text(Example, V)

25 25 Calculate_Probability_Terms

26 26 Classify_Naïve_Bayes_Text(Doc)

27 27 How to Improve More training data Better training data Better text representation – Usual IR tricks (term weighting, etc.) – Manually construct good predictor features Hand off hard cases to human being


Download ppt "Vector Space Model CS 652 Information Extraction and Integration."

Similar presentations


Ads by Google