Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 CS 430: Information Discovery Lecture 16 Thesaurus Construction.

Similar presentations


Presentation on theme: "1 CS 430: Information Discovery Lecture 16 Thesaurus Construction."— Presentation transcript:

1 1 CS 430: Information Discovery Lecture 16 Thesaurus Construction

2 2 Course Administration Midterm examination Grades will be mailed over the weekend Answer books will not be returned Most questions will be discussed in class Question paper will be posted on the course web site

3 3 Decisions in creating a thesaurus 1. Which terms should be included in the thesaurus? 2. How should the terms be grouped?

4 4 Terms to include Only terms that are likely to be of interest for content identification Ambiguous terms should be coded for the senses likely to be important in the document collection Each thesaurus class should have approximately the same frequency of occurrence Terms of negative discrimination should be eliminated after Salton and McGill

5 5 Discriminant value Discriminant value is the degree to which a term is able to discriminate between the documents of a collection = (average document similarity without term k) - (average document similarity with term k) Good discriminators decrease the average document similarity Note that this definition uses the document similarity.

6 6 Incidence array D 1 : alpha bravo charlie delta echo foxtrot golf D 2 : golf golf golf delta alpha D 3 : bravo charlie bravo echo foxtrot bravo D 4 : foxtrot alpha alpha golf golf delta alphabravocharliedeltaechofoxtrotgolf D 1 1 1 1 1 1 1 1 D 2 1 1 1 D 3 1 1 1 1 D 4 1 1 1 1 73447344

7 7 Document similarity matrix D 1 D 2 D 3 D 4 D 1 0.65 0.760.76 D 2 0.65 0.000.87 D 3 0.760.00 0.25 D 4 0.760.87 0.25 Average similarity = 0.55

8 8 Discriminant value Average similarity = 0.55 withoutaverage similarityDV alpha0.53 -0.02 bravo0.56+0.01 charlie0.56+0.01 delta0.53 -0.02 echo0.56+0.01 foxtrot0.52-0.03 golf0.53-0.02

9 9 Similarities Automatic thesaurus construction depends on a measure of similarity between terms One measure of similarity is the number of documents that have terms i and k in common: S(t j, t k ) =  t ij t ik where t ij if document i contains term j and 0 otherwise. i=1 n

10 10 Similarity measures Improved similarity measures can be generated by: Using term frequency matrix instead of incidence matrix Weighting terms by frequency: cosine measure  t ij t ik |t j | |t k | dice measure  t ij t ik  t ik +  t ij i=1 n n n n S(t j, t k ) =

11 11 Similarities: Incidence array D 1 : alpha bravo charlie delta echo foxtrot golf D 2 : golf golf golf delta alpha D 3 : bravo charlie bravo echo foxtrot bravo D 4 : foxtrot alpha alpha golf golf delta alphabravocharliedeltaechofoxtrotgolf D 1 1 1 1 1 1 1 1 D 2 1 1 1 D 3 1 1 1 1 D 4 1 1 1 1 n 3 2 2 3 2 3 3

12 12 Term similarity matrix alphabravocharliedeltaechofoxtrotgolf alpha 0.2 0.2 0.5 0.2 0.33 0.5 bravo 0.2 0.5 0.2 0.5 0.4 0.2 charlie 0.2 0.5 0.2 0.5 0.4 0.2 delta 0.5 0.2 0.2 0.2 0.33 0.5 echo 0.2 0.5 0.5 0.2 0.4 0.2 foxtrot 0.33 0.4 0.4 0.33 0.4 0.33 golf 0.5 0.2 0.2 0.5 0.2 0.33 Using incidence matrix and dice weighting

13 13 Clustering -- nearest neighbor alpha delta 1 golf 2 echo bravo 3 6 charlie 4 5 foxtrot

14 14 Phrase construction In a thesaurus, term classes may contain phrases. Informal definitions: pair-frequency (i, j) is the frequency that a pair of words occur in context (e.g., in succession within a sentence) phrase is a pair of words, i and j that occur in context with a higher frequency than would be expected from their overall frequency cohesion (i, j) = pair-frequency (i, j) frequency(i)*frequency(j)

15 15 Phrase construction Salton and McGill algorithm 1. Computer pair-frequency for all terms. 2. Reject all pairs that fall below a certain threshold 3. Calculate cohesion values 4. If cohesion above a threshold value, consider word pair as a phrase. Automatic phrase construction by statistical methods is rarely used in practice. There is promising research on phrase identification using methods of computational linguistics

16 16 Types of Information Discovery media type textimage, video, audio, etc. searchingbrowsing linking statistical user-in-loop catalogs, indexes (metadata) CS 502 natural language processing CS 474


Download ppt "1 CS 430: Information Discovery Lecture 16 Thesaurus Construction."

Similar presentations


Ads by Google