Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Conventional Text-Retrieval Systems Automatic Text Processing by G. Salton, Addison-Wesley, 1989. (Chapter 9)

Similar presentations


Presentation on theme: "1 Conventional Text-Retrieval Systems Automatic Text Processing by G. Salton, Addison-Wesley, 1989. (Chapter 9)"— Presentation transcript:

1 1 Conventional Text-Retrieval Systems Automatic Text Processing by G. Salton, Addison-Wesley, (Chapter 9)

2 2 Database Management l A specified set of attributes is used to characterize each record. EMPLOYEE(NAME, SSN, BDATE, ADDR, SEX, SALARY, DNO) l Exact match between the attributes used in query formulations and those attached to the document. SELECT BDATE, ADDR FROM EMPLOYEE WHERE NAME = ‘John Smith’

3 3 Text-Retrieval Systems l Content identifiers (keywords, index terms, descriptors) characterize the stored texts. l Degrees of coincidence between the sets of identifiers attached to queries and documents content analysis query formulation

4 4 Possible Representation l Document representation »unweighted index terms (term vectors) »weighted index terms »… l Query »unweighted or weighted index terms »Boolean combinations (or, and, not) »… l Search operation must be effective

5 5 File Structures l Main requirements »fast-access for various kinds of searches »large number of indices l Alternatives »Inverted Files »Signature Files »PAT trees

6 6 Inverted Files l File is represented as an array of indexed documents.

7 7 Inverted-file process l The document-term array is inverted (transposed).

8 8 Inverted-file process ( Continued ) l Take two or more rows of an inverted term-document array, and produce a single combined list of document identifiers. l Ex: Query= (term2 and term3) term21100 term <-- D2

9 9 List-merging for two ordered lists l The inverted-index operations to obtain answers are based on list-merging process. l Example T1:{D1, D3} T2:{D1, D2} Merged(T1, T2): {D1, D1, D2, D3}

10 10 Extensions of Inverted Index Operations (Distance Constraints) l Distance Constraints »(A within sentence B) terms A and B must co-occur in a common sentence »(A adjacent B) terms A and B must occur adjacently in the text

11 11 Extensions of Inverted Index Operations (Distance Constraints) l Implementation »include term-location in the inverted indexes information:{P345, P348, P350, …} retrieval:{P123, P128, P345, …} »include sentence-location in the indexes information: {P345, 25; P345, 37; P348, 10; P350, 8; …} retrieval: {P123, 5; P128, 25; P345, 37; P345, 40; …}

12 12 Extensions of Inverted Index Operations (Distance Constraints) »Include paragraph numbers in the indexes sentence numbers within paragraphs word numbers within sentences information: {P345, 2, 3, 5; …} retrieval: {P345, 2, 3, 6; …} »Query examples (information adjacent retrieval) (information within five words retrieval) »Cost: the size of indexes

13 13 Term Weights l Term Weights D i ={T i1, 0.2; T i2, 0.5; T i3, 0.6} l Issues »how to generate the term weights »how to apply the term weights –Sum the weights of all document terms that match the given query. –Rank the output documents in the descending order of term weight.

14 14 Boolean Query with Term Weights l Transform a Boolean expression into disjunctive normal form. T1 and (T2 or T3) =(T1 and T2) or (T1 and T3) l For each conjunct, compute the minimum term weight of any document term in that conjunct. l The document weight is the maximum of all the conjunct weights.

15 15 Boolean Query with Term Weights l Example: Q=(T1 and T2) or T3 DocumentConjunctQuery VectorsWeightsWeight (T1 and T2)(T3) (T1 and T2) or T3 D1=(T1,0.2;T2,0.5;T3,0.6) D2=(T1,0.7;T2,0.2;T3,0.1) D1 is preferred.

16 16 Synonym Specification l Original Query (T1 and T2) or T3 Assume S1 is a synonym of T1. Assume S3 is a synonym of T3. l Broader Query ((T1 or S1) and T2) or (T3 or S3) l The number of relevant items retrieved may be larger.

17 17 Stemming l Term Truncation »Remove suffixes and/or prefixes from context terms. »Example PSYCH*: psychiatrist, psychiatry, psychiatric, psychology, psychological, …

18 18 Term Truncation l Implementation »Only suffix truncation Conventional inverted-index methodology can be maintained unchanged. »Only prefix truncation The term entries in inverted index are inversely alphabetized. antisymmetry --> yrtemmysitna

19 19 Term Truncation »Both prefix and suffix truncation *SYMM*: antisymmetric, asymmetry inverted-index entries that are alphabetized both forward and backward »infix truncation wom*nwomanwomen inverted index with entries for all possible “rotated” word forms

20 20 Term Truncation l Each term entry X=x 1, x 2, …, x n with individual characters x i is augmented by adding a special terminal character /. ABCABC/ BABCBABC/ BCABBCAB/ l Each augmented term x 1, x 2, …, x n / is rotated cyclically by wrapping the term around itself n+1 times. ABC // ABC, C/ AB, BC/ A, ABC/

21 21 Term Truncation l Each resulting word form is then augmented by appending a blank character ^. l The resulting file of word forms is sorted alphabetically. ^, /, a, b, c, …, Z lowhigh

22 ABCABC//ABC^/ABC^ C/AB^/BABC^ BC/A^/BCAB^ ABC/^AB/BC^ BABCBABC//BABC^ABC/^ C/BAB^ABC/B^ BC/BA^B/BCA^ ABC/B^BABC/^ BABC/^BC/A^ BCABBCAB//BCAB^BC/BA^ B/BCA^BCAB/^ AB/BC^C/AB^ CAB/B^C/BAB^ BCAB/^CAB/B^

23 23 Retrieval Strategies l Query term X Look for index entries /X^ or X/^. l Query term X* Look for /X*. l Query term *X Look for X/^=> X/Y1, …, X/Yn. original patterns: X, Y1X, …, YnX l Query term *X* Look for XY1/Z1, …, XYn/Zn. original patterns: Z1XY1, …, ZnXYn

24 ABCABC//ABC^/ABC^*B* C/AB^/BABC^ BC/A^/BCAB^ ABC/^AB/BC^ BABCBABC//BABC^ABC/^ C/BAB^ABC/B^ BC/BA^B/BCA^BCAB ABC/B^BABC/^BABC BABC/^BC/A^ABC BCABBCAB//BCAB^BC/BA^BABC B/BCA^BCAB/^BCAB AB/BC^C/AB^ CAB/B^C/BAB^ BCAB/^CAB/B^

25 25 Retrieval Strategies l Query term X*Y Look for Y/XZ1, …, Y/XZm. Original patterns: XZ1Y, …, XZmY l Cost Increase index entries.


Download ppt "1 Conventional Text-Retrieval Systems Automatic Text Processing by G. Salton, Addison-Wesley, 1989. (Chapter 9)"

Similar presentations


Ads by Google