Presentation is loading. Please wait.

Presentation is loading. Please wait.

IR Data Structures Making Matching Queries and Documents Effective and Efficient.

Similar presentations


Presentation on theme: "IR Data Structures Making Matching Queries and Documents Effective and Efficient."— Presentation transcript:

1 IR Data Structures Making Matching Queries and Documents Effective and Efficient

2 Lecture Objectives l Learn an algorithm to stem without a dictionary l Know principles of other stemming systems l Understand other data structures which facilitate rapid access from keywords to documents

3 Stemming l Reducing morphological variants of words to a standard underlying form –e.g. calculate, calculates, calculations to calculat- l improves recall at the expense of precision

4 Porter Stemming Algorithm l Well known, effective stemmer, which does not use a dictionary l uses measure m –C(VC) m V –where »C is a sequence of consonants »V is a sequence of vowels

5 Porter Algorithm Step 1 -sses-ss-ing--at-ate-y-i Stem only vowels

6 Porter Algorithm Step 2-4 -aliti-al-icate-ic-able- Measure >0 Measure >1

7 Dictionary Based Stemmers l Dictionary of stems –cf vector based methods l Dictionary of words –effective handling of irregular forms l Proper Name/Controlled Vocabulary Lists l Equivalent Term/Thesaurii

8 Problems with stemming l Always worsens precision hoping to improve recall l Causes (sometimes odd misretrieval) –“bled” vs “bleeding” –incorrect term conflation “plastered” to “plaster” l Do we really want to improve recall on the web ?

9 N-Gram structures l Store keywords broken down into fixed length segments –e.g. trigrams “sea colony” to »sea + col + olo + lon + ony l useful as an index structure, stemming and for spelling correction –“compuuter”

10 Index Data Structures l Inverted Files l PAT Data Structure –tree based substrings l Signature Files l Hypertext Data Structure

11 Inverted Files Alice 1 5 2 887 42 51182 1, 5, 51182

12 Inverted Files Supporting Proximity Alice 1, 5, 51182 while Alice was sitting curled up in a corner of the great arm- chair, half talking to herself and half asleep, thekitten had been having a grand game of romps with the ball of worsted Alice had 167, 201,...

13 Hypertext Data Structure l Nodes and Links l File types imply a program to interpret (Display/play) the data l Tags in HTML imply how to load referenced data: –protocol –server –location at server

14 URL Example http:// www.cet. sunderland.ac.uk/ ~cs0jel/teaching/com268/Lglass.asc protocol server location

15 The Web

16 Conclusions l Stemmers –Porters Algorithm –Dictionary Based –disadvantages l Inverted Files l Hypertext N-grams - other Data Structures


Download ppt "IR Data Structures Making Matching Queries and Documents Effective and Efficient."

Similar presentations


Ads by Google