Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 File Structures Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992.

Similar presentations


Presentation on theme: "1 File Structures Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992."— Presentation transcript:

1 1 File Structures Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992. (Chapters 3-5)

2 2 File Structures for IR l lexicographical indices »indices that are sorted »e.g. inverted files »e.g. Patricia (PAT) trees l cluster file structures l indices based on hashing »signature files

3 3 Inverted Files Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992. (Chapters 3)

4 4 Inverted Files l Each document is assigned a list of keywords or attributes. l Each keyword (attribute) is associated with operational relevance weights. l An inverted file is the sorted list of keywords (attributes), with each keyword having links to the documents containing that keyword. l Penalty »the size of inverted files ranges from 10% to 100% of more of the size of the text itself »need to update the index as the data set changes

5 5 Indexing Restrications l A controlled vocabulary which is the collection of keywords that will be indexed. Words in the text that are not in the vocabulary will not be indexed l A list of stopwords that for reasons of volume will not be included in the index l A set of rules that decide the beginning of a word or a piece of text that is indexable l A list of character sequences to be indexed (or not indexed)

6 Sorted array implementation of an inverted file

7 7 Structures used in Inverted Files l Sorted Arrays »store the list of keywords in a sorted array »using a standard binary search »advantage: easy to implement »disadvantage: updating the index is expensive l Hashing Structures l Tries (digital search trees) l Combinations of these structures

8 8 Sorted Arrays 1. The input text is parsed into a list of words along with their location in the text. (time and storage consuming operation) 2. This list is inverted from a list of terms in location order to a list of terms in alphabetical order. 3. Add term weights, or reorganize or compress the files.

9 Inversion of Word List

10 10 Dictionary and postings file Idea: the file to be searched should be as short as possible split a single file into two pieces e.g. data set: 38,304 records, 250,000 unique terms (document #, frequency)

11 Producing an Inverted File for Large Data Sets without Sorting Idea: avoid the use of an explicit sort by using a right-threaded binary tree current number of term postings & the storage location of postings list traverse the binary tree and the linked postings list

12 12 A Fast Inversion Algorithm l Principle 1 the large primary memories are available If databases can be split into memory loads that can be rapidly processed and then combined, the overall cost will be minimized. l Principle 2 the inherent order of the input data It is very expensive to use polynomial or even nlogn sorting algorithms for large files

13 FAST-INV algorithm See p. 13. concept postings/ pointers

14 document number concept number(one concept number for each unique word) Sample document vector Similar to the document- word list shown in p. 7. The concept numbers are sorted within document numbers, and document numbers are sorted within collection

15 15 Preparation l Terminology »HCN= highest concept number in dictionary, or the number of words to be indexed »L= number of document/concept pairs in the collection »M= available primary memory size l Assumption »M>>HCN »M<L

16 : the range of concepts for each primary load 讀入 (Doc,Con) 依 Con 去查 Load 表,確定這個 配對該落在那 個 Load 依序將每個 Load File 反轉。 CONPTR 表中的 Offset 顯示每 筆資料該填入的位 置。

17 Preparation 1. Allocate an array, con_entries_cnt, of size HCN. 2. For each entry in the document vector file: increment con_entries_cnt[con#] ……………………0 (1,1), (1,4)……….. 2 (2,3) …………….. 3 (3,1), (3,2), (3,5)... 6 (4,2), (4,3) ………. 8 … (con#, doc#)

18 Preparation (continued) 5. For each pair obtained from con_entries_cnt: if there is no room for documents with this concept to fit in the current load, then created an entry in the load table and initialize the next load entry; otherwise update information for the current load table entry.

19 19 Building Load Table l Terminology »LL= length of current load »S= spread of concept numbers in the current load »8 bytes = space needed for each concept/weight pair »4 bytes = space needed for each concept to store count of postings for it l Constraints »8*LL+4*S<M

20 : the range of concepts for each primary load 讀入 (Doc,Con) 依 Con 去查 Load 表,確定這個 配對該落在那 個 Load 依序將每個 Load File 反轉。 CONPTR 表中的 Offset 顯示每 筆資料該填入的位 置。


Download ppt "1 File Structures Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992."

Similar presentations


Ads by Google