1 File Structures Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992.

1 File Structures Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992. (Chapters 3-5)

2 File Structures for IR l lexicographical indices »indices that are sorted »e.g. inverted files »e.g. Patricia (PAT) trees l cluster file structures l indices based on hashing »signature files

3 Inverted Files Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992. (Chapters 3)

4 Inverted Files l Each document is assigned a list of keywords or attributes. l Each keyword (attribute) is associated with operational relevance weights. l An inverted file is the sorted list of keywords (attributes), with each keyword having links to the documents containing that keyword. l Penalty »the size of inverted files ranges from 10% to 100% of more of the size of the text itself »need to update the index as the data set changes

5 Indexing Restrications l A controlled vocabulary which is the collection of keywords that will be indexed. Words in the text that are not in the vocabulary will not be indexed l A list of stopwords that for reasons of volume will not be included in the index l A set of rules that decide the beginning of a word or a piece of text that is indexable l A list of character sequences to be indexed (or not indexed)

Sorted array implementation of an inverted file

7 Structures used in Inverted Files l Sorted Arrays »store the list of keywords in a sorted array »using a standard binary search »advantage: easy to implement »disadvantage: updating the index is expensive l Hashing Structures l Tries (digital search trees) l Combinations of these structures

8 Sorted Arrays 1. The input text is parsed into a list of words along with their location in the text. (time and storage consuming operation) 2. This list is inverted from a list of terms in location order to a list of terms in alphabetical order. 3. Add term weights, or reorganize or compress the files.

Inversion of Word List

10 Dictionary and postings file Idea: the file to be searched should be as short as possible split a single file into two pieces e.g. data set: 38,304 records, 250,000 unique terms (document #, frequency)

Producing an Inverted File for Large Data Sets without Sorting Idea: avoid the use of an explicit sort by using a right-threaded binary tree current number of term postings & the storage location of postings list traverse the binary tree and the linked postings list

12 A Fast Inversion Algorithm l Principle 1 the large primary memories are available If databases can be split into memory loads that can be rapidly processed and then combined, the overall cost will be minimized. l Principle 2 the inherent order of the input data It is very expensive to use polynomial or even nlogn sorting algorithms for large files

FAST-INV algorithm See p. 13. concept postings/ pointers

document number concept number(one concept number for each unique word) Sample document vector Similar to the document- word list shown in p. 7. The concept numbers are sorted within document numbers, and document numbers are sorted within collection

15 Preparation l Terminology »HCN= highest concept number in dictionary, or the number of words to be indexed »L= number of document/concept pairs in the collection »M= available primary memory size l Assumption »M>>HCN »M<L

: the range of concepts for each primary load 讀入 (Doc,Con) 依 Con 去查 Load 表，確定這個配對該落在那個 Load 依序將每個 Load File 反轉。 CONPTR 表中的 Offset 顯示每筆資料該填入的位置。

Preparation 1. Allocate an array, con_entries_cnt, of size HCN. 2. For each entry in the document vector file: increment con_entries_cnt[con#] ……………………0 (1,1), (1,4)……….. 2 (2,3) …………….. 3 (3,1), (3,2), (3,5)... 6 (4,2), (4,3) ………. 8 … (con#, doc#)

Preparation (continued) 5. For each pair obtained from con_entries_cnt: if there is no room for documents with this concept to fit in the current load, then created an entry in the load table and initialize the next load entry; otherwise update information for the current load table entry.

19 Building Load Table l Terminology »LL= length of current load »S= spread of concept numbers in the current load »8 bytes = space needed for each concept/weight pair »4 bytes = space needed for each concept to store count of postings for it l Constraints »8*LL+4*S<M

: the range of concepts for each primary load 讀入 (Doc,Con) 依 Con 去查 Load 表，確定這個配對該落在那個 Load 依序將每個 Load File 反轉。 CONPTR 表中的 Offset 顯示每筆資料該填入的位置。

1 File Structures Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992.

Similar presentations

Presentation on theme: "1 File Structures Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 File Structures Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992.

Similar presentations

Presentation on theme: "1 File Structures Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992."— Presentation transcript:

Similar presentations

About project

Feedback