1 File Structures Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Hashing as a Dictionary Implementation
1 Signature Files Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992.
1 Introduction to Database Systems CSE 444 Lectures 19: Data Storage and Indexes November 14, 2007.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Dr. Kalpakis CMSC 661, Principles of Database Systems Index Structures [13]
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
The Trie Data Structure Basic definition: a recursive tree structure that uses the digital decomposition of strings to represent a set of strings for searching.
Modern Information Retrieval Chapter 8 Indexing and Searching.
1 Foundations of Software Design Fall 2002 Marti Hearst Lecture 18: Hash Tables.
Dictionaries and Their Implementations
Modern Information Retrieval
1 Indexing and Searching (File Structures) Modern Information Retrieval (C hapter 8) With G. Navarro.
Design a Data Structure Suppose you wanted to build a web search engine, a la Alta Vista (so you can search for “banana slugs” or “zyzzyvas”) index say.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
IS 4420 Database Fundamentals Chapter 6: Physical Database Design and Performance Leon Chen.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
CS/Info 430: Information Retrieval
1 Lecture 20: Indexes Friday, February 25, Outline Representing data elements (12) Index structures (13.1, 13.2) B-trees (13.3)
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Indexing and Searching
Introducing Hashing Chapter 21 Copyright ©2012 by Pearson Education, Inc. All rights reserved.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University.
Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa.
1 © Prentice Hall, 2002 Physical Database Design Dr. Bijoy Bordoloi.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.
CS 430: Information Discovery
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
1. L01: Corpuses, Terms and Search Basic terminology The need for unstructured text search Boolean Retrieval Model Algorithms for compressing data Algorithms.
Appendix C File Organization & Storage Structure.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Copyright © 2009 Pearson Education, Inc. Publishing as Prentice Hall Chapter 9 Designing Databases 9.1.
Spring 2004 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2004 Yanyong Zhang
1 Discussion Class 1 Inverted Files. 2 Discussion Classes Format: Question Ask a member of the class to answer Provide opportunity for others to comment.
Chapter 5 Record Storage and Primary File Organizations
Appendix C File Organization & Storage Structure.
© 2006 Pearson Addison-Wesley. All rights reserved15 A-1 Chapter 15 External Methods.
CS315 Introduction to Information Retrieval Boolean Search 1.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
Why indexing? For efficient searching of a document
Information Retrieval in Practice
Indexing Structures for Files and Physical Database Design
Indexing Goals: Store large files Support multiple search keys
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Indexing & querying text
Implementation Issues & IR Systems
CS 430: Information Discovery
CHAPTER 5: PHYSICAL DATABASE DESIGN AND PERFORMANCE
Indexing and Searching (File Structures)
Dictionaries and Their Implementations
Database Management System
Database Design and Programming
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Indexing 4/11/2019.
Advance Database System
Index Structures Chapter 13 of GUW September 16, 2019
Presentation transcript:

1 File Structures Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, (Chapters 3-5)

2 File Structures for IR l lexicographical indices »indices that are sorted »e.g. inverted files »e.g. Patricia (PAT) trees l cluster file structures l indices based on hashing »signature files

3 Inverted Files Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, (Chapters 3)

4 Inverted Files l Each document is assigned a list of keywords or attributes. l Each keyword (attribute) is associated with operational relevance weights. l An inverted file is the sorted list of keywords (attributes), with each keyword having links to the documents containing that keyword. l Penalty »the size of inverted files ranges from 10% to 100% of more of the size of the text itself »need to update the index as the data set changes

5 Indexing Restrications l A controlled vocabulary which is the collection of keywords that will be indexed. Words in the text that are not in the vocabulary will not be indexed l A list of stopwords that for reasons of volume will not be included in the index l A set of rules that decide the beginning of a word or a piece of text that is indexable l A list of character sequences to be indexed (or not indexed)

Sorted array implementation of an inverted file

7 Structures used in Inverted Files l Sorted Arrays »store the list of keywords in a sorted array »using a standard binary search »advantage: easy to implement »disadvantage: updating the index is expensive l Hashing Structures l Tries (digital search trees) l Combinations of these structures

8 Sorted Arrays 1. The input text is parsed into a list of words along with their location in the text. (time and storage consuming operation) 2. This list is inverted from a list of terms in location order to a list of terms in alphabetical order. 3. Add term weights, or reorganize or compress the files.

Inversion of Word List

10 Dictionary and postings file Idea: the file to be searched should be as short as possible split a single file into two pieces e.g. data set: 38,304 records, 250,000 unique terms (document #, frequency)

Producing an Inverted File for Large Data Sets without Sorting Idea: avoid the use of an explicit sort by using a right-threaded binary tree current number of term postings & the storage location of postings list traverse the binary tree and the linked postings list

12 A Fast Inversion Algorithm l Principle 1 the large primary memories are available If databases can be split into memory loads that can be rapidly processed and then combined, the overall cost will be minimized. l Principle 2 the inherent order of the input data It is very expensive to use polynomial or even nlogn sorting algorithms for large files

FAST-INV algorithm See p. 13. concept postings/ pointers

document number concept number(one concept number for each unique word) Sample document vector Similar to the document- word list shown in p. 7. The concept numbers are sorted within document numbers, and document numbers are sorted within collection

15 Preparation l Terminology »HCN= highest concept number in dictionary, or the number of words to be indexed »L= number of document/concept pairs in the collection »M= available primary memory size l Assumption »M>>HCN »M<L

: the range of concepts for each primary load 讀入 (Doc,Con) 依 Con 去查 Load 表,確定這個 配對該落在那 個 Load 依序將每個 Load File 反轉。 CONPTR 表中的 Offset 顯示每 筆資料該填入的位 置。

Preparation 1. Allocate an array, con_entries_cnt, of size HCN. 2. For each entry in the document vector file: increment con_entries_cnt[con#] ……………………0 (1,1), (1,4)……….. 2 (2,3) …………….. 3 (3,1), (3,2), (3,5)... 6 (4,2), (4,3) ………. 8 … (con#, doc#)

Preparation (continued) 5. For each pair obtained from con_entries_cnt: if there is no room for documents with this concept to fit in the current load, then created an entry in the load table and initialize the next load entry; otherwise update information for the current load table entry.

19 Building Load Table l Terminology »LL= length of current load »S= spread of concept numbers in the current load »8 bytes = space needed for each concept/weight pair »4 bytes = space needed for each concept to store count of postings for it l Constraints »8*LL+4*S<M

: the range of concepts for each primary load 讀入 (Doc,Con) 依 Con 去查 Load 表,確定這個 配對該落在那 個 Load 依序將每個 Load File 反轉。 CONPTR 表中的 Offset 顯示每 筆資料該填入的位 置。