1 File Structures Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992.

Slides:

Advertisements

Similar presentations

Information Retrieval in Practice

Advertisements

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.

Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.

Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.

Hashing as a Dictionary Implementation

1 Signature Files Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992.

1 Introduction to Database Systems CSE 444 Lectures 19: Data Storage and Indexes November 14, 2007.

Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.

Dr. Kalpakis CMSC 661, Principles of Database Systems Index Structures [13]

1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)

The Trie Data Structure Basic definition: a recursive tree structure that uses the digital decomposition of strings to represent a set of strings for searching.

Modern Information Retrieval Chapter 8 Indexing and Searching.

1 Foundations of Software Design Fall 2002 Marti Hearst Lecture 18: Hash Tables.

Dictionaries and Their Implementations

Modern Information Retrieval

1 Indexing and Searching (File Structures) Modern Information Retrieval (C hapter 8) With G. Navarro.

Design a Data Structure Suppose you wanted to build a web search engine, a la Alta Vista (so you can search for “banana slugs” or “zyzzyvas”) index say.

1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.

Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.

1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.

IS 4420 Database Fundamentals Chapter 6: Physical Database Design and Performance Leon Chen.

1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.

CS/Info 430: Information Retrieval

1 Lecture 20: Indexes Friday, February 25, Outline Representing data elements (12) Index structures (13.1, 13.2) B-trees (13.3)

Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.

Indexing and Searching

Introducing Hashing Chapter 21 Copyright ©2012 by Pearson Education, Inc. All rights reserved.

WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.

1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.

Hsin-Hsi Chen8-1 Chapter 8 Indexing and Searching Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University.

Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa.

1 © Prentice Hall, 2002 Physical Database Design Dr. Bijoy Bordoloi.

CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.

Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.

CS 430: Information Discovery

1 CS 430: Information Discovery Lecture 3 Inverted Files.

Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.

Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.

1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.

COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.

1. L01: Corpuses, Terms and Search Basic terminology The need for unstructured text search Boolean Retrieval Model Algorithms for compressing data Algorithms.

Appendix C File Organization & Storage Structure.

Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.

Copyright © 2009 Pearson Education, Inc. Publishing as Prentice Hall Chapter 9 Designing Databases 9.1.

Spring 2004 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2004 Yanyong Zhang

1 Discussion Class 1 Inverted Files. 2 Discussion Classes Format: Question Ask a member of the class to answer Provide opportunity for others to comment.

Chapter 5 Record Storage and Primary File Organizations

Appendix C File Organization & Storage Structure.

© 2006 Pearson Addison-Wesley. All rights reserved15 A-1 Chapter 15 External Methods.

CS315 Introduction to Information Retrieval Boolean Search 1.

Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.

Why indexing? For efficient searching of a document

Information Retrieval in Practice

Indexing Structures for Files and Physical Database Design

Indexing Goals: Store large files Support multiple search keys

Azita Keshmiri CS 157B Ch 12 indexing and hashing

Indexing & querying text

Implementation Issues & IR Systems

CS 430: Information Discovery

CHAPTER 5: PHYSICAL DATABASE DESIGN AND PERFORMANCE

Indexing and Searching (File Structures)

Dictionaries and Their Implementations

Database Management System

Database Design and Programming

Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures

Indexing 4/11/2019.

Advance Database System

Index Structures Chapter 13 of GUW September 16, 2019

Presentation transcript:

1 File Structures Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, (Chapters 3-5)

2 File Structures for IR l lexicographical indices »indices that are sorted »e.g. inverted files »e.g. Patricia (PAT) trees l cluster file structures l indices based on hashing »signature files

3 Inverted Files Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, (Chapters 3)

4 Inverted Files l Each document is assigned a list of keywords or attributes. l Each keyword (attribute) is associated with operational relevance weights. l An inverted file is the sorted list of keywords (attributes), with each keyword having links to the documents containing that keyword. l Penalty »the size of inverted files ranges from 10% to 100% of more of the size of the text itself »need to update the index as the data set changes

5 Indexing Restrications l A controlled vocabulary which is the collection of keywords that will be indexed. Words in the text that are not in the vocabulary will not be indexed l A list of stopwords that for reasons of volume will not be included in the index l A set of rules that decide the beginning of a word or a piece of text that is indexable l A list of character sequences to be indexed (or not indexed)

Sorted array implementation of an inverted file

7 Structures used in Inverted Files l Sorted Arrays »store the list of keywords in a sorted array »using a standard binary search »advantage: easy to implement »disadvantage: updating the index is expensive l Hashing Structures l Tries (digital search trees) l Combinations of these structures

8 Sorted Arrays 1. The input text is parsed into a list of words along with their location in the text. (time and storage consuming operation) 2. This list is inverted from a list of terms in location order to a list of terms in alphabetical order. 3. Add term weights, or reorganize or compress the files.

Inversion of Word List

10 Dictionary and postings file Idea: the file to be searched should be as short as possible split a single file into two pieces e.g. data set: 38,304 records, 250,000 unique terms (document #, frequency)

Producing an Inverted File for Large Data Sets without Sorting Idea: avoid the use of an explicit sort by using a right-threaded binary tree current number of term postings & the storage location of postings list traverse the binary tree and the linked postings list

12 A Fast Inversion Algorithm l Principle 1 the large primary memories are available If databases can be split into memory loads that can be rapidly processed and then combined, the overall cost will be minimized. l Principle 2 the inherent order of the input data It is very expensive to use polynomial or even nlogn sorting algorithms for large files

FAST-INV algorithm See p. 13. concept postings/ pointers

document number concept number(one concept number for each unique word) Sample document vector Similar to the document- word list shown in p. 7. The concept numbers are sorted within document numbers, and document numbers are sorted within collection

15 Preparation l Terminology »HCN= highest concept number in dictionary, or the number of words to be indexed »L= number of document/concept pairs in the collection »M= available primary memory size l Assumption »M>>HCN »M<L

: the range of concepts for each primary load 讀入 (Doc,Con) 依 Con 去查 Load 表，確定這個配對該落在那個 Load 依序將每個 Load File 反轉。 CONPTR 表中的 Offset 顯示每筆資料該填入的位置。

Preparation 1. Allocate an array, con_entries_cnt, of size HCN. 2. For each entry in the document vector file: increment con_entries_cnt[con#] ……………………0 (1,1), (1,4)……….. 2 (2,3) …………….. 3 (3,1), (3,2), (3,5)... 6 (4,2), (4,3) ………. 8 … (con#, doc#)

Preparation (continued) 5. For each pair obtained from con_entries_cnt: if there is no room for documents with this concept to fit in the current load, then created an entry in the load table and initialize the next load entry; otherwise update information for the current load table entry.

19 Building Load Table l Terminology »LL= length of current load »S= spread of concept numbers in the current load »8 bytes = space needed for each concept/weight pair »4 bytes = space needed for each concept to store count of postings for it l Constraints »8*LL+4*S<M

: the range of concepts for each primary load 讀入 (Doc,Con) 依 Con 去查 Load 表，確定這個配對該落在那個 Load 依序將每個 Load File 反轉。 CONPTR 表中的 Offset 顯示每筆資料該填入的位置。