Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 CS 430: Information Discovery Lecture 3 Inverted Files.

Similar presentations


Presentation on theme: "1 CS 430: Information Discovery Lecture 3 Inverted Files."— Presentation transcript:

1 1 CS 430: Information Discovery Lecture 3 Inverted Files

2 2 Course Administration Reading for Wednesday -- since the textbooks have not yet arrived, photocopies of the reading are provided. Assignment 1 will be posted during the next couple of days.

3 3 Course Administration Course enrollment: We do not have enough laptops and wireless cards for everybody who has applied. Everybody who has both (a) pre-registered for the class and (b) applied for a laptop should have received an email. Other people may enroll for the course, but without laptops. No part of the course will require use of laptops. If you drop the course, we will re-issue your laptop to another student. If we have any extra laptops we will allocate them to people who applied, by drawing lots.

4 4 Inverted File (Basic) Inverted file: a list of the words in a set of documents and the documents in which they appear. Word Document abacus 3 19 22 actor 2 19 29 aspen 5 atoll 11 34 Stop words are removed before building the index.

5 5 Inverted List Inverted list: All the entries in an inverted file that apply to a specific word, e.g. abacus 3 19 22 Posting: Entry in an inverted list, e.g., the postings for "abacus" are documents 3, 19, 22.

6 6 Boolean Search (Keyword) Boolean query: two or more search terms, related by logical operators, e.g., andornot Examples: abacus and actor abacus or actor (abacus and actor) or (abacus and atoll) not actor

7 7 Boolean Diagram A B A and B A or B not (A or B)

8 8 Evaluating a Boolean Query Examples: abacus and actor Postings for abacus Postings for actor Document 19 is the only document that contains both terms, "abacus" and "actor". Since inverted lists may be very long, it is important to match postings efficiently. 3 19 22 2 19 29

9 9 Enhancements to Inverted Files Location: The inverted file holds information about the location of each term within the document. Uses adjacency and near operators user interface design Frequency: The inverted file includes the number of postings for each term. Uses term weighting query processing optimization user interface design

10 10 Inverted File (Enhanced) WordPostings DocumentLocation abacus4 3 94 19 7 19 212 2256 actor3 2 66 19 213 29 45 aspen1 5 43 atoll3 11 3 1170 34 40

11 11 Adjacent and Near Operators abacus adj actor The terms abacus and actor are next to each other as in the string "abacus actor". abacus near 4 actor The terms abacus and actor are next to each other as in the string "abacus actor". Some systems support other operators, such as with (two terms in the same sentence) or same (two terms in the same paragraph).

12 12 Evaluating an Adjacency Operation Examples: abacus adj actor Postings for abacus Postings for actor Document 19, locations 212 and 213, is the only occurrence of the terms "abacus" and "actor" adjacent. 3 94 19 7 19 212 22 56 2 66 19 213 29 45

13 13 Evaluation of Boolean Operators Precedence of operators must be defined: adj, nearhigh and, not or low Example A and B or C and B is evaluated as (A and B) or (C and B)

14 14 Efficiency Criteria Storage Inverted files are big, typically 10% to 100% the size of the collection of documents. Update performance It must be possible, with a reasonable amount of computation, to: (a) Add a large batch of documents (b) Add a single document Retrieval performance Retrieval must be fast enough to satisfy users and not use excessive resource.

15 15 Efficiency and Query Languages Some query options may require huge computation, e.g., Regular expressions If inverted files are stored in alphabetical order, comp* can be processed efficiently *comp cannot be processed efficiently Boolean terms If A and B are search terms A or B can be processed by comparing two moderate sized lists (not A) or (not B) requires two very large lists

16 16 Index File Structures: Linear Index TermPointer to list of postings ant bee cat dog elk fox gnu hog Inverted lists

17 17 Linear Index Advantages Can be searched quickly, e.g., by binary search, O(log n) Good for sequential processing, e.g., comp* Convenient for batch updating Economical use of storage Disadvantages Index must be rebuilt if an extra term is added

18 18 Index File Structures: Binary Tree elk cathog beedogfox ant gnu

19 19 Binary Tree Advantages Can be searched quickly Convenient for batch updating Easy to add an extra term Economical use of storage Disadvantages Poor for sequential processing, e.g., comp* Tree tends to become unbalanced If the index is held on disk, each node may require a separate disk access

20 20 Binary Tree Calculation of maximum depth of tree. Illustrates importance of balanced trees.

21 21 Impact of Secondary Storage If an index is held on disk, search time is dominated by the number of disk accesses. Suppose that an index has 100,000 terms. Each index entry consists of the term and a pointer to the inverted list, average 50 characters. Size of index is 5 megabytes, which can easily be held in memory.


Download ppt "1 CS 430: Information Discovery Lecture 3 Inverted Files."

Similar presentations


Ads by Google