Presentation is loading. Please wait.

Presentation is loading. Please wait.

© 2014 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Indexing February 5, 2014.

Similar presentations


Presentation on theme: "© 2014 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Indexing February 5, 2014."— Presentation transcript:

1 © 2014 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Indexing February 5, 2014

2 © 2014 A. Haeberlen, Z. Ives Announcements HW1 MS1 is due IN ONE WEEK At this point, you should have a feature-complete prototype, so you have time to debug and test your solution Debugging tips When in doubt about protocol details, please look in the HTTP/1.1 spec (RFC2616; linked from HTTP Made Really Easy) Reminder: You have three jokers; the late penalty without jokers is 20% per day Please: Use private questions on Piazza sparingly Reading: D. Comer: "The Ubiquitous B-Tree" http://dl.acm.org/citation.cfm?id=356776 2 University of Pennsylvania

3 © 2014 A. Haeberlen, Z. Ives Plan for today Inverted indices B+ trees 3 University of Pennsylvania NEXT

4 © 2014 A. Haeberlen, Z. Ives 4 Finding data by content Weve seen two approaches to search: Flood the network with requests (example: Gnutella), and do all the work at the data stores Have a directory based on names (example: LDAP) Which of these is the 'best'? An alternative, two-step process: Build a content index over whats out there An index is a key value map Typically limited in what kinds of queries can be supported Most common instance: an index of document keywords

5 © 2014 A. Haeberlen, Z. Ives A common model for search Index the words in every document Forward index: document (ID) list of words Inverted index: word document (ID) 5

6 © 2014 A. Haeberlen, Z. Ives 6 Inverted indices A conceptually very simple map-multiset data structure: In its simplest form, each occurrence includes a document pointer (e.g., URI), perhaps a count and/or position What might a count be useful for? A position? Requires two components, an indexer and a retrieval system Well consider the cost of building the index, plus searching the index using a single keyword Storage efficiency is also a concern

7 © 2014 A. Haeberlen, Z. Ives 7 How do we lay out an inverted index? Some data structures we could use: Unordered list (e.g., a log) Ordered list Tree Hash table

8 © 2014 A. Haeberlen, Z. Ives 8 Unordered and ordered lists Assume that we have entries such as: What does ordering buy us? Assume that we adopt a model in which we use: Do we get any additional benefits? How about: where we fix the size of the keyword and the number of items?

9 © 2014 A. Haeberlen, Z. Ives 9 Tree-based indices Trees have several benefits over lists: Potentially logarithmic search time, as with a well-designed sorted list if it is balanced! Ability to handle variable-length records Weve already seen how trees might make a natural way of distributing data, as well How does a binary search tree fare? Cost of building? Cost of finding an item in it?

10 © 2014 A. Haeberlen, Z. Ives Recap: Inverted indices Useful for search Different data structures can be used Pros / cons 10 University of Pennsylvania

11 © 2014 A. Haeberlen, Z. Ives Plan for today Inverted indices B+ trees 11 University of Pennsylvania NEXT

12 © 2014 A. Haeberlen, Z. Ives The B+ tree A flexible, height-balanced, high-fanout tree Insert/delete at log F N cost (F = fanout, N = # leaf pages) Need to keep tree height-balanced Minimum 50% occupancy (except for root) Each node contains d <= m <= 2d entries Inner nodes contain up to 2d+1 pointers d is called the order of the tree Can search efficiently based on equality (or also range, though we dont need that here) Index Entries Data Entries ("Sequence set") (Direct search)... Linked list (compare to B-tree!)

13 © 2014 A. Haeberlen, Z. Ives Example B+ Tree Data (inverted list pointers) is at the leaves; intermediate nodes have copies of search keys Search begins at root, and key comparisons direct it to a leaf Search for be, bobcat... Based on the search for bobcat*, we know it is not in the tree! Root best but dog a am an antartbe bestbit bobbutcan crydogdry elf fox art

14 © 2014 A. Haeberlen, Z. Ives Inserting data into a B+ Tree Find correct leaf L Put data entry onto L If L has enough space we are, done! Else, must split leaf node L (into L and a new node L2) Redistribute entries evenly, copy up middle key Insert index entry pointing to L2 into parent of L This can happen recursively To split index node, redistribute entries evenly, but push up middle key. (Contrast with leaf splits.) Splits grow tree; root split increases height Tree growth: gets wider or one level taller at the top Root best but dog a am an antartbe bestbit bobbutcan crydogdry elf fox art

15 © 2014 A. Haeberlen, Z. Ives 15 Inserting and Example: Copy up Want to insert here; no room, so split & copy up: a am an and ant an Entry to be inserted in parent node. (Note that key an is copied up and continues to appear in the leaf.) and Root best but dog a am an antartbe bestbit bobbutcan crydogdry elf fox art But where? Parent node is already "full"!

16 © 2014 A. Haeberlen, Z. Ives 16 Inserting and Example: Push up 1/2 Root artbe bestbit bobbutcan cry an Need to split node & push up best but dog art a am dogdry elf fox an ant and

17 © 2014 A. Haeberlen, Z. Ives 17 Inserting and Example: Push up 2/2 Root artbe bestbit bobbutcan cry an butdog best art Entry to be inserted in parent node. (Note that best is pushed up and only appears once in the index. Contrast this with a leaf split.) a am dogdry elf fox an ant and

18 © 2014 A. Haeberlen, Z. Ives 18 Summary: Copying vs. splitting Every keyword (search key) appears in at most one intermediate node Hence, in splitting an intermediate node, we push up Every inverted list entry must appear in a leaf We may also need it in an intermediate node to define a partition point in the tree We must copy up the key of this entry Note that B+ trees easily accommodate multiple occurrences of a keyword

19 © 2014 A. Haeberlen, Z. Ives Some details How would you choose the order of the tree? How would you find all the words starting with the letters 'com'? How would you delete something? Do you always have to split/merge? 19 University of Pennsylvania

20 © 2014 A. Haeberlen, Z. Ives Virtues of the B+ Tree B+ tree and other indices are quite efficient: Height-balanced; log F N cost to search High fanout (F) means depth rarely more than 3 or 4 Almost always better than maintaining a sorted file Typically, 67% occupancy on average Berkeley DB library (C, C++, Java; Oracle) is a toolkit for B+ trees that you will be using later in the semester: Interface: open B+ Tree; get and put items based on key Handles concurrency, caching, etc.

21 © 2014 A. Haeberlen, Z. Ives Example: B+ tree Insert 15, 11, 12, 32, 74 21 University of Pennsylvania 65 130 187 9 25 45 70 80 101 138 150 159 122180 1 4 6 9 14 16 25 31 38 41 45 61 63 64 65 67 68 69 70 72 75 79

22 © 2014 A. Haeberlen, Z. Ives 22 How do we distribute a B+ Tree? We need to host the root at one machine and distribute the rest What are the implications for scalability? Consider building the index as well as searching

23 © 2014 A. Haeberlen, Z. Ives 23 Eliminating the root Sometimes we dont want a tree-structured system because the higher levels can be a central point of congestion or failure Two strategies: Modified tree structure (e.g., the BATON p2p tree; see Jagadish et al., VLDB 2005) Non-hierarchical structure (distributed hash table, discussed in a couple of weeks)

24 © 2014 A. Haeberlen, Z. Ives Recap: B+ trees A very common data structure for indices Used, e.g., in many file systems and many DBMS Very efficient Height-balanced; log F N cost to search High fanout (F) means depth rarely more than 3 or 4 Almost always better than maintaining a sorted file Typically, 67% occupancy on average 24 University of Pennsylvania


Download ppt "© 2014 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Indexing February 5, 2014."

Similar presentations


Ads by Google