We think you have liked this presentation. If you wish to download it, please recommend it to your friends in any social system. Share buttons are a little bit lower. Thank you!
Presentation is loading. Please wait.
Published byJasper Newcombe
Modified over 2 years ago
© 2014 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Indexing February 5, 2014
© 2014 A. Haeberlen, Z. Ives Announcements HW1 MS1 is due IN ONE WEEK At this point, you should have a feature-complete prototype, so you have time to debug and test your solution Debugging tips When in doubt about protocol details, please look in the HTTP/1.1 spec (RFC2616; linked from HTTP Made Really Easy) Reminder: You have three jokers; the late penalty without jokers is 20% per day Please: Use private questions on Piazza sparingly Reading: D. Comer: "The Ubiquitous B-Tree" 2 University of Pennsylvania
© 2014 A. Haeberlen, Z. Ives Plan for today Inverted indices B+ trees 3 University of Pennsylvania NEXT
© 2014 A. Haeberlen, Z. Ives 4 Finding data by content Weve seen two approaches to search: Flood the network with requests (example: Gnutella), and do all the work at the data stores Have a directory based on names (example: LDAP) Which of these is the 'best'? An alternative, two-step process: Build a content index over whats out there An index is a key value map Typically limited in what kinds of queries can be supported Most common instance: an index of document keywords
© 2014 A. Haeberlen, Z. Ives A common model for search Index the words in every document Forward index: document (ID) list of words Inverted index: word document (ID) 5
© 2014 A. Haeberlen, Z. Ives 6 Inverted indices A conceptually very simple map-multiset data structure: In its simplest form, each occurrence includes a document pointer (e.g., URI), perhaps a count and/or position What might a count be useful for? A position? Requires two components, an indexer and a retrieval system Well consider the cost of building the index, plus searching the index using a single keyword Storage efficiency is also a concern
© 2014 A. Haeberlen, Z. Ives 7 How do we lay out an inverted index? Some data structures we could use: Unordered list (e.g., a log) Ordered list Tree Hash table
© 2014 A. Haeberlen, Z. Ives 8 Unordered and ordered lists Assume that we have entries such as: What does ordering buy us? Assume that we adopt a model in which we use: Do we get any additional benefits? How about: where we fix the size of the keyword and the number of items?
© 2014 A. Haeberlen, Z. Ives 9 Tree-based indices Trees have several benefits over lists: Potentially logarithmic search time, as with a well-designed sorted list if it is balanced! Ability to handle variable-length records Weve already seen how trees might make a natural way of distributing data, as well How does a binary search tree fare? Cost of building? Cost of finding an item in it?
© 2014 A. Haeberlen, Z. Ives Recap: Inverted indices Useful for search Different data structures can be used Pros / cons 10 University of Pennsylvania
© 2014 A. Haeberlen, Z. Ives Plan for today Inverted indices B+ trees 11 University of Pennsylvania NEXT
© 2014 A. Haeberlen, Z. Ives The B+ tree A flexible, height-balanced, high-fanout tree Insert/delete at log F N cost (F = fanout, N = # leaf pages) Need to keep tree height-balanced Minimum 50% occupancy (except for root) Each node contains d <= m <= 2d entries Inner nodes contain up to 2d+1 pointers d is called the order of the tree Can search efficiently based on equality (or also range, though we dont need that here) Index Entries Data Entries ("Sequence set") (Direct search)... Linked list (compare to B-tree!)
© 2014 A. Haeberlen, Z. Ives Example B+ Tree Data (inverted list pointers) is at the leaves; intermediate nodes have copies of search keys Search begins at root, and key comparisons direct it to a leaf Search for be, bobcat... Based on the search for bobcat*, we know it is not in the tree! Root best but dog a am an antartbe bestbit bobbutcan crydogdry elf fox art
© 2014 A. Haeberlen, Z. Ives Inserting data into a B+ Tree Find correct leaf L Put data entry onto L If L has enough space we are, done! Else, must split leaf node L (into L and a new node L2) Redistribute entries evenly, copy up middle key Insert index entry pointing to L2 into parent of L This can happen recursively To split index node, redistribute entries evenly, but push up middle key. (Contrast with leaf splits.) Splits grow tree; root split increases height Tree growth: gets wider or one level taller at the top Root best but dog a am an antartbe bestbit bobbutcan crydogdry elf fox art
© 2014 A. Haeberlen, Z. Ives 15 Inserting and Example: Copy up Want to insert here; no room, so split & copy up: a am an and ant an Entry to be inserted in parent node. (Note that key an is copied up and continues to appear in the leaf.) and Root best but dog a am an antartbe bestbit bobbutcan crydogdry elf fox art But where? Parent node is already "full"!
© 2014 A. Haeberlen, Z. Ives 16 Inserting and Example: Push up 1/2 Root artbe bestbit bobbutcan cry an Need to split node & push up best but dog art a am dogdry elf fox an ant and
© 2014 A. Haeberlen, Z. Ives 17 Inserting and Example: Push up 2/2 Root artbe bestbit bobbutcan cry an butdog best art Entry to be inserted in parent node. (Note that best is pushed up and only appears once in the index. Contrast this with a leaf split.) a am dogdry elf fox an ant and
© 2014 A. Haeberlen, Z. Ives 18 Summary: Copying vs. splitting Every keyword (search key) appears in at most one intermediate node Hence, in splitting an intermediate node, we push up Every inverted list entry must appear in a leaf We may also need it in an intermediate node to define a partition point in the tree We must copy up the key of this entry Note that B+ trees easily accommodate multiple occurrences of a keyword
© 2014 A. Haeberlen, Z. Ives Some details How would you choose the order of the tree? How would you find all the words starting with the letters 'com'? How would you delete something? Do you always have to split/merge? 19 University of Pennsylvania
© 2014 A. Haeberlen, Z. Ives Virtues of the B+ Tree B+ tree and other indices are quite efficient: Height-balanced; log F N cost to search High fanout (F) means depth rarely more than 3 or 4 Almost always better than maintaining a sorted file Typically, 67% occupancy on average Berkeley DB library (C, C++, Java; Oracle) is a toolkit for B+ trees that you will be using later in the semester: Interface: open B+ Tree; get and put items based on key Handles concurrency, caching, etc.
© 2014 A. Haeberlen, Z. Ives Example: B+ tree Insert 15, 11, 12, 32, University of Pennsylvania
© 2014 A. Haeberlen, Z. Ives 22 How do we distribute a B+ Tree? We need to host the root at one machine and distribute the rest What are the implications for scalability? Consider building the index as well as searching
© 2014 A. Haeberlen, Z. Ives 23 Eliminating the root Sometimes we dont want a tree-structured system because the higher levels can be a central point of congestion or failure Two strategies: Modified tree structure (e.g., the BATON p2p tree; see Jagadish et al., VLDB 2005) Non-hierarchical structure (distributed hash table, discussed in a couple of weeks)
© 2014 A. Haeberlen, Z. Ives Recap: B+ trees A very common data structure for indices Used, e.g., in many file systems and many DBMS Very efficient Height-balanced; log F N cost to search High fanout (F) means depth rarely more than 3 or 4 Almost always better than maintaining a sorted file Typically, 67% occupancy on average 24 University of Pennsylvania
Indexes An index on a file speeds up selections on the search key fields for the index. –Any subset of the fields of a relation can be the search key for.
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
§6 B+ Trees 【 Definition 】 A B+ tree of order M is a tree with the following structural properties: (1) The root is either a leaf or has between 2 and.
Storage and Indexing February 26 th, 2003 Lecture 19.
Introduction to Database Systems1 B+-Trees Storage Technology: Topic 5.
Implementation of Relational Operators/Estimated Cost 1.Select 2.Join.
Index tuning-- B+tree. overview Overview of tree-structured index Indexed sequential access method (ISAM) B+tree.
Introduction to Database Systems1 Records and Files Storage Technology: Topic 3.
Tree-Structured Indexes. Range Searches ``Find all students with gpa > 3.0’’ –If data is in sorted file, do binary search to find first such student,
B+-Trees Adapted from Mike Franklin. Example Tree Index Index entries: they direct search for data entries in leaves. Example where each node can hold.
BTREE Indices A little context information What’s the purpose of an index? Example of web search engines Queries do not directly search the WWW for data;
Storage and Indexing. How do we store efficiently large amounts of data? The appropriate storage depends on what kind of accesses we expect to have to.
Announcements Exam Friday. More Physical Storage Lecture 10.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 B+-Tree Index Chapter 10 Modified by Donghui Zhang Nov 9, 2005.
1 CPS216: Advanced Database Systems Notes 05: Operators for Data Access (contd.) Shivnath Babu.
CSE 326: Data Structures Lecture #13 Extendible Hashing and Splay Trees Alon Halevy Spring Quarter 2001.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 9.
1 CPS216: Data-intensive Computing Systems Operators for Data Access (contd.) Shivnath Babu.
Spring 2003 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2003 Yanyong Zhang
B+-tree and Hashing. Model of Computation Data stored on disk(s) Minimum transfer unit: a page = b bytes or B records N records -> N/B = n pages I/O complexity:
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search Faster response time, supports updates n Text search engines.
Finding What We Want: DNS and XPath-Based Pub-Sub Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems February 12, 2008.
Data Organization - B-trees. A simple index Brighton A Downtown A Downtown A Mianus A Perry A A-101 A-102.
Appendix C File Organization & Storage Structure.
David Luebke 1 6/7/2014 CS 332: Algorithms Skip Lists Introduction to Hashing.
1 Overview of Storage and Indexing Yanlei Diao UMass Amherst Feb 13, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Announcements Exam Friday Project: Steps –Due today.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Indexing - revisited CS 186, Fall 2012 R & G Chapter 8.
Lecture 6 Indexing Part 2 Column Stores. Indexes Recap Heap FileBitmapHash FileB+Tree InsertO(1) O( log B n ) DeleteO(P)O(1) O( log B n ) Range Scan O(P)--
Chapter 5 Record Storage and Primary File Organizations 1. Disk Storage Devices 2. Files of Records 3. Operations on Files 4. Unordered Files 5. Ordered.
DBMS Internals: Storage February 27th, Representing Data Elements Relational database elements: A tuple is represented as a record CREATE TABLE.
1 Tree-Structured Indexes Chapter Introduction As for any index, 3 alternatives for data entries k* : Data record with key value k Choice.
Indexing Techniques. Advanced DatabasesIndexing Techniques2 The Problem What can we introduce to make search more efficient? –Indices! What is an index?
1 Lecture 20: Indexes Friday, February 25, Outline Representing data elements (12) Index structures (13.1, 13.2) B-trees (13.3)
1 the BSTree class BSTreeNode has same structure as binary tree nodes elements stored in a BSTree are a key- value pair must be a class (or a struct)
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
1 Lecture 8: Data structures for databases II Jose M. Peña
Binary Heaps COL 106 Shweta Agrawal and Amit Kumar.
CpSc 3220 File and Database Processing Lecture 17 Indexed Files.
2010/3/81 Lecture 8 on Physical Database DBMS has a view of the database as a collection of stored records, and that view is supported by the file manager.
B+-Trees and Hashing Techniques for Storage and Index Structures Rizwan Rehman Centre For Computer Studies Dibrugarh University.
1 Tree Indexing (1) Linear index is poor for insertion/deletion. Tree index can efficiently support all desired operations: –Insert/delete –Multiple search.
1 Indexing Structures for Files. 2 Basic Concepts Indexing mechanisms used to speed up access to desired data without having to scan entire.
Index Structures Parin Shah Id:-207. Topics Introduction Structure of B-tree Features of B-tree Applications of B-trees Insertion into B-tree Deletion.
Indexing. 421: Database Systems - Index Structures 2 Cost Model for Data Access q Data should be stored such that it can be accessed fast q Evaluation.
CSE 326: Data Structures Lecture #16 Hashing HUGE Data Sets (and two presents from the Database Fiancée) Steve Wolfman Winter Quarter 2000.
Efficient Storage and Retrieval of Data n Physical Data Organization n Management of large amount of persistent, reliable and shared data – large: data.
© 2017 SlidePlayer.com Inc. All rights reserved.