We think you have liked this presentation. If you wish to download it, please recommend it to your friends in any social system. Share buttons are a little bit lower. Thank you!
Presentation is loading. Please wait.
Published byJasper Newcombe
Modified over 2 years ago
© 2014 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Indexing February 5, 2014
© 2014 A. Haeberlen, Z. Ives Announcements HW1 MS1 is due IN ONE WEEK At this point, you should have a feature-complete prototype, so you have time to debug and test your solution Debugging tips When in doubt about protocol details, please look in the HTTP/1.1 spec (RFC2616; linked from HTTP Made Really Easy) Reminder: You have three jokers; the late penalty without jokers is 20% per day Please: Use private questions on Piazza sparingly Reading: D. Comer: "The Ubiquitous B-Tree" 2 University of Pennsylvania
© 2014 A. Haeberlen, Z. Ives Plan for today Inverted indices B+ trees 3 University of Pennsylvania NEXT
© 2014 A. Haeberlen, Z. Ives 4 Finding data by content Weve seen two approaches to search: Flood the network with requests (example: Gnutella), and do all the work at the data stores Have a directory based on names (example: LDAP) Which of these is the 'best'? An alternative, two-step process: Build a content index over whats out there An index is a key value map Typically limited in what kinds of queries can be supported Most common instance: an index of document keywords
© 2014 A. Haeberlen, Z. Ives A common model for search Index the words in every document Forward index: document (ID) list of words Inverted index: word document (ID) 5
© 2014 A. Haeberlen, Z. Ives 6 Inverted indices A conceptually very simple map-multiset data structure: In its simplest form, each occurrence includes a document pointer (e.g., URI), perhaps a count and/or position What might a count be useful for? A position? Requires two components, an indexer and a retrieval system Well consider the cost of building the index, plus searching the index using a single keyword Storage efficiency is also a concern
© 2014 A. Haeberlen, Z. Ives 7 How do we lay out an inverted index? Some data structures we could use: Unordered list (e.g., a log) Ordered list Tree Hash table
© 2014 A. Haeberlen, Z. Ives 8 Unordered and ordered lists Assume that we have entries such as: What does ordering buy us? Assume that we adopt a model in which we use: Do we get any additional benefits? How about: where we fix the size of the keyword and the number of items?
© 2014 A. Haeberlen, Z. Ives 9 Tree-based indices Trees have several benefits over lists: Potentially logarithmic search time, as with a well-designed sorted list if it is balanced! Ability to handle variable-length records Weve already seen how trees might make a natural way of distributing data, as well How does a binary search tree fare? Cost of building? Cost of finding an item in it?
© 2014 A. Haeberlen, Z. Ives Recap: Inverted indices Useful for search Different data structures can be used Pros / cons 10 University of Pennsylvania
© 2014 A. Haeberlen, Z. Ives Plan for today Inverted indices B+ trees 11 University of Pennsylvania NEXT
© 2014 A. Haeberlen, Z. Ives The B+ tree A flexible, height-balanced, high-fanout tree Insert/delete at log F N cost (F = fanout, N = # leaf pages) Need to keep tree height-balanced Minimum 50% occupancy (except for root) Each node contains d <= m <= 2d entries Inner nodes contain up to 2d+1 pointers d is called the order of the tree Can search efficiently based on equality (or also range, though we dont need that here) Index Entries Data Entries ("Sequence set") (Direct search)... Linked list (compare to B-tree!)
© 2014 A. Haeberlen, Z. Ives Example B+ Tree Data (inverted list pointers) is at the leaves; intermediate nodes have copies of search keys Search begins at root, and key comparisons direct it to a leaf Search for be, bobcat... Based on the search for bobcat*, we know it is not in the tree! Root best but dog a am an antartbe bestbit bobbutcan crydogdry elf fox art
© 2014 A. Haeberlen, Z. Ives Inserting data into a B+ Tree Find correct leaf L Put data entry onto L If L has enough space we are, done! Else, must split leaf node L (into L and a new node L2) Redistribute entries evenly, copy up middle key Insert index entry pointing to L2 into parent of L This can happen recursively To split index node, redistribute entries evenly, but push up middle key. (Contrast with leaf splits.) Splits grow tree; root split increases height Tree growth: gets wider or one level taller at the top Root best but dog a am an antartbe bestbit bobbutcan crydogdry elf fox art
© 2014 A. Haeberlen, Z. Ives 15 Inserting and Example: Copy up Want to insert here; no room, so split & copy up: a am an and ant an Entry to be inserted in parent node. (Note that key an is copied up and continues to appear in the leaf.) and Root best but dog a am an antartbe bestbit bobbutcan crydogdry elf fox art But where? Parent node is already "full"!
© 2014 A. Haeberlen, Z. Ives 16 Inserting and Example: Push up 1/2 Root artbe bestbit bobbutcan cry an Need to split node & push up best but dog art a am dogdry elf fox an ant and
© 2014 A. Haeberlen, Z. Ives 17 Inserting and Example: Push up 2/2 Root artbe bestbit bobbutcan cry an butdog best art Entry to be inserted in parent node. (Note that best is pushed up and only appears once in the index. Contrast this with a leaf split.) a am dogdry elf fox an ant and
© 2014 A. Haeberlen, Z. Ives 18 Summary: Copying vs. splitting Every keyword (search key) appears in at most one intermediate node Hence, in splitting an intermediate node, we push up Every inverted list entry must appear in a leaf We may also need it in an intermediate node to define a partition point in the tree We must copy up the key of this entry Note that B+ trees easily accommodate multiple occurrences of a keyword
© 2014 A. Haeberlen, Z. Ives Some details How would you choose the order of the tree? How would you find all the words starting with the letters 'com'? How would you delete something? Do you always have to split/merge? 19 University of Pennsylvania
© 2014 A. Haeberlen, Z. Ives Virtues of the B+ Tree B+ tree and other indices are quite efficient: Height-balanced; log F N cost to search High fanout (F) means depth rarely more than 3 or 4 Almost always better than maintaining a sorted file Typically, 67% occupancy on average Berkeley DB library (C, C++, Java; Oracle) is a toolkit for B+ trees that you will be using later in the semester: Interface: open B+ Tree; get and put items based on key Handles concurrency, caching, etc.
© 2014 A. Haeberlen, Z. Ives Example: B+ tree Insert 15, 11, 12, 32, University of Pennsylvania
© 2014 A. Haeberlen, Z. Ives 22 How do we distribute a B+ Tree? We need to host the root at one machine and distribute the rest What are the implications for scalability? Consider building the index as well as searching
© 2014 A. Haeberlen, Z. Ives 23 Eliminating the root Sometimes we dont want a tree-structured system because the higher levels can be a central point of congestion or failure Two strategies: Modified tree structure (e.g., the BATON p2p tree; see Jagadish et al., VLDB 2005) Non-hierarchical structure (distributed hash table, discussed in a couple of weeks)
© 2014 A. Haeberlen, Z. Ives Recap: B+ trees A very common data structure for indices Used, e.g., in many file systems and many DBMS Very efficient Height-balanced; log F N cost to search High fanout (F) means depth rarely more than 3 or 4 Almost always better than maintaining a sorted file Typically, 67% occupancy on average 24 University of Pennsylvania
Tree-Structured Indexes CS 186, Spring 2006, Lectures 5 &6 R & G Chapters 9 & 10 “If I had eight hours to chop down a tree, I'd spend six sharpening my.
DBMS Storage and Indexing. Disk Storage Disks and Files DBMS stores information on (“hard”) disks. This has major implications for DBMS design! ▫ READ:
Indexes An index on a file speeds up selections on the search key fields for the index. –Any subset of the fields of a relation can be the search key for.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
110/6/2014CSE Suprakash Datta datta[at]cse.yorku.ca CSE 3101: Introduction to the Design and Analysis of Algorithms.
File Organizations and Indexing Chapter 8. Review: Memory, Disks, & Files Everything won’t fit in RAM (usually) Hierarchy of storage, RAM, disk, tape.
Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered.
BINARY SEARCH TREES CS16: Introduction to Data Structures & Algorithms Thursday, February 20,
Linked Lists. Please Read These slides are provided for the use of students enrolled in James Durbanos Data Structures class (CISC 220). They are the.
Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 12: Query Processing.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University.
File Concept A file is a named collection of related information that is recorded on secondary storage. A file has a define structure, which we must know.
Chapter 14 Query Optimization. ©Silberschatz, Korth and Sudarshan14.2Database System Concepts 3 rd Edition Chapter 14: Query Optimization Introduction.
File Organizations and Indexing Lecture 4 R&G Chapter 8 "If you don't find it in the index, look very carefully through the entire catalogue." -- Sears,
1 File Systems: Fundamentals. 2 Files What is a file? A named collection of related information recorded on secondary storage (e.g., disks) File attributes.
F28DM Indexes in Oracle 1 F28DM : Database Management Systems Indexes in Oracle Monica Farrow Room: EMG30, Ext: 4160 Material on Vision.
CSCD34 - Data Management Systems - A. Vaisman1 Overview of Storage and Indexing.
Introduction to Information Retrieval Kangnam Univ. Introduction to Information Retrieval Kangnam Univ. Lecture 4: Index Construction.
File Systems. Storing Information Applications can store it in the process address space Why is it a bad idea? –Size is limited to size of virtual address.
This part is about how XML Documents can be stored and retrieved.
Lower and Upper Bounds on Obtaining History Independence Niv Buchbinder and Erez Petrank Technion, Israel.
1 Indexing. 2 Overview An index is a table containing a list of keys associated with a reference field pointing to the record where the information referenced.
Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.
© 2013 A. Haeberlen, Z. Ives NETS 212: Scalable and Cloud Computing 1 University of Pennsylvania MapReduce algorithms October 1, 2013.
© 2013 A. Haeberlen, Z. Ives NETS 212: Scalable and Cloud Computing 1 University of Pennsylvania Case studies September 24, 2013.
UNIT – IV VIRTUAL MEMORY MANAGEMENT Handled by K. Venkatesh & Razia Sultana.
Chapter 13. Red-Black Trees A variation of binary search trees. Balanced: height is O(lg n), where n is the number of nodes. Operations will take O(lg.
An Introduction to Object Modeling An Introduction to Object Modeling The approach of using object modeling during systems analysis and design is called.
© 2013 A. Haeberlen NETS 212: Scalable and Cloud Computing 1 University of Pennsylvania Storage at Facebook December 3, 2013.
© 2016 SlidePlayer.com Inc. All rights reserved.