Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Database management system (DBMS)  a DBMS allows users and other software to store and retrieve data in a structured way  controls the organization,
Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.
Space-for-Time Tradeoffs
3/5/2009Computer systems1 Analyzing System Using Data Dictionaries Computer System: 1. Data Dictionary 2. Data Dictionary Categories 3. Creating Data Dictionary.
Modern Information Retrieval
B+-tree and Hashing.
Database Systems: A Practical Approach to Design, Implementation and Management International Computer Science S. Carolyn Begg, Thomas Connolly Lecture.
IS 4420 Database Fundamentals Chapter 6: Physical Database Design and Performance Leon Chen.
Cache effective mergesort and quicksort Nir Zepkowitz Based on: “Improving Memory Performance of Sorting Algorithms” by Li Xiao, Xiaodong Zhang, Stefan.
Physical Database Monitoring and Tuning the Operational System.
METU Department of Computer Eng Ceng 302 Introduction to DBMS Disk Storage, Basic File Structures, and Hashing by Pinar Senkul resources: mostly froom.
Physical design. Stage 6 - Physical Design Retrieve the target physical environment Create physical data design Create function component implementation.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 13 Disk Storage, Basic File Structures, and Hashing.
Modern Systems Analysis and Design Third Edition
Chapter 8 Physical Database Design. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Outline Overview of Physical Database.
Chapter 17 Methodology – Physical Database Design for Relational Databases Transparencies © Pearson Education Limited 1995, 2005.
Structured COBOL Programming, Stern & Stern, 9th Edition
Chapter Seven Advanced Shell Programming. 2 Lesson A Developing a Fully Featured Program.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Topics Covered: Data preparation Data preparation Data capturing Data capturing Data verification and validation Data verification and validation Data.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
Workflow Manager and General Tuning Tips. Topics to discuss… Working with Workflows Working with Tasks General Tuning Tips.
1 © Prentice Hall, 2002 Physical Database Design Dr. Bijoy Bordoloi.
Chapter 13 Sequential File Processing. Master Files Set of files used to store companies data in areas like payroll, inventory Usually processed by batch.
Lecture 9 Methodology – Physical Database Design for Relational Databases.
Introduction to Unix (CA263) File Processing. Guide to UNIX Using Linux, Third Edition 2 Objectives Explain UNIX and Linux file processing Use basic file.
Computers Data Representation Chapter 3, SA. Data Representation and Processing Data and information processors must be able to: Recognize external data.
TM 7-1 Copyright © 1999 Addison Wesley Longman, Inc. Physical Database Design.
Database Management 9. course. Execution of queries.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
Lecture 12 Designing Databases 12.1 COSC4406: Software Engineering.
Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.
Discovering Computers Fundamentals Fifth Edition Chapter 9 Database Management.
Indexed and Relative File Processing
10/10/2012ISC239 Isabelle Bichindaritz1 Physical Database Design.
13-1 COBOL for the 21 st Century Nancy Stern Hofstra University Robert A. Stern Nassau Community College James P. Ley University of Wisconsin-Stout (Emeritus)
Sequential Files Chapter 13. Master Files Set of files used to store companies data in areas like payroll, inventory Set of files used to store companies.
Database Management COP4540, SCS, FIU Physical Database Design (ch. 16 & ch. 3)
Computer Architecture Lecture 2 System Buses. Program Concept Hardwired systems are inflexible General purpose hardware can do different tasks, given.
Methodology – Physical Database Design for Relational Databases.
File Structures. 2 Chapter - Objectives Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and.
Indexes and Views Unit 7.
Processor Architecture
CS4432: Database Systems II Query Processing- Part 2.
Lesson 3-Touring Utilities and System Features. Overview Employing fundamental utilities. Linux terminal sessions. Managing input and output. Using special.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
8 Chapter Eight Server-side Scripts. 8 Chapter Objectives Create dynamic Web pages that retrieve and display database data using Active Server Pages Process.
Copyright © 2009 Pearson Education, Inc. Publishing as Prentice Hall Chapter 9 Designing Databases 9.1.
Spring 2004 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2004 Yanyong Zhang
CS523 Database Design Instructor : Somchai Thangsathityangkul You can download lecture note at Class Presence 10% Quiz 10%
Chapter 5 Record Storage and Primary File Organizations
Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.
1 AQA ICT AS Level © Nelson Thornes 2008 Operating Systems What are they and why do we need them?
OCR A Level F453: The function and purpose of translators Translators a. describe the need for, and use of, translators to convert source code.
Bigtable A Distributed Storage System for Structured Data.
Index Building.
Why indexing? For efficient searching of a document
Understanding Indexes: Headings
Information Retrieval in Practice
Indexing Innovations 14.2 Seminar 14.1 Seminar - Filing Procedures.
Ch. 8 File Structures Sequential files. Text files. Indexed files.
Methodology – Physical Database Design for Relational Databases
Modern Systems Analysis and Design Third Edition
CS 430: Information Discovery
Physical Database Design
The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited)
Presentation transcript:

Index Building

Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

Index Building Database tables Word Index: Z97 - word dictionary Z98 - bitmap Z980 - cache of bitmap updates Z95 - words in document

Index Building Database tables Z97 translation from word to internal representation (sequence) same character set as documents

Index Building Database tables Z98 “bitmap” of word occurrence in documents each bitmap is physically made up of one or more records compressed one bitmap for every combination of word and index

Index Building Database tables Z980 cache of bitmap updates increases speed of large bitmap updates 1/1000

Index Building Database tables Z95 list of words and their location in a document adjacency

Index Building Database tables Heading index: Z01 - phrase dictionary Z02 - phrase->document mapping

Index Building Database tables Z01: filing phrase connection to authority database hash key (display text)

Index Building -10- Building flow - word Stage 1: Retrieval + Sort Read document prepare list of words and locations for each word find list of indices it belongs to sort according to words

Index Building -11- Building flow - word Stage 2: Word Dictionary read intermediate file from stage 1 build up word dictionary (check + load) replace word with internal representation create 2nd intermediate file

Index Building -12- Building flow - word Stage 3: Sort + Build Z95 sort intermediate file from stage 2 - by document number create Z95 records load Z95 sequential file to database

Index Building -13- Building flow - word Stage 4: Merge + Build Z98 intermediate file from stage 2 already sorted by word number split words into a number of files according to range of word numbers merge into Z98 records load sequential files

Index Building -14- Building flow - heading Stage 1: Retrieval + Sort Read document prepare list of phrases for each phrase find list of indices it belongs to sort according to hash key

Index Building -15- Building flow - heading Stage 2: Phrase Dictionary read intermediate file from stage 1 build up phrase dictionary generate unique key - acc sequence load Z01 sequential file to database build Z02 - non unique

Index Building -16- Building flow - heading Stage 3: Sort + Load Z02 sort non unique Z02 sequential file load Z02 sequential file to database

Index Building -17- Sequential - word Every stage is handled by a single process Only after handling by a previous stage would the next stage proceed stage 4 would proceed after all other stages were finished

Index Building -18- Sequential - word Example from version 12.1 csh -f p_manage_01_a $1 >& $data_scratch/p_manage_01_a.log & csh -f p_manage_01_b $1 >& $data_scratch/p_manage_01_b.log & csh -f p_manage_01_c $1 >& $data_scratch/p_manage_01_c.log & csh -f p_manage_01_d $1 >& $data_scratch/p_manage_01_d.log csh -f p_manage_01_e $1 >& $data_scratch/p_manage_01_e.log

Index Building -19- Sequential - word p_manage_01_a: retrieval p_manage_01_b: sort (by word) p_manage_01_c: build Z97 p_manage_01_d: build Z95 p_manage_01_e: merge + build Z98

Index Building -20- Drawbacks Minimum parallel processing Single process per stage No recoverability - Z97 could be reused but the whole building process needed to be rerun Computer resources not fully utilized Long run time

Index Building -21- Parallel processing Large databases - multiple processors Identify stages that are not “workflow” bottlenecks Coordinate parallel processes with assignment/progress table

Index Building -22- Parallel processing (word) Stage 1: Retrieval + Sort Retrieval is parallel - “io” not “workflow” bottleneck Split into cycles of range document numbers

Index Building -23- Parallel processing (word) p_manage_01_a.cycles - initial

Index Building -24- Parallel processing (word) p_manage_01_a.cycles - 3 processes, 1st retrieval cycle 0001 ? ? ?

Index Building -25- Parallel processing (word) p_manage_01_a.cycles - 3 processes, 2nd retrieval cycle ? ? ? ? ?

Index Building -26- Parallel processing (word) Whenever possible stages were split into separate sub-stages Usually in cases of non-parallel stages stages 2 and 3 were not made into parallel processes - retrieval was by far the most costly stage

Index Building -27- Parallel processing (word) Stage 2 and 3 were subdivided into the 3 sub stages: build Z97 + load sort intermediate file by document number build Z95 + load

Index Building -28- Parallel processing (word) p_manage_01_a.cycles - example ? ? ? ? ? ?

Index Building -29- Parallel processing (word) Stage 4 is split into sub stages: pre-processing of intermediate files from stage 2 - distribution of words build Z98 - parallel load Z98 sequential file input files are compressed and stored in separate directory

Index Building -30- Parallel processing (word) Pre-processing: generate histogram - # of lines per 5000 words determine range of words - no more than 1G in intermediate files

Index Building -31- Parallel processing (word) p_manage_01_e.cycles

Index Building -32- Parallel processing (word) Build Z98: intermediate files - split into discrete range of words parallel merging and building of Z98

Index Building -33- Parallel processing (word) p_manage_01_e.cycles - example ? ? ?

Index Building -34- Parallel processing (heading) Stage 1: Retrieval + Sort same handling as word index stage 1 “io” bottleneck Split into cycles of range document numbers

Index Building -35- Parallel processing (heading) p_manage_02.cycles

Index Building -36- Parallel processing (heading) Stage 2 and 3 were subdivided into the 3 sub stages: build Z01 + load + build Z02 sort non unique Z02 sequential file load Z02

Index Building -37- Parallel processing (heading) p_manage_02.cycles - example ? ? ? ? ? ?

Index Building -38- Parallel processing (heading) Building of headings is conceptually and practically similar to word building, except for the building of bitmaps (Z98)

Index Building -39- Recovery Word index: stages 1-3 and stage 4 are separate stage 4 runs only after all processing is done in stage 3

Index Building -40- Recovery Stage scenarios: database tables need to be enlarged not enough disk space - intermediate files not enough disk spaces - sort general disaster?

Index Building -41- Recovery Stage 1-3: identify last successful section change “in process” signs (?) to “not processed” sign (-) rerun discrete stage scripts: –p_manage_01_a –p_manage_01_c –p_manage_01_d –p_manage_01_d1

Index Building -42- Recovery Stage 4: must be rerun in totality input files are saved and compressed $word_compress_dir p_manage_01_e

Index Building -43- Helpful rules Stage 1 outrunning stage 2-3: decide on number of stage 1 processes to stop (p_manage_01_a) kill shell and program process reset associated cycle in p_manage_01_a.cycles

Index Building -44- Helpful rules Log file names: p_manage_01_a_{process_number}.log p_manage_01_e_{process_number}.log others are without process_number p_manage_01_c.log p_manage_01_d.log p_manage_01_d1.log p_manage_01_e1.log p_manage_01_e2.log

Index Building -45- Helpful rules cycle size: # docs<2M - 50k # docs<4M - 100k otherwise - 200k

Index Building -46- Helpful rules Disk space calculation: d = no. documents c = no. cycles p = no. processors s = size of retrieval file

Index Building -47- Helpful rules Sort space ($TMPDIR): sort = p*s + 20% stage 1 sort (parallel) + stage 2,3 sorting (single file)

Index Building -48- Helpful rules Scratch space: scratch =p*1.5*s + c*s*1/3 output from stage 1 (in process and not yet processed) + output from stage 3

Index Building -49- Helpful rules Example: UBU d=2M cycle size=50k p=4, c=40, s= ~0.5G sort=4*0.5*1.2=2.4G scratch=4*1.5* *0.5*1/3 = 3G G= 10.67G