Index Building.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Modern Information Retrieval
B+-tree and Hashing.
IS 4420 Database Fundamentals Chapter 6: Physical Database Design and Performance Leon Chen.
Cache effective mergesort and quicksort Nir Zepkowitz Based on: “Improving Memory Performance of Sorting Algorithms” by Li Xiao, Xiaodong Zhang, Stefan.
Physical design. Stage 6 - Physical Design Retrieve the target physical environment Create physical data design Create function component implementation.
Chapter 8 Physical Database Design. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Outline Overview of Physical Database.
Chapter 17 Methodology – Physical Database Design for Relational Databases Transparencies © Pearson Education Limited 1995, 2005.
Chapter Seven Advanced Shell Programming. 2 Lesson A Developing a Fully Featured Program.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
1 © Prentice Hall, 2002 Physical Database Design Dr. Bijoy Bordoloi.
Lecture 9 Methodology – Physical Database Design for Relational Databases.
Introduction to Unix (CA263) File Processing. Guide to UNIX Using Linux, Third Edition 2 Objectives Explain UNIX and Linux file processing Use basic file.
Computers Data Representation Chapter 3, SA. Data Representation and Processing Data and information processors must be able to: Recognize external data.
TM 7-1 Copyright © 1999 Addison Wesley Longman, Inc. Physical Database Design.
Database Management 9. course. Execution of queries.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.
Index Building Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules.
Discovering Computers Fundamentals Fifth Edition Chapter 9 Database Management.
10/10/2012ISC239 Isabelle Bichindaritz1 Physical Database Design.
Database Management COP4540, SCS, FIU Physical Database Design (ch. 16 & ch. 3)
Methodology – Physical Database Design for Relational Databases.
File Structures. 2 Chapter - Objectives Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and.
Processor Architecture
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
Copyright © 2009 Pearson Education, Inc. Publishing as Prentice Hall Chapter 9 Designing Databases 9.1.
CS523 Database Design Instructor : Somchai Thangsathityangkul You can download lecture note at Class Presence 10% Quiz 10%
Chapter 5 Record Storage and Primary File Organizations
1 AQA ICT AS Level © Nelson Thornes 2008 Operating Systems What are they and why do we need them?
Database Management 7. course. Reminder Disk and RAM RAID Levels Disk space management Buffering Heap files Page formats Record formats.
Why indexing? For efficient searching of a document
Understanding Indexes: Headings
Information Retrieval in Practice
Indexing and hashing.
Indexing Innovations 14.2 Seminar 14.1 Seminar - Filing Procedures.
Informatica PowerCenter Performance Tuning Tips
Physical Database Design and Performance
External Sorting Chapter 13
Ch. 8 File Structures Sequential files. Text files. Indexed files.
Week 12 Option 3: Database Design
Methodology – Physical Database Design for Relational Databases
Database Management System
Modern Systems Analysis and Design Third Edition
CS 430: Information Discovery
Database Management Systems
CHAPTER 5: PHYSICAL DATABASE DESIGN AND PERFORMANCE
Database Management Systems (CS 564)
External Memory Hashing
Join Processing in Database Systems with Large Main Memories (part 2)
國立臺北科技大學 課程:資料庫系統 fall Chapter 18
Database management concepts
Database Management Systems (CS 564)
Lecture#12: External Sorting (R&G, Ch13)
Physical Database Design
Programming Logic and Design Fourth Edition, Comprehensive
External Sorting Chapter 13
Selected Topics: External Sorting, Join Algorithms, …
Tree-Structured Indexes
The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited)
Database management concepts
Database Design and Programming
2018, Spring Pusan National University Ki-Joune Li
Contents Memory types & memory hierarchy Virtual memory (VM)
Evaluation of Relational Operations: Other Techniques
General External Merge Sort
5/7/2019 Map Reduce Map reduce.
External Sorting Chapter 13
Course Instructor: Supriya Gupta Asstt. Prof
Presentation transcript:

Index Building

Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules Index Building

Database tables Word Index: Z97 - word dictionary Z98 - bitmap Z980 - cache of bitmap updates Z95 - words in document Index Building

Database tables Z97 translation from word to internal representation (sequence) same character set as documents Index Building

Database tables Z98 “bitmap” of word occurrence in documents each bitmap is physically made up of one or more records compressed one bitmap for every combination of word and index Index Building

Database tables Z980 cache of bitmap updates increases speed of large bitmap updates 1/1000 Index Building

Database tables Z95 list of words and there location in a document adjacency Index Building

Database tables Heading index: Z01 - phrase dictionary Z02 - phrase->document mapping Index Building

Database tables Z01: filing phrase connection to authority database hash key (display text) Index Building

Building flow - word Stage 1: Retrieval + Sort Read document prepare list of words and locations for each word find list of indices it belongs to sort according to words Index Building

Building flow - word Stage 2: Word Dictionary read intermediate file from stage 1 build up word dictionary (check + load) replace word with internal representation create 2nd intermediate file Index Building

Building flow - word Stage 3: Sort + Build Z95 sort intermediate file from stage 2 - by document number create Z95 records load Z95 sequential file to database Index Building

Building flow - word Stage 4: Merge + Build Z98 intermediate file from stage 2 already sorted by word number split words into a number of files according to range of word numbers merge into Z98 records load sequential files Index Building

Building flow - heading Stage 1: Retrieval + Sort Read document prepare list of phrases for each phrase find list of indices it belongs to sort according to hash key Index Building

Building flow - heading Stage 2: Phrase Dictionary read intermediate file from stage 1 build up phrase dictionary generate unique key - acc sequence load Z01 sequential file to database build Z02 - non unique Index Building

Building flow - heading Stage 3: Sort + Load Z02 sort non unique Z02 sequential file load Z02 sequential file to database Index Building

Sequential - word Every stage is handled by a single process Only after handling by a previous stage would the next stage proceed stage 4 would proceed after all other stages were finished Index Building

Sequential - word Example from version 12.1 Index Building csh -f p_manage_01_a $1 >& $data_scratch/p_manage_01_a.log & csh -f p_manage_01_b $1 >& $data_scratch/p_manage_01_b.log & csh -f p_manage_01_c $1 >& $data_scratch/p_manage_01_c.log & csh -f p_manage_01_d $1 >& $data_scratch/p_manage_01_d.log csh -f p_manage_01_e $1 >& $data_scratch/p_manage_01_e.log Index Building

Sequential - word p_manage_01_a: retrieval p_manage_01_b: sort (by word) p_manage_01_c: build Z97 p_manage_01_d: build Z95 p_manage_01_e: merge + build Z98 Index Building

Drawbacks Minimum parallel processing Single process per stage No recoverability - Z97 could be reused but the whole building process needed to be rerun Computer resources not fully utilized Long run time Index Building

Parallel processing Large databases - multiple processors Identify stages that are not “workflow” bottlenecks Coordinate parallel processes with assignment/progress table Index Building

Parallel processing (word) Stage 1: Retrieval + Sort Retrieval is parallel - “io” not “workflow” bottleneck Split into cycles of range document numbers Index Building

Parallel processing (word) p_manage_01_a.cycles - initial 0001 - - - - 000000001 000010000 0002 - - - - 000010001 000020000 0003 - - - - 000020001 000030000 0004 - - - - 000030001 000040000 0005 - - - - 000040001 000050000 0006 - - - - 000050001 000060000 0007 - - - - 000060001 000070000 0008 - - - - 000070001 000080000 0009 - - - - 000080001 000090000 0010 - - - - 000090001 000100000 0011 - - - - 000100001 000110000 0012 - - - - 000110001 000110511 Index Building

Parallel processing (word) p_manage_01_a.cycles - 3 processes, 1st retrieval cycle 0001 ? - - - 000000001 000010000 0002 ? - - - 000010001 000020000 0003 ? - - - 000020001 000030000 0004 - - - - 000030001 000040000 0005 - - - - 000040001 000050000 0006 - - - - 000050001 000060000 0007 - - - - 000060001 000070000 0008 - - - - 000070001 000080000 0009 - - - - 000080001 000090000 0010 - - - - 000090001 000100000 0011 - - - - 000100001 000110000 0012 - - - - 000110001 000110511 Index Building

Parallel processing (word) p_manage_01_a.cycles - 3 processes, 2nd retrieval cycle 0001 + + ? - 000000001 000010000 0002 + ? - - 000010001 000020000 0003 + - - - 000020001 000030000 0004 ? - - - 000030001 000040000 0005 ? - - - 000040001 000050000 0006 ? - - - 000050001 000060000 0007 - - - - 000060001 000070000 0008 - - - - 000070001 000080000 0009 - - - - 000080001 000090000 0010 - - - - 000090001 000100000 0011 - - - - 000100001 000110000 0012 - - - - 000110001 000110511 Index Building

Parallel processing (word) Whenever possible stages were split into separate sub-stages Usually in cases of non-parallel stages stages 2 and 3 were not made into parallel processes - retrieval was by far the most costly stage Index Building

Parallel processing (word) Stage 2 and 3 were subdivided into the 3 sub stages: build Z97 + load sort intermediate file by document number build Z95 + load Index Building

Parallel processing (word) p_manage_01_a.cycles - example 0001 + + + + 000000001 000010000 0002 + + + ? 000010001 000020000 0003 + + ? - 000020001 000030000 0004 + + - - 000030001 000040000 0005 + ? - - 000040001 000050000 0006 + - - - 000050001 000060000 0007 ? - - - 000060001 000070000 0008 ? - - - 000070001 000080000 0009 ? - - - 000080001 000090000 0010 - - - - 000090001 000100000 0011 - - - - 000100001 000110000 0012 - - - - 000110001 000110511 Index Building

Parallel processing (word) Stage 4 is split into sub stages: pre-processing of intermediate files from stage 2 - distribution of words build Z98 - parallel load Z98 sequential file input file are compressed and stored in separate directory Index Building

Parallel processing (word) Pre-processing: generate histogram - # of lines per 5000 words determine range of words - no more than 1G in intermediate files Index Building

Parallel processing (word) p_manage_01_e.cycles 0001 - - 000000001 000600000 0002 - - 000600001 000900000 0003 - - 000900001 999999999 Index Building

Parallel processing (word) Build Z98: intermediate files - split into discrete range of words parallel merging and building of Z98 Index Building

Parallel processing (word) p_manage_01_e.cycles - example 0001 + ? 000000001 000600000 0002 ? - 000600001 000900000 0003 ? - 000900001 999999999 Index Building

Parallel processing (heading) Stage 1: Retrieval + Sort same handling as word index stage 1 “io” bottleneck Split into cycles of range document numbers Index Building

Parallel processing (heading) p_manage_02.cycles 0001 - - - - 000000001 000005000 0002 - - - - 000005001 000010000 0003 - - - - 000010001 000015000 0004 - - - - 000015001 000020000 0005 - - - - 000020001 000025000 0006 - - - - 000025001 000030000 0007 - - - - 000030001 000035000 0008 - - - - 000035001 000040000 0009 - - - - 000040001 000045000 0010 - - - - 000045001 000048435 Index Building

Parallel processing (heading) Stage 2 and 3 were subdivided into the 3 sub stages: build Z01 + load + build Z02 sort non unique Z02 sequential file load Z02 Index Building

Parallel processing (heading) p_manage_02.cycles - example 0001 + + + ? 000000001 000005000 0002 + + ? - 000005001 000010000 0003 + + - - 000010001 000015000 0004 + ? - - 000015001 000020000 0005 + - - - 000020001 000025000 0006 ? - - - 000025001 000030000 0007 ? - - - 000030001 000035000 0008 ? - - - 000035001 000040000 0009 - - - - 000040001 000045000 0010 - - - - 000045001 000048435 Index Building

Parallel processing (heading) Building of headings is conceptually and practically similar to word building, except for the building of bitmaps (Z98) Index Building

Recovery Word index: stages 1-3 and stage 4 are separate stage 4 runs only after all processing is done in stage 3 Index Building

Recovery Stage 1-3 - scenarios: database tables need to be enlarged not enough disk space - intermediate files not enough disk spaces - sort general disaster? Index Building

Recovery Stage 1-3: identify last successful section change “in process” signs (?) to “not processed” sign (-) rerun discrete stage scripts: p_manage_01_a p_manage_01_c p_manage_01_d p_manage_01_d1 Index Building

Recovery Stage 4: must be rerun in totality input files are saved and compressed $word_compress_dir p_manage_e Index Building

Helpful rules Stage 1 outrunning stage 2-3: decide on number of stage 1 processes to stop (p_manage_01_a) kill shell and program process reset associated cycle in p_manage_01_a.cycles Index Building

Helpful rules Log file names: Index Building p_manage_01_a_{process_number}.log p_manage_01_e_{process_number}.log others are without process_number p_manage_01_c.log p_manage_01_d.log p_manage_01_d1.log p_manage_01_e.log p_manage_01_e2.log Index Building

Helpful rules cycle size: # docs<.5M - 20k # docs<2M - 50k otherwise - 200k Index Building

Helpful rules Disk space calculation: d = no documents c = no cycles p = no processors s = size of retrieval file Index Building

Helpful rules Sort space ($TMPDIR): sort = p*s + 20% stage 1 sort (parallel) + stage 2,3 sorting (single file) Index Building

Helpful rules Scratch space: scratch = p*1.5*s + c*s*1/3 output from stage 1 (in process and not yet processed) + output from stage 3 Index Building

Helpful rules Example: UBU d=2M cycle size=50k p=4, c=40, s= ~0.5G sort=4*0.5*1.2=2.4G scratch=4*1.5*0.5 + 40*0.5*1/3 = 3G + 6.67G= 10.67G Index Building