Presentation is loading. Please wait.

Presentation is loading. Please wait.

Index Building.

Similar presentations


Presentation on theme: "Index Building."— Presentation transcript:

1 Index Building

2 Overview Database tables Building flow (logical) Sequential Drawbacks
Parallel processing Recovery Helpful rules Index Building

3 Database tables Word Index: Z97 - word dictionary Z98 - bitmap
Z980 - cache of bitmap updates Z95 - words in document Index Building

4 Database tables Z97 translation from word to internal representation (sequence) same character set as documents Index Building

5 Database tables Z98 “bitmap” of word occurrence in documents
each bitmap is physically made up of one or more records compressed one bitmap for every combination of word and index Index Building

6 Database tables Z980 cache of bitmap updates
increases speed of large bitmap updates 1/1000 Index Building

7 Database tables Z95 list of words and there location in a document
adjacency Index Building

8 Database tables Heading index: Z01 - phrase dictionary
Z02 - phrase->document mapping Index Building

9 Database tables Z01: filing phrase connection to authority database
hash key (display text) Index Building

10 Building flow - word Stage 1: Retrieval + Sort Read document
prepare list of words and locations for each word find list of indices it belongs to sort according to words Index Building

11 Building flow - word Stage 2: Word Dictionary
read intermediate file from stage 1 build up word dictionary (check + load) replace word with internal representation create 2nd intermediate file Index Building

12 Building flow - word Stage 3: Sort + Build Z95
sort intermediate file from stage 2 - by document number create Z95 records load Z95 sequential file to database Index Building

13 Building flow - word Stage 4: Merge + Build Z98
intermediate file from stage 2 already sorted by word number split words into a number of files according to range of word numbers merge into Z98 records load sequential files Index Building

14 Building flow - heading
Stage 1: Retrieval + Sort Read document prepare list of phrases for each phrase find list of indices it belongs to sort according to hash key Index Building

15 Building flow - heading
Stage 2: Phrase Dictionary read intermediate file from stage 1 build up phrase dictionary generate unique key - acc sequence load Z01 sequential file to database build Z02 - non unique Index Building

16 Building flow - heading
Stage 3: Sort + Load Z02 sort non unique Z02 sequential file load Z02 sequential file to database Index Building

17 Sequential - word Every stage is handled by a single process
Only after handling by a previous stage would the next stage proceed stage 4 would proceed after all other stages were finished Index Building

18 Sequential - word Example from version 12.1 Index Building
csh -f p_manage_01_a $1 >& $data_scratch/p_manage_01_a.log & csh -f p_manage_01_b $1 >& $data_scratch/p_manage_01_b.log & csh -f p_manage_01_c $1 >& $data_scratch/p_manage_01_c.log & csh -f p_manage_01_d $1 >& $data_scratch/p_manage_01_d.log csh -f p_manage_01_e $1 >& $data_scratch/p_manage_01_e.log Index Building

19 Sequential - word p_manage_01_a: retrieval
p_manage_01_b: sort (by word) p_manage_01_c: build Z97 p_manage_01_d: build Z95 p_manage_01_e: merge + build Z98 Index Building

20 Drawbacks Minimum parallel processing Single process per stage
No recoverability - Z97 could be reused but the whole building process needed to be rerun Computer resources not fully utilized Long run time Index Building

21 Parallel processing Large databases - multiple processors
Identify stages that are not “workflow” bottlenecks Coordinate parallel processes with assignment/progress table Index Building

22 Parallel processing (word)
Stage 1: Retrieval + Sort Retrieval is parallel - “io” not “workflow” bottleneck Split into cycles of range document numbers Index Building

23 Parallel processing (word)
p_manage_01_a.cycles - initial Index Building

24 Parallel processing (word)
p_manage_01_a.cycles - 3 processes, 1st retrieval cycle 0001 ? 0002 ? 0003 ? Index Building

25 Parallel processing (word)
p_manage_01_a.cycles - 3 processes, 2nd retrieval cycle ? ? 0004 ? 0005 ? 0006 ? Index Building

26 Parallel processing (word)
Whenever possible stages were split into separate sub-stages Usually in cases of non-parallel stages stages 2 and 3 were not made into parallel processes - retrieval was by far the most costly stage Index Building

27 Parallel processing (word)
Stage 2 and 3 were subdivided into the 3 sub stages: build Z97 + load sort intermediate file by document number build Z95 + load Index Building

28 Parallel processing (word)
p_manage_01_a.cycles - example ? ? ? 0007 ? 0008 ? 0009 ? Index Building

29 Parallel processing (word)
Stage 4 is split into sub stages: pre-processing of intermediate files from stage 2 - distribution of words build Z98 - parallel load Z98 sequential file input file are compressed and stored in separate directory Index Building

30 Parallel processing (word)
Pre-processing: generate histogram - # of lines per 5000 words determine range of words - no more than 1G in intermediate files Index Building

31 Parallel processing (word)
p_manage_01_e.cycles Index Building

32 Parallel processing (word)
Build Z98: intermediate files - split into discrete range of words parallel merging and building of Z98 Index Building

33 Parallel processing (word)
p_manage_01_e.cycles - example ? 0002 ? 0003 ? Index Building

34 Parallel processing (heading)
Stage 1: Retrieval + Sort same handling as word index stage 1 “io” bottleneck Split into cycles of range document numbers Index Building

35 Parallel processing (heading)
p_manage_02.cycles Index Building

36 Parallel processing (heading)
Stage 2 and 3 were subdivided into the 3 sub stages: build Z01 + load + build Z02 sort non unique Z02 sequential file load Z02 Index Building

37 Parallel processing (heading)
p_manage_02.cycles - example ? ? ? 0006 ? 0007 ? 0008 ? Index Building

38 Parallel processing (heading)
Building of headings is conceptually and practically similar to word building, except for the building of bitmaps (Z98) Index Building

39 Recovery Word index: stages 1-3 and stage 4 are separate
stage 4 runs only after all processing is done in stage 3 Index Building

40 Recovery Stage 1-3 - scenarios: database tables need to be enlarged
not enough disk space - intermediate files not enough disk spaces - sort general disaster? Index Building

41 Recovery Stage 1-3: identify last successful section
change “in process” signs (?) to “not processed” sign (-) rerun discrete stage scripts: p_manage_01_a p_manage_01_c p_manage_01_d p_manage_01_d1 Index Building

42 Recovery Stage 4: must be rerun in totality
input files are saved and compressed $word_compress_dir p_manage_e Index Building

43 Helpful rules Stage 1 outrunning stage 2-3:
decide on number of stage 1 processes to stop (p_manage_01_a) kill shell and program process reset associated cycle in p_manage_01_a.cycles Index Building

44 Helpful rules Log file names: Index Building
p_manage_01_a_{process_number}.log p_manage_01_e_{process_number}.log others are without process_number p_manage_01_c.log p_manage_01_d.log p_manage_01_d1.log p_manage_01_e.log p_manage_01_e2.log Index Building

45 Helpful rules cycle size: # docs<.5M - 20k # docs<2M - 50k
otherwise k Index Building

46 Helpful rules Disk space calculation: d = no documents c = no cycles
p = no processors s = size of retrieval file Index Building

47 Helpful rules Sort space ($TMPDIR): sort = p*s + 20%
stage 1 sort (parallel) + stage 2,3 sorting (single file) Index Building

48 Helpful rules Scratch space: scratch = p*1.5*s + c*s*1/3
output from stage 1 (in process and not yet processed) + output from stage 3 Index Building

49 Helpful rules Example: UBU d=2M cycle size=50k p=4, c=40, s= ~0.5G
sort=4*0.5*1.2=2.4G scratch=4*1.5* *0.5*1/3 = 3G G= 10.67G Index Building


Download ppt "Index Building."

Similar presentations


Ads by Google