Presentation is loading. Please wait.

Presentation is loading. Please wait.

Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules.

Similar presentations


Presentation on theme: "Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules."— Presentation transcript:

1 Index Building

2 -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules

3 Index Building -3--3- Database tables Word Index: Z97 - word dictionary Z98 - bitmap Z980 - cache of bitmap updates Z95 - words in document

4 Index Building -4--4- Database tables Z97 translation from word to internal representation (sequence) same character set as documents

5 Index Building -5--5- Database tables Z98 “bitmap” of word occurrence in documents each bitmap is physically made up of one or more records compressed one bitmap for every combination of word and index

6 Index Building -6--6- Database tables Z980 cache of bitmap updates increases speed of large bitmap updates 1/1000

7 Index Building -7--7- Database tables Z95 list of words and their location in a document adjacency

8 Index Building -8--8- Database tables Heading index: Z01 - phrase dictionary Z02 - phrase->document mapping

9 Index Building -9--9- Database tables Z01: filing phrase connection to authority database hash key (display text)

10 Index Building -10- Building flow - word Stage 1: Retrieval + Sort Read document prepare list of words and locations for each word find list of indices it belongs to sort according to words

11 Index Building -11- Building flow - word Stage 2: Word Dictionary read intermediate file from stage 1 build up word dictionary (check + load) replace word with internal representation create 2nd intermediate file

12 Index Building -12- Building flow - word Stage 3: Sort + Build Z95 sort intermediate file from stage 2 - by document number create Z95 records load Z95 sequential file to database

13 Index Building -13- Building flow - word Stage 4: Merge + Build Z98 intermediate file from stage 2 already sorted by word number split words into a number of files according to range of word numbers merge into Z98 records load sequential files

14 Index Building -14- Building flow - heading Stage 1: Retrieval + Sort Read document prepare list of phrases for each phrase find list of indices it belongs to sort according to hash key

15 Index Building -15- Building flow - heading Stage 2: Phrase Dictionary read intermediate file from stage 1 build up phrase dictionary generate unique key - acc sequence load Z01 sequential file to database build Z02 - non unique

16 Index Building -16- Building flow - heading Stage 3: Sort + Load Z02 sort non unique Z02 sequential file load Z02 sequential file to database

17 Index Building -17- Sequential - word Every stage is handled by a single process Only after handling by a previous stage would the next stage proceed stage 4 would proceed after all other stages were finished

18 Index Building -18- Sequential - word Example from version 12.1 csh -f p_manage_01_a $1 >& $data_scratch/p_manage_01_a.log & csh -f p_manage_01_b $1 >& $data_scratch/p_manage_01_b.log & csh -f p_manage_01_c $1 >& $data_scratch/p_manage_01_c.log & csh -f p_manage_01_d $1 >& $data_scratch/p_manage_01_d.log csh -f p_manage_01_e $1 >& $data_scratch/p_manage_01_e.log

19 Index Building -19- Sequential - word p_manage_01_a: retrieval p_manage_01_b: sort (by word) p_manage_01_c: build Z97 p_manage_01_d: build Z95 p_manage_01_e: merge + build Z98

20 Index Building -20- Drawbacks Minimum parallel processing Single process per stage No recoverability - Z97 could be reused but the whole building process needed to be rerun Computer resources not fully utilized Long run time

21 Index Building -21- Parallel processing Large databases - multiple processors Identify stages that are not “workflow” bottlenecks Coordinate parallel processes with assignment/progress table

22 Index Building -22- Parallel processing (word) Stage 1: Retrieval + Sort Retrieval is parallel - “io” not “workflow” bottleneck Split into cycles of range document numbers

23 Index Building -23- Parallel processing (word) p_manage_01_a.cycles - initial 0001 - - - - 000000001 000010000 0002 - - - - 000010001 000020000 0003 - - - - 000020001 000030000 0004 - - - - 000030001 000040000 0005 - - - - 000040001 000050000 0006 - - - - 000050001 000060000 0007 - - - - 000060001 000070000 0008 - - - - 000070001 000080000 0009 - - - - 000080001 000090000 0010 - - - - 000090001 000100000 0011 - - - - 000100001 000110000 0012 - - - - 000110001 000110511

24 Index Building -24- Parallel processing (word) p_manage_01_a.cycles - 3 processes, 1st retrieval cycle 0001 ? - - - 000000001 000010000 0002 ? - - - 000010001 000020000 0003 ? - - - 000020001 000030000 0004 - - - - 000030001 000040000 0005 - - - - 000040001 000050000 0006 - - - - 000050001 000060000 0007 - - - - 000060001 000070000 0008 - - - - 000070001 000080000 0009 - - - - 000080001 000090000 0010 - - - - 000090001 000100000 0011 - - - - 000100001 000110000 0012 - - - - 000110001 000110511

25 Index Building -25- Parallel processing (word) p_manage_01_a.cycles - 3 processes, 2nd retrieval cycle 0001 + + ? - 000000001 000010000 0002 + ? - - 000010001 000020000 0003 + - - - 000020001 000030000 0004 ? - - - 000030001 000040000 0005 ? - - - 000040001 000050000 0006 ? - - - 000050001 000060000 0007 - - - - 000060001 000070000 0008 - - - - 000070001 000080000 0009 - - - - 000080001 000090000 0010 - - - - 000090001 000100000 0011 - - - - 000100001 000110000 0012 - - - - 000110001 000110511

26 Index Building -26- Parallel processing (word) Whenever possible stages were split into separate sub-stages Usually in cases of non-parallel stages stages 2 and 3 were not made into parallel processes - retrieval was by far the most costly stage

27 Index Building -27- Parallel processing (word) Stage 2 and 3 were subdivided into the 3 sub stages: build Z97 + load sort intermediate file by document number build Z95 + load

28 Index Building -28- Parallel processing (word) p_manage_01_a.cycles - example 0001 + + + + 000000001 000010000 0002 + + + ? 000010001 000020000 0003 + + ? - 000020001 000030000 0004 + + - - 000030001 000040000 0005 + ? - - 000040001 000050000 0006 + - - - 000050001 000060000 0007 ? - - - 000060001 000070000 0008 ? - - - 000070001 000080000 0009 ? - - - 000080001 000090000 0010 - - - - 000090001 000100000 0011 - - - - 000100001 000110000 0012 - - - - 000110001 000110511

29 Index Building -29- Parallel processing (word) Stage 4 is split into sub stages: pre-processing of intermediate files from stage 2 - distribution of words build Z98 - parallel load Z98 sequential file input files are compressed and stored in separate directory

30 Index Building -30- Parallel processing (word) Pre-processing: generate histogram - # of lines per 5000 words determine range of words - no more than 1G in intermediate files

31 Index Building -31- Parallel processing (word) p_manage_01_e.cycles 0001 - - 000000001 000600000 0002 - - 000600001 000900000 0003 - - 000900001 999999999

32 Index Building -32- Parallel processing (word) Build Z98: intermediate files - split into discrete range of words parallel merging and building of Z98

33 Index Building -33- Parallel processing (word) p_manage_01_e.cycles - example 0001 + ? 000000001 000600000 0002 ? - 000600001 000900000 0003 ? - 000900001 999999999

34 Index Building -34- Parallel processing (heading) Stage 1: Retrieval + Sort same handling as word index stage 1 “io” bottleneck Split into cycles of range document numbers

35 Index Building -35- Parallel processing (heading) p_manage_02.cycles 0001 - - - - 000000001 000005000 0002 - - - - 000005001 000010000 0003 - - - - 000010001 000015000 0004 - - - - 000015001 000020000 0005 - - - - 000020001 000025000 0006 - - - - 000025001 000030000 0007 - - - - 000030001 000035000 0008 - - - - 000035001 000040000 0009 - - - - 000040001 000045000 0010 - - - - 000045001 000048435

36 Index Building -36- Parallel processing (heading) Stage 2 and 3 were subdivided into the 3 sub stages: build Z01 + load + build Z02 sort non unique Z02 sequential file load Z02

37 Index Building -37- Parallel processing (heading) p_manage_02.cycles - example 0001 + + + ? 000000001 000005000 0002 + + ? - 000005001 000010000 0003 + + - - 000010001 000015000 0004 + ? - - 000015001 000020000 0005 + - - - 000020001 000025000 0006 ? - - - 000025001 000030000 0007 ? - - - 000030001 000035000 0008 ? - - - 000035001 000040000 0009 - - - - 000040001 000045000 0010 - - - - 000045001 000048435

38 Index Building -38- Parallel processing (heading) Building of headings is conceptually and practically similar to word building, except for the building of bitmaps (Z98)

39 Index Building -39- Recovery Word index: stages 1-3 and stage 4 are separate stage 4 runs only after all processing is done in stage 3

40 Index Building -40- Recovery Stage 1-3 - scenarios: database tables need to be enlarged not enough disk space - intermediate files not enough disk spaces - sort general disaster?

41 Index Building -41- Recovery Stage 1-3: identify last successful section change “in process” signs (?) to “not processed” sign (-) rerun discrete stage scripts: –p_manage_01_a –p_manage_01_c –p_manage_01_d –p_manage_01_d1

42 Index Building -42- Recovery Stage 4: must be rerun in totality input files are saved and compressed $word_compress_dir p_manage_01_e

43 Index Building -43- Helpful rules Stage 1 outrunning stage 2-3: decide on number of stage 1 processes to stop (p_manage_01_a) kill shell and program process reset associated cycle in p_manage_01_a.cycles

44 Index Building -44- Helpful rules Log file names: p_manage_01_a_{process_number}.log p_manage_01_e_{process_number}.log others are without process_number p_manage_01_c.log p_manage_01_d.log p_manage_01_d1.log p_manage_01_e1.log p_manage_01_e2.log

45 Index Building -45- Helpful rules cycle size: # docs<2M - 50k # docs<4M - 100k otherwise - 200k

46 Index Building -46- Helpful rules Disk space calculation: d = no. documents c = no. cycles p = no. processors s = size of retrieval file

47 Index Building -47- Helpful rules Sort space ($TMPDIR): sort = p*s + 20% stage 1 sort (parallel) + stage 2,3 sorting (single file)

48 Index Building -48- Helpful rules Scratch space: scratch =p*1.5*s + c*s*1/3 output from stage 1 (in process and not yet processed) + output from stage 3

49 Index Building -49- Helpful rules Example: UBU d=2M cycle size=50k p=4, c=40, s= ~0.5G sort=4*0.5*1.2=2.4G scratch=4*1.5*0.5 + 40*0.5*1/3 = 3G + 6.67G= 10.67G


Download ppt "Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules."

Similar presentations


Ads by Google