# Algorithms and data structures Protected by 1.5.2015.

## Presentation on theme: "Algorithms and data structures Protected by 1.5.2015."— Presentation transcript:

Algorithms and data structures Protected by http://creativecommons.org/licenses/by-nc-sa/3.0/hr/ 1.5.2015

1.5.2015 Addressing techniques Basics Retrieval procedures Hashing

Algorithms and data structures, FER1.5.2015Basics n Knowing the key of some record, the question arises how to find this record n Primary key Defines a record uniquely Defines a record uniquely – E.g. StudentID n Concatenated (composite) keys necessary for unique identification of some types of records necessary for unique identification of some types of records – E.g. StudentID & CourseCode & ExamDate uniquely define a record of examination (a possible examination by a committee due to student’s complaint is neglected!) n Secondary key Need not uniquely define the record, but it points at some attribute value Need not uniquely define the record, but it points at some attribute value – E.g. YearOfStudy in the record with course data 4 / 40

Algorithms and data structures, FER1.5.2015 Sequential search n Searching the file record by record - the most primitive way It is used for sequential files where all the records have to be read anyhow It is used for sequential files where all the records have to be read anyhow Other terms: linear, serial search Other terms: linear, serial search The records need not be sorted The records need not be sorted On average n/2 records are read On average n/2 records are read Complexity: O(n) Complexity: O(n) – Best case: O(1) – Worst case: O(n)  IspisiTrazi(PrintSearch) Repeat for all the records If the current record equals the searched one Record is found Leave the loop 5 / 40

Algorithms and data structures, FER1.5.2015 Sequential searching of sorted records n How to improve the sequential search? Sort the records according to a key! Sort the records according to a key! n What are the complexities in the best, worst and average case while searching sorted records? Sort the records Repeat for all records If the current record equals the searched one Record is found Leave the loop If the current record is larger than the searched one If the current record is larger than the searched one Record does not exist Leave the loop 6 / 40

Algorithms and data structures, FER1.5.2015 Questions and exercises 1. A colleague tells you that s/he wrote a sequential search algorithm with complexity O(log n). Shall you congratulate him or laugh at him? 1. A colleague tells you that s/he wrote a sequential search algorithm with complexity O(log n). Shall you congratulate him or laugh at him? 2. In the best case, the record will be found after the minimum number of comparisons. Where was this record located? 3. Where was the record found after the maximum number of comparisons? 7 / 40

Algorithms and data structures, FER1.5.2015 Block wise reading n In direct/random access files (all the records are of the same length!) sorted by the primary key, it is not indispensable to check all the records E.g. only each hundredth record is examined E.g. only each hundredth record is examined When the block of the record with the searched key is located, the block is searched sequentially When the block of the record with the searched key is located, the block is searched sequentially n How to find the optimal block size? 8 / 40

Algorithms and data structures, FER1.5.2015 Example – places in Croatia n We are looking for the city of Malinska in a list of F=6935 places; each page contains B=60 places There are F / B leading records (and corresponding pages) - F / B = 116 There are F / B leading records (and corresponding pages) - F / B = 116 AdaAdamovecAdžamovci...BairBajagićBajčići BajićiBajkiniBakar-dio...BarilovićBarkovićiBarlabaševec Mali Gradac Mali Grđevac Mali Iž.Malinska. Manja Vas ManjadvorciManjerovići ZvijerciZvjerinacZvoneća...ŽitomirŽivajaŽivike Živković Kosa Živogošće Žlebec Gorički.ŽutnicaŽužićiŽužići 1 2 115 116 MaoviceMaoviceMaračići...MartinMartinaMartinac 6160 CitanjePoBlokovima (ReadingBlockWise)  CitanjePoBlokovima (ReadingBlockWise) 9 / 40

Algorithms and data structures, FER1.5.2015 Binary search n The binary search starts at the half of the file/array and continues with constant halving of the interval where the searched record could be found prerequisite: data sorted! prerequisite: data sorted! Average number of searching steps log 2 n Average number of searching steps log 2 n The procedure is inappropriate for disk memories with direct access due to time/consuming positioning of reading heads. It is recommendable for core memory. The procedure is inappropriate for disk memories with direct access due to time/consuming positioning of reading heads. It is recommendable for core memory. The fact is used that the array is sorted and in each step the search area is halved The fact is used that the array is sorted and in each step the search area is halved Complexity is O(log 2 n) Complexity is O(log 2 n) Number of elements = n The searched element Search steps = log 2 n 11 / 40

Algorithms and data structures, FER1.5.2015 Example of binary search n Looking for 25 2568912152123253139 12 / 40

Algorithms and data structures, FER1.5.2015 Algorithm for binary search lower_bound = 0 upper_bound = total_elements_count Repeat Find the middle record If the middle record equals to the searched one Record is found Leave the loop If the lower bound is greater than or equal to the upper bound Record is not found Leave the loop If the middle record is smaller then the searched one If the middle record is smaller then the searched one Set the lower bound to the position of the current record + 1 If the middle record is larger then the searched one If the middle record is larger then the searched one Set the upper bound to the position of the current record - 1  (BinarnoPretrazivanje) BinarnySearch 13 / 40

Algorithms and data structures, FER1.5.2015Questions 1. What are the execution times in binary search of n records for the best, worst and average case? Average case is just for 1 step simpler than the worst case. The proof can be seen on: http://www.mcs.sdsmt.edu/ecorwin/cs251/binavg/binavg.htmhttp://www.mcs.sdsmt.edu/ecorwin/cs251/binavg/binavg.htm, March 24th, 2014 http://www.mcs.sdsmt.edu/ecorwin/cs251/binavg/binavg.htm 2. For the search of places in Croatia (list of 6935 places), what is the maximum number of steps necessary to locate the searched place? 3. Shall the binary search be always faster then the sequential, even for a large set of data? 14 / 40

Algorithms and data structures, FER1.5.2015Problem Suppose that n unsorted data in a set can be sorted in time O(n log 2 n). You have to perform n searches in this data set. What is better: Suppose that n unsorted data in a set can be sorted in time O(n log 2 n). You have to perform n searches in this data set. What is better: To use the sequential search? To use the sequential search? To sort data and then apply the binary search? To sort data and then apply the binary search? Solution: it makes more sense to sort and then search binary! Solution: it makes more sense to sort and then search binary! – n sequential searches: n * O(n) = O(n 2 ) – sort + binary: O(n log 2 n) + n * O (log 2 n) = O(n log 2 n) O(n log 2 n) < O(n 2 ) 15 / 40

Algorithms and data structures, FER1.5.2015 Index-sequential files n Every record contains a key as a unique identifier n If a file is sorted by the key, it is appropriate to form a look-up table Input for the table is the key of the searched record Input for the table is the key of the searched record Output is the information regarding the more precise location of the searched record Output is the information regarding the more precise location of the searched record n Such a table is called index The index need not point to each record but only to a block The index need not point to each record but only to a block – In the example – places in Croatia, it was shown that the optimal block size was √F For large files there are indices on multiple levels – with the optimal organisation, in the worst case in each of the indices and in the data file, the same number of records is read For large files there are indices on multiple levels – with the optimal organisation, in the worst case in each of the indices and in the data file, the same number of records is read Optimal sizes of indices on 2 levels are 3 √F i 3 √F 2 O ( 3 √F ) Optimal sizes of indices on 2 levels are 3 √F i 3 √F 2 O ( 3 √F ) For k levels: k +1 √F, k+1 √F 2, k +1 √F 3,… k +1 √F k -1, k +1 √F k O ( k +1 √F ) For k levels: k +1 √F, k+1 √F 2, k +1 √F 3,… k +1 √F k -1, k +1 √F k O ( k +1 √F ) 16 / 40

Algorithms and data structures, FER1.5.2015 Index-sequential files n Insertion and deletion In traditional sequential file (magnetic tape), they are possible only in the way by copying the whole file with addition and/or deletion of records In traditional sequential file (magnetic tape), they are possible only in the way by copying the whole file with addition and/or deletion of records In direct (random) access files, deletion is performed logically, i.e. a tag is written to mark a deleted record In direct (random) access files, deletion is performed logically, i.e. a tag is written to mark a deleted record n After a certain number of revisions, the file has to be reorganised Data are written sequentially and by the key value, indexing is repeated (so called file maintenance) Data are written sequentially and by the key value, indexing is repeated (so called file maintenance) 17 / 40

Algorithms and data structures, FER1.5.2015 Search procedures n Index non-sequential files If the search by multiple keys is required or if addition and deletion of records is frequent (volatile files), in the first case it is very difficult (if not impossible) and in the latter case difficult, to maintain the requirement that the records are sorted and contained within their initial block If the search by multiple keys is required or if addition and deletion of records is frequent (volatile files), in the first case it is very difficult (if not impossible) and in the latter case difficult, to maintain the requirement that the records are sorted and contained within their initial block In that case, the index should contain the address (relative or absolute) of each single record In that case, the index should contain the address (relative or absolute) of each single record n The key contains the address The simplest case is to form the key so that some part of it contains the record address The simplest case is to form the key so that some part of it contains the record address – E.g. At some entrance examination for enrolment, the application number can serve as the key, and simultaneously it can be the ordinal number of the record in a direct access file – Very often, such a simple procedure is not possible because the coding scheme cannot be adapted to each single application 18 / 40

Algorithms and data structures, FER1.5.2015 The idea of hashing n Problem: a company employs about hundred thousand employees, every person has his or her own unique identifier ID, generated from the interval [0, 1 000 000]. The records read & write must be fast. How to organise the file? Direct file with the key equals the ID? Direct file with the key equals the ID? – 1 000 000 x 4 bytes~ 4MB, 90% of space is unused! n It is possible to devise procedures to transform the key into address, or, even better, into some ordinal number The position of the record is stored under this ordinal number The position of the record is stored under this ordinal number This modification improves the flexibilit y This modification improves the flexibilit y 19 / 40

Algorithms and data structures, FER1.5.2015Hashing n Let us suppose to have M buckets (blocks of records with the same starting address of the block) available n A pseudo-random number from the interval 0 to M-1 is calculated from the key value using a hash-function This number is the address of a group of data (of a bucket) where all of the respective keys are transformed into the same pseudo-random number This number is the address of a group of data (of a bucket) where all of the respective keys are transformed into the same pseudo-random number Collision happens when two different keys are transformed into a same address Collision happens when two different keys are transformed into a same address If a bucket is full, it is possible to insert a pointer to the overflow area, or insertion is attempted in the next neighbouring bucket - Bad neighbour policy If a bucket is full, it is possible to insert a pointer to the overflow area, or insertion is attempted in the next neighbouring bucket - Bad neighbour policy n In hashing the following parameters can vary: Bucket capacity Bucket capacity Packing density Packing density 20 / 40

Algorithms and data structures, FER1.5.2015Example n Store the names into a hash table hash-function = (sum of ASCII codes) % (number of buckets) hash-function = (sum of ASCII codes) % (number of buckets) 0 1 2 3 Vanja Matija Andrea Doris Saša Alex Sandi Perica Iva ? 21 / 40

Algorithms and data structures, FER1.5.2015 Bucket capacity n A pseudo-random number is generated through a transformation of the key, yielding the bucket address If the bucket capacity equals 1, overflow is frequent If the bucket capacity equals 1, overflow is frequent With increasing of the bucket size, decreses the probability of overflow, but reading of a single bucket is more time consuming and the amount of sequential search within a bucket increases With increasing of the bucket size, decreses the probability of overflow, but reading of a single bucket is more time consuming and the amount of sequential search within a bucket increases n It is recommendable to match the bucket size with the physical size of the record on external memory (disk block) 22 / 40

Algorithms and data structures, FER1.5.2015 Packing density n After the bucket size has been chosen, the packing density can be selected, i.e. the number of buckets to store the foreseen number of records To reduce the number of overflows, a larger capacity is chosen To reduce the number of overflows, a larger capacity is chosen Packing density = number of records / total capacity Packing density = number of records / total capacity – N = number of records to be stored – M = number of buckets – C = number of records within a bucket n Packing density = N / (M *C) 23 / 40

Algorithms and data structures, FER1.5.2015 How to deal with overflow?” n Using the primary area If a bucket is full, use the next one, etc. If a bucket is full, use the next one, etc. After the last one, comes the first one After the last one, comes the first one Efficient if the bucket size exceeds 10 Efficient if the bucket size exceeds 10 n Separate chaining Buckets are organised as linear lists Buckets are organised as linear lists 24 / 40

Algorithms and data structures, FER1.5.2015 Statistics of hashing n Let M be the number of buckets, and N the amount of input data. The probability for directing x records into a certain bucket obeys the binomial distribution : probability for Y overflows: P(C + Y) probability for Y overflows: P(C + Y) the expected number of overflows from a given bucket: the expected number of overflows from a given bucket: n Total expected number of overflows: 100  s  M/N n The average number of records to be entered into the hash table before collision ~ 1.25 √M n The average total number of entered records before every bucket contains at least 1 record is M ln M 25 / 40

Algorithms and data structures, FER1.5.2015 Transformation of the key into address n Generally, the key is transformed into the address in 3 steps: If the key is not numeric, it should be transformed into a number, preferably without loss of information If the key is not numeric, it should be transformed into a number, preferably without loss of information An algorithm is applied to transform the key, as uniformly as possible, into a pseudo-random number with an order of magnitude of the bucket count An algorithm is applied to transform the key, as uniformly as possible, into a pseudo-random number with an order of magnitude of the bucket count The result is multiplied by a constant  1 for transformation into an interval of relative addresses, equal to the number of buckets The result is multiplied by a constant  1 for transformation into an interval of relative addresses, equal to the number of buckets – relative addresses are converted into the absolute ones on a concrete physical unit and, as a rule, that is the task of system programs n An ideal transformation: probability of transforming 2 different keys in a table of size M into the same address is 1/M 26 / 40

Algorithms and data structures, FER1.5.2015 Characteristics of a good transformation n The output value depends only on the input data If it were dependent also on some other variable, its value should be also known at the search If it were dependent also on some other variable, its value should be also known at the search n The function uses all the information from the input data If it were not using it, under a small variation of the input data a large number of equal outputs would be achieved – the distribution would depart from the wished one If it were not using it, under a small variation of the input data a large number of equal outputs would be achieved – the distribution would depart from the wished one n Uniformly distributes the output values Otherwise efficiency is decreased Otherwise efficiency is decreased n For similar input data, results in very different output values In reality, the input data are often very similar, while a uniform distribution at the output is required In reality, the input data are often very similar, while a uniform distribution at the output is required n Which of these requirements are not fulfilled by our hash function from the previous example? 27 / 40

Algorithms and data structures, FER1.5.2015 Usage of hashing n When is it appropriate? Compilers use it for recording of declared variables Compilers use it for recording of declared variables For spelling checker s and dictionaries For spelling checker s and dictionaries In games to store the positions of players In games to store the positions of players For equality checks (in information security) For equality checks (in information security) – If two elements result in different hash values, they must be different When quick and frequent search is required When quick and frequent search is required n When is it not appropriate? When records are searched by a non-key attribute value When records are searched by a non-key attribute value When the data need to be sorted When the data need to be sorted – E.g. To find the minimum value of the key 28 / 40

Algorithms and data structures, FER1.5.2015 Example: Transformation of key into address n 6 digit key, 7000 buckets; key: 172148 n Method: middle digits of the key squared Key squared yields a 12 digits number. Digits from the 5 th to the 8 th are used Key squared yields a 12 digits number. Digits from the 5 th to the 8 th are used 172148 2 = 0296 3493 3904 The middle 4 digits should be transformed into the interval [0, 6999] The middle 4 digits should be transformed into the interval [0, 6999] – As the pseudo-random number obtains values from the interval [0, 9999], while the bucket addresses can be from the interval [0, 6999], the multiplication factor is 6999/9999  0.7 Bucket address = 3493 * 0.7 = 2445 Bucket address = 3493 * 0.7 = 2445 The results correspond to behaviour of the roulette The results correspond to behaviour of the roulette 29 / 40

Algorithms and data structures, FER1.5.2015 Methods for transformation of key into address n Root from the middle digits of the key squared Like in the previous example but after the square operation the root from the 8 middle digits is calculated and cut-off to obtain a four digit integer: Like in the previous example but after the square operation the root from the 8 middle digits is calculated and cut-off to obtain a four digit integer: sqrt (96349339) = 9815 sqrt (96349339) = 9815 n Division The key is divided by a prime number slightly less or equal to the number of buckets (e.g. 6997) The key is divided by a prime number slightly less or equal to the number of buckets (e.g. 6997) Remainder after division is the bucket address Remainder after division is the bucket address – Bucket address = 172148 mod (6997) = 4220 Keys having a sequence of values are well distributed Keys having a sequence of values are well distributed n Shifting of digits and addition E.g. key= 17207359 E.g. key= 17207359 – 1720 + 7359 = 9079 30 / 40

Algorithms and data structures, FER1.5.2015 Methods for transformation of key into address n Overlapping Overlapping resembles shifting, but is more appropriate for long keys Overlapping resembles shifting, but is more appropriate for long keys E.g. the key= 172407359 E.g. the key= 172407359 – 407 + 953 + 271 = 1631 n Change of the counting base The number is calculated as if having another counting base B The number is calculated as if having another counting base B E.g. B = 11, key = 172148 E.g. B = 11, key = 172148 – 1*11 5 + 7*11 4 + 2*11 3 + 1*11 2 + 4*11 1 + 8*11 0 = 266373 The necessary number of least significant digits is selected and transformed into the address interval: bucket address = 6373 * 0.7 = 4461 The necessary number of least significant digits is selected and transformed into the address interval: bucket address = 6373 * 0.7 = 4461 n The best method can be chosen after simulation for a concrete application Generally, division gives the best results Generally, division gives the best results 31 / 40

Algorithms and data structures, FER1.5.2015 Determination of parameters n Example: There are 350 students enrolled in a study. Their ID - identification number (11 characters) and the family name (14 characters) should be stored, under the requirement to retrieve the records fast using their ID. There are 350 students enrolled in a study. Their ID - identification number (11 characters) and the family name (14 characters) should be stored, under the requirement to retrieve the records fast using their ID. n Remark: ID contains 11 digits ID contains 11 digits – The last digit is for control and it can but need not be stored, if the rule how to calculate it from the rest of the digits is known 32 / 40

Algorithms and data structures, FER1.5.2015Solution n A record contains11+1 + 14+1 = 27 bytes n Let the physical block on the disk be 512 bytes The bucket size should be equal or less than that The bucket size should be equal or less than that – 512/27 = 18,963 Therefore, a bucket shall contain data for 18 students and 26 bytes of unused space Therefore, a bucket shall contain data for 18 students and 26 bytes of unused space A somewhat larger table capacity, e.g. 30%, shall be provided to reduce the number of expected overflows A somewhat larger table capacity, e.g. 30%, shall be provided to reduce the number of expected overflows – That means there are  350/18  *1.3 = 25 buckets – ID should be transformed into a bucket address from the interval [0, 24] ID is rather long, so overlapping can be considered ID is rather long, so overlapping can be considered – Methods can be combined – after overlapping division can be performed – Address shall be calculated by division with a prime number close to the bucket number, e.g. 23 33 / 40

Algorithms and data structures, FER1.5.2015 Writing of records into buckets of the hash table IDFamilyName HASH 0 M-1 C BLOCK on disk 1 M 34 / 40

Algorithms and data structures, FER1.5.2015 Examples of key transformation n If the bucket is full, entering into the next bucket is attempted cyclically ( bad neighbour policy ) 5702926036 ID = 5702926036x 2926 630 075 3631 mod (23) = 20 ID = 6702926037x 2926 730 076 3732 mod (23) = 6 ID = ID = 5702926037x 2926 2926 730 730 075 075 3731 mod (23) = 5 3731 mod (23) = 5 ID = ID = 6702926036x 2926 2926 630 630 076 076 3632 mod (23) = 21 3632 mod (23) = 21 35 / 40

Algorithms and data structures, FER1.5.2015 Algorithm - 1 Create an empty table on a disk Read ID and family name sequentially, until there are data If the control digit is not correct If the control digit is not correct “Incorrect ID" Else Else Set the tag that the record is not written Set the tag that the record is not written Calculate the bucket address Calculate the bucket address Remember it as the initial address Remember it as the initial address Repeat Repeat Read the existing records from the bucket Read the existing records from the bucket Repeat for all the records from the bucket Repeat for all the records from the bucket If the record is not empty If the record is not empty If the already written ID is equal to the input one “Record already exists" “Record already exists" Put the tag that the record is written Put the tag that the record is written Leave the loop Leave the loop Else 36 / 40

Algorithms and data structures, FER1.5.2015 Algorithm - 2 Write the input record Write the input record Put the tag that the record is written Put the tag that the record is written Leave the loop Leave the loop If the record is not written If the record is not written Increment the bucket address for 1 and calculate the modulo(buckets count) If the achieved address equals the initial Table is full Table is full Until record written or table full Until record written or table fullEnd  Hash 37 / 40

Algorithms and data structures, FER1.5.2015Exercises Update  Hash such that instead JMBG (13 character ID) it uses OIB (11 character ID). Update the function “Kontrola” for checking the control digit. Implement deletion by ID. Update  Hash such that instead JMBG (13 character ID) it uses OIB (11 character ID). Update the function “Kontrola” for checking the control digit. Implement deletion by ID. n Let products be the name of the unformatted file organised using hashing. Each record consists of ID (int), name (50+1 char), quantity (int) and price (float). The record is considered empty if its ID equals zero. Count the number of non-empty records in a file. The size of the block on the disk is BLOCK, and expected maximum number of records is MAXREC. These parameters are contained in parameters.h. n Let an unformatted file be organised using hashing. Each record of the file consists of name (50+1 char), quantity (int) and price (float). The record is considered empty if its quantity equals zero. The size of the block on the disk is BLOCK what is contained in parameters.h. Write the function that will find the packing density. The prototype of the function is: float density (const char *file_name); 38 / 40

Algorithms and data structures, FER1.5.2015Exercises n Write the function for the insertion of the ID (int) and the name (char 20+1) into the memory resident hash table with 500 buckets. Each bucket consists of a single record. If the bucket is full, the next bucket is used (cyclically). The input arguments are the bucket address (previously computed), ID and name. The function returns 1 if the insertion is successful, 0 if the record already exists, and -1 if the table is full so the record cannot be inserted. n Write the function that will find the ID (int) and the company name (30+1) in the memory resident hash table with 200 buckets. Each bucket consists of a single record. If the bucket is full and does not contain the searched key value, the next bucket is used (cyclically). Input arguments are the bucket address (previously computed) and the ID. The output argument is the company name. The function returns 1 if the record is found, and 0 if it is not found. 39 / 40

Algorithms and data structures, FER1.5.2015Exercises n Let a key be 7-digit telephone number. Write the function for transformation of the key into address. Hash table consists of M buckets. The division method should be used. The prototype of the function is: int address (int m, long teleph); n Let products be the name of the unformatted file organised using hashing. A record consists of ID (4-digit integer), name (up to 30 char) and price (float). The record is considered empty if its ID equals zero. Write the function for emptying of the file. The size of the block on the disk is BLOCK, and expected maximum number of records is MAXREC. These parameters are contained in parameters.h. n Write the function that will compute the bucket address in the table of 500 buckets. The key is 4-digit ID, and the method is the root from the central digits of the key squared. 40 / 40