Data Types and Data Structures

Data Types and Data Structures
Containers Dictionaries Priority Queue Data Structures Hash Tables Binary Search Trees

Data types & structures
There are numerous options for data structures for many commonly used abstract data types: Containers Dictionaries Priority Queues Changing data structures should not change the correctness of a program, but it can have a dramatic effect on the speed. Classic, simple data structures include things like linked lists and arrays. Arrays are fast at indexing to a specific element; Linked lists take linear time ( O(n) ). So, if we were going to make a container for maintaining a database of people that use a particular web page, which structure should we use? Well, if we need to pick people at random, it might be easiest to use an array. We then pick a random number from 1 to n, and index into the array. But what happens if suddenly we need to remove people from this structure at their request? How long would it take us to remove k people?

Choosing a Data Structure
It is important to choose the proper data structure when you first design an algorithm. There are many data structures that can handle common operations: insertion, deletion, sorting, searching, finding the maximum or minimum, predecessor or successor, etc. Different data structure will each take their own time for the different operations.

Guidelines... Building an algorithm around a properly chosen data structure leads to both a clean algorithm and good performance Using an incorrect data structure can be disastrous, but you don’t always need the best structure. Sorting is at the heart of many good algorithms. Common algorithm design paradigms include divide-and-conquer, randomization, incremental construction, and dynamic programming. * This is just saying that choosing the right structure is very important. * For example, we don’t want to use an array if we’re going to need to do a lot of insertions and deletions in the middle, while we don’t want to use a linked list if we’re going to have to index into it. Fortunately, there are many other structures to choose from if we’re going to do both. * This will be something I’m going to study soon; analyzing sorting and explaining to you why its really so important. * Each of these will be studied in turn later in the course.

Fundamental Data Types
An abstract data type is a collection of well-defined operations that can be performed on a particular structure. Different data structures make different tradeoffs that make certain operations (say, insertion ) faster at the cost of others (say, searching.) Often there will be other considerations that will make one structure more desirable over others.

Containers Hold data for later retrieval Operations:
Insert(item) Retrieve(); typically removing item from container Simple data structures for implementing containers Stack: LIFO Queue: FIFO Table: retrieve by index Implementation Linked list or array

Dictionaries Dictionaries are a form of container that permits access to data items by content (key). Operations: Insert(key) delete(pointer to item) search(key) Linked list implementation (no sorting) Insert: Delete: Search Sorted array implementation Search:

Priority Queues Insert(x) : Given an item x, insert it into the priority Queue. Find-Maximum( ) : Return the item with the maximal priority. Delete-Maximum( ) : Remove the item from the queue whose key is maximum.

Data Structures Ways to implement data types Linked lists
Arrays with auxilary data Hash table Binary search tree Others, of course

Hash Tables Maintain an array to hold your items
“Hash” the key to determine the index the specific item should be stored at Good hash functions Methods for dealing with collisions Chaining Universal hash functions Open addressing

Direct-address hash table
Assumptions Universe of keys is small (size m) Set of keys can be mapped to {0, 1, …, m-1} No elements have the same key Use an array of size m Array contents can be pointer to element Array can directly store element

Hash Functions Problem with direct-addressed tables Hash function
Universe of possible keys U is too large Set of keys used K may be much smaller Hash function Use an array of size Q(m) Use function h(k) = x to determine slot x h: U  {0, 1, …, m-1} Collision When h(k1) = h(k2)

Good Hash Functions Each key is equally likely to hash to any of the m slots independently of where any other key has hashed to Difficult to achieve as this requires knowledge of distribution of keys Good characteristics Must be able to evaluate quickly May want keys that are “close” to map to slots that are far apart

Hashing by Height 1’ 2’ 3’ 4’ 5’ 6’ 7’ 8’ 9’
1’ ’ ’ ’ ’ ’ ’ ’ ’ If I were doing this for all Mammals, this might be a reasonable function.

Collisions unavoidable
Even if we have a good function, we will still have collisions: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Chaining Create a linked list to store all elements that map to same table slot Running time Insert(T,k): how long? what assumptions? Search(T,k): how long? Delete(T,x): pointer to element x, how long, what assumptions?

Search time Notation Worst-case search time? Expected search time
n items m slots load factor a = n/m Worst-case search time? What is worst case? Expected search time Simple uniform hashing: each element is equally likely to hash to any of the m slots, independent of where any other element has hashed to. Expected search time?

Universal hashing In the worst-case, for any hash function, the keys may be exactly the worst-case for your function Avoid this by choosing the hash function randomly independent of the keys to be hashed Key distinction from probabilistic analysis Universal hash function will work well with high probability on EVERY input instance but may perform poorly with low probablity on EVERY input instance Probabilistic analysis of static hash function h says h will work well on most input instances every time but may perform poorly on some input instances every time

Definition and analysis
Let H be a finite collection of hash functions that map U into {0, …, m-1} This collection is universal if for each pair of distinct keys k and q in U, the number of hash functions h in H for which h(k) = h(q) is at most |H|/m. If we choose our hash function randomly from H, this implies that there is at most a 1/m chance that h(k) = h(q). This leads to the expect length of a chain being n/m Note we assume chaining and not open addressing in analysis

An example of universal hash functions
Choose prime p larger than all possible keys Let Zp = {0, …, p-1} and Zp* = {1, …, p-1} Clearly p > m. Why? ha,b for any a in Zp* and b in Zp ha,b(k) = ((ak+b) mod p) mod m Hp,m = {ha,b | a in Zp* and b in Zp} This family has a total of p(p-1) hash functions This family of hash functions is universal

Open addressing Store all elements in the table
Probe the hash table in event of a collision Key idea: probe sequence is NOT the same for each element, depends on initial key h: U x {0, 1, …, m-1}  {0, 1, …, m-1} Permutation requirement h(k,0), h(k,1), …, h(k,m-1) is a permutation of (0, …, m-1)

Operations Insert, search straightforward
Why can we not simply mark a slot as deleted? If keys need to be deleted, open addressing may not be the right choice

Probing schemes uniform hashing: each of m! permutations equally likely not typically achieved linear probing: h(k,i) = (h’(k) + i) mod m Clustering effect Only m possible probe sequences are considered quadratic probing: h(k,i) = (h’(k)+ci+di2) mod m constraints on c, d, m better than linear probing as clustering effect is not as bad Only m possible probe sequences are considered, and keys that map to same position do have identical probe sequences double hashing: h(k,i) = (h(k) + iq(k)) mod m q(k) must be relatively prime wrt m m2 probe sequences considered Much closer to uniform hashing

Search time Preliminaries Expected search time on a miss
n elements, m slots, a = n/m with n <= m Assumption of uniform hashing Expected search time on a miss Given that h(k,i) is non-empty, what is the probability that h(k,i+1) is empty? What is expected search time then? Expect insertion time is essentially the same. Why? Expected search time on a hit If entry was ith element added, expected search time is 1/(1 – i/m) = m/(m-i) Sum this over all m and you get 1/a (Hm – Hm-n) This can be bounded by 1/a ln 1/(1-a)

Binary search trees Supports search, min, max, predecessor, successor, insert, delete, and list all efficiently Thus can be used for more than just dictionary applications Basic tree property For any node x left subtree has nodes <= x right subtree has nodes >= x

Binary Trees Left = LESS THAN Right = GREATER THAN

Example Search Trees How do we search in a tree like this? How long does search take? How do we find successor and predecessor?

Operations Search procedure? Minimum node in tree rooted at node x?
search time? Minimum node in tree rooted at node x? Maximum node in tree rooted at node x? Listing all nodes in sorted order? time to list?

Successor and Predecessor
Successor: Find the minimal entry in the right sub-tree, if there is a right sub-tree. Otherwise find the first ancestor v such that the entry is in v’s left sub-tree. Predecessor: Find the maximal entry in the left sub-tree, if there is a left sub-tree. Otherwise find the first ancestor v such that the entry is in v’s right sub-tree. In either test, if the root node is reached, no predecessor/ successor exists. Show examples of each case. How long do these operations take? NEXT: Insertion and Deletion.

Simple Insertion and Deletion
Insertion: Traverse the tree as you would when searching. When the required branch does not exist, attach the new entry at that location. Deletion: Three possible cases exist: a) Entry is a leaf : Just delete it. b) Entry has one child : Remove entry replacing it with child. c) Entry had two children : Replace entry with successor. Successor has at most one child (why?); use step a or b on it. Ideally you want a balanced tree. We will talk about how to have slightly more complicated insertion and deletion later in the class. How long do these operations take?

Simple binary search trees
What is the expected height of a binary search tree? Difficult to compute if we allow both insertions and deletions With insertions, analysis of section 12.4 shows that expected height is O(log n)

Tree-Balancing Algorithms
Red-Black Trees Splay Trees Others AVL Trees 2-3 Trees and Trees

Manipulating Search Trees

Red-Black Trees All nodes in the tree are either red or black.
Every null-child is included and colored black. All red nodes must have two black children. Every path from the root to a leaf must have the same number of black nodes. How balanced of a tree will this produce? How hard will it be to maintain? At most 2 rotations on an insert, 3 on deletion.

Example Red-Black Tree

Splay trees No adjustment is done in a splay tree when nodes are inserted or removed. All rotations occur within the Search function - the element being searched for is rotated to the root of the tree. Individual operations may take O(n) time However, it can be shown that any sequence of m operations including n insertions starting with an empty tree take O(m log n) time

Splay trees Dynamic optimality conjecture: splay trees are as asymptotically fast on any sequence of operations as any other type of search tree with rotations. What does this mean? Worst case sequence of splay tree operations takes amortized O(log n) time per operation Some sequences of operations take less. Accessing the same ten items over and over again Splay tree should then take less on these sequences as well. One special case that has been proven: search in order from the smallest key to the largest key, the total time for all n operations is O(n).

Splay Tree Example

Specialized Data Structures
Strings Geometric shapes Graphs Sets Schedules

Data Types and Data Structures

Similar presentations

Presentation on theme: "Data Types and Data Structures"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Types and Data Structures

Similar presentations

Presentation on theme: "Data Types and Data Structures"— Presentation transcript:

Similar presentations

About project

Feedback