# Order Statistics Sorted

## Presentation on theme: "Order Statistics Sorted"— Presentation transcript:

Order Statistics Sorted
Find the key that is smaller than exactly k of the n keys

Order Statistics Statistics: Methods for combining a large amount of data (such as the scores of the whole class on a homework) into a single number or small set of numbers that gives a representative value of the data. The phrase order statistics refers to statistical methods that depend only on the ordering of the data and not on its numerical values. Average of the data, while easy to compute and very important as an estimate of a central value, is NOT an order statistic.

Concept of robustness of estimation
Order Statistics Mode (most commonly occurring value) also does not depend on ordering. Most efficient methods for computing mode in a comparison-based model involve sorting algorithms. Median: The most commonly used order statistic, the value in the middle position in the sorted order of the values. Median can be obtained easily in O(n log n) time via sorting, is it possible to do better? Concept of robustness of estimation

Randomized Algorithms
An algorithm that uses random “bits” to guide so as to achieve good “average case” performance. Formally, the algorithm's performance will be a random variable. The "worst case" is typically so unlikely to occur that it can be ignored.

Randomized Algorithms
Access a source of independent, unbiased random bits (pseudo random numbers), and it is then allowed to use these random bits to influence its computation. Input Output Algorithm Random bits

Randomized Algorithms
Las Vegas Algorithms A randomized algorithm that always outputs the correct answer, it is just that there is a small probability of taking long to execute. Monte Carlo Algorithms Sometimes we want the algorithm to always complete quickly, but allow a small probability of error. Any Las Vegas algorithm can be converted into a Monte Carlo algorithm, by outputting an arbitrary, possibly incorrect answer if it fails to complete within a specified time.

Randomized Quick Sort In traditional Quick Sort, we will always pick the first element as the pivot for partitioning. The worst case runtime is O(n2) while the expected runtime is O(nlogn) over the set of all input. Therefore, some input are born to have long runtime, e.g., an inversely sorted list.

Randomized Quick Sort In randomized Quick Sort, we will pick randomly an element as the pivot for partitioning. The expected runtime of any input is O(nlogn) even if the pivot is off by 90%.

Randomized Algorithms: Motivating Example
Problem: Finding an 'a' in an array of n elements, given that half are 'a's and the other half are 'b's. Solution: Look at each element of the array, requiring (n/2 operations) if the array were ordered as 'b's first followed by 'a's. Similar drawback with checking in the reverse order, or checking every second element.

Randomized Algorithms: Motivating Example
Any strategy with fixed order of checking i.e, a deterministic algorithm, we cannot guarantee that the algorithm will complete quickly for all possible inputs. On the other hand, if we were to check array elements at random, then we will quickly find an 'a' with high probability, whatever be the input.

Order Statistics The ith order statistic in a set of n elements is the ith smallest element The minimum is thus the 1st order statistic The maximum is the nth order statistic The median is the n/2 order statistic If n is even, there are 2 medians How can we calculate order statistics? What is the running time?

Selection problem Given a list of n items, and a number k between 1 and n, find the item that would be kth if we sorted the list. The median is the special case of this for which k=n/2. We'll see two algorithms i.e. a randomized one based on quicksort ("quickselect") and a deterministic one. The randomized one is easier to understand & better in practice so we'll do it first. Let's warm up with some cases of selection that don't have much to do with medians (because k is very far from n/2).

Selection problem: 2nd best search
If k=1, the selection problem is trivial: just select the minimum element. As usual we maintain a value x that is the minimum seen so far, and compare it against each successive value, updating it when something smaller is seen. min(L) { x = L[1] for (i = 2; i <= n; i++) if (L[i] < x) x = L[i] return x } What if you want to select the second best?

Selection problem: 2nd best search
One possibility: Follow the same general strategy, but modify min(L) to keep two values, the best and second best seen so far. Compare each new value against the second best, to tell whether it is in the top two, but then if we discover that a new value is one of the top two so far we need to tell whether it's best or second best.

Selection problem: 2nd best search
Some interesting behavior shows up when we try to analyze it. Worst case: List may be sorted in decreasing order, so each of the n-2 iterations of the loop performs 2 comparisons. The total is then 2n-3 comparisons. Average case: (assuming any permutation of L is equally likely) the first comparison in each iteration still always happens. But the second only happens when L[i] is one of the two smallest values among the first i. Each of the first i values is equally likely to be one of these two, so this is true with probability 2/i. The total expected number of times we make the second comparison is

Selection problem: 2nd best search
Conclusion The sum (for i from 1 to n) of 1/i, known as the harmonic series, is ln n + O(1) (this can be proved using calculus, by comparing the sum to a similar integral). Therefore the total expected number of comparisons overall is n + O(log n). This small increase over the n-1 comparisons needed to find the minimum gives us hope that we can perform selection faster than sorting.

Linear-Time Median Selection
Random-Select (S, i) 1. If |S| = 1 then return S. 2. Choose a random element y uniformly from S 3. Compare all elements of S to y. Let S1 = {x ≤ y} S2 = {x > y} 4. If |S1| = n then 4.1 If i = n return {y} else S1 = S1 – {y} 5. If |S1| ≥ i then return Random-Select(S1, i) else return Random-Select(|S2|, i - |S1|)

Linear-Time Median Selection
Given a “black box” O(n) median algorithm, what can we do? ith order statistic: Find median x Partition input around x if (i  (n+1)/2) recursively find ith element of first half else find (i - (n+1)/2)th element in second half T(n) = T(n/2) + O(n) = O(n) Can you think of an application to sorting?

Linear-Time Median Selection
Worst-case O(n lg n) quicksort Find median x and partition around it Recursively quicksort two halves T(n) = 2T(n/2) + O(n) = O(n lg n)