Boosting Textual Compression in Optimal Linear Time.

Presentation on theme: "Boosting Textual Compression in Optimal Linear Time."— Presentation transcript:

Boosting Textual Compression in Optimal Linear Time

Disclaimer The author of this presentation, henceforth referred to as The Author, should not be held accountable for any mental illness, confusion, disorientation, or general lack of will to live caused, directly or indirectly, by prolonged exposure to this material.

Introduction A boosting technique, in very informal terms, can be seen as a method that, when applied to a particular class of algorithms, yields improved algorithms in terms of one or more parameters characterizing their performance in the class. General boosting techniques have a deep significance for Computer Science. Using such techniques, one can, informally, take a good algorithm and, applying the boosting technique on it, get a very high-quality algorithm, again in terms of the parameters characterizing the nature of the problem.

Introduction (cont) In the past weeks, I am sure we have all been convinced of the importance of textual compression to our field of study. If so, we would like to come up with a boosting technique to improve existing textual compression algorithms, while sustaining the smallest possible loss in the algorithms asymptotic time and space complexity. In general, such efficient boosting techniques are very hard to come by. In this class I will present one such boosting technique for improving textual compression algorithms.

Presentation Outline For a change, this presentation will begin with the results of the boosting technique. only then will I elaborate further. As with all previous presentations, I will have to introduce many new definitions, and repeat a few that we have already seen. Its not going to be easy, so bear with me. Once the new definitions are all clear, we will see the pseudocode for the boosting technique. Assuming that the definitions are indeed clear, the technique itself is quite straightforward. To conclude this presentation, I will show some remaining open problems.

Statement of Results Let s be a string over a finite alphabet Σ, Let denote the k-th order empirical entropy of s, and let be the k-th order modified empirical entropy of s, both of which will be defined soon enough. Also, let us recall the Burrows-Wheeler Transform that, given a string s, computes a permutation of that string, hereby denoted BWT(s). let us consider a compression algorithm A that compresses any string z# in at most bits, where λ,η and μ are constants independent of z, and # is a special symbol not appearing elsewhere in z. A general outline of the boosting technique will be shown in the next slide.

Statement of Results (cont) Here are the three major steps of the technique: 1. compute. 2. using the suffix tree of, greedily partition so that a suitably defined objective function is minimized. 3. compress each substring of the partition, separately, using algorithm A.

Statement of Results (cont) We will show that for any, the length in bits of the string resulting from the boosting is bounded by: If we rely on the stronger assumption that A compresses every string z# in at most bits then the following improved bound can be achieved:

Definitions Let s be a string over the alphabet and, for each, let be the number of occurences of in s. We will assume that. The zeroth order empirical entropy of s is: For any string w, we denote by the string of single symbols following the occurrences of w in s, taken from left to right. For example, if s = mississippi and w = si then = sp. We define the k-th order entropy as:

Definitions (cont) Now, we shall define the zeroth order modified empirical entropy: To define the k-th order modified empirical entropy, I will introduce the notion of suffix cover: we say that a set of substrings of s of length at most k, is a suffix cover of, and write, if every string in has a unique suffix in. For example, if and k = 3 then both and are suffix covers for.

Definitions (cont) We now define, for every set cover : Now we can finally define the k-th order modified empirical entropy of s: For some optimal suffix cover.

Definitions (cont) Three more notations also worthy of mentioning, but briefly, are, and, where is the string of single characters preceding the occurrences of w in s from left to right, and. I also wish to introduce the notion of prefix cover, which is equivalent to the notion of suffix cover, just with prefixes instead of suffixes. That is, is a prefix cover of if every string in has a unique prefix in.

Definitions (cont) Let us recall that BWT(s) constructs a matrix whose rows are cyclic shifts of s\$ sorted by lexicographical order and returns the last column of that matrix. Let w be a substring of s. then by the matrixs construction, all of the rows prefixed by w are consecutive (because the matrix is sorted in lexicographical order). this means that the single symbols preceding every occurrence of w in s are grouped together in a set of consecutive positions of the string. we denote this substring. It is easy to see that is a premutation of.

Definitions (cont) Example: BWT matrix for the string t = mississippi. Let w = s. The four occurrences of s in t are in the last four rows of the BWT matrix. Then and that is indeed a permutation of.

Definitions (cont) Let T be the suffix tree of the string s\$. We assume that the suffix tree edges are sorted lexicographically. If so, then the i-th leaf (counting from the left) of the suffix tree corresponds to the i-th row of the BWT matrix. We associate the i-th leaf of T with the i-th symbol of the string. Ill denote the i-th leaf of T by and the symbol associated with it by. By definition,.

Definitions (cont) Let w be a substring of s. The locus of w, denoted, is the node of T that has associated the shortest string prefixed by w.

Definitions (cont) Example: suffix tree for the string s = mississippi\$. The locus of the substrings ss and ssi is the node reachable by the path labelled by ssi.

Definitions (cont) Another very important notion I would like to introduce is that of the leaf cover. Given a suffix tree T we say that a subset L of its nodes is a leaf cover if every leaf of the suffix tree has a unique ancestor in L. For every node u of T we will denote by the substring of concatenating, from left to right, the symbols associated to the leaves descending from node u. For example, in the suffix tree from the previous slide,.

Definitions (cont) Note that these symbols are exactly the single symbols preceding i in mississippi\$. that is, for any string w we have.

Definitions (cont) A key observation in this article is the natural relation between leaf covers and prefix covers. let be the optimal prefix cover defining and let be the set of nodes. since is a prefix cover of we get that every leaf of T corresponding to a suffix of length greater than k has a unique ancestor in. on the other hand, leaves of T corresponding to suffixes of length smaller than k might not have an ancestor in. We would like to enhance in a way that will make it a leaf cover of T.

Definitions (cont) We will denote by the set of leaves corresponding to suffixes of s\$ of length at most k which are not prefixed by a string in. we set. because s\$ has at most k suffixes of length smaller than k. This relation is exploited next.

Definitions (cont) The Cost of a Leaf Cover: Let C denote the function which associates to every string x over, with at most one occurrence of \$, the positive real value where are constants and x is the string x with the symbol \$ removed, if it was present. we will now define the value of C for a leaf cover L:

Definitions (cont) In this section, I only have the following lemma left to prove: For any given there exists a constant such that for any string s: The next three slides details the proof for the lemma.

Definitions (cont) Let us recall that and that by definition. If so, then the following equation obviously holds: Observe that every is a leaf of T. By the definition of C we get that for every :

Definitions (cont) Also, recall that. Combined, we get that summation (2) is bound by. For us to evaluate summation (1), recall that every is the locus of a string. By the relation between the suffix tree and the BWT matrix we have that. Also,. Then we get:

Definitions (cont) For the last step, recall that is a permutation of and therefore and, obviously,. Finally, we get:

Computing the Optimal Leaf Cover Now that were finally done with all of the required definitions, we can finally get on to business. Perhaps the most important aspect of this boosting technique is that the optimal leaf cover can be computed in time linear in |s|. In the following slides I will present an algorithm that computes that optimal leaf cover in linear time, and prove its correctness and time complexity.

Computing the Optimal Leaf Cover (cont) Before I show the actually algorithm, I will prove the following lemma: An optimal leaf cover for the subtree rooted at u consists of either the single node u, or of the union of optimal leaf covers of the subtrees rooted at the children of u in T.

Computing the Optimal Leaf Cover (cont) Proof: Let denote the optimal leaf cover for the subtree of T rooted at u. If u is a leaf then the result obviously holds. We assume then that u is an internal node and that are its children. Its obvious that and are both leaf covers of the subtree rooted at u. I will show that one of them is optimal.

Computing the Optimal Leaf Cover (cont) Lets assume that. We can then say that where each is a leaf cover (not necessarily the optimal one) for the subtree rooted at. then the following holds:

Computing the Optimal Leaf Cover (cont) Since the cost of the optimal leaf cover is smaller or equal to that of any other leaf cover we get that: Which means that the union of the optimal leaf covers of the trees rooted at the children of u is indeed an optimal leaf cover for the tree rooted at u.

Computing the Optimal Leaf Cover (cont) The following algorithm computes the optimal leaf cover in linear time: The algorithms correctness follows immediately from the previous lemma. I will show that it runs in O(|s|) time.

Computing the Optimal Leaf Cover (cont) The only nontrivial operation in the algorithm is the calculation of at each step. To do that, we have to know the number of occurrences of each symbol in the alphabet in the string (Because in order to calculate the cost of a string, we have to calculate ). Doing this is possible in constant time for each node because if u is a leaf then each symbol in the alphabet appears either once or never in.

Computing the Optimal Leaf Cover (cont) If u is not a leaf, then the number of occurrences of each symbol in is the sum of the number of its occurrences in where are the children of u (Recall that is the concatenation of ). Now we are finally ready to see the actual algorithm describing the boosting technique.

The Boosting Technique The following algorithm describes the technique:

The Boosting Technique First, any compression algorithm we wish to use the boosting technique on has to satisfy the following property: A is a compression algorithm such that, given an input string, A first appends an end-of-string symbol # to x and then compresses x# with the following space and time bounds: 1. A compresses x# in at most bits. 2. the running time of A is T(|x|) and its working space is S(|x|) where T is convex and S is non-decreasing.

The Boosting Technique The boosting algorithm can be used on any algorithm satisfying the previous property to boost its compression up to the k-th order entropy for any k without any asymptotic loss in time efficiency and and with a slightly larger working space complexity.

The Boosting Technique Theorem: Given a compression algorithm A that satisfies the aforementioned property, our boosting technique yields the following results: 1. If applied to s, it compresses it within bits, for any k. 2. If applied to, it compresses it within bits, for any k.