# Embedding the Ulam metric into ℓ 1 (Ενκρεβάτωση του μετρικού χώρου Ulam στον ℓ 1 ) Για το μάθημα “Advanced Data Structures” Αντώνης Αχιλλέως.

## Presentation on theme: "Embedding the Ulam metric into ℓ 1 (Ενκρεβάτωση του μετρικού χώρου Ulam στον ℓ 1 ) Για το μάθημα “Advanced Data Structures” Αντώνης Αχιλλέως."— Presentation transcript:

Embedding the Ulam metric into ℓ 1 (Ενκρεβάτωση του μετρικού χώρου Ulam στον ℓ 1 ) Για το μάθημα “Advanced Data Structures” Αντώνης Αχιλλέως

Metrics A metric space is a couple s.t. X is a set, d : X 2 → R and for all x, y, z in X, A metric space is a couple s.t. X is a set, d : X 2 → R and for all x, y, z in X, 1. d(x,y) ≥ 0 and d(x,y) = 0 iff x = y 2. d(x,y) = d(y,x) 3. d(x,y) ≥ d(x,z) +d(z,y)

Two metric spaces Edit Distance Let Σ be a set of symbols, Σ n the set of all finite sequences (strings, or n-tuples) of characters from Σ Edit operations on an element of Σ n are the following: adding a character deleting a character replacing a character If for x, y in Σ n, if ed(x,y) is the minimum number of edit operations needed to transform x to y Then, is a metric space

Two metric spaces The Ulam metric of dimension n Let Σ, be as before, but let P n be the set of strings of n distinct characters from Σ, where n = |Σ|. Let Σ, be as before, but let P n be the set of strings of n distinct characters from Σ, where n = |Σ|. And if x, y are in P n, then define UL(x,y) to be the number of character moves needed to transform x to y. And if x, y are in P n, then define UL(x,y) to be the number of character moves needed to transform x to y. is a metric space. is a metric space. The above definitions are limited: we need pairs of strings with different characters, so The above definitions are limited: we need pairs of strings with different characters, so We let n be < |Σ| and instead of UL, we use ed. We can see that for x, y, UL(x,y) ≤ ed(x,y) ≤ 2 UL(x,y) We let n be < |Σ| and instead of UL, we use ed. We can see that for x, y, UL(x,y) ≤ ed(x,y) ≤ 2 UL(x,y)

Embeddings An embedding of a metric space into a target metric space is a mapping f : X → Y s.t. there are C, s real numbers such that for all x, y in X, An embedding of a metric space into a target metric space is a mapping f : X → Y s.t. there are C, s real numbers such that for all x, y in X, d(x, y) ≤ s∙m(f(x), f(y)) ≤ C∙d(x, y) d(x, y) ≤ s∙m(f(x), f(y)) ≤ C∙d(x, y) The minimum C that satisfies the above inequality for some s is called the distortion of the embedding f. The minimum C that satisfies the above inequality for some s is called the distortion of the embedding f.

Edit distance algorithm in O(n 2 ) Edit distance algorithm in O(n 2 ) If LCS(x,y) is the longest common subsequence between x and y, where x, y strings of length n, then If LCS(x,y) is the longest common subsequence between x and y, where x, y strings of length n, then n – LCS(x,y) ≤ ed(x,y) ≤ 2(n – LCS(x,y))

Theorem For every n, the Ulam metric of dimension n can be embedded into ℓ 1 O(|Σ| 2 ) with distortion O(logn). For every n, the Ulam metric of dimension n can be embedded into ℓ 1 O(|Σ| 2 ) with distortion O(logn). Let n be an integer, and lets suppose it is a power of 2, let m = |Σ|, so we can suppose that Σ = {1, 2, …, m}. The embedding is the following: Let n be an integer, and lets suppose it is a power of 2, let m = |Σ|, so we can suppose that Σ = {1, 2, …, m}. The embedding is the following:

The embedding The embedding is f : P n → ℓ 1 ( m 2 ) The embedding is f : P n → ℓ 1 ( m 2 ) Associate every coordinate of the target space with a distinct pair {a, b}, where a, b in Σ, and a ≠ b, and every permutation p in P n receives in the new space the following coordinates: Associate every coordinate of the target space with a distinct pair {a, b}, where a, b in Σ, and a ≠ b, and every permutation p in P n receives in the new space the following coordinates: f(p) {a, b} = 1/(p -1 (b) – p -1 (a)), if a, b appear in p, f(p) {a, b} = 1/(p -1 (b) – p -1 (a)), if a, b appear in p, f(p) {a, b} = 0, if they don’t. f(p) {a, b} = 0, if they don’t. The proof is given by the following two lemmas. The proof is given by the following two lemmas.

Lemma 1 - Expansion Let p and q be permutations of length n. Then, Let p and q be permutations of length n. Then, ║f(p) – f(q)║ 1 ≤ O(logn)∙ed(p, q) Proof: First notice that f can be extended to strings of length less than n. So, we only need to show the inequality to hold for the case ed(x, y) = 1, the size of x is n and of y is n – 1. Also, we will treat substitution as a character deletion and insertion.

Proof of lemma 1 (cont.) q is obtained from p by deleting p[s] for some s. q is obtained from p by deleting p[s] for some s. So, p[i] = q[i] for i < s, and So, p[i] = q[i] for i < s, and p[i+1] = q[i] for i ≥ s. p[i+1] = q[i] for i ≥ s. ║f(p) – f(q)║ 1 = ∑ a,b in Σ |f(p) {a,b} – f(q) {a,b} | ║f(p) – f(q)║ 1 = ∑ a,b in Σ |f(p) {a,b} – f(q) {a,b} | Ignore {a, b} not entirely in p. So, a = p[i], b = p[j], i s and i s and i < s < j (on the whiteboard) QED QED

Definitions needed LIS(p) LIS(p) breakpoint: a position i in [k-1] s.t. p[i] > p[i+1]. breakpoint: a position i in [k-1] s.t. p[i] > p[i+1]. b(p) : # of breakpoints in p. b(p) : # of breakpoints in p. p 0, p 1 are a partition of p if distinct and for all x of p, x appears in p 0 or p 1. p 0, p 1 are a partition of p if distinct and for all x of p, x appears in p 0 or p 1. block: a pair of positions {2i – 1, 2i}. block: a pair of positions {2i – 1, 2i}. a partition p 0, p 1 is block-balanced if they also partition every block with one element each. a partition p 0, p 1 is block-balanced if they also partition every block with one element each.

Proposition 1 Let p be a permutation of length k, k even. Let p be a permutation of length k, k even. Then, for every block-balanced partition of p into p 0, p 1, Then, for every block-balanced partition of p into p 0, p 1, LIS(p) ≥ LIS(p 0 ) + LIS(p 1 ) – 2b(p) LIS(p) ≥ LIS(p 0 ) + LIS(p 1 ) – 2b(p) Will prove that LIS(p) ≥ 2LIS(p 0 ) – 2b(p). Will prove that LIS(p) ≥ 2LIS(p 0 ) – 2b(p). Argument follows Argument follows

Argument points will try to augment LIS(p 0 ) with points from p 1. will try to augment LIS(p 0 ) with points from p 1. if j position in p 0, then, {j’, j} is a block if j position in p 0, then, {j’, j} is a block if j in LIS(p 0 ), then j’ is a candidate. if j in LIS(p 0 ), then j’ is a candidate. #candidates = LIS(p 0 ) #candidates = LIS(p 0 ) LIS(p 0 ) can always be augmented by LIS(p 0 ) – 2b(p). LIS(p 0 ) can always be augmented by LIS(p 0 ) – 2b(p). Every breakpoint can only be blamed for at most 2 candidates Every breakpoint can only be blamed for at most 2 candidates

Lemma 2 - Contraction Let p and q be permutations of length n, and assume that n is a power of 2. Let p and q be permutations of length n, and assume that n is a power of 2. Then ║f(p) – f(q)║ ≥ (1/16)ed(p, q) Then ║f(p) – f(q)║ ≥ (1/16)ed(p, q) For the proof assume: For the proof assume: p and q have the same characters p and q have the same characters q = (1, 2, 3, …, n) q = (1, 2, 3, …, n) So, ed(p, q) ≤ 2(n – LCS(p, q)) = 2(n – LIS(p)) So, ed(p, q) ≤ 2(n – LCS(p, q)) = 2(n – LIS(p))

Proof of Lemma2 Partition p to p 0, p 1 at random, uniformly splitting every block. Partition p to p 0, p 1 at random, uniformly splitting every block. Partition p 0 to p 00, p 01 at random, uniformly splitting every block, e.t.c. recursively, until we have singleton subsequences p σ, for σ in {0,1} logn. Let ε be the empty string, p ε = p. Partition p 0 to p 00, p 01 at random, uniformly splitting every block, e.t.c. recursively, until we have singleton subsequences p σ, for σ in {0,1} logn. Let ε be the empty string, p ε = p. LIS(p) ≥ E[LIS(p 0 ) + LIS(p 1 )] – 2b(p) ≥ LIS(p) ≥ E[LIS(p 0 ) + LIS(p 1 )] – 2b(p) ≥ ∑ σ in {0,1} logn E[LIS(p σ )] – 2∑ k≤logn ∑ σ in {0,1} k-1 E[b(p σ )]

Proof of Lemma 2 (cont.) So, n – LIS(p) ≤ 2E[∑∑(b(p σ ))] ≤ 8 ∑1/(i – j), So, n – LIS(p) ≤ 2E[∑∑(b(p σ ))] ≤ 8 ∑1/(i – j), i > j and p[i] j and p[i] < p[j] For such i, j, f(p) {p[i],p[j]} = 1/(j – i) 0. For such i, j, f(p) {p[i],p[j]} = 1/(j – i) 0. So, |f(p) {p[i],p[j]} – f(q) {p[i],p[j]} | > 1/(i – j) So, |f(p) {p[i],p[j]} – f(q) {p[i],p[j]} | > 1/(i – j) And 8║f(p) – f(q)║ ≥ (1/2)ed(p, q), which ends the proof. And 8║f(p) – f(q)║ ≥ (1/2)ed(p, q), which ends the proof.

Applications (some definitions) X n, t includes all t-non-repetitive strings of length n over Σ. X n, t includes all t-non-repetitive strings of length n over Σ. B n, t includes all t-bounded-occurence strings of length n over Σ. B n, t includes all t-bounded-occurence strings of length n over Σ. X n, r, t includes all (t, r)-non-repetitive strings of length n over Σ. X n, r, t includes all (t, r)-non-repetitive strings of length n over Σ.

Non-repetitive strings (X n, t, ed) embeds with distortion 2t into the Ulam metric of dimension n – t + 1 and alphabet size 2 t. Consequently, it embeds into ℓ 1 with distortion O(logn) (X n, t, ed) embeds with distortion 2t into the Ulam metric of dimension n – t + 1 and alphabet size 2 t. Consequently, it embeds into ℓ 1 with distortion O(logn) Σ = {0, 1} t, and for x in {0,1} n, f(x) is defined: f(x) j = x[j] … x[j + t – 1] Σ = {0, 1} t, and for x in {0,1} n, f(x) is defined: f(x) j = x[j] … x[j + t – 1] ½ ed(x, y) ≤ ed(f(x), f(y)) ≤ t ed(x, y) ½ ed(x, y) ≤ ed(f(x), f(y)) ≤ t ed(x, y) (proof …)

Bounded-occurrence strings (B n, t, ed) embeds with distortion t into the Ulam metric of dimension n over an extended alphabet of size t|Σ|. Consequently, it embeds into ℓ 1 with distortion O(logn). (B n, t, ed) embeds with distortion t into the Ulam metric of dimension n over an extended alphabet of size t|Σ|. Consequently, it embeds into ℓ 1 with distortion O(logn). Just substitute a in Σ with a 1, a 2, …, a t and extend it to Σ’, of size t|Σ|. Just substitute a in Σ with a 1, a 2, …, a t and extend it to Σ’, of size t|Σ|. Substitute the j-th occurrence of a in x, with a j to have f(x). Substitute the j-th occurrence of a in x, with a j to have f(x). ed(x, y) ≤ ed(f(x), f(y)) ≤ t ed(x, y) follows. ed(x, y) ≤ ed(f(x), f(y)) ≤ t ed(x, y) follows.

Sketching t-non-repetitive strings For every k, there exists a polynomial-time sketching algorithm that solves the k vs Ω(k t logn) gap edit distance problem on t-non-repetitive strings of length n, using sketches of size O(1). We use the following: For every k, there exists a polynomial-time sketching algorithm that solves the k vs Ω(k t logn) gap edit distance problem on t-non-repetitive strings of length n, using sketches of size O(1). We use the following: For all k and ε > 0, there exists a polynomial- time sketching algorithm that solves the k vs (1+ε)k gap edit distance problem on binary of length n, using a sketch of size O(1/ε 2 ). For all k and ε > 0, there exists a polynomial- time sketching algorithm that solves the k vs (1+ε)k gap edit distance problem on binary of length n, using a sketch of size O(1/ε 2 ).

Sketching t-non-repetitive strings Convert ℓ 1 into Hamming metric: Convert ℓ 1 into Hamming metric: Round each coordinate to multiples of 1/Cn 2 for sufficiently large C > 0 (distortion increases by 2). Convert this to an element of the Hamming space… Convert this to an element of the Hamming space… Use sketching algorithm for Hamming distance Use sketching algorithm for Hamming distance

Locally non-repetitive strings For every t, and every k, there exists an embedding f of the (t, 180tk)-non-repetitive strings into ℓ 1, such that for every two strings x, y, For every t, and every k, there exists an embedding f of the (t, 180tk)-non-repetitive strings into ℓ 1, such that for every two strings x, y, Ω(min{k, ed(x, y)/(t log(tk))}) ≤ ║f(x) – f(y)║ 1 ≤ ed(x, y) Ω(min{k, ed(x, y)/(t log(tk))}) ≤ ║f(x) – f(y)║ 1 ≤ ed(x, y) …    Proof    …

The embedding let x be a (t, 180tk)-non-repetitive string, W = 56tk, append to x the string a 1 a 2 …a 2W+t (new symbols). let x be a (t, 180tk)-non-repetitive string, W = 56tk, append to x the string a 1 a 2 …a 2W+t (new symbols). Use anchors α 1, α 2, …, α r x. r x = O(n/tk). Define φ i. Use anchors α 1, α 2, …, α r x. r x = O(n/tk). Define φ i. Embed the φ i ’s into ℓ 1 O(tk). φ i is a string of length at most 2W + t ≤ 180tk, so it is t-non-repetitive. Embed the φ i ’s into ℓ 1 O(tk). φ i is a string of length at most 2W + t ≤ 180tk, so it is t-non-repetitive. Concatenate to φ(x) in ℓ 1 O(n). Concatenate to φ(x) in ℓ 1 O(n). Choose r in {0, 1} O(n) of same length s.t. r i = 1 independedly with probability 1/(kt log(kt)) Choose r in {0, 1} O(n) of same length s.t. r i = 1 independedly with probability 1/(kt log(kt)) Then, f’(x) = r∙φ(x) mod 2 Then, f’(x) = r∙φ(x) mod 2

Two lemmas that complete the proof 1. If x and y are (t, 180tk)-non-repetitive strings, then Pr[f’(x) ≠ f’(y)] ≤ O(ed(x, y)/k). 2. If x and y are (t, 180tk)-non-repetitive strings, Pr[f’(x) ≠ f’(y)] ≥ Ω(min{ed(x, y)/kt log(kt),1}) Also, if f(x) is the concatenation of k f’ results, it follows: Also, if f(x) is the concatenation of k f’ results, it follows: ║f(x) – f(y)║ 1 = k E[|f’(x) - f’(y)|] = = k Pr[f’(x) ≠ f’(y)] ≤ O(ed(x, y)).

Resulting… For every t, k, there exists a polynomial-time efficient sketching algorithm that solves the k vs Ω(t k logk) gap edit distance problem for (t,180tk)-non-repetitive strings using sketches of size O(1). For every t, k, there exists a polynomial-time efficient sketching algorithm that solves the k vs Ω(t k logk) gap edit distance problem for (t,180tk)-non-repetitive strings using sketches of size O(1). This improves a previous result and gives a sketching algorithm for the Ulam metric for this gap (with t = 1). This improves a previous result and gives a sketching algorithm for the Ulam metric for this gap (with t = 1).

Embed(x) (of the Ulam metric) (x is the inverse of the permutation – if a not in permutation, then x[a] = 0) A[1..m][1..m]: array of real; i, j : int Begin for i:=1 to m do for j:= 1 to i – 1 do if x[i]*x[j] <> 0 then A [j, i] := 1/(x[i] – x[j]) else A [j, i] := 0; output (A); End.

Embednr(x) (of a(t,180tk)-non- repetitive string x of size n) const W = 56tk var A,B[1..n+2W+t]: array of int; c, i, j, k, l, m, h: int; Beginh:=0; for all possible coin tosses do{ h:=h+1k:=0; for j:= 1 to n do {A[i]:=x[i]; if k < x[i] then k:=x[i]}; for j:= n+1 to n + 2W+t do {k=k+1; A[i]:=k}; c:=1; i:=1; repeat 1. s ij :=the j’th string of size t starting in A[c+W+j-1]; 2. pick random permutation Π on Σ t 3. set a i := min{s ij } and l:= such that s il = min{s ij } (by perm. Π) 4. c:=c+W+l; i:=i+1 until c > n until c > n rx := i; for i := 1 to rx do φ[i] := substring of x starting after a i-1, ending at the end of a i ; for i := 1 to rx do B[i]:= Embed[φ[i]]; φ’:= concatenate B; pick random r with r[i] = 1 with possibility 1/kt log(kt); f’:=0; for all i do f’:= f’ + r[i]*φ[i] mod 2; f[h]:=f’} f[h]:=f’} End. (Τελειώσαμε, μπορείτε να ξυπνήσετε)

Ενδ (Καληνύχτα...) Ενδ (Καληνύχτα...)

Similar presentations