Embedding the Ulam metric into ℓ 1 (Ενκρεβάτωση του μετρικού χώρου Ulam στον ℓ 1 ) Για το μάθημα “Advanced Data Structures” Αντώνης Αχιλλέως.

Slides:



Advertisements
Similar presentations
Optimal Space Lower Bounds for All Frequency Moments David Woodruff MIT
Advertisements

The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Xiaoming Sun Tsinghua University David Woodruff MIT
Shortest Vector In A Lattice is NP-Hard to approximate
Longest Common Subsequence
Foundations of Cryptography Lecture 10 Lecturer: Moni Naor.
Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)
1 390-Elliptic Curves and Elliptic Curve Cryptography Michael Karls.
Greedy Algorithms Amihood Amir Bar-Ilan University.
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
Primality Testing Patrick Lee 12 July 2003 (updated on 13 July 2003)
Lectures on Recursive Algorithms1 COMP 523: Advanced Algorithmic Techniques Lecturer: Dariusz Kowalski.
Noga Alon Institute for Advanced Study and Tel Aviv University
Updated QuickSort Problem From a given set of n integers, find the missing integer from 0 to n using O(n) queries of type: “what is bit[j]
Lower Envelopes (Cont.) Yuval Suede. Reminder Lower Envelope is the graph of the pointwise minimum of the (partially defined) functions. Letbe the maximum.
11.Hash Tables Hsu, Lih-Hsing. Computer Theory Lab. Chapter 11P Directed-address tables Direct addressing is a simple technique that works well.
Randomized Algorithms and Randomized Rounding Lecture 21: April 13 G n 2 leaves
Algorithm Design Techniques: Induction Chapter 5 (Except Section 5.6)
Distance Functions for Sequence Data and Time Series
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion.
Pattern Matching in Weighted Sequences Oren Kapah Bar-Ilan University Joint Work With: Amihood Amir Costas S. Iliopoulos Ely Porat.
Variable-Length Codes: Huffman Codes
Randomized Algorithms Morteza ZadiMoghaddam Amin Sayedi.
On Embedding Edit Distance into L_11 On Embedding Edit Distance into L 1 Robert Krauthgamer (Weizmann Institute and IBM Almaden)‏ Based on joint work (i)
Chapter 8. Section 8. 1 Section Summary Introduction Modeling with Recurrence Relations Fibonacci Numbers The Tower of Hanoi Counting Problems Algorithms.
Chapter 14 Randomized algorithms Introduction Las Vegas and Monte Carlo algorithms Randomized Quicksort Randomized selection Testing String Equality Pattern.
Order Statistics The ith order statistic in a set of n elements is the ith smallest element The minimum is thus the 1st order statistic The maximum is.
Project 2 due … Project 2 due … Project 2 Project 2.
C++ Programming: From Problem Analysis to Program Design, Second Edition Chapter 19: Searching and Sorting.
Length Reduction in Binary Transforms Oren Kapah Ely Porat Amir Rothschild Amihood Amir Bar Ilan University and Johns Hopkins University.
The Selection Problem. 2 Median and Order Statistics In this section, we will study algorithms for finding the i th smallest element in a set of n elements.
Sorting Fun1 Chapter 4: Sorting     29  9.
Analysis of Algorithms CS 477/677
Data Structures and Algorithms (AT70.02) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Prof. Sumanta Guha Slide Sources: CLRS “Intro.
A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur.
13 th Nov Geometry of Graphs and It’s Applications Suijt P Gujar. Topics in Approximation Algorithms Instructor : T Kavitha.
Information Theory Linear Block Codes Jalal Al Roumy.
1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008.
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 7.
Tirgul 11 Notes Hash tables –reminder –examples –some new material.
Ravello, Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.
Lower Bounds for Embedding Edit Distance into Normed Spaces A. Andoni, M. Deza, A. Gupta, P. Indyk, S. Raskhodnikova.
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
Chapter 8 Properties of Context-free Languages These class notes are based on material from our textbook, An Introduction to Formal Languages and Automata,
Young CS 331 D&A of Algo. Topic: Divide and Conquer1 Divide-and-Conquer General idea: Divide a problem into subprograms of the same kind; solve subprograms.
Approximation Algorithms based on linear programming.
TU/e Algorithms (2IL15) – Lecture 4 1 DYNAMIC PROGRAMMING II
Sorting by placement and Shift Sergi Elizalde Peter Winkler By 資工四 B 周于荃.
Section Recursion 2  Recursion – defining an object (or function, algorithm, etc.) in terms of itself.  Recursion can be used to define sequences.
Section Recursion  Recursion – defining an object (or function, algorithm, etc.) in terms of itself.  Recursion can be used to define sequences.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Dr Nazir A. Zafar Advanced Algorithms Analysis and Design Advanced Algorithms Analysis and Design By Dr. Nazir Ahmad Zafar.
Dynamic Programming for the Edit Distance Problem.
 2004 SDU Uniquely Decodable Code 1.Related Notions 2.Determining UDC 3.Kraft Inequality.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Chapter 11 Sorting Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and Mount.
Distance Functions for Sequence Data and Time Series
Lecture 18: Uniformity Testing Monotonicity Testing
Sublinear Algorithmic Tools 3
Vapnik–Chervonenkis Dimension
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
Quick-Sort 11/19/ :46 AM Chapter 4: Sorting    7 9
Enumerating Distances Using Spanners of Bounded Degree
Lecture 16: Earth-Mover Distance
The Curve Merger (Dvir & Widgerson, 2008)
Overcoming the L1 Non-Embeddability Barrier
Lecture 15: Least Square Regression Metric Embeddings
Clustering.
Fragment Assembly 7/30/2019.
Presentation transcript:

Embedding the Ulam metric into ℓ 1 (Ενκρεβάτωση του μετρικού χώρου Ulam στον ℓ 1 ) Για το μάθημα “Advanced Data Structures” Αντώνης Αχιλλέως

Metrics A metric space is a couple s.t. X is a set, d : X 2 → R and for all x, y, z in X, A metric space is a couple s.t. X is a set, d : X 2 → R and for all x, y, z in X, 1. d(x,y) ≥ 0 and d(x,y) = 0 iff x = y 2. d(x,y) = d(y,x) 3. d(x,y) ≥ d(x,z) +d(z,y)

Two metric spaces Edit Distance Let Σ be a set of symbols, Σ n the set of all finite sequences (strings, or n-tuples) of characters from Σ Edit operations on an element of Σ n are the following: adding a character deleting a character replacing a character If for x, y in Σ n, if ed(x,y) is the minimum number of edit operations needed to transform x to y Then, is a metric space

Two metric spaces The Ulam metric of dimension n Let Σ, be as before, but let P n be the set of strings of n distinct characters from Σ, where n = |Σ|. Let Σ, be as before, but let P n be the set of strings of n distinct characters from Σ, where n = |Σ|. And if x, y are in P n, then define UL(x,y) to be the number of character moves needed to transform x to y. And if x, y are in P n, then define UL(x,y) to be the number of character moves needed to transform x to y. is a metric space. is a metric space. The above definitions are limited: we need pairs of strings with different characters, so The above definitions are limited: we need pairs of strings with different characters, so We let n be < |Σ| and instead of UL, we use ed. We can see that for x, y, UL(x,y) ≤ ed(x,y) ≤ 2 UL(x,y) We let n be < |Σ| and instead of UL, we use ed. We can see that for x, y, UL(x,y) ≤ ed(x,y) ≤ 2 UL(x,y)

Embeddings An embedding of a metric space into a target metric space is a mapping f : X → Y s.t. there are C, s real numbers such that for all x, y in X, An embedding of a metric space into a target metric space is a mapping f : X → Y s.t. there are C, s real numbers such that for all x, y in X, d(x, y) ≤ s∙m(f(x), f(y)) ≤ C∙d(x, y) d(x, y) ≤ s∙m(f(x), f(y)) ≤ C∙d(x, y) The minimum C that satisfies the above inequality for some s is called the distortion of the embedding f. The minimum C that satisfies the above inequality for some s is called the distortion of the embedding f.

Edit distance algorithm in O(n 2 ) Edit distance algorithm in O(n 2 ) If LCS(x,y) is the longest common subsequence between x and y, where x, y strings of length n, then If LCS(x,y) is the longest common subsequence between x and y, where x, y strings of length n, then n – LCS(x,y) ≤ ed(x,y) ≤ 2(n – LCS(x,y))

Theorem For every n, the Ulam metric of dimension n can be embedded into ℓ 1 O(|Σ| 2 ) with distortion O(logn). For every n, the Ulam metric of dimension n can be embedded into ℓ 1 O(|Σ| 2 ) with distortion O(logn). Let n be an integer, and lets suppose it is a power of 2, let m = |Σ|, so we can suppose that Σ = {1, 2, …, m}. The embedding is the following: Let n be an integer, and lets suppose it is a power of 2, let m = |Σ|, so we can suppose that Σ = {1, 2, …, m}. The embedding is the following:

The embedding The embedding is f : P n → ℓ 1 ( m 2 ) The embedding is f : P n → ℓ 1 ( m 2 ) Associate every coordinate of the target space with a distinct pair {a, b}, where a, b in Σ, and a ≠ b, and every permutation p in P n receives in the new space the following coordinates: Associate every coordinate of the target space with a distinct pair {a, b}, where a, b in Σ, and a ≠ b, and every permutation p in P n receives in the new space the following coordinates: f(p) {a, b} = 1/(p -1 (b) – p -1 (a)), if a, b appear in p, f(p) {a, b} = 1/(p -1 (b) – p -1 (a)), if a, b appear in p, f(p) {a, b} = 0, if they don’t. f(p) {a, b} = 0, if they don’t. The proof is given by the following two lemmas. The proof is given by the following two lemmas.

Lemma 1 - Expansion Let p and q be permutations of length n. Then, Let p and q be permutations of length n. Then, ║f(p) – f(q)║ 1 ≤ O(logn)∙ed(p, q) Proof: First notice that f can be extended to strings of length less than n. So, we only need to show the inequality to hold for the case ed(x, y) = 1, the size of x is n and of y is n – 1. Also, we will treat substitution as a character deletion and insertion.

Proof of lemma 1 (cont.) q is obtained from p by deleting p[s] for some s. q is obtained from p by deleting p[s] for some s. So, p[i] = q[i] for i < s, and So, p[i] = q[i] for i < s, and p[i+1] = q[i] for i ≥ s. p[i+1] = q[i] for i ≥ s. ║f(p) – f(q)║ 1 = ∑ a,b in Σ |f(p) {a,b} – f(q) {a,b} | ║f(p) – f(q)║ 1 = ∑ a,b in Σ |f(p) {a,b} – f(q) {a,b} | Ignore {a, b} not entirely in p. So, a = p[i], b = p[j], i s and i s and i < s < j (on the whiteboard) QED QED

Definitions needed LIS(p) LIS(p) breakpoint: a position i in [k-1] s.t. p[i] > p[i+1]. breakpoint: a position i in [k-1] s.t. p[i] > p[i+1]. b(p) : # of breakpoints in p. b(p) : # of breakpoints in p. p 0, p 1 are a partition of p if distinct and for all x of p, x appears in p 0 or p 1. p 0, p 1 are a partition of p if distinct and for all x of p, x appears in p 0 or p 1. block: a pair of positions {2i – 1, 2i}. block: a pair of positions {2i – 1, 2i}. a partition p 0, p 1 is block-balanced if they also partition every block with one element each. a partition p 0, p 1 is block-balanced if they also partition every block with one element each.

Proposition 1 Let p be a permutation of length k, k even. Let p be a permutation of length k, k even. Then, for every block-balanced partition of p into p 0, p 1, Then, for every block-balanced partition of p into p 0, p 1, LIS(p) ≥ LIS(p 0 ) + LIS(p 1 ) – 2b(p) LIS(p) ≥ LIS(p 0 ) + LIS(p 1 ) – 2b(p) Will prove that LIS(p) ≥ 2LIS(p 0 ) – 2b(p). Will prove that LIS(p) ≥ 2LIS(p 0 ) – 2b(p). Argument follows Argument follows

Argument points will try to augment LIS(p 0 ) with points from p 1. will try to augment LIS(p 0 ) with points from p 1. if j position in p 0, then, {j’, j} is a block if j position in p 0, then, {j’, j} is a block if j in LIS(p 0 ), then j’ is a candidate. if j in LIS(p 0 ), then j’ is a candidate. #candidates = LIS(p 0 ) #candidates = LIS(p 0 ) LIS(p 0 ) can always be augmented by LIS(p 0 ) – 2b(p). LIS(p 0 ) can always be augmented by LIS(p 0 ) – 2b(p). Every breakpoint can only be blamed for at most 2 candidates Every breakpoint can only be blamed for at most 2 candidates

Lemma 2 - Contraction Let p and q be permutations of length n, and assume that n is a power of 2. Let p and q be permutations of length n, and assume that n is a power of 2. Then ║f(p) – f(q)║ ≥ (1/16)ed(p, q) Then ║f(p) – f(q)║ ≥ (1/16)ed(p, q) For the proof assume: For the proof assume: p and q have the same characters p and q have the same characters q = (1, 2, 3, …, n) q = (1, 2, 3, …, n) So, ed(p, q) ≤ 2(n – LCS(p, q)) = 2(n – LIS(p)) So, ed(p, q) ≤ 2(n – LCS(p, q)) = 2(n – LIS(p))

Proof of Lemma2 Partition p to p 0, p 1 at random, uniformly splitting every block. Partition p to p 0, p 1 at random, uniformly splitting every block. Partition p 0 to p 00, p 01 at random, uniformly splitting every block, e.t.c. recursively, until we have singleton subsequences p σ, for σ in {0,1} logn. Let ε be the empty string, p ε = p. Partition p 0 to p 00, p 01 at random, uniformly splitting every block, e.t.c. recursively, until we have singleton subsequences p σ, for σ in {0,1} logn. Let ε be the empty string, p ε = p. LIS(p) ≥ E[LIS(p 0 ) + LIS(p 1 )] – 2b(p) ≥ LIS(p) ≥ E[LIS(p 0 ) + LIS(p 1 )] – 2b(p) ≥ ∑ σ in {0,1} logn E[LIS(p σ )] – 2∑ k≤logn ∑ σ in {0,1} k-1 E[b(p σ )]

Proof of Lemma 2 (cont.) So, n – LIS(p) ≤ 2E[∑∑(b(p σ ))] ≤ 8 ∑1/(i – j), So, n – LIS(p) ≤ 2E[∑∑(b(p σ ))] ≤ 8 ∑1/(i – j), i > j and p[i] j and p[i] < p[j] For such i, j, f(p) {p[i],p[j]} = 1/(j – i) 0. For such i, j, f(p) {p[i],p[j]} = 1/(j – i) 0. So, |f(p) {p[i],p[j]} – f(q) {p[i],p[j]} | > 1/(i – j) So, |f(p) {p[i],p[j]} – f(q) {p[i],p[j]} | > 1/(i – j) And 8║f(p) – f(q)║ ≥ (1/2)ed(p, q), which ends the proof. And 8║f(p) – f(q)║ ≥ (1/2)ed(p, q), which ends the proof.

Applications (some definitions) X n, t includes all t-non-repetitive strings of length n over Σ. X n, t includes all t-non-repetitive strings of length n over Σ. B n, t includes all t-bounded-occurence strings of length n over Σ. B n, t includes all t-bounded-occurence strings of length n over Σ. X n, r, t includes all (t, r)-non-repetitive strings of length n over Σ. X n, r, t includes all (t, r)-non-repetitive strings of length n over Σ.

Non-repetitive strings (X n, t, ed) embeds with distortion 2t into the Ulam metric of dimension n – t + 1 and alphabet size 2 t. Consequently, it embeds into ℓ 1 with distortion O(logn) (X n, t, ed) embeds with distortion 2t into the Ulam metric of dimension n – t + 1 and alphabet size 2 t. Consequently, it embeds into ℓ 1 with distortion O(logn) Σ = {0, 1} t, and for x in {0,1} n, f(x) is defined: f(x) j = x[j] … x[j + t – 1] Σ = {0, 1} t, and for x in {0,1} n, f(x) is defined: f(x) j = x[j] … x[j + t – 1] ½ ed(x, y) ≤ ed(f(x), f(y)) ≤ t ed(x, y) ½ ed(x, y) ≤ ed(f(x), f(y)) ≤ t ed(x, y) (proof …)

Bounded-occurrence strings (B n, t, ed) embeds with distortion t into the Ulam metric of dimension n over an extended alphabet of size t|Σ|. Consequently, it embeds into ℓ 1 with distortion O(logn). (B n, t, ed) embeds with distortion t into the Ulam metric of dimension n over an extended alphabet of size t|Σ|. Consequently, it embeds into ℓ 1 with distortion O(logn). Just substitute a in Σ with a 1, a 2, …, a t and extend it to Σ’, of size t|Σ|. Just substitute a in Σ with a 1, a 2, …, a t and extend it to Σ’, of size t|Σ|. Substitute the j-th occurrence of a in x, with a j to have f(x). Substitute the j-th occurrence of a in x, with a j to have f(x). ed(x, y) ≤ ed(f(x), f(y)) ≤ t ed(x, y) follows. ed(x, y) ≤ ed(f(x), f(y)) ≤ t ed(x, y) follows.

Sketching t-non-repetitive strings For every k, there exists a polynomial-time sketching algorithm that solves the k vs Ω(k t logn) gap edit distance problem on t-non-repetitive strings of length n, using sketches of size O(1). We use the following: For every k, there exists a polynomial-time sketching algorithm that solves the k vs Ω(k t logn) gap edit distance problem on t-non-repetitive strings of length n, using sketches of size O(1). We use the following: For all k and ε > 0, there exists a polynomial- time sketching algorithm that solves the k vs (1+ε)k gap edit distance problem on binary of length n, using a sketch of size O(1/ε 2 ). For all k and ε > 0, there exists a polynomial- time sketching algorithm that solves the k vs (1+ε)k gap edit distance problem on binary of length n, using a sketch of size O(1/ε 2 ).

Sketching t-non-repetitive strings Convert ℓ 1 into Hamming metric: Convert ℓ 1 into Hamming metric: Round each coordinate to multiples of 1/Cn 2 for sufficiently large C > 0 (distortion increases by 2). Convert this to an element of the Hamming space… Convert this to an element of the Hamming space… Use sketching algorithm for Hamming distance Use sketching algorithm for Hamming distance

Locally non-repetitive strings For every t, and every k, there exists an embedding f of the (t, 180tk)-non-repetitive strings into ℓ 1, such that for every two strings x, y, For every t, and every k, there exists an embedding f of the (t, 180tk)-non-repetitive strings into ℓ 1, such that for every two strings x, y, Ω(min{k, ed(x, y)/(t log(tk))}) ≤ ║f(x) – f(y)║ 1 ≤ ed(x, y) Ω(min{k, ed(x, y)/(t log(tk))}) ≤ ║f(x) – f(y)║ 1 ≤ ed(x, y) …    Proof    …

The embedding let x be a (t, 180tk)-non-repetitive string, W = 56tk, append to x the string a 1 a 2 …a 2W+t (new symbols). let x be a (t, 180tk)-non-repetitive string, W = 56tk, append to x the string a 1 a 2 …a 2W+t (new symbols). Use anchors α 1, α 2, …, α r x. r x = O(n/tk). Define φ i. Use anchors α 1, α 2, …, α r x. r x = O(n/tk). Define φ i. Embed the φ i ’s into ℓ 1 O(tk). φ i is a string of length at most 2W + t ≤ 180tk, so it is t-non-repetitive. Embed the φ i ’s into ℓ 1 O(tk). φ i is a string of length at most 2W + t ≤ 180tk, so it is t-non-repetitive. Concatenate to φ(x) in ℓ 1 O(n). Concatenate to φ(x) in ℓ 1 O(n). Choose r in {0, 1} O(n) of same length s.t. r i = 1 independedly with probability 1/(kt log(kt)) Choose r in {0, 1} O(n) of same length s.t. r i = 1 independedly with probability 1/(kt log(kt)) Then, f’(x) = r∙φ(x) mod 2 Then, f’(x) = r∙φ(x) mod 2

Two lemmas that complete the proof 1. If x and y are (t, 180tk)-non-repetitive strings, then Pr[f’(x) ≠ f’(y)] ≤ O(ed(x, y)/k). 2. If x and y are (t, 180tk)-non-repetitive strings, Pr[f’(x) ≠ f’(y)] ≥ Ω(min{ed(x, y)/kt log(kt),1}) Also, if f(x) is the concatenation of k f’ results, it follows: Also, if f(x) is the concatenation of k f’ results, it follows: ║f(x) – f(y)║ 1 = k E[|f’(x) - f’(y)|] = = k Pr[f’(x) ≠ f’(y)] ≤ O(ed(x, y)).

Resulting… For every t, k, there exists a polynomial-time efficient sketching algorithm that solves the k vs Ω(t k logk) gap edit distance problem for (t,180tk)-non-repetitive strings using sketches of size O(1). For every t, k, there exists a polynomial-time efficient sketching algorithm that solves the k vs Ω(t k logk) gap edit distance problem for (t,180tk)-non-repetitive strings using sketches of size O(1). This improves a previous result and gives a sketching algorithm for the Ulam metric for this gap (with t = 1). This improves a previous result and gives a sketching algorithm for the Ulam metric for this gap (with t = 1).

Embed(x) (of the Ulam metric) (x is the inverse of the permutation – if a not in permutation, then x[a] = 0) A[1..m][1..m]: array of real; i, j : int Begin for i:=1 to m do for j:= 1 to i – 1 do if x[i]*x[j] <> 0 then A [j, i] := 1/(x[i] – x[j]) else A [j, i] := 0; output (A); End.

Embednr(x) (of a(t,180tk)-non- repetitive string x of size n) const W = 56tk var A,B[1..n+2W+t]: array of int; c, i, j, k, l, m, h: int; Beginh:=0; for all possible coin tosses do{ h:=h+1k:=0; for j:= 1 to n do {A[i]:=x[i]; if k < x[i] then k:=x[i]}; for j:= n+1 to n + 2W+t do {k=k+1; A[i]:=k}; c:=1; i:=1; repeat 1. s ij :=the j’th string of size t starting in A[c+W+j-1]; 2. pick random permutation Π on Σ t 3. set a i := min{s ij } and l:= such that s il = min{s ij } (by perm. Π) 4. c:=c+W+l; i:=i+1 until c > n until c > n rx := i; for i := 1 to rx do φ[i] := substring of x starting after a i-1, ending at the end of a i ; for i := 1 to rx do B[i]:= Embed[φ[i]]; φ’:= concatenate B; pick random r with r[i] = 1 with possibility 1/kt log(kt); f’:=0; for all i do f’:= f’ + r[i]*φ[i] mod 2; f[h]:=f’} f[h]:=f’} End. (Τελειώσαμε, μπορείτε να ξυπνήσετε)

Ενδ (Καληνύχτα...) Ενδ (Καληνύχτα...)