Presentation is loading. Please wait.

Presentation is loading. Please wait.

Flipping letters to minimize the support of a string Giuseppe Lancia, Franca Rinaldi, Romeo Rizzi University of Udine.

Similar presentations


Presentation on theme: "Flipping letters to minimize the support of a string Giuseppe Lancia, Franca Rinaldi, Romeo Rizzi University of Udine."— Presentation transcript:

1 Flipping letters to minimize the support of a string Giuseppe Lancia, Franca Rinaldi, Romeo Rizzi University of Udine

2 Outline of talk: 1. Problem definition 2. Parametrized complexity 3. Polynomial cases 4. NP-hardness 5. ILP formulations

3 1. Problem definition

4 We are given a string s and a parameter k (e.g., k = 3) The string has a set of k-mers, its support, K(s) 010010011

5 K(s) = { 010 } We are given a string s and a parameter k (e.g., k = 3) The string has a set of k-mers, its support, K(s) 010010011

6 K(s) = { 010, 100 } We are given a string s and a parameter k (e.g., k = 3) The string has a set of k-mers, its support, K(s) 010010011

7 K(s) = { 010, 100, 001} We are given a string s and a parameter k (e.g., k = 3) The string has a set of k-mers, its support, K(s) 010010011

8 K(s) = { 010, 100, 001} We are given a string s and a parameter k (e.g., k = 3) The string has a set of k-mers, its support, K(s) 010010011

9 K(s) = { 010, 100, 001} We are given a string s and a parameter k (e.g., k = 3) The string has a set of k-mers, its support, K(s) 010010011

10 K(s) = { 010, 100, 001} We are given a string s and a parameter k (e.g., k = 3) The string has a set of k-mers, its support, K(s) 010010011

11 K(s) = { 010, 100, 001, 011} We are given a string s and a parameter k (e.g., k = 3) The string has a set of k-mers, its support, K(s) 010010011

12 K(s) = { 010, 100, 001, 011} | K(s) | = 4 We are given a string s and a parameter k (e.g., k = 3) The string has a set of k-mers, its support, K(s) 010010011

13 K(s) = { 010, 100, 001, 011} | K(s) | = 4 By flipping some bits, we could reduce the number of k-mers 010010011 We are given a string s and a parameter k (e.g., k = 3)

14 K(s) = { 010, 100, 001, 011} | K(s) | = 4 By flipping some bits, we could reduce the number of k-mers 010010011 010010010 S’= We are given a string s and a parameter k (e.g., k = 3)

15 K(s) = { 010, 100, 001, 011} | K(s) | = 4 By flipping some bits, we could reduce the number of k-mers 010010011 010010010 S’= K(s’) = { 010, 100, 001} | K(s’) | = 3 We are given a string s and a parameter k (e.g., k = 3)

16 The Problem : - A string s over an alphabet  Ingredients:

17 The Problem : - A string s over an alphabet  - A parameter k (k-mer size) Ingredients:

18 The Problem : - A string s over an alphabet  - A parameter k (k-mer size) - A budget B Ingredients:

19 The Problem : - A string s over an alphabet  - A parameter k (k-mer size) - A budget B Ingredients: Change at most B letters in s so as resulting s’ has as few distinct k-mers as possible Objective:

20 The Problem : - A string s over an alphabet  - A parameter k (k-mer size) - A budget B Ingredients: Find a string s’ with d(s,s’) <= B with the smallest number of kmers Objective: s s’

21 Motivation : Curiosity-driven (it’s a cute combinatorial problem)Real:

22 Motivation : Curiosity-driven (it’s a cute combinatorial problem)Real: Analysis of DNA sequencesFictious: atcgattgatccttta atc, tcg, cga, gat, …. 3-mers are aminoacid codons. Protein complexity relates to # of codons. Mutations may reduce complexity….

23 Our results: The problem has many parameters (|s|, |  |, k, B), we study all versions (when possibly some of the parameters are bounded) - Polynomial special cases (e.g. for B fixed or both k,|  | fixed) - NP-hard special cases (even k=2 or |  |=2)

24 2. Parametrized complexity

25  s  NO B NO  s  YES B NO  s  NO B YES  s  YES B YES  NO k NO  YES k NO  NO k YES  YES k YES

26  s  NO B NO  s  YES B NO  s  NO B YES  s  YES B YES We can assume : k <= |s|  NO k NO  YES k NO  NO k YES  YES k YES

27  s  NO B NO  s  YES B NO  s  NO B YES  s  YES B YES We can assume : k <= |s|  NO k NO  YES k NO  NO k YES  YES k YES

28  s  NO B NO  s  YES B NO  s  NO B YES  s  YES B YES We can assume : B <= |s|  NO k NO  YES k NO  NO k YES  YES k YES

29  NO k NO  YES k NO  NO k YES  YES k YES  s  NO B NO  s  YES B NO  s  NO B YES  s  YES B YES We can assume : |  | <= |s| (we don’t need any symbol not already in s)

30  NO k NO  YES k NO  NO k YES  YES k YES  s  NO B NO  s  YES B NO  s  NO B YES  s  YES B YES

31  NO k NO  YES k NO  NO k YES  YES k YES  s  NO B NO  s  YES B NO  s  NO B YES  s  YES B YES Polynomial cases

32  NO k NO  YES k NO  NO k YES  YES k YES  s  NO B NO  s  YES B NO  s  NO B YES  s  YES B YES NP-hard cases N P-hard for |  |=2 N P-hard for k=2

33 3. Polynomial cases

34 The case |  | and k fixed:

35  NO k NO  YES k NO  NO k YES  YES k YES  s  NO B NO  s  YES B NO  s  NO B YES  s  YES B YES N P-hard for |  |=2 N P-hard for k=2

36 The case |  | and k fixed: We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A?

37 The case |  | and k fixed: s = 01000100101110 A = 0100, 1001, 0010, 0001 B = 3 We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A?

38 The case |  | and k fixed: s = 01000100101110 A = 0100, 1001, 0010, 0001 B = 3 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 …… 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A?

39 The case |  | and k fixed: s = 01000100101110 A = 0100, 1001, 0010, 0001 B = 3 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 …… 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A?

40 The case |  | and k fixed: s = 01000100101110 A = 0100, 1001, 0010, 0001 B = 3 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 …… 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A?

41 The case |  | and k fixed: s = 01000100101110 A = 0100, 1001, 0010, 0001 B = 3 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 …… 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 0100 0 1 ….. 1 1 0 We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A?

42 The case |  | and k fixed: s = 01000100101110 A = 0100, 1001, 0010, 0001 B = 3 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 …… 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 0100 0 1 ….. 1 1 0 We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A?

43 The case |  | and k fixed: s = 01000100101110 A = 0100, 1001, 0010, 0001 B = 3 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 …… 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 0100 0 1 ….. 1 1 0 0 0 0 0 0 3 2 2 0 1 1 0 0 1 1 0 0 1 1 0 0 0 1 1 We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A?

44 The case |  | and k fixed: s = 01000100101110 A = 0100, 1001, 0010, 0001 B = 3 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 …… 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 0100 0 1 ….. 1 1 0 0 0 0 0 0 3 2 2 0 1 1 0 0 1 1 0 0 1 1 0 0 0 1 1 Each path corresponds to a string s’ with all its kmers in A We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A?

45 The case |  | and k fixed: s = 01000100101110 A = 0100, 1001, 0010, 0001 B = 3 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 …… 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 0100 0 1 ….. 1 1 0 0 0 0 0 0 3 2 2 0 1 1 0 0 1 1 0 0 1 1 0 0 0 1 1 The length of the path is the Hamming distance d(s’, s) We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A?

46 The case |  | and k fixed: We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A? s = 01000100101110 A = 0100, 1001, 0010, 0001 B = 3 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 …… 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 0100 0 1 ….. 1 1 0 0 0 0 0 0 3 2 2 0 1 1 0 0 1 1 0 0 1 1 0 0 0 1 1 SUB(A) has a solution iff the shortest path is <= B

47 The case |  | and k fixed: We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A? - we can solve SUB(A) in polytime (O|A||  ||s|) = O(|s|) since

48 The case |  | and k fixed: We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A? - we can solve SUB(A) in polytime (O|A||  ||s|) = O(|s|) since - There are “only” possible subsets A to try…  problem is solved in polytime O(|s|)

49 The case of B fixed:

50  NO k NO  YES k NO  NO k YES  YES k YES  s  NO B NO  s  YES B NO  s  NO B YES  s  YES B YES N P-hard for |  |=2 N P-hard for k=2

51 The case of B fixed: For B fixed, we can try all possible solutions. There are possible choices of bits to flip. We can try them all, and count the # of k-mers. Since |  |<=|s|, the way to flip them is bounded by

52 4. NP-hardness

53 - Theorem: the problem is NP-hard even for k=2.

54  NO k NO  YES k NO  NO k YES  YES k YES  s  NO B NO  s  YES B NO  s  NO B YES  s  YES B YES N P-hard for |  |=2 N P-hard for k=2

55 - Theorem: the problem is NP-hard even for k=2. - Proof: reduction from COMPACT BIPARTITE SUBGRAPH (CBS) INSTANCE: a bipartite graph G=(U,V;E). Integers n, m PROBLEM: does there exist a set such that (i) (ii)

56 - Theorem: the problem is NP-hard even for k=2. - Proof: reduction from COMPACT BIPARTITE SUBGRAPH (CBS) INSTANCE: a bipartite graph G=(U,V;E). Integers n, m PROBLEM: does there exist a set such that (i) (ii) n = 6, m = 5

57 - Theorem: the problem is NP-hard even for k=2. - Proof: reduction from COMPACT BIPARTITE SUBGRAPH (CBS) INSTANCE: a bipartite graph G=(U,V;E). Integers n, m PROBLEM: does there exist a set such that (i) (ii) n = 6, m = 5

58 - Theorem: the problem is NP-hard even for k=2. - Proof: reduction from COMPACT BIPARTITE SUBGRAPH (CBS) INSTANCE: a bipartite graph G=(U,V;E). Integers n, m PROBLEM: does there exist a set such that (i) (ii) n = 6, m = 5 ( note that CBS is NP-hard because it includes MAX BALANCED BIP GRAPH, for n = 2t, m = t^2 )

59 CBS: does there existsuch that The reduction: and? a b c d e f g h i l

60 CBS: does there existsuch that The reduction: and? a b c d e f g h i l Let  = {a,b,c,d,e,f,g,h,i,l} U { ,  }B = |E| - m

61 CBS: does there existsuch that The reduction: and? a b c d e f g h i l Let  = {a,b,c,d,e,f,g,h,i,l} U { ,  }B = |E| - m IDEA: Encode an edge (i,j) as …  i  j  … and make all k-mers  x, x  and  unavoidable (i.e., insert a LOT of each of them in s)

62 CBS: does there existsuch that The reduction: and? a b c d e f g h i l Let  = {a,b,c,d,e,f,g,h,i,l} U { ,  }B = |E| - m The only kmers that can be destroyed are of the form x  or  x, and this is achieved by “flippling” the  into a . This corresponds to removing the edge.  i  j .. ...  i  j  IDEA: Encode an edge (i,j) as …  i  j  … and make all k-mers  x, x  and  unavoidable (i.e., insert a LOT of each of them in s)

63 CBS: does there existsuch that The reduction: and? a b c d e f g h i l Let  = {a,b,c,d,e,f,g,h,i,l} U { ,  }B = |E| - m The set of kmers of type  x or x  which remain define the set X which covers at least m edges IDEA: Encode an edge (i,j) as …  i  j  … and make all k-mers  x, x  and  unavoidable (i.e., insert a LOT of each of them in s) The only kmers that can be destroyed are of the form x  or  x, and this is achieved by “flippling” the  into a . This corresponds to removing the edge.  i  j .. ...  i  j 

64 CBS: does there existsuch that The reduction: and? a b c d e f g h i l Let  = {a,b,c,d,e,f,g,h,i,l} U { ,  }B = |E| - m IDEA: Encode an edge (i,j) as …  i  j  … and make all k-mers  x, x  and  unavoidable (i.e., insert a LOT of each of them in s) S =  a  a ...a  a  b  b ...b  b...  l  l  l...  l  a  g  b  g  e  i 

65 -With similar reductions, we can also prove the Theorem: the problem is NP-hard even for |  | = 2

66 5. Integer Linear Programming formulations

67 Let K be the set of all possible kmers Define a 0/1 variable in K and a 0/1 variable for each position Exponential-size formulation (  ={0,1}) K={ 000, 001, 010, 011, 100, 101, 110, 111 }

68 min Exponential-size formulation (  ={0,1})

69 min Exponential-size formulation (  ={0,1})

70 min Exponential-size formulation (  ={0,1})

71 Exponential n. of variables/constraints  pricing & separation problems Our P&S strategy is exponential in the general case (polynomial for k fixed)

72 Exponential n. of variables/constraints  pricing & separation problems Our P&S strategy is exponential in the general case (polynomial for k fixed) Fractional solutions may lead to effective heuristics (e.g., solving SUB(A) with

73 Polynomial-size formulation (  ={0,1})

74 In addition to variables, there are variables for positions i and j saying “is the kmer starting at i identical to the kmer starting at j ? “ The variables w depend on s and z via linear constraints

75 To count only kmers that have no identical kmer following them: if all kmers starting after i are different from kmer at i Polynomial-size formulation (  ={0,1}) In addition to variables, there are variables for positions i and j saying “is the kmer starting at i identical to the kmer starting at j ? “ The variables w depend on s and z via linear constraints

76 Polynomial-size formulation (  ={0,1}) -The formulation (not show) is basically a big boolean formula -It yields a poor bound compared to the exponential formulation -It can be improved in many ways (expecially via valid cuts) -At this stage it’s not clear which method is best for not too small instances -We’ll run experiments with variants of both formulations

77 <EOT>


Download ppt "Flipping letters to minimize the support of a string Giuseppe Lancia, Franca Rinaldi, Romeo Rizzi University of Udine."

Similar presentations


Ads by Google