Presentation is loading. Please wait.

Presentation is loading. Please wait.

On the k-Closest Substring and k-Consensus Pattern Problems

Similar presentations


Presentation on theme: "On the k-Closest Substring and k-Consensus Pattern Problems"— Presentation transcript:

1 On the k-Closest Substring and k-Consensus Pattern Problems
Yishan Jiao, Jingyi Xu Institute of Computing Technology Chinese Academy of Sciences Ming Li University of Waterloo July 5, 2004 2019/1/16

2 Outline Motivation & background Our contributions Conclusion
A PTAS for k -Closest Substring Problem The NP-hardness of (2-)-approximation of the HRC problem A PTAS for k -Consensus Pattern Problem Conclusion 1.Firtstly, general introduction the main content of the paper 2.Most related works accomplished before 3.

3 Motivation Given n protein sequences, find a “conserved” region separately: N sequences L 1.While the original departure point of this study has been separating repeats in our DNA sequence assembly project, we have quickly realized that the problems we have abstracted relate to many widely studied in different areas from geometric clustering to DNA multiple motif finding. 2. Red/blue regions are different conserved regions, or motifs. They don’t have to be exactly the same. They match with higher scores than other regions.

4 Focused problem k -Closest Substring Problem(k -CSS)
The definition of k-closest substring problem is presented here. Given a string set S, the length of the string in S is m. Find k center string c one through c k ,which has length L. A special case when k =2 

5 2-KCSS L L L S C one and c two are center strings of two separate clusters.

6 Related work Closest Substring problem:
L=m geometric Closest Substring problem Hamming Radius k-clustering problem (HRC) Geometric k-center problem counterpart L=m Closest String problem Closest Substring problem: A PTAS; M.Li et al. ,JACM 49(2): ,2002 Hamming Radius O(1)-clustering problem (O(1)-HRC): A RPTAS for Hamming Radius O(1)-clustering problem ; Doctoral dessertation,J.Jansson,2003.

7 Outline Motivation & background Our contributions Conclusion
A PTAS for k -Closest Substring Problem The NP-hardness of (2- )-approximation of the HRC problem A PTAS for k -Consensus Pattern Problem Conclusion 1.Firtstly, general introduction the main content of the paper 2.Most related works accomplished before 3.

8 The PTAS for k-CSS Difficulties: Method: Result:
How to choose n closest substrings? How to partition strings into k sets accordingly? Method: Extend random sampling strategy in [M.Li et al. , JACM 49(2): ,2002] Construct h to approximate the Hamming distance. Result: A PTAS for O(1)- CSS.

9 P-Q decomposition L positions Q P R ……

10 P-Q decomposition

11 Random sampling strategy :
???? The random sampling strategy R1(R2):randomly pick O(log(mn)) positions from P1(P2)

12 Random sampling Strategy
H approximate hamming distance very well h approximate Hamming distance well.

13 Scheme of PTAS

14 Scheme of PTAS 5. Get final approximating center strings
Outputs (c1”, c2”) ,{t1,t2,…,tn} in polynomial time Satisfying with high probability: Extend to k=O(1) case: trivial

15 Sum up: Here, we partition the string set S based on the c prime one, c prime two and h. According to Lemma 3 and Lemma 1, h approximate the Hamming distance within the error of O(dopt). According to Lemma 1, c prime one and c prime two approximate the optimal c one and c two within the error of O(dopt). If we soly use random sampling strategy without P-Q decomposition, If we random sample positions over the whole string, the Lemma 3 only grant that the approximation bound is O( length of string).This is not sufficient when dopt is small. Thanks to the P-Q decomposition, we can random sample over P set instead.Thus,the approximation bound is O(dopt). So , h approximate true Hamming distance very well and the partition and the choice T1, T2 obtained by (c1’,c2’,h) is good approximation.

16 Outline Motivation & background Our contributions Conclusion
A PTAS for k -Closest Substring Problem The NP-hardness of (2- )-approximation of the HRC problem A PTAS for k -Consensus Pattern Problem Conclusion 1.Firtstly, general introduction the main content of the paper 2.Most related works accomplished before 3.

17 The NP-hardness of (2-)-approximation of the HRC problem
Main Ideas: Given any instance G=(V,E) of the Vertex Cover Problem, |V|=n, |E|= m' . Construct an instance <S ,k > of the Hamming radius k-clustering problem, which has a k-clustering with the maximum cluster radius not exceeding 2 . if and only if G has a vertex cover with k-m' vertices.

18 Thus finding an approximate solution within an approximation factor less than 2 is no easier than finding an exact solution.

19 We can proof: Given k  2m', k-m' vertices in V can cover E ,
if and only if there is a k-clustering of S with the maximum cluster radius equal to 2. if there is a polynomial algorithm for the Hamming radius k -clustering problem within an approximation factor less than 2 the exact vertex cover number of any instance G can be solved in polynomial time. This is a contradiction.

20 Outline Motivation & background Our contributions Conclusion
A PTAS for k -Closest Substring Problem the NP-hardness of (2- )-approximation of the HRC problem A PTAS for k -Consensus Pattern Problem Conclusion Another contribution of our paper is that we give a PTAS for k -Consensus Pattern Problem. It’s a simple extension of some previous work. Due to time limitation, we skip it.

21 Conclusion A nice combination of Combinatorial argument (P-Q decomposition) with the random sampling strategy in solving k -CSS problem. An alternative and direct proof of the NP-hardness of (2- )-approximation of the HRC problem. Here, we present a nice combination of combinatorial argument with the random sampling strategy in solving the k-Closest Substring problem. As mentioned in the key ideas of the proof, it’s not sufficient to just use anyone of them. The role of such a combination is illustrated by our example.

22 Contact Us Authors Yishan Jiao, Jingyi Xu : {jys,xjy}@ict.ac.cn
Bioinformatics lab, Institute of Computing Technology, Chinese Academy of Sciences Ming Li: University of Waterloo

23 Thank You!

24 Outline Motivation & background Our contributions Conclusion
The PTAS for k-Closest Substring Problem the NP-hardness of (2-)-approximation of the HRC problem The PTAS for k-Consensus Pattern Problem Conclusion 1.Firtstly, general introduction the main content of the paper 2.Most related works accomplished before 3.

25 Deterministic PTAS for O(1)-Consensus Pattern problem 1
k-Consensus Pattern problem Most related works: The Hamming O(1) -median clustering problem  O(1)-Consensus Pattern problem when L= m. A RPTAS ; R. Ostrovsky et al. ,JACM 49(2): ,2002 The Consensus Pattern problem  k-Consensus Pattern problem when k= 1. A PTAS; M.Li et al., STOC’99. 给出O(1)-Consensus Pattern Problem的一个确定性PTAS,并证明。

26 DPTAS for O(1)-CP 1 Outline: 1.Suppose in the optimal solution:
({c1,c2}, {t1,t2,…,tn}, {C1,C2}) C1,C2: instances of Consensus Pattern problem 2.Trying all possibilities, get and satisfying Lemma 3 in M.Li et al., STOC’99.

27 DPTAS for O(1)-CP 2 Outline: 3. Get c1’,c2’
c1’: the column-wise majority string of c2’: the column-wise majority string of 4.Partition each into C1’,C2’ as follows: otherwise 5.Get closest substrings (tl’) in T1’,T2’ satisfying

28 DPTAS for O(1)-CP 3 Outline: 6.Get a good approximation solution where
c1”,c2” are the column-wise majority string of all string in T1’,T2’ respectively. 7.Conclusion: Output a solution in polynomial time with total cost at most

29 PTAS for 2-Consensus Pattern problem

30 Definition of PTAS A family of approximation algorithms for problem P,{Ak}k, is called a polynomial (time) approximation scheme or PTAS, if algorithm Ak is a (1+k)-approximation algorithm and its running time is polynomial in the size of the input for a fixed k.

31 Vertex-cover problem Vertex cover: given an undirected graph G=(V,E), then a subset V'V such that if (u,v)E, then uV' or v V' (or both). Size of a vertex cover: the number of vertices in it. Vertex-cover problem: find a vertex-cover of minimal size.

32 Vertex-cover problem Vertex-cover problem is NP-complete. (See section ). Vertex-cover belongs to NP. Vertex-cover is NP-hard (CLIQUEPvertex-cover.) Reduce <G,k> where G=<V,E> of a CLIQUE instance to <G',|V|-k> where G'=<V,E'> where E'={(u,v): u,vV, uv and <u,v>E} of a vertex-cover instance. So find an approximate algorithm.

33

34 Conclusion for the approximation solution
Outline Get a good approximation solution where 10.Conclusion: Outputs (c1”, c2”) in polynomial time Satisfying with high probability: Can be derandomized by standard method [MR95]. Extend to k=O(1) case: trivial

35 PTAS for 2-CSS

36 Notation

37 P-Q decomposition L positions Q P R ……


Download ppt "On the k-Closest Substring and k-Consensus Pattern Problems"

Similar presentations


Ads by Google