Common (Rigid) Subsequence Longest Common Subsequence (LCS) –combinatorial pattern matching –longest common rigid subsequence comnienc Longest Common Rigid Subsequence (LCRS) – combinatorial pattern matching –longest common rigid subsequence comni,(1,1,3,5)
Previous Results LCS and LCRS of two strings: –polynomial time solvable LCS of many strings: –Cannot be approximated within ratio in polynomial time (Jiang and Li 1995, SIAM J COMP). –For random instances, a simple greedy algorithm can give an almost optimal solution with only small error. LCRS of many strings: –Exponential time algorithms. –Our CPM paper tries to answer the time complexity.
Motivation in Bioinformatics In biochemistry, a motif is a recurring pattern in DNA/protein sequences. A protein motif (SH3 domain binding motif) in J. Biological Chemistry 269: Many motifs can be found at PROSITE database of ExPASy.
Motivation Rigoutsos and Floratos proposed the following problem (Bioinformatics 14:55-67,1998). –Given n strings and a positive number K, find a longest “rigid pattern” (rigid subsequence) that occurs in at least K of the n strings. When K=n, it is LCRS. Exponential time algorithms were studied. NP-hardness unknown.
Our Results LCRS is MAX-SNP hard –Therefore, Rigoutsos and Floratos’ problem is also MAX-SNP hard. For random instances, there is an algorithm solves LCRS with quasi-polynomial average running time. –The algorithm also works for Rigoutsos and Floratos’ problem with simple modifications.
MAX-SNP hard L-reduction from Max-Cut vertex edge delimiter
The construction of each edge aaa aba bab contributes 0 aaa aba bab contributes 1 aaa aba bab contributes 1 Three possible configurations in an ungapped alignment
The Algorithm Let S i be the set of length-i common rigid subsequences. We only need to prove that
Sketch of Proof For each rigid subsequence in S i, the probability it occurs in one random string of length n The prob. that it occurs in every input string There are in total length i rigid subsequences. This can be done by two cases i 2 logn.
Acknowledgement Supported by NSERC, PREA and CRC.