Computational Molecular Biology Non-unique Probe Selection via Group Testing
My T. Thai 2
My T. Thai 3
My T. Thai 4
My T. Thai 5 DNA Microarrays DNA Microarrays are small, solid supports onto which the sequences from thousands of different genes are immobilized, or attached, at fixed locations. Contain a very large number of genes in a small size chip. A tool for performing large numbers of DNA-RNA hybridization experiments in parallel.
My T. Thai 6 Applications Quantitative analysis of expression levels of individual genes The comparison of cell samples from different tissues. Computational diagnostics. Qualitative analysis of an unknown sample Identification of micro-bacterial organisms. Detection of contamination of biotechnological products. Identification of viral subtypes.
My T. Thai 7 Unique Probes vs. Non-unique Probes Unique probes Gene-specific probes or signature probes. Difficult to find such probes Non-unique probes Hybridize to more than one target. Difficult to design the test based on non-unique probes
My T. Thai 8 Probe-Target Matrix 12 probe candidates. 4 targets (genes). For target set S, define P(S) as set of probes reacting to any target in S. P({1, 2}) = {1, 2, 3, 4, 7, 8, 9, 10, 12}. P({2, 3}) = {1, 3, 4, 5, 6, 7, 8, 9, 12}. Symmetric set difference: P({1, 2})∆P({2, 3}) = {2, 5, 6, 10}. Probes that separate two sets.
My T. Thai 9 Probe-Target Matrix Non-unique probe for each sequence group probe1(p1) probe2(p2) probe3(p3) s1 group1 group2 group3 s2 s3 s4 s5 s6 s7 s8 s9 probe4(p4) probe5(p5) group1group2group3 t1t2t3t4t5t6t7t8t9 p p p p p
My T. Thai 10 The Problem Given a sample with m items and a set of n non- unique proble Goal: determine the presence or absence of targets
My T. Thai 11 The Approach 3 Steps: Pre-select suitable probe candidates and compute the probe-target incidence matrix H. Select a minimal set of probes and compute a suitable design matrix D (a sub-matrix of H). Decode the result.
My T. Thai 12 An Example Matrix Hsub-matrix D
My T. Thai 13 Observation What is the property of a sub-matrix D? If we want to identify at most d targets without testing errors, D should be either d-separable matrix or d-disjunct matrix With at most e errors, the matrix should have the e- error correcting. That is, the Hamming distance between any two unions must be at least 2e + 1
My T. Thai 14 Non-unique probe selection via group testing Given a matrix H Find a submatrix D such that D is d-separable or d-disjunct with the minimum number of rows (We can easily extend this definition to the error correcting model)
My T. Thai 15 Min-d-DS Minimum-d-Disjunct Submatrix: Given a binary matrix M, find a submatrix H with minimum number of rows and the same number of columns such that H is d-disjunct
My T. Thai 16 Complexity Theorem: Min-d-DS is NP-hard for any fixed d ≥ 1 Proof: Reduce the 3 dimensional matching into it YZX a b c
My T. Thai 17 Complexity
My T. Thai 18 Approximation Pair: (c 0, ) Cover: a probe is said to cover a pair (c 0, ) if the incident entry at c 0 is 1 where the rest is 0. Greedy approach: While all pairs not covered yet, at each iteration: Choose a probe that can cover the most un-covered pairs Approximation ratio: 1 + (d+1)ln n If NP is not contained by DTIME(n^{log log n}), then no approximation has performance (1-ε)ln n for any ε > 0.
My T. Thai 19 Pool Size is 2 Consider a case when each probe can hybridize with exactly 2 targets Min-1-separable submatrix is also called the minimum test cover The minimum test cover is APX-complete The Min-d-DS is really polynomial-time solvable.
My T. Thai 20 Lemma Consider a collection C of pools of size at most 2. Let G be the graph with all items as vertices and all pools of size 2 as edges. Then C gives a d-disjunct matrix if and only if every item not in a singleton pool has degree at least d+1 in G.
My T. Thai 21 Proof
My T. Thai 22
My T. Thai 23 Theorem Min-d-DS is polynomial-time solvable in the case that all given pools have size exactly 2 Proof: Given a graph G representing M then finding a minimum d-DS is equivalent to finding a subgraph H with minimum number of edges such that all the vertices has a degree at least d + 1 Equivalent to maximize number of edges in G – H such that every vertex v has the degree at most d G (v) – d -1 in G - H
My T. Thai 24 Complexity Min-1-DS is NP-hard in the case that all given pools have size at most 2. Proof: Reduce Vertex-Cover Min-d-DS is MAX SNP-complete in the case that all given pools have size at most 2 for d ≥ 2 Proof: Reduce VC-CUBIC Given a cubic graph G, find the minimum vertex-cover of G
My T. Thai 25 Approximation Consist of 2 steps: Step 1 Compute a minimum solution of the polynomial- time solvable problem as mentioned Step 2 Choose all singleton pools at vertices with degree less than d+ 1 in H
My T. Thai 26 Approximation Ratio Analysis Suppose all given pools have size at most 2. Let s be the number of given singleton pools. Then any feasible solution of Min-d-DS contains at least s+ (n-s)(d+1)/2 pools.
My T. Thai 27 Proof
My T. Thai 28 Theorem The feasible solution obtained in the above algorithm is a polynomial-time approximation with performance ratio 1+2/(d+1).
My T. Thai 29 Proof Suppose H contains m edges and k vertices of degree at least d+1. Suppose an optimal solution containing s* singletons and m* pools of size 2. Then m < m* and (n-k)-s*< 2m*/(d+1). (n-k)+m < s*+m*+ 2m*/(d+1) < (s*+m*)(1+2/(d+1)).
My T. Thai 30 More Challenging Experimental Errors False negative: Pool (probes) contains some positive targets But return the negative outcome False positive: Pool contains all negative targets But return the positive outcome
My T. Thai 31 An e-Error Correcting Model Assume that there is at most e errors in testing (d,e)-disjunct matrix: for any column t j, t j must have at least e entries not contained in the union of other d columns. Theorem: (d+e)-disjunct matrix without any isolated column is (d,e)-disjunct matrix
My T. Thai 32 Decoding Algorithm with e Errors Theorem: If the number of errors is at most e, then the number of negative pools containing a positive item is always smaller than the number of negative pools containing a negative item Algorithm: Assume there is exactly d positive ones compute the number of negative results containing each item and select d smallest ones. Time complexity: O(tn)
My T. Thai 33 In S(d,n) sample What if the sample contain at most d positive (not exactly d positive) The previous theorem holds, however, the decoding algorithm will not work If still decoding on (d,e)-disjunct, the time complexity is O((n + t)t e ) where t is the number of selected probes
My T. Thai 34 Decoding Theorem: Suppose testing done on a (d, 2e)- disjunct matrix H with at most e errors, a positive item will appear in at most e negative results. Proof. Since there are at most e errors, a target can appear in at most e negative results (due to errors). However, a negative item appears in at least 2e +1- e = e +1 > e negative results. It implies that a positive item appears in at most e negative results.
My T. Thai 35 Decoding Algorithm Algorithm: For each item, we just need to count the number of negative results containing it. If this number is less than e, then this item is positive. Linear Decoding.