Approximation Algorithms for the Selection of Robust Tag SNPs

Approximation Algorithms for the Selection of Robust Tag SNPs
Speaker: Yao-Ting Huang Advisor: Kun-Mao Chao National Taiwan University Department of Computer Science & Information Engineering Algorithms and Computational Biology Lab. 2019/5/10

Referrences Zhang, K., Sun, F., Waterman, M.S., Chen, T. Dynamic programming algorithms for haplotype block partitioning: Applications to human chromosome 21 haplotype data, RECOMB, 2003 Patil, N., et at, Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science 294: , 2001. Zhang, K., Deng M., Chen, T., Waterman, M.S., Sun, F. A dynamic programming algorithm for haplotype block partitioning. Proceedings of the National Academy of Sciences of the United States of America, 2002 Garey, M.R. and Johnson D.S. Computers and Intractability, New York, 1979 Cormen T.H., Leiserson, C.E., Rivest, R.L., and Stein, C. Introduction to Algorithms. Freeman, New York, 1979.

Outline Introduction Problem formulation
Algorithms for finding the robust tag SNPs Algorithms for finding the auxiliary SNPs Discussion

Biological background
Single Nucleotide Polymorphism (SNP) is a single DNA base variation. The difference between mutation and SNP depends on the observed frequency in a population. SNP: DNA single base variations found >1% Mutation: DNA single base variations found <1% General Population A C T T A G C T T 94% SNP A C T T A G C T C 6% General Population A C T T A G C T T 99.9% Mutation A C T T A G C T C 0.1%

Introduction (1/2) Recent studies have shown that the chromosome recombination only takes places at some narrow hotspots. Haplotype blocks stand for segments between these hotspots where little or even no recombination occurs. The SNPs within each haplotype block are highly correlated due to the low diversity in each block. Tag SNPs stand for a small subset of SNPs in the block which are sufficient to capture the entire block pattern.

Introduction (2/2) The haplotype block with corresponding tag SNPs are quite cost-effective in association studies It does not require genotyping all SNPs within the haplotype block to identify a testing sample. Many studies have tried to minimize the number of tag SNPs required to identify each block. e.g., Patil et al., Zhang et al., and Bafna et al. The tag SNP is genotyped as missing data if it does not pass the threshold of data quality The testing sample may fail to be identified when missing data occurs

An Example S1 S2 S3 S4 S1 S1 S1 S3 S2 S4 S4 S4 : Major Allele
: Minor Allele : Missing Data S1 S3 S4 S4 S2 S1 S1 S4

Definition Auxiliary SNPs are Robust tag SNPs are
the SNPs which incorporate with tag SNPs to distinguish the haplotype block. e.g., S3 w.r.t T2 and S2 w.r.t T3 Robust tag SNPs are the SNPs which distinguish the haplotype block and avoid the ambiguity caused by missing data. e.g., S1, S2, S3, and S4 can distinguish any testing sample with one missing data occurred. The number of robust tag SNPs required is with respect to the number of missing data ocurred.

The problem of finding the robust tag SNPs
Input: An N * K matrix Mh and an integer m N: the number of SNPs in the haplotype block K: the number of training samples m: the number of missing data e.g., Output: The minimum subset of rows (SNPs) in Mh which can distinguish these K columns (block patterns) when m missing data occurs. K N

Reduction to the set covering problem (1/2)
Define the set corresponding to the rth row in Mh as Sr = {(i, j) | Mh[r, i] ≠ Mh[r, j] and i < j}. e.g., {1,1,1,2} => {(1,4), (2,4), (3,4)}. Let P = {(i; j) | 1< i < j } be the set that contains each pair of these K block patterns The set of rows can distinguish K patterns iff the corresponding collection of Sr covers all elements in P.

Reduction to the set covering problem (2/2)
C covers all elements in P for at least (m + 1) times iff C can distinguish these K patterns when m missing data occurs. S1 S2 S3 S4 (1,2) (1,3) (1,4) (2,3) (2,4) (3,4)

Finding the robust tag SNPs
Finding the minimum set of robust tag SNPs is NP-complete. Consider the restriction of this problem where m is set to 0. Each element in P need to be covered once by C, which is the same as the set covering problem. We propose two greedy approximation algorithms to solve this problem.

The first greedy algorithm
(1,2) (1,3) (1,4) (2,3) (2,4) (3,4) S4 S1 S2 S3 Suppose the bipartite graph is implemented by a two-dimensional array. The greedy algorithm tries to cover all elements in one row at each stage. While covering the ith row, the algorithm picks a set S that can cover the maximal uncovered elements in the ith row.

The approximation ratio of the first greedy algorithm
(1,2) (1,3) (1,4) (2,3) (2,4) (3,4) 1/2 1/4 1/2 1/2 Let |Si’| be the number of elements covered by S2 which contribute to the greedy algorithm. e.g., |S2’|=2 since S2 covers two elements, (1,2) and (3,4), in the first row. Assign the score to each grid by

Let Rik be the number of grids in the ith row remaining uncovered before the kth iteration.

The collection of sets picked by the greedy algorithm The collection of sets picked by the optimal solution C* S1 S4 S6 S3 S2 S5 S7 There exists at least one set in with size at least

Because the algorithm always picks a set to cover maximal uncovered elements in the row, the next selected set (Sk’) must be at least

The second greedy algorithm
(1,2) (1,3) (1,4) (2,3) (2,4) (3,4) S4 S1 S5 S2 The greedy algorithm tries to cover all elements in the table at each stage. The algorithm always picks a set S that can cover the maximal uncovered elements in the table.

The approximation ratio of the second greedy algorithm
(1,2) (1,3) (1,4) (2,3) (2,4) (3,4) 1/2 1/4 1/2 1/4 Similar to the first greedy algorithm

Finding auxiliary SNPs
Input: An N * K matrix Mh, and a set S. Output: The minimum set of auxiliary SNPs A such that A Ư S can identify the testing sample without ambiguity. Finding auxiliary SNPs is NP-complete. Consider when all tag SNPs are genotyped as missing data, auxiliary SNPs are another set of tag SNPs to distinguish those K block patterns. Auxiliary SNPs can be found efficiently when robust tag SNPs have been computed in advance.

Finding auxiliary SNPs
(1,2) (1,3) (1,4) (2,3) (2,4) (3,4) S2 S3 S1 S4 S4 S2 S4 If there is only one pattern matched (e.g., T1), the testing sample is identified as that block pattern (e.g., B2) and we are done. Otherwise (e.g., T2), for each pair of the ambiguous patterns, traverse the corresponding column to find a set which can distinguish them.

Approximation Algorithms for the Selection of Robust Tag SNPs

Similar presentations

Presentation on theme: "Approximation Algorithms for the Selection of Robust Tag SNPs"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Approximation Algorithms for the Selection of Robust Tag SNPs

Similar presentations

Presentation on theme: "Approximation Algorithms for the Selection of Robust Tag SNPs"— Presentation transcript:

Similar presentations

About project

Feedback