Approximation Algorithms for the Selection of Robust Tag SNPs

Slides:



Advertisements
Similar presentations
Minimum Clique Partition Problem with Constrained Weight for Interval Graphs Jianping Li Department of Mathematics Yunnan University Jointed by M.X. Chen.
Advertisements

Association Studies, Haplotype Blocks and Tagging SNPs Prof. Sorin Istrail.
Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano.
Amplicon-Based Quasipecies Assembly Using Next Generation Sequencing Nick Mancuso Bassam Tork Computer Science Department Georgia State University.
Complexity 16-1 Complexity Andrei Bulatov Non-Approximability.
June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
CS177 Lecture 9 SNPs and Human Genetic Variation Tom Madej
The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations.
Applying haplotype models to association study design Natalie Castellana June 7, 2005.
A coalescent computational platform for tagging marker selection for clinical studies Gabor T. Marth Department of Biology, Boston College
Computational Problems in Perfect Phylogeny Haplotyping: Xor-Genotypes and Tag SNPs Tamar Barzuza 1 Jacques S. Beckmann 2,3 Ron Shamir 4 Itsik Pe’er 5.
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference Ion Mandoiu University of Connecticut CS&E Department.
Wei-Bung Wang Tao Jiang
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
CSE182-L17 Clustering Population Genetics: Basics.
Finding Bit Patterns Applying haplotype models to association study design Natalie Castellana Kedar Dhamdhere Russell Schwartz August 16, 2005.
A dynamic program algorithm for haplotype block partitioning Zhang, et. al. (2002) PNAS. 99, 7335.
Evaluation of the Haplotype Motif Model using the Principle of Minimum Description Srinath Sridhar, Kedar Dhamdhere, Guy E. Blelloch, R. Ravi and Russell.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
Human SNP haplotypes Statistics 246, Spring 2002 Week 15, Lecture 1.
Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Approximation Algorithms
Informative SNP Selection Based on Multiple Linear Regression
National Taiwan University Department of Computer Science and Information Engineering Dynamic Programming Algorithms for Haplotype Block Partitioning:
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.
Julia N. Chapman, Alia Kamal, Archith Ramkumar, Owen L. Astrachan Duke University, Genome Revolution Focus, Department of Computer Science Sources
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail.
Fast test for multiple locus mapping By Yi Wen Nisha Rajagopal.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Lectures 7 – Oct 19, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.
Efficient Algorithms for SNP Haplotype Block Selection Problems Yaw-Ling Lin ( 林耀鈴 ) Dept Computer Sci and Info Engineering College of Computing and Informatics.
Estimating Recombination Rates. Daly et al., 2001 Daly and others were looking at a 500kb region in 5q31 (Crohn disease region) 103 SNPs were genotyped.
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
The Haplotype Blocks Problems Wu Ling-Yun
Introduction to SNP and Haplotype Analysis
Mathematical Foundations of AI
Of Sea Urchins, Birds and Men
Abolfazl Asudeh Azade Nazi Nan Zhang Gautam DaS
SNP Haplotype Block Partition and tagSNP Finding
13 Text Processing Hongfei Yan June 1, 2016.
Character-Based Phylogeny Reconstruction
Computability and Complexity
Introduction to SNP and Haplotype Analysis
Estimating Recombination Rates
TagSNP Selection Problems based on Linkage Disequilibrium and Lagrangian Relaxation Chia-Yi Ma I-Lin Wang Department of Industrial & Information Management.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Consensus Partition Liang Zheng 5.21.
On the Range Maximum-Sum Segment Query Problem
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
The computation of hitting sets: Review and new algorithms
Strings and Pointer Arrays
Building Windows Applications by Visual C++ and Homework #3 Assignment
Multiple Sequence Alignment
A Few Sample Reductions
Approximation Algorithms for the Selection of Robust Tag SNPs
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Parsimony population haplotyping
Multiple Sequence Alignment
Haplotype Block Partition with Limited Resources and Applications to Human Chromosome 21 Haplotype Data  Kui Zhang, Fengzhu Sun, Michael S. Waterman,
Presentation transcript:

Approximation Algorithms for the Selection of Robust Tag SNPs Speaker: Yao-Ting Huang Advisor: Kun-Mao Chao National Taiwan University Department of Computer Science & Information Engineering Algorithms and Computational Biology Lab. 2019/5/10

Referrences Zhang, K., Sun, F., Waterman, M.S., Chen, T. Dynamic programming algorithms for haplotype block partitioning: Applications to human chromosome 21 haplotype data, RECOMB, 2003 Patil, N., et at, Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science 294: 1719-1723, 2001. Zhang, K., Deng M., Chen, T., Waterman, M.S., Sun, F. A dynamic programming algorithm for haplotype block partitioning. Proceedings of the National Academy of Sciences of the United States of America, 2002 Garey, M.R. and Johnson D.S. Computers and Intractability, New York, 1979 Cormen T.H., Leiserson, C.E., Rivest, R.L., and Stein, C. Introduction to Algorithms. Freeman, New York, 1979.

Outline Introduction Problem formulation Algorithms for finding the robust tag SNPs Algorithms for finding the auxiliary SNPs Discussion

Biological background Single Nucleotide Polymorphism (SNP) is a single DNA base variation. The difference between mutation and SNP depends on the observed frequency in a population. SNP: DNA single base variations found >1% Mutation: DNA single base variations found <1% General Population A C T T A G C T T 94% SNP A C T T A G C T C 6% General Population A C T T A G C T T 99.9% Mutation A C T T A G C T C 0.1%

Introduction (1/2) Recent studies have shown that the chromosome recombination only takes places at some narrow hotspots. Haplotype blocks stand for segments between these hotspots where little or even no recombination occurs. The SNPs within each haplotype block are highly correlated due to the low diversity in each block. Tag SNPs stand for a small subset of SNPs in the block which are sufficient to capture the entire block pattern.

Introduction (2/2) The haplotype block with corresponding tag SNPs are quite cost-effective in association studies It does not require genotyping all SNPs within the haplotype block to identify a testing sample. Many studies have tried to minimize the number of tag SNPs required to identify each block. e.g., Patil et al., Zhang et al., and Bafna et al. The tag SNP is genotyped as missing data if it does not pass the threshold of data quality The testing sample may fail to be identified when missing data occurs

An Example S1 S2 S3 S4 S1 S1 S1 S3 S2 S4 S4 S4 : Major Allele : Minor Allele : Missing Data S1 S3 S4 S4 S2 S1 S1 S4

Definition Auxiliary SNPs are Robust tag SNPs are the SNPs which incorporate with tag SNPs to distinguish the haplotype block. e.g., S3 w.r.t T2 and S2 w.r.t T3 Robust tag SNPs are the SNPs which distinguish the haplotype block and avoid the ambiguity caused by missing data. e.g., S1, S2, S3, and S4 can distinguish any testing sample with one missing data occurred. The number of robust tag SNPs required is with respect to the number of missing data ocurred.

The problem of finding the robust tag SNPs Input: An N * K matrix Mh and an integer m N: the number of SNPs in the haplotype block K: the number of training samples m: the number of missing data e.g., Output: The minimum subset of rows (SNPs) in Mh which can distinguish these K columns (block patterns) when m missing data occurs. K 1 1 2 2 1 1 1 2 1 2 2 1 1 2 1 2 N

Reduction to the set covering problem (1/2) Define the set corresponding to the rth row in Mh as Sr = {(i, j) | Mh[r, i] ≠ Mh[r, j] and i < j}. e.g., {1,1,1,2} => {(1,4), (2,4), (3,4)}. Let P = {(i; j) | 1< i < j } be the set that contains each pair of these K block patterns The set of rows can distinguish K patterns iff the corresponding collection of Sr covers all elements in P.

Reduction to the set covering problem (2/2) C covers all elements in P for at least (m + 1) times iff C can distinguish these K patterns when m missing data occurs. S1 S2 S3 S4 (1,2) (1,3) (1,4) (2,3) (2,4) (3,4)

Finding the robust tag SNPs Finding the minimum set of robust tag SNPs is NP-complete. Consider the restriction of this problem where m is set to 0. Each element in P need to be covered once by C, which is the same as the set covering problem. We propose two greedy approximation algorithms to solve this problem.

The first greedy algorithm (1,2) (1,3) (1,4) (2,3) (2,4) (3,4) S4 S1 S2 S3 Suppose the bipartite graph is implemented by a two-dimensional array. The greedy algorithm tries to cover all elements in one row at each stage. While covering the ith row, the algorithm picks a set S that can cover the maximal uncovered elements in the ith row.

The approximation ratio of the first greedy algorithm (1,2) (1,3) (1,4) (2,3) (2,4) (3,4) 1/2 1/4 1/2 1/2 Let |Si’| be the number of elements covered by S2 which contribute to the greedy algorithm. e.g., |S2’|=2 since S2 covers two elements, (1,2) and (3,4), in the first row. Assign the score to each grid by

The approximation ratio of the first greedy algorithm Let Rik be the number of grids in the ith row remaining uncovered before the kth iteration.

The approximation ratio of the first greedy algorithm The collection of sets picked by the greedy algorithm The collection of sets picked by the optimal solution C* S1 S4 S6 S3 S2 S5 S7 There exists at least one set in with size at least

The approximation ratio of the first greedy algorithm Because the algorithm always picks a set to cover maximal uncovered elements in the row, the next selected set (Sk’) must be at least

The approximation ratio of the first greedy algorithm

The second greedy algorithm (1,2) (1,3) (1,4) (2,3) (2,4) (3,4) S4 S1 S5 S2 The greedy algorithm tries to cover all elements in the table at each stage. The algorithm always picks a set S that can cover the maximal uncovered elements in the table.

The approximation ratio of the second greedy algorithm (1,2) (1,3) (1,4) (2,3) (2,4) (3,4) 1/2 1/4 1/2 1/4 Similar to the first greedy algorithm

Finding auxiliary SNPs Input: An N * K matrix Mh, and a set S. Output: The minimum set of auxiliary SNPs A such that A Ư S can identify the testing sample without ambiguity. Finding auxiliary SNPs is NP-complete. Consider when all tag SNPs are genotyped as missing data, auxiliary SNPs are another set of tag SNPs to distinguish those K block patterns. Auxiliary SNPs can be found efficiently when robust tag SNPs have been computed in advance.

Finding auxiliary SNPs (1,2) (1,3) (1,4) (2,3) (2,4) (3,4) S2 S3 S1 S4 S4 S2 S4 If there is only one pattern matched (e.g., T1), the testing sample is identified as that block pattern (e.g., B2) and we are done. Otherwise (e.g., T2), for each pair of the ambiguous patterns, traverse the corresponding column to find a set which can distinguish them.