Efficient Algorithms for SNP Haplotype Block Selection Problems Yaw-Ling Lin ( 林耀鈴 ) Dept Computer Sci and Info Engineering College of Computing and Informatics.

Slides:



Advertisements
Similar presentations
Boosting Textual Compression in Optimal Linear Time.
Advertisements

Association Studies, Haplotype Blocks and Tagging SNPs Prof. Sorin Istrail.
Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
Improved Algorithms for Inferring the Minimum Mosaic of a Set of Recombinants Yufeng Wu and Dan Gusfield UC Davis CPM 2007.
Clustering Categorical Data The Case of Quran Verses
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Fast Algorithms For Hierarchical Range Histogram Constructions
IEPAD: Information Extraction based on Pattern Discovery Chia-Hui Chang National Central University, Taiwan
Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL FastANOVA: an Efficient Algorithm for Genome-Wide Association Study Xiang Zhang Fei Zou Wei Wang University.
1 A simple fast hybrid pattern- matching algorithm Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C.
Approaching the Long-Range Phasing Problem using Variable Memory Markov Chains Samuel Angelo Crisanto 2015 Undergraduate Research Symposium Brown University.
WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Computational Problems in Perfect Phylogeny Haplotyping: Xor-Genotypes and Tag SNPs Tamar Barzuza 1 Jacques S. Beckmann 2,3 Ron Shamir 4 Itsik Pe’er 5.
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
August 2005RSFDGrC 2005, Regina, Canada 1 Feature Selection Based on Relative Attribute Dependency: An Experimental Study Jianchao Han 1, Ricardo Sanchez.
CSE182-L17 Clustering Population Genetics: Basics.
A dynamic program algorithm for haplotype block partitioning Zhang, et. al. (2002) PNAS. 99, 7335.
Evaluation of the Haplotype Motif Model using the Principle of Minimum Description Srinath Sridhar, Kedar Dhamdhere, Guy E. Blelloch, R. Ravi and Russell.
Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.
Modeling Gene Interactions in Disease CS 686 Bioinformatics.
Physical Mapping of DNA Shanna Terry March 2, 2004.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
Human SNP haplotypes Statistics 246, Spring 2002 Week 15, Lecture 1.
Descendent Subtrees Comparison of Phylogenetic Trees with Applications to Co-evolutionary Classifications in Bacterial Genome Yaw-Ling Lin 1 Tsan-Sheng.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
CS177 Lecture 10 SNPs and Human Genetic Variation
Informative SNP Selection Based on Multiple Linear Regression
National Taiwan University Department of Computer Science and Information Engineering Dynamic Programming Algorithms for Haplotype Block Partitioning:
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.
Julia N. Chapman, Alia Kamal, Archith Ramkumar, Owen L. Astrachan Duke University, Genome Revolution Focus, Department of Computer Science Sources
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail.
Synthetic Sequence Design for Signal Location Yaw-Ling Lin ( 林 耀 鈴 ) Dept Computer Sci and Info Engineering College of Computing and Informatics Providence.
Subtrees Comparison of Phylogenetic Trees with Applications to Two Component Systems Sequence Classifications in Bacterial Genome Yaw-Ling Lin 1 Ming-Tat.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Lectures 7 – Oct 19, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.
Estimating Recombination Rates. Daly et al., 2001 Daly and others were looking at a 500kb region in 5q31 (Crohn disease region) 103 SNPs were genotyped.
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
1 An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis Jianting Zhang Le Gruenwald School of Computer.
Association tests. Basics of association testing Consider the evolutionary history of individuals proximal to the disease carrying mutation.
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
1 Double-Patterning Aware DSA Template Guided Cut Redistribution for Advanced 1-D Gridded Designs Zhi-Wen Lin and Yao-Wen Chang National Taiwan University.
The Haplotype Blocks Problems Wu Ling-Yun
On the R ange M aximum-Sum S egment Q uery Problem Kuan-Yu Chen and Kun-Mao Chao Department of Computer Science and Information Engineering, National Taiwan.
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
Introduction to SNP and Haplotype Analysis
Of Sea Urchins, Birds and Men
SNP Haplotype Block Partition and tagSNP Finding
Introduction to SNP and Haplotype Analysis
Estimating Recombination Rates
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
On the Range Maximum-Sum Segment Query Problem
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Approximation Algorithms for the Selection of Robust Tag SNPs
Approximation Algorithms for the Selection of Robust Tag SNPs
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Discovering Frequent Poly-Regions in DNA Sequences
Haplotype Block Partition with Limited Resources and Applications to Human Chromosome 21 Haplotype Data  Kui Zhang, Fengzhu Sun, Michael S. Waterman,
Presentation transcript:

Efficient Algorithms for SNP Haplotype Block Selection Problems Yaw-Ling Lin ( 林耀鈴 ) Dept Computer Sci and Info Engineering College of Computing and Informatics Providence University, Taiwan

Outline Introduction Motivation Terminology Definition Diversity Functions Haplotype Block Selection Dealing with Missing Data Experiment Conclusion

Introduction Mutation in DNA is the principle factor that is responsible for the phenotypic differences among human beings. SNP (Single Nucleotide Polymorphisms) is the most common mutation.

Introduction (cont.) Recent studies have shown that the chromosome recombination only takes places at some narrow hotspots. Haplotype blocks stand for segments between these hotspots where little or even no recombination occurs. A B a b A B A b a B a b

Motivation The SNPs within a haplotype block are highly correlated due to the low diversity in each block. SNPs, haplotype pattern, or disease gene in the same block are associative. (Linkage)

Terminology Definition H4 H3 H2 H1 major minor cgccttnnct tgtntagccc ngcgntagtt catgaaacnc c/ t g/ag/a t/ c g /c t/ a a /t a/ga/g c/ g c/ t H4 H3 H2 H1 major minor c/ t g/ag/a t/ c g /c t/ a a /t a/ga/g c/ g c /t major←0 minor←1 n←3

Terminology Definition (cont.)

Diversity Functions Each different haplotype string s i in a matrix is associated with a probability p i. p i : 2/7, 2/7, 1/7, 1/7, 1/7

Diversity Functions (cont.) Raising the square to an arbitrary power q. Information Entropy function:

Results

Results (cont.)

Haplotype Block Selection Computing Diversities of All Blocks ij O(mn) Total: n 2 (i,j) pairs. Total time complexity: O(mn 3 )

Haplotype Block Selection (cont.) Suffix Tree T 1 Suffix Tree: 1-suffix Time Complexity: O(n) … n leaves.

Haplotype Block Selection (cont.) Merge m suffix trees into the total suffix tree T* …… … …… 1-suffixi-suffixm-suffix merge mn leaves. T*T* …

Lowest Common Ancestor

LCA (confluent) subtree

Confluent subtree – Illustration

Consructing confluent subtree

Haplotype Block Selection (cont.) LCA Tree T*T* … 1-LCA Tree1st suffix string for each row … … i-LCA Treen-LCA Tree m×n haplotype matrix n LCA Trees (with m leaves)

Haplotype Block Selection (cont.) Event-List … n 8 … 1 1[4,3]2[2,2,2,1]4[2,2,1,1,1] 8-LCA Tree h 1 (8), h 6 (8)h 4 (8), h 5 (8) h 3 (8)h 7 (8)h 2 (8)

Haplotype Block Selection (cont.) n … LCA Tree Depth-List Event-List h 1 (8), h 6 (8)h 4 (8), h 5 (8) h 3 (8)h 7 (8)h 2 (8) [4,3] 8[2,2]8[2,1] 8[2,2,1,1,1] … n 8 … 1 Event-List 1[4,3]2[2,2,2,1]4[2,2,1,1,1] BFS Search

Haplotype Block Selection (cont.) Farthest-sites (good partner) i L[i]L[i]L[i-1] i-1

Haplotype Block Selection (cont.)

Dynamic Programming ij BkBk L[j]L[j] B1B1 B k-1 … i j BkBk B1B1 … j-1

k Haplotype Block Selection (cont.) Dynamic Programming k-1 f(k,i,j)f(k,i,j-1) f(k-1,i,L[j]-1) i j

Haplotype Block Selection (cont.) Dynamic Programming

Haplotype Block Selection (cont.) Dynamic Programming 1 i j i=1

Dealing with Missing Data Sometime we may fail to distinguish two different haplotype due to the ambiguity cased by missing data. Let A ij ∈ {0,1,3}. A ij =3 means the j-th site of observation i is missing data. One way to deal with missing data is to assign each A ij =3 to either 0 or 1 such that the resulting diversity is minimized.

Dealing with Missing Data (cont.) The minimum-diversity problem is NP-hard by a reduction from the minimum-clique-partition problem. Two rows i,j of A are different is there exists a column k such that {A ik,A jk }={0,1}. Two rows are compatible if they are not different (3,5)(2,4)(1,4)(1,3)

Dealing with Missing Data (cont.) Our heuristic method: 1.Partition Phase T (Missing Data) S

Dealing with Missing Data (cont.) Our heuristic method: 1.Partition Phase T (Missing Data)S t1t1 t2t2 t3t3 ^ ^ s1s1 s2s2 s3s3 s4s4

Dealing with Missing Data (cont.) Our heuristic method: 2.Search Phase T (Missing Data)S t1t1 t2t2 t3t3 s1s1 s2s2 s3s3 s4s4 3.Assignment Phase (Consolidate) count+1 Miss s5s5

Experiment Experiment Method o Data: Patil (Blocks of limited haplotype diversity revealed by high resolution of human chromosome 21.) o Chromosome: 21 o No. of SNP: 24,047 SNPs from 20 individuals. o Diversity threshold: 0.85 and 0.9 o No. of Block: 100, 200, and 300 o Classification: block length<15, 15 ≦ length ≦ 30, and 30<length.

Experiment (cont.) Experiment Results D=0.85 No.=100 D=0.9 No.=100

Experiment (cont.) Experiment Results D=0.85 No.=200 D=0.9 No.=200

Experiment (cont.) Experiment Results D=0.85 No.=300 D=0.9 No.=300

Conclusion Contributions o We develop a visualization tool to help us with observation the diversity of haplotype strings. o We propose several efficient algorithms to select interesting haplotype blocks by using different diversity functions. o We show the minimum-diversity problem is NP- complete and propose a heuristic method for dealing with missing data suitably.

Conclusion (cont.) Future and continuous works: o Explore and elaborate other meaningful diversity functions. o Improve our diversity visualization tool. o TagSNP selection in the haplotype block. o Further experiments on related biomedical haplotype data.

Thank You! Any Question?

Problem Definitions (1) Given a haplotype matrix A, find a segmentation S consisted of k blocks, with the coverage of common hapltypes in each block more than α% and the total length of S in maximized.

Monotonic Diversity A diversity function δ is said to be monotonic if, for any block (interval) I = [i, j] of A, it follows that δ(i’, j’) δ(i, j) whenever [i’, j’] [i, j]; that is, the diversity of any subinterval of I is always no larger than the diversity of I. The coverage of common haplotype does not satisfy the property of monotonic diversity in the haplotype sample with missing data. jii’j’ δ (i’, j’) δ (i, j), [i’,j’] [i,j]

Longest Blocks Partitioning with Constraint on Diversity Dynamic programming algorithm i BkBk L[j]L[j] B1B1 B k-1 … i j BkBk B1B1 … j-1 j

Longest Blocks Partitioning with Constraint on Diversity (cont.) Preprocessing of farthest-sites (good partner) o Given a haplotype matrix A and a diversity upper limit D; for each column j, find the farthest left marker i=L[j] so that δ(i,j)<D. o We use the techniques of suffix tree and LCA to solve the problem in O(mn+n 2 ) time. j L[j]L[j]

Longest Blocks Partitioning with Constraint on Diversity (cont.) Time: O(nk) after the preprocessing of L[j]’s. Space: O(nk). k ij f(k,i,j) f(k,i,j-1) f(k-1,i,L[j]-1) n

Longest Blocks Partitioning with Constraint on Diversity (cont.) Linear space ij D1D1 DD2D2 E2E2 E1E1 …… x* E ij D1D1 k>1 k=1

Longest Blocks Partitioning with Constraint on Diversity (cont.) How to find the cut-point x* j-1j-2i+2i+1 x=i j-1j-2i+2i+1x=i x*x*

Longest Blocks Partitioning with Constraint on Diversity (cont.) Time: O(nk) after the preprocessing of L[j]’s and R[j]’s. Let T(n,k) denote the time needed for f(k,1,n). Assume that T(n’,k’) c 2 n’k’ for all n’ < n, k’< k. According to the algorithm, we have:

Experimental Results Algorithm o Time: O(nk) o Space: O(n) Experiment Method o 24,047 SNPs from 20 individuals (21 chromosome). o Use the same criteria as in Patil et al.(Coverage = 80%) Experimental Results o Patil et al.’s results: 4,563 tagSNPs and a total of 4,135 blocks.(2001) o Zhang et al.’s results: 3,582 tagSNPs and 2,575 blocks.(2002) o Our results: 4,588 tagSNPs and 1,707 haplotype blocks. o 673 blocks suffice to cover 80% of chromosome region.

Problem Definitions (2) Given a haplotype matrix A and a specific number of tagSNP t, we wish to find a list of feasible blocks with the coverage of common hapltypes in each block more than α%, the total number of tagSNP required for these blocks less than t and the total length is maximized.

Longest Blocks Partitioning with Constraint on Diversity and TagSNPs Dynamic programming algorithm 1 BnBn k B1B1 B n-1 … BnBn B1B1 … i-1 i 1 k-1 t tag(k,i) t - tag(k,i)

Longest Blocks Partitioning with Constraint on Diversity and TagSNPs (cont.) Preprocessing o Compute the set of left good parners L i for each SNP marker i, L i ={x |[x,i]is a feasible haplotype block}. o Using exhaustive searching for tagSNP selecting of all feasible blocks will need time, is the maximum number of tagSNP required among all feasible bocks, L is the number of all feasible blocks. i L i ={x |[x,i]is a feasible haplotype block} i-1i-2i-3 …

Longest Blocks Partitioning with Constraint on Diversity and TagSNPs (cont.) Time: O(ntl), l is the average size of L i (or O(tL), ) after the preprocessing of L i for each SNP locus i, and tagSNPs required for each feasible block. Space: O(nt). t 1i f(i,t) f(i-1,t) f(k-1,t-tag(k,i)), k L i O(l)O(l) n

Longest Blocks Partitioning with Constraint on Diversity and TagSNPs (cont.) Difference between our algorithm and Zhang’s o Zhang’s algorithm is used to partition entire haplotype sample into blocks with tagSNPs minimization. o Our algorithm can be used to find the longest segmentation consisted of some haplotype blocks with a specific tagSNP number t.

Experimental Results Experiment Method o Haplotype data the same as in Patil et al.(Blocks of limited haplotype diversity revealed by high resolution of human chromosome 21.) o No. of SNP: 24,047 SNPs from 20 individuals. o Use the same criteria as in Patil et al.(Coverage = 80%) Experimental Results o Patil et al.’s results: 4,563 tagSNPs and a total of 4,135 blocks.(2001) o Zhang et al.’s results: 3,582 tagSNPs and 2,575 blocks.(2002) o Our results: 3,260 tagSNPs and 2,266 haplotype blocks.

Experimental Results (cont.)

We can partition % of chromosome region into blocks which do not require any tagSNPs. As length of the chromosome region covered increase, we need to increase more and more extra tagSNPs. 1,045 tagSNPs suffice to capture 80% of chromosome region information.

Experimental Results (cont.)

Conclusion and Discussion Compared with Patil et al.'s results, our method identifies longer blocks and the numbers of blocks and tagSNPs required is reduced by 45.2% and 28.6%. The results discovered by our method is superior to Zhang et al.'s Our method discovers that only a few blocks is sufficient to cover a wide range of chromosome region. We just require a few tagSNPs to capture a large portion of chromosome region information.

TagSNPs Selection For each block, we want to minimize the number of SNPs (tagSNPs) that uniquely distinguish at least 80% of the unambiguous haplotypes in the block.

TagSNPs Selection (cont.) Strategy: o Group common haplotypes into k distinct patterns. o Determine the least number of groups needed. o Select a loci set which consist of minimum number of SNPs on the haplotypes so that each pattern can be uniquely distinguish. The exhaustive searching algorithm enumerates next r-combination in lexicographic order.

Experimental Results Experiment Method o Haplotype data the same as in Patil et al.(Blocks of limited haplotype diversity revealed by high resolution of human chromosome 21.) o No. of SNP: 24,047 SNPs from 20 individuals. o Use the same criteria as in Patil et al.(Coverage = 80%) Experimental Results o Patil et al.’s results: 4,563 tagSNPs and a total of 4,135 blocks.(2001) o Zhang et al.’s results: 3,582 tagSNPs and 2,575 blocks.(2002) o Our results: 4,588 tagSNPs and 1,707 haplotype blocks. o 673 blocks suffice to cover 80% of genome region. o 2,159 tagSNPs suffice to capture 80% of genome region information.

Experimental Results (cont.) A total of 564 blocks contain more than 15 SNPs per block. The average number of SNPs for all of the blocks is

Experimental Results (cont.) Only a few blocks are needed to cover a wide range of genome region. 673 blocks suffice to cover 80% of genome region.

Experimental Results (cont.) Our method identifies only a few tagSNPs to capture the most of genome region information. 2,159 tagSNPs suffice to capture 80% of genome region information.

Experimental Results (cont.) As the total blocks coverage in the genome region increase, fewer common SNPs are covered by each tagSNP on average. The marginal utility of tagSNPs decreases as the genome region covered increases.

Conclusion and Discussion Compared with Patil et al.'s results, our method identifies longer blocks and the numbers of blocks is reduced by 58.7%. Our method discovers that only a few blocks is sufficient to cover a wide range of genome region. It requires just a few tagSNPs to capture the most of genome region information. The system implemented in (Pentium 3, 1GHz CPU) FreeBSD system, by our algorithm requires 4 minutes to find 673 blocks on the haplotype data; it takes another 9 minutes to select all tagSNPs (2,159) for these blocks.

Thank you. Q&A

What Weekday is Today? Magic Number: -4/4, 6/6, 8/8, 10/10, 12/12 -7/11, 9/5 [also 11/7, 5/9] -3/0? [implying 2/28, 2/0 = 1/31] Extension: -365 = 52 * Leap Year? 2009: 6 ; 2010: 7 ; 2011: 1 ; 2012:3