Efficient Algorithms for SNP Haplotype Block Selection Problems Yaw-Ling Lin ( 林耀鈴 ) Dept Computer Sci and Info Engineering College of Computing and Informatics.

Efficient Algorithms for SNP Haplotype Block Selection Problems Yaw-Ling Lin ( 林耀鈴 ) Dept Computer Sci and Info Engineering College of Computing and Informatics Providence University, Taiwan E-mail: yllin@pu.edu.tw http://www.cs.pu.edu.tw/~yawlin

Outline Introduction Motivation Terminology Definition Diversity Functions Haplotype Block Selection Dealing with Missing Data Experiment Conclusion

Introduction Mutation in DNA is the principle factor that is responsible for the phenotypic differences among human beings. SNP (Single Nucleotide Polymorphisms) is the most common mutation.

Introduction (cont.) Recent studies have shown that the chromosome recombination only takes places at some narrow hotspots. Haplotype blocks stand for segments between these hotspots where little or even no recombination occurs. A B a b A B A b a B a b

Motivation The SNPs within a haplotype block are highly correlated due to the low diversity in each block. SNPs, haplotype pattern, or disease gene in the same block are associative. (Linkage)

Terminology Definition H4 H3 H2 H1 major minor cgccttnnct tgtntagccc ngcgntagtt catgaaacnc c/ t g/ag/a t/ c g /c t/ a a /t a/ga/g c/ g c/ t H4 H3 H2 H1 major minor 0011013301 1003001000 3010310111 0100100030 c/ t g/ag/a t/ c g /c t/ a a /t a/ga/g c/ g c /t major←0 minor←1 n←3

Terminology Definition (cont.)

Diversity Functions Each different haplotype string s i in a matrix is associated with a probability p i. p i : 2/7, 2/7, 1/7, 1/7, 1/7

Diversity Functions (cont.) Raising the square to an arbitrary power q. Information Entropy function:

Results

Results (cont.)

Haplotype Block Selection Computing Diversities of All Blocks ij O(mn) Total: n 2 (i,j) pairs. Total time complexity: O(mn 3 )

Haplotype Block Selection (cont.) Suffix Tree T 1 Suffix Tree: 1-suffix Time Complexity: O(n) … n leaves.

Haplotype Block Selection (cont.) Merge m suffix trees into the total suffix tree T* …… … …… 1-suffixi-suffixm-suffix merge mn leaves. T*T* …

Lowest Common Ancestor

LCA (confluent) subtree

Confluent subtree – Illustration

Consructing confluent subtree

Haplotype Block Selection (cont.) LCA Tree T*T* … 1-LCA Tree1st suffix string for each row … … i-LCA Treen-LCA Tree m×n haplotype matrix n LCA Trees (with m leaves)

Haplotype Block Selection (cont.) Event-List … n 8 … 1 1[4,3]2[2,2,2,1]4[2,2,1,1,1] 8-LCA Tree h 1 (8), h 6 (8)h 4 (8), h 5 (8) h 3 (8)h 7 (8)h 2 (8) 0 1 0 0 1 0 0 1 1 0 0 1 1 10 7 43 2221

Haplotype Block Selection (cont.) n … 4 3 2 1 8-LCA Tree Depth-List Event-List h 1 (8), h 6 (8)h 4 (8), h 5 (8) h 3 (8)h 7 (8)h 2 (8) 7 4 3 2221 8[4,3] 8[2,2]8[2,1] 8[2,2,1,1,1] … n 8 … 1 Event-List 1[4,3]2[2,2,2,1]4[2,2,1,1,1] BFS Search

Haplotype Block Selection (cont.) Farthest-sites (good partner) i L[i]L[i]L[i-1] i-1

Haplotype Block Selection (cont.)

Dynamic Programming ij BkBk L[j]L[j] B1B1 B k-1 … i j BkBk B1B1 … j-1

k Haplotype Block Selection (cont.) Dynamic Programming k-1 f(k,i,j)f(k,i,j-1) f(k-1,i,L[j]-1) i j

Haplotype Block Selection (cont.) Dynamic Programming

Haplotype Block Selection (cont.) Dynamic Programming 1 i j i=1

Dealing with Missing Data Sometime we may fail to distinguish two different haplotype due to the ambiguity cased by missing data. Let A ij ∈ {0,1,3}. A ij =3 means the j-th site of observation i is missing data. One way to deal with missing data is to assign each A ij =3 to either 0 or 1 such that the resulting diversity is minimized.

Dealing with Missing Data (cont.) The minimum-diversity problem is NP-hard by a reduction from the minimum-clique-partition problem. Two rows i,j of A are different is there exists a column k such that {A ik,A jk }={0,1}. Two rows are compatible if they are not different. 1 2 34 5 5 4 3 2 1 (3,5)(2,4)(1,4)(1,3) 0313303133 0331303313 3031330313 3303133031 0001 1110

Dealing with Missing Data (cont.) Our heuristic method: 1.Partition Phase T (Missing Data) S

Dealing with Missing Data (cont.) Our heuristic method: 1.Partition Phase T (Missing Data)S t1t1 t2t2 t3t3 ^ ^ s1s1 s2s2 s3s3 s4s4

Dealing with Missing Data (cont.) Our heuristic method: 2.Search Phase T (Missing Data)S t1t1 t2t2 t3t3 s1s1 s2s2 s3s3 s4s4 3.Assignment Phase (Consolidate) count+1 Miss s5s5

Experiment Experiment Method o Data: Patil (Blocks of limited haplotype diversity revealed by high resolution of human chromosome 21.) o Chromosome: 21 o No. of SNP: 24,047 SNPs from 20 individuals. o Diversity threshold: 0.85 and 0.9 o No. of Block: 100, 200, and 300 o Classification: block length<15, 15 ≦ length ≦ 30, and 30<length.

Experiment (cont.) Experiment Results D=0.85 No.=100 D=0.9 No.=100

Conclusion Contributions o We develop a visualization tool to help us with observation the diversity of haplotype strings. o We propose several efficient algorithms to select interesting haplotype blocks by using different diversity functions. o We show the minimum-diversity problem is NP- complete and propose a heuristic method for dealing with missing data suitably.

Conclusion (cont.) Future and continuous works: o Explore and elaborate other meaningful diversity functions. o Improve our diversity visualization tool. o TagSNP selection in the haplotype block. o Further experiments on related biomedical haplotype data.

Thank You! Any Question?

Problem Definitions (1) Given a haplotype matrix A, find a segmentation S consisted of k blocks, with the coverage of common hapltypes in each block more than α% and the total length of S in maximized.

Monotonic Diversity A diversity function δ is said to be monotonic if, for any block (interval) I = [i, j] of A, it follows that δ(i’, j’) δ(i, j) whenever [i’, j’] [i, j]; that is, the diversity of any subinterval of I is always no larger than the diversity of I. The coverage of common haplotype does not satisfy the property of monotonic diversity in the haplotype sample with missing data. jii’j’ δ (i’, j’) δ (i, j), [i’,j’] [i,j]

Longest Blocks Partitioning with Constraint on Diversity Dynamic programming algorithm i BkBk L[j]L[j] B1B1 B k-1 … i j BkBk B1B1 … j-1 j

Longest Blocks Partitioning with Constraint on Diversity (cont.) Preprocessing of farthest-sites (good partner) o Given a haplotype matrix A and a diversity upper limit D; for each column j, find the farthest left marker i=L[j] so that δ(i,j)<D. o We use the techniques of suffix tree and LCA to solve the problem in O(mn+n 2 ) time. j L[j]L[j]

Longest Blocks Partitioning with Constraint on Diversity (cont.) Time: O(nk) after the preprocessing of L[j]’s. Space: O(nk). k ij f(k,i,j) f(k,i,j-1) f(k-1,i,L[j]-1) n

Longest Blocks Partitioning with Constraint on Diversity (cont.) Linear space ij D1D1 DD2D2 E2E2 E1E1 …… x* E ij D1D1 k>1 k=1

Longest Blocks Partitioning with Constraint on Diversity (cont.) How to find the cut-point x* 2. 3. j-1j-2i+2i+1 x=i j-1j-2i+2i+1x=i x*x*

Longest Blocks Partitioning with Constraint on Diversity (cont.) Time: O(nk) after the preprocessing of L[j]’s and R[j]’s. Let T(n,k) denote the time needed for f(k,1,n). Assume that T(n’,k’) c 2 n’k’ for all n’ < n, k’< k. According to the algorithm, we have:

Experimental Results Algorithm o Time: O(nk) o Space: O(n) Experiment Method o 24,047 SNPs from 20 individuals (21 chromosome). o Use the same criteria as in Patil et al.(Coverage = 80%) Experimental Results o Patil et al.’s results: 4,563 tagSNPs and a total of 4,135 blocks.(2001) o Zhang et al.’s results: 3,582 tagSNPs and 2,575 blocks.(2002) o Our results: 4,588 tagSNPs and 1,707 haplotype blocks. o 673 blocks suffice to cover 80% of chromosome region.

Problem Definitions (2) Given a haplotype matrix A and a specific number of tagSNP t, we wish to find a list of feasible blocks with the coverage of common hapltypes in each block more than α%, the total number of tagSNP required for these blocks less than t and the total length is maximized.

Longest Blocks Partitioning with Constraint on Diversity and TagSNPs Dynamic programming algorithm 1 BnBn k B1B1 B n-1 … BnBn B1B1 … i-1 i 1 k-1 t tag(k,i) t - tag(k,i)

Longest Blocks Partitioning with Constraint on Diversity and TagSNPs (cont.) Preprocessing o Compute the set of left good parners L i for each SNP marker i, L i ={x |[x,i]is a feasible haplotype block}. o Using exhaustive searching for tagSNP selecting of all feasible blocks will need time, is the maximum number of tagSNP required among all feasible bocks, L is the number of all feasible blocks. i L i ={x |[x,i]is a feasible haplotype block} i-1i-2i-3 …

Longest Blocks Partitioning with Constraint on Diversity and TagSNPs (cont.) Time: O(ntl), l is the average size of L i (or O(tL), ) after the preprocessing of L i for each SNP locus i, and tagSNPs required for each feasible block. Space: O(nt). t 1i f(i,t) f(i-1,t) f(k-1,t-tag(k,i)), k L i O(l)O(l) n

Longest Blocks Partitioning with Constraint on Diversity and TagSNPs (cont.) Difference between our algorithm and Zhang’s o Zhang’s algorithm is used to partition entire haplotype sample into blocks with tagSNPs minimization. o Our algorithm can be used to find the longest segmentation consisted of some haplotype blocks with a specific tagSNP number t.

Experimental Results Experiment Method o Haplotype data the same as in Patil et al.(Blocks of limited haplotype diversity revealed by high resolution of human chromosome 21.) o No. of SNP: 24,047 SNPs from 20 individuals. o Use the same criteria as in Patil et al.(Coverage = 80%) Experimental Results o Patil et al.’s results: 4,563 tagSNPs and a total of 4,135 blocks.(2001) o Zhang et al.’s results: 3,582 tagSNPs and 2,575 blocks.(2002) o Our results: 3,260 tagSNPs and 2,266 haplotype blocks.

Experimental Results (cont.)

We can partition 38.55 % of chromosome region into blocks which do not require any tagSNPs. As length of the chromosome region covered increase, we need to increase more and more extra tagSNPs. 1,045 tagSNPs suffice to capture 80% of chromosome region information.

Experimental Results (cont.)

Conclusion and Discussion Compared with Patil et al.'s results, our method identifies longer blocks and the numbers of blocks and tagSNPs required is reduced by 45.2% and 28.6%. The results discovered by our method is superior to Zhang et al.'s Our method discovers that only a few blocks is sufficient to cover a wide range of chromosome region. We just require a few tagSNPs to capture a large portion of chromosome region information.

TagSNPs Selection For each block, we want to minimize the number of SNPs (tagSNPs) that uniquely distinguish at least 80% of the unambiguous haplotypes in the block.

TagSNPs Selection (cont.) Strategy: o Group common haplotypes into k distinct patterns. o Determine the least number of groups needed. o Select a loci set which consist of minimum number of SNPs on the haplotypes so that each pattern can be uniquely distinguish. The exhaustive searching algorithm enumerates next r-combination in lexicographic order.

Experimental Results Experiment Method o Haplotype data the same as in Patil et al.(Blocks of limited haplotype diversity revealed by high resolution of human chromosome 21.) o No. of SNP: 24,047 SNPs from 20 individuals. o Use the same criteria as in Patil et al.(Coverage = 80%) Experimental Results o Patil et al.’s results: 4,563 tagSNPs and a total of 4,135 blocks.(2001) o Zhang et al.’s results: 3,582 tagSNPs and 2,575 blocks.(2002) o Our results: 4,588 tagSNPs and 1,707 haplotype blocks. o 673 blocks suffice to cover 80% of genome region. o 2,159 tagSNPs suffice to capture 80% of genome region information.

Experimental Results (cont.) A total of 564 blocks contain more than 15 SNPs per block. The average number of SNPs for all of the blocks is 14.09.

Experimental Results (cont.) Only a few blocks are needed to cover a wide range of genome region. 673 blocks suffice to cover 80% of genome region.

Experimental Results (cont.) Our method identifies only a few tagSNPs to capture the most of genome region information. 2,159 tagSNPs suffice to capture 80% of genome region information.

Experimental Results (cont.) As the total blocks coverage in the genome region increase, fewer common SNPs are covered by each tagSNP on average. The marginal utility of tagSNPs decreases as the genome region covered increases.

Conclusion and Discussion Compared with Patil et al.'s results, our method identifies longer blocks and the numbers of blocks is reduced by 58.7%. Our method discovers that only a few blocks is sufficient to cover a wide range of genome region. It requires just a few tagSNPs to capture the most of genome region information. The system implemented in (Pentium 3, 1GHz CPU) FreeBSD system, by our algorithm requires 4 minutes to find 673 blocks on the haplotype data; it takes another 9 minutes to select all tagSNPs (2,159) for these blocks.

Thank you. Q&A

What Weekday is Today? Magic Number: -4/4, 6/6, 8/8, 10/10, 12/12 -7/11, 9/5 [also 11/7, 5/9] -3/0? [implying 2/28, 2/0 = 1/31] Extension: -365 = 52 * 7 + 1 -Leap Year? 2009: 6 ; 2010: 7 ; 2011: 1 ; 2012:3

Efficient Algorithms for SNP Haplotype Block Selection Problems Yaw-Ling Lin ( 林耀鈴 ) Dept Computer Sci and Info Engineering College of Computing and Informatics.

Similar presentations

Presentation on theme: "Efficient Algorithms for SNP Haplotype Block Selection Problems Yaw-Ling Lin ( 林耀鈴 ) Dept Computer Sci and Info Engineering College of Computing and Informatics."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Efficient Algorithms for SNP Haplotype Block Selection Problems Yaw-Ling Lin ( 林耀鈴 ) Dept Computer Sci and Info Engineering College of Computing and Informatics.

Similar presentations

Presentation on theme: "Efficient Algorithms for SNP Haplotype Block Selection Problems Yaw-Ling Lin ( 林耀鈴 ) Dept Computer Sci and Info Engineering College of Computing and Informatics."— Presentation transcript:

Similar presentations

About project

Feedback