Presentation is loading. Please wait.

Presentation is loading. Please wait.

Some Algorithmic Problems Concerning the Inference and Analysis of TagSNPs, Haplotypes and Pedigrees PH.D candidate: Lan Liu Advisor: Tao Jiang.

Similar presentations


Presentation on theme: "Some Algorithmic Problems Concerning the Inference and Analysis of TagSNPs, Haplotypes and Pedigrees PH.D candidate: Lan Liu Advisor: Tao Jiang."— Presentation transcript:

1

2 Some Algorithmic Problems Concerning the Inference and Analysis of TagSNPs, Haplotypes and Pedigrees PH.D candidate: Lan Liu Advisor: Tao Jiang

3 Outline The haplotype inference problem The tagSNP selection problem The minimum common integer partition problem

4 The haplotype inference problem  Biological background Approximation and complexity of MRHC Efficient algorithms for ZRHC A linear-time algorithm for loop-free ZRHC The tagSNP selection problem The minimum common integer partition problem Outline

5 Introduction Basic concepts Example: Mendelian experiment Mendelian Law: one haplotype comes from the mother and the other comes from the father. paternalmaternal

6 Notations and Recombinant 11221122 22222222 Genotype 12221222 21222122 Haplotype Configuration 11111111 22222222 22222222 22222222 11111111 0 recombinant 22222222 Father Mother Child : recombinant 11111111 22222222 22222222 22222222 11221122 22222222 1 recombinant Father Mother child

7 Pedigree An example: British Royal Family A mating loop: a cycle inside the pedigree.

8 Haplotype Reconstruction - Haplotype: useful, expensive - Genotype: cheaper to obtain Reconstruct haplotypes from genotypes

9 Problem Definitions MRHC Given a pedigree and the genotype information for each member, find a haplotype configuration for each member which obeys Mendelian law, s.t. the number of recombinants are minimized. ZRHC: zero-recombinant Loop-free-ZRHC: zero recombinant, pedigree with no mating loops

10 The haplotype inference problem Biological background  Approximation and complexity of MRHC Efficient algorithms for ZRHC A linear-time algorithm for loop-free ZRHC The tagSNP selection problem The minimum common integer partition problem Outline

11 Approximation and Complexity of MRHC  The known hardness results for MRHC 2-locus-MRHC: 2 loci Tree-MRHC: pedigree having no mating loops

12 Our Hardness and Approximation Results  Tree-MRHC: no mating loop  Binary-tree-MRHC: 1 mate, 1 child  Binary-tree-MRHC*: 1 mate, 1 child, missing data  2-locus-MRHC: 2 loci  2-locus-MRHC*: 2 loci with missing data

13 The haplotype inference problem Biological background Approximation and complexity of MRHC  Efficient algorithms for ZRHC A linear-time algorithm for loop-free ZRHC The tagSNP selection problem The minimum common integer partition problem Outline

14 The ZRHC problem Problem definition Given a pedigree and the genotype information for each member, find a recombination-free haplotype configuration for each member that obeys the Mendelian law of inheritance.

15 Previous work Li and Jiang introduced a system of linear equations over F[2] and presented an O(m 3 n 3 ) time algorithm for ZRHC [LJ03], where m is #loci and n is #members in pedigree. Recently, Chan et al. proposed a linear-time algorithm in [CCC+06], which only works for pedigree without mating loops. Methods based on fast matrix multiplication algorithms could achieve an asymptotic speed of O(k 2.376 ) on k equations with k unknowns. The Lanczos and conjugate gradient algorithms are only heuristics [GV96]. The Wiedeman algorithm has expected quadratic running time [W86].

16 Our Result We present a much faster algorithm for ZRHC with running time. Ax=b transformation redundancy elimination O(n log 2 n log log n) O(n)

17 The New Linear System n, m m : #loci n : #members in pedigree Unknowns : the paternal haplotype vector of a member j. : the scalar demonstrating inheritance info between a parent j 1 and a child j.

18 The New Linear System 01000100 11011101 00000000 01110111 00010001 11011101 j 2 j 1 j P j1,1 p j1,2 p j1,3 p j1,4 j 2 j j 1 P j2,1 p j2,2 p j2,3 p j2,4 P j2,1 +0 p j2,2 +1 p j2,3 +1 p j2,4 +1 P j,1 p j,2 p j,3 p j,4 P j,1 +1 p j,2 +1 p j,3 +0 p j,4 +0 h j1,j h j2,j P j1 +w j1 P j1 P j2 P j2 +w j2 P j1,1 +1 p j1,2 +0 p j1,3 +0 p j1,4 +1 PjPj P j +w j p j1,2 =1 p j1,3 =0 FatherMother Child

19 The Linear System  O(mn) equations on O(mn) unknowns.  Given a homozygous locus i on a member j (with a child j 1 ), p j [i] and p j1 [i] are pre-determined.

20 Pedigree Graph A pedigree with genotype 1 6 9 8 3 2 475 12 11 12 11 12 22 12 22 12 11 22 12 11 12 22 12 1 6 9 8 3 2 475 Pedigree graph G #edges · 2n

21 Locus Graph  Locus graph G i 1 6 9 8 3 2 475 122211 12 11 12 22 Example: Locus graph for the 3 rd locus G i = (V, E i ), where E i = {(k,j)| k is a parent of j, w k [i]=1} (a) Genotype info Zero-weight : 1 6 9 8 3 2 475 ? 1 0 1 1 1 0 1 0 h 1,4 h 4,9 h 8,9 h 6,8 (b) Locus graph p -variables: variables on vertices. h -variables: variables on edges shared by all locus graphs.

22 An Observation  For any cycle or any path connecting two pre-determined vertices in a locus graph, the summation of h -variables along the path is a constant. We can use paths to denote constraints! a constant + d j 0, j 1 … P j 1 [i] h j 1, j 2 P j 2 [i]P j k-1 [i]P j k [i] h j k-1, j k d j 1, j 2 d j k-1, j k P j 1 [i]+ d j 1, j 2 + h j 1, j 2 = P j 2 [i] P j 2 [i]+ d j 2, j 3 + h j 2, j 2 = P j 3 [i] … P j k-1 [i]+ d j k-1, j k + h j k-1, j k = P j k [i] P j 0 [i] h j 0, j 1 d j 0, j 1 P j 0 [i]= P j 1 [i] + h j 0, j 1  (proof sketch) Assume the path in locus graph G i connecting two pre-determined vertices j 0 and j k.

23 Examples of Linear Constraints 1 6 9 8 3 2 475 ?1 0 1 1 1 0 1 0 h 8,9 h 6,8 (a) 1 st locus graph h 6,8 + h 8,9 = 1 1 6 9 8 3 2 475 0? ? 1 ? ? 1 0 1 : (b) 2 nd locus graph h 3,5 + h 3,6 + h 2,5 + h 2,6 = 0 h 2,5 h 3,5 h 3,6 h 2,6 1 6 9 8 3 2 475 ?? ? ? ? ? ? 0 1 h 6,8 h 2,4 h 2,5 h 3,5 h 3,6 h 4,9 (c) 3 rd locus graph h 4,9 + h 2,4 + h 2,5 + h 3,5 + h 3,6 + h 6,8 = 0

24 Linear Constraints Obviously, the linear constraints are necessary. We can also show that these constraints are sufficient. Moreover, we can upper bound #constraints in each locus graph as O( n ), while the trivial analysis gives an upper bound O( n 2 ). Total #constraints = O( mn ).

25 The ZRHC-PHASE algorithm Algorithm ZRHC_PHASE input: a pedigree G =( V, E ) and genotype {g j } output: a general solution of {p j } begin Step 1. Preprocessing Step 2. Linear constraint generation on h -variables Step 3. Solve h -variables by Gaussian Elimination Step 4. Solve the p -variables by propagation from pre-determined p -variables to others. end Our method  Solve h -variables and p - variables separately  O(mn) linear equations on O(n) h -variables. Traditional method  Solve h -variables and p - variables together  O(mn) equations on O(mn) unknowns: O(mn) p- variables and O(n) h- variable s.

26 Our Method Ax=b transformation redundancy elimination O(n log 2 n log log n) O(n)

27 Redundant Equation Elimination j0j0 j1j1 j k-1 jkjk j k-2 j2j2 … An observation Given a cycle, assume that there are constraints among each pair of vertices. Originally, there are O ( k 2 ) constraints. Notice that they are not independent. We can replace the original constraints by an equivalent set of constraints with size O ( k ). j 2 ~ j k-1 j0 ~ j2j0 ~ j2 j 0 ~ j k-1 Remove the redundant equations without solving them! Key lemma

28 Given a spanning tree, the stretch of an edge ( k, j ) is defined as the length of the unique path between k and j on the tree. Elkin, Emeky, Spielman and Teng shows that we can embed any graph in a low-stretch spanning tree with average stretch O(log 2 n log log n). The number of irredundant constraints can be bounded by the sum of cycle lengths, which is further bounded by the sum of stretches O(nlog 2 n log log n). Redundant Equation Elimination

29 The haplotype inference problem Biological background Approximation and complexity of MRHC Efficient algorithms for ZRHC  A linear-time algorithm for loop-free ZRHC The tagSNP selection problem The minimum common integer partition problem Outline

30 The Loop-Free ZRHC problem Problem definition Given a pedigree without mating loops and the genotype information for each member, find a recombination-free haplotype configuration for each member that obeys the Mendelian law of inheritance.

31 Constraint Graphs Given the constraints in a pedigree graph, we can construct the corresponding constraint graph. (b) Corresponding constraint graph An example (a) A pedigree graph with constrains

32 A Key Lemma There exists a solution to the loop-free ZRHC problem if and only if the weight sum of every cycle C is 0 in the corresponding constraint graph.  ”<=” Done by a construction later.  (proof sketch)  Each h -variables occurs even number of times in the constraint set S corresponding to C.  The sum of h -variable in S is equal to the weight sum of C.  The weight sum of C is 0.  ”=>” (a) The pedigree graph(b) Corresponding constraint graph The constraints in S are not independent!

33  The constraints forming a spanning forest in the constraint graph are sufficient to represent all constraints.  There are at most n-1 independent constraints.  We can construct an injective mapping f from the independent constraints to edges in the pedigree graph A Mapping from Constraints to Edges (b) The pedigree graph (a) A spanning forest for the constraint graph Each constraint is mapped to an edge on the path corresponding to the constraint.

34 The ZRHC-PHASE algorithm Algorithm ZRHC_PHASE input: a pedigree G =( V, E ) and genotype {g j } output: a general solution of {p j } begin Step 1. Preprocessing Step 2. Linear constraint generation on h -variables Step 3. Solve h -variables by Gaussian Elimination Step 4. Solve the p -variables by propagation from pre-determined p -variables to others. end It takes O(n 3 ) time!

35 Solving h -variables In order to obtain a linear-time algorithm, we want to avoid the Gaussian elimination method. j0j0 j1j1 jkjk … j k-1 An observation Given a constraint along a path j 0, j 1,…, j k-1, j k h +h + …+ h = b j 0, j 1 j 1, j 2 j k-1, j k Assign the h -variables on edges ( j 0, j 1 ), ( j 1, j 2 ), …, ( j k-2, j k-1 ) arbitrarily. Assign the h -variables on the last edge ( j k- 1, j k ) as a fixed value to satisfy the constraint: h = h + …+ h + b. j 0, j 1 j k-2, j k-1 j k-1, j k We can solve the constraint in the following way:

36 Solving h -variables Based on the Mapping f We have constructed the infective mapping f : S -> E, where S is the constraint set and E is the edge set. h -variables can be solved by a single BFS Traversal. We solve h -variables as follows: For each h -variable corresponding to an edge e not in f (S), assign an arbitrary value. For each h -variable corresponding to an edge e in f (S), assign a fixed value based on the constraint f –1 (e), such that the constraint is satisfied.

37 The haplotype inference problem Biological background Approximation and complexity of MRHC Efficient algorithms for ZRHC A linear-time algorithm for loop-free ZRHC  The tagSNP selection problem The minimum common integer partition problem Outline

38 Motivation With the rapid development of genotyping technologies, there are more than 10 million verified single-nucleotide polymorphisms (SNPs) in dbSNP. We aim to select a subset of informative SNPs ( i.e. tagSNPs) to save the cost for genotyping all SNPs and performing disease association mapping.

39 r 2 Linkage Disequilibrium Statistics Given a pair of genetic markers 1 and 2. r 2 statistics: r 2 = (p AB –p A. p.B ) 2 p A. (1-p A. ) p.B (1-p.B ) If r 2 is no less than a given threshold r 0, marker 1 (or marker 2) can tag marker 2 (or marker 1, respectively).

40 The TagSNP Selection Problem Given a set V of SNP markers and LD patterns E ={ ( v j1, v j2 )| r 2 (v j1,v j2 ) is no less than a given threshold r 0, v j1 and v j2 are in V }, we want to select a subset V' of minimum cardinality, such that given any v in V, there exists a v' in V', where r 2 (v,v') is no less than r 0. If we define G =( V, E ), a tagSNP set is equivalent to a dominating set on G. (a) SNP markers and their LD patterns in a population (b) TagSNPs for the population

41 TagSNP Selection across Populations In two populations with different evolutionary histories, a pair of SNPs having remarkably different marker frequencies and very weak LD may show strong LD in the admixed population. Therefore, tagSNPs picked from the combined populations or one of the populations might not be sufficient to capture the variations in all populations.

42 Problem Definition Given a set of SNP markers and LD patterns in multiple populations, we want to find a minimum common tagSNP set for each of the populations. The above problem is called the minimum common tagSNP selection problem (MCTS). (a) SNP markers and their LD patterns in two populations. (b) The minimum TagSNP set for these two populations.

43 Our Algorithms The MCTS problem can be easily formulated by integer linear programming. Lower bound: GreedyTag_lb and LRTag_lb We calculate both the upper bound (i.e. the number of the tagSNPs obtained by our algorithms) and the lower bound (i.e. the minimum number of tagSNPs needed). We first apply some data reduction rules, then use one of the following algorithms A greedy algorithm: GreedyTag A Lagrangian relaxation algorithm: LRTag

44 Experimental Result We apply our algorithms on real HapMap data ( release #19, NCBI build 34, October 2005 ). There are four populations in HapMap data. CEU: Europe descendents. CHB: Chinese people from Beijing. JPT: Japanese people from Tokyo. YRI: Yoruba people of Ibadan, Nigeria. We get tagSNPs for the following two datasets: Encode regions: all 10 ENCODE regions with totally 10,859 markers. Human genome: chromosomes 1 – 22 with totally 2,862,454 markers.

45 Experiment Result for ENCODE Regions  We compare our GreedyTag and LRTag with MultiPop-TagSelect(MPS).  The gap between LRTag_lb and LRTag is at most two for each ENCODE region and totally six for all ENCODE regions with the r 2 threshold being 0.5.  There is no gap with the r 2 threshold being 0.8.

46 Experiment Result for Human Genome  The gap between our solution and the lower bound is 1061 SNPs with r 2 threshold being 0.5, given the entire human genome with 2,862,454 SNPs.  The gap is 142 SNPs with the r 2 threshold being 0.8. The numbers of tagSNPs selected by our algorithms are almost optimal.

47 The haplotype inference problem Biological background Approximation and complexity of MRHC Efficient algorithms for ZRHC A linear-time algorithm for loop-free ZRHC  The tagSNP selection problem  The minimum common integer partition problem Outline

48 Problem Definitions P(n): given an integer n, a partition is a set of integers, say {n 1,n 2,…, n r }, s.t.  i=1 r n i =n. Example: given n=4, {2,2} is a P(4); given n=3, {3} is a P(3). Example: given S= {3, 3, 4}, {2,2,3,3} is an IP({3,3,4}).  IP(S): given a multiset S= {x 1, , x m }, an integer partition is a disjoint union

49 Examples CIP(S 1, S 2, …, S k ): given multisets S 1, S 2, …, S k, a common integer partition of all multisets. Example: given S= {3, 3, 4}, T={2,2,6}, {2,2,3,3} is a CIP(S,T); {1,1,2,2,4} is also a CIP(S,T).  #P(100)=190,569,292  MCIP is NP-hard  MCIP(S 1, S 2, , S k ): a common integer partition with the minimum cardinality. Example: {2,2,3,3} is a MCIP(S,T).

50 Biological Applications(1) The distance between two strings a b c d e f g h i j k h h i j k h e f g a b c d  Genetic distance between two genomes a b c d e f g h i j k h h i j k h e f g a b c d Minimum Common Substring Partition

51 Biological Applications(2) MCIP is a special case of Minimum Common Substring Partition(MCSP) MCIP(S',T') S'= {x 1, x 2, , x m } T'= {y 1, y 2, , y n } MCSP(S,T) S= T=

52 Our Result 2- MCIP: MCIP on two input multisets k - MCIP: MCIP on k input multisets APX-hard: There is a constant c, s.t. a problem cannot be approximated within c.

53 Conclusion and Future Work The haplotype inference problem Biological background Approximation and complexity of MRHC Efficient algorithms for ZRHC A linear-time algorithm for loop-free ZRHC  The tagSNP selection problem The minimum common integer partition problem

54 References L. Liu and T. Jiang. Linear-Time Reconstruction of Zero-Recombinant Medelian Inheritance on Pedigrees without Mating Loops. In submission. L. Liu, Y. Wu, S. Lonardi and T. Jiang. Efficient Algorithms for Genome-wide TagSNP Selection across Populations via Linkage Disequilibrium Criterion. To appear in proc. of 6th Annual International Conference on Computational Systems Bioinformatics(CSB'2007). 6th Annual International Conference on Computational Systems Bioinformatics(CSB'2007) Y. Wu, L. Liu, T. Close and S. Lonardi. Deconvoluting the BAC-gene Relationship Using a Physical Map. To appear in proc. of 6th Annual International Conference on Computational Systems Bioinformatics(CSB'2007).6th Annual International Conference on Computational Systems Bioinformatics(CSB'2007) J. Xiao, L. Liu, L. Xia and T. Jiang. Fast Elimination of Redundant Linear Equations and Reconstruction of Recombination-Free Mendelian Inheritance on a Pedigree. In Proc. of ACM-SIAM Symposium on Discrete Algorithms(SODA'2007), pp. 655-664.ACM-SIAM Symposium on Discrete Algorithms(SODA'2007) X. Chen, L. Liu, Z. Liu and T. Jiang. On the Minimum Common Integer Partition Problem. In proc.of the 6th Conference on Algorithms and Complexity, Rome, Italy, pp. 236-247.the 6th Conference on Algorithms and Complexity, Rome, Italy, L. Liu, X. Chen, J. Xiao and T. Jiang. Complexity and Approximation of the Minimum Recombination Haplotype Configuration Problem. In Proc.of the 16th Annual International Symposium on Algorithms and Computation (ISAAC'05), pp. 370-379. [Best paper nominations: 5.35%]. To appear in Theoretical Computer Science.the 16th Annual International Symposium on Algorithms and Computation (ISAAC'05)Theoretical Computer Science

55 Thanks for your time and attention!


Download ppt "Some Algorithmic Problems Concerning the Inference and Analysis of TagSNPs, Haplotypes and Pedigrees PH.D candidate: Lan Liu Advisor: Tao Jiang."

Similar presentations


Ads by Google