Presentation is loading. Please wait.

Presentation is loading. Please wait.

National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.

Similar presentations


Presentation on theme: "National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department."— Presentation transcript:

1 National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department of Computer Science and Information Engineering National Taiwan University, Taiwan * We thank Yao-Ting Huang for his technical assistance.

2 National Taiwan University Department of Computer Science and Information Engineering 2 Genetic Variations The genetic variations in DNA sequences (e.g., insertions, deletions, and mutations) have a major impact on genetic diseases and phenotypic differences.  All humans share 99% the same DNA sequence.  The genetic variations in the coding region may change the codon of an amino acid and alter the amino acid sequence.

3 National Taiwan University Department of Computer Science and Information Engineering Single Nucleotide Polymorphism A Single Nucleotide Polymorphism (SNP), pronounced “snip,” is a genetic variation when a single nucleotide (i.e., A, T, C, or G) is altered and kept through heredity.  SNP: Single DNA base variation found >= 1%  Mutation: Single DNA base variation found <1% C T T A G C T T C T T A G T T T SNP C T T A G C T T C T T A G T T T Mutation 94% 6% 99.9% 0.1%

4 National Taiwan University Department of Computer Science and Information Engineering 4 Mutations and SNPs Common Ancestor timepresent Observed genetic variations Mutations SNPs

5 National Taiwan University Department of Computer Science and Information Engineering 5 Single Nucleotide Polymorphism SNPs are the most frequent form among various genetic variations.  90% of human genetic variations come from SNPs.  SNPs occur about every 300~600 base pairs.  Millions of SNPs have been identified (e.g., HapMap and Perlegen). SNPs have become the preferred markers for association studies because of their high abundance and high-throughput SNP genotyping technologies.

6 National Taiwan University Department of Computer Science and Information Engineering Single Nucleotide Polymorphism A SNP is usually assumed to be a binary variable.  The probability of repeat mutation at the same SNP locus is quite small.  The tri-allele cases are usually considered to be the effect of genotyping errors. The nucleotide on a SNP locus is called  a major allele (if allele frequency > 50%), or  a minor allele (if allele frequency < 50%). A C T T A G C T T A C T T A G C T C C: Minor allele 94% 6% T: Major allele

7 National Taiwan University Department of Computer Science and Information Engineering 7 Haplotypes A haplotype stands for a set of linked SNPs on the same chromosome.  A haplotype can be simply considered as a binary string since each SNP is binary. SNP 1 SNP 2 SNP 3 -A C T T A G C T T- -A A T T T G C T C- -A C T T T G C T C- Haplotype 2 Haplotype 3 C A T A T C C T C Haplotype 1 SNP 1 SNP 2 SNP 3

8 National Taiwan University Department of Computer Science and Information Engineering 8 Tag SNP Selection Haplotype Inference Tag SNP Selection Maximum Parsimony Perfect Phylogeny Statistical Methods Haplotype block LD bin Prediction Accuracy SNP Database …

9 National Taiwan University Department of Computer Science and Information Engineering 9 Problems of Using SNPs for Association Studies The number of SNPs is too large to be used for association studies.  There are millions of SNPs in a human body.  To reduce the SNP genotyping cost, we wish to use as few SNPs as possible for association studies. Tag SNPs are a small subset of SNPs that is sufficient for performing association studies without losing the power of using all SNPs.  We will first study the definition of tag SNPs based on the haplotype-block model.

10 National Taiwan University Department of Computer Science and Information Engineering 10 Haplotype Blocks and Tag SNPs Some studies have shown that the chromosome can be partitioned into haplotype blocks interspersed by some recombination hotspots.  Within a haplotype block, there is little or no recombination occurred.  The SNPs within a haplotype block tend to be inherited together. Within a haplotype block, a small subset of SNPs (called tag SNPs) is sufficient to distinguish each pair of haplotype patterns in the block.  We only need to genotype tag SNPs instead of all SNPs within a haplotype block.

11 National Taiwan University Department of Computer Science and Information Engineering 11 Recombination Hotspots and Haplotype Blocks Recombination hotspots Chromosome Haplotype blocks P1P1 P2P2 P3P3 P4P4 S1S1 S2S2 S3S3 S4S4 S5S5 S6S6 S7S7 S8S8 S9S9 S 10 S 11 S 12 SNP loci Haplotype patterns : Major allele : Minor allele

12 National Taiwan University Department of Computer Science and Information Engineering 12 A Haplotype Block Example Human chromosome 21 is partitioned into 4,135 haplotype blocks over 24,047 SNPs by Patil et al. (Science, 2001).  Blue box: major allele  Yellow box: minor allele

13 National Taiwan University Department of Computer Science and Information Engineering 13 Examples of Tag SNPs P1P1 P2P2 P3P3 P4P4 S1S1 S2S2 S3S3 S4S4 S5S5 S6S6 S7S7 S8S8 S9S9 S 10 S 11 S 12 SNP loci Haplotype patterns Suppose we wish to distinguish an unknown haplotype sample. We can genotype all SNPs to identify the haplotype sample. An unknown haplotype sample : Major allele : Minor allele

14 National Taiwan University Department of Computer Science and Information Engineering 14 Examples of Tag SNPs P1P1 P2P2 P3P3 P4P4 S1S1 S2S2 S3S3 S4S4 S5S5 S6S6 S7S7 S8S8 S9S9 S 10 S 11 S 12 SNP loci Haplotype pattern In fact, it is not necessary to genotype all SNPs. SNPs S 3, S 4, and S 5 can form a set of tag SNPs. P1P1 P2P2 P3P3 P4P4 S3S3 S4S4 S5S5

15 National Taiwan University Department of Computer Science and Information Engineering 15 Examples of Wrong Tag SNPs P1P1 P2P2 P3P3 P4P4 S1S1 S2S2 S3S3 S4S4 S5S5 S6S6 S7S7 S8S8 S9S9 S 10 S 11 S 12 SNP loci Haplotype pattern SNPs S 1, S 2, and S 3 can not form a set of tag SNPs because P 1 and P 4 will be ambiguous. P1P1 P2P2 P3P3 P4P4 S1S1 S2S2 S3S3

16 National Taiwan University Department of Computer Science and Information Engineering 16 Examples of Tag SNPs P1P1 P2P2 P3P3 P4P4 S1S1 S2S2 S3S3 S4S4 S5S5 S6S6 S7S7 S8S8 S9S9 S 10 S 11 S 12 SNP loci Haplotype pattern SNPs S 1 and S 12 can form a set of tag SNPs. This set of SNPs is the minimum solution in this example. P1P1 P2P2 P3P3 P4P4 S1S1 S 12

17 National Taiwan University Department of Computer Science and Information Engineering 17 Problems of Finding Tag SNPs The problem of finding the minimum set of tag SNPs is known to be NP-hard.  This problem is the minimum test set problem.  A number of methods have been proposed to find the minimum set of tag SNPs. Here we illustrate how to recast the tag SNP selection problem as the set cover problem and the integer linear programming problem.

18 National Taiwan University Department of Computer Science and Information Engineering 18 Problem Formulation (1,2)(1,3)(1,4)(2,3)(2,4)(3,4) S1S1 S2S2 S3S3 S4S4 Given h patterns, we have pairs of patterns. The relation between SNPs and haplotypes can be formulated as a bipartite graph. S 1 can distinguish (P 1, P 3 ), (P 1, P 4 ), (P 2, P 3 ), and (P 2, P 4 ). S 2 can distinguish (P 1, P 4 ), (P 2, P 4 ), (P 3, P 4 ). P1P1 P2P2 P3P3 P4P4 S3S3 S4S4 S1S1 S2S2

19 National Taiwan University Department of Computer Science and Information Engineering 19 Set Cover (1,2) (1,3)(1,4)(2,3)(2,4)(3,4) S1S1 S3S3 Each pair of patterns is connected by at least one edge. The SNPs can form a set of tag SNPs if each pair of patterns is connected by at least one edge. e.g., S 1 and S 3 can form a set of tag SNPs. e.g., S 1 and S 2 can not be tag SNPs. P1P1 P2P2 P3P3 P4P4 S3S3 S4S4 S1S1 S2S2 S2S2

20 National Taiwan University Department of Computer Science and Information Engineering 20 A Greedy Algorithm (1,2)(1,3)(1,4)(2,3)(2,4)(3,4) S1S1 S1S1 S1S1 S1S1 S4S4 S4S4 S4S4 S4S4 (1,2)(1,3)(1,4)(2,3)(2,4)(3,4) S1S1 S4S4 P1P1 P2P2 P3P3 P4P4 S3S3 S4S4 S1S1 S2S2

21 National Taiwan University Department of Computer Science and Information Engineering 21 Integer Linear Programming n SNPs, h patterns Let x i be defined as follows.  x i = 1 if the i-th SNP is selected;  x i = 0 otherwise. Let D(P j, P k ) be the set of SNPs that can distinguish patterns P j and P k. Integer programming formulation.

22 National Taiwan University Department of Computer Science and Information Engineering 22 Problem Formulation D(P 1, P 2 )={S 3, S 4 } D(P 1, P 3 )={S 1, S 3 } D(P 1, P 4 )={S 1, S 2, S 4 } D(P 2, P 3 )={S 1, S 4 } D(P 2, P 4 )={S 1, S 2, S 3 } D(P 3, P 4 )={S 2, S 3, S 4 } P1P1 P2P2 P3P3 P4P4 S3S3 S4S4 S1S1 S2S2

23 National Taiwan University Department of Computer Science and Information Engineering 23 An Iterative LP-relaxation Algorithm Linear programming relaxation. Randomized rounding method. Repeat the steps for those unsatisfied inequalities until all of them are satisfied.

24 National Taiwan University Department of Computer Science and Information Engineering 24 Discussion In this chapter, we illustrate how to recast the tag SNP selection problem as the set cover problem and the integer linear programming problem. hard problems approximation algorithms Related topics: missing data LD-bins a specified number of tag SNPs


Download ppt "National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department."

Similar presentations


Ads by Google