Presentation is loading. Please wait.

Presentation is loading. Please wait.

Wei-Bung Wang Tao Jiang

Similar presentations


Presentation on theme: "Wei-Bung Wang Tao Jiang"— Presentation transcript:

1 Wei-Bung Wang Tao Jiang
A New Model of Multi-Marker Correlation for Genome-Wide Tag SNP Selection Wei-Bung Wang Tao Jiang

2 Outline Introduction Problem Related Work Our Approach Result

3 Single Nucleotide Polymorphism
Introduction Single Nucleotide Polymorphism Single Nucleotide Polymorphism (SNP) A genetic variation C T T A G C T T C T T A G T T T SNP 94% 6% Modified from slide by Yao-Ting Huang, National Taiwan University Department of Computer Science and Information Engineering

4 SNPs SNPs are usually bi-allelic
Introduction SNPs SNPs are usually bi-allelic Major allele Minor allele Minor allele frequency > 1% (or 5%) Tri-allelic: very rare C T T A G C T T C T T A G T T T SNP 94% 6%

5 Haplotype -A C T T A G C T T- -A A T T T G C T C- -A C T T T G C T C-
Introduction Haplotype -A C T T A G C T T- -A A T T T G C T C- -A C T T T G C T C- Haplotype 2 Haplotype 3 C A T A T C C T C Haplotype 1 SNP1 SNP2 SNP3 SNP1 SNP2 SNP3 Among humans, there are more than 99.9% identical nucleotides, and we consider only those < 0.1% difference, called SNPs. A set of SNPs on a single chromosome May refer to as few as two loci or to an entire chromosome In this talk, we only consider SNPs, and disregard all other identical nucleotides Modified from slide by Yao-Ting Huang, National Taiwan University Department of Computer Science and Information Engineering

6 Introduction Tag SNP What is a tag SNP? Here I use some slides by Yao-Ting Huang and Kun-Mao Chao

7 Examples of Tag SNPs Haplotype patterns An unknown haplotype sample P1
Suppose we wish to distinguish an unknown haplotype sample. We can genotype all SNPs to identify the haplotype sample. S2 S3 S4 S5 SNP loci S6 S7 S8 S9 : Major allele S10 S11 : Minor allele S12

8 Examples of Tag SNPs Haplotype pattern P1 P2 P3 P4 S1
In fact, it is not necessary to genotype all SNPs. SNPs S3, S4, and S5 can form a set of tag SNPs. S2 S3 S4 S5 SNP loci S6 P1 P2 P3 P4 S7 S8 S3 Tag SNPs is a set of representative that is sufficient to distinguish what pattern it is or predict all other SNPs. S9 S4 S10 S5 S11 S12

9 Examples of Wrong Tag SNPs
Haplotype pattern P1 P2 P3 P4 S1 SNPs S1, S2, and S3 can not form a set of tag SNPs because P1 and P4 will be ambiguous. S2 S3 S4 S5 SNP loci S6 P1 P2 P3 P4 S7 S1 S8 S2 S9 S3 S10 S11 S12

10 Examples of Tag SNPs Haplotype pattern P1 P2 P3 P4
SNPs S1 and S12 can form a set of tag SNPs. This set of SNPs is the minimum solution in this example. S1 S2 S3 S4 S5 SNP loci S6 S7 S8 P1 P2 P3 P4 S9 S1 S10 S12 S11 S12

11 Problem Tag SNP selection How to select representatives?
Many different ways

12 ? Flowchart What we do A group of individuals (all SNPs are known)
Select A set of SNPs (tag SNPs) Relationships between tag SNPs and other SNPs ? This is a kind of standard flowchart. … We want to improve the selection, that is, try to select a smaller set of tag SNPs and save more money Assay Haplotype: tag SNPs Haplotype: all SNPs Save money here

13 Problem Perfect world Real life Minimum set of tag SNPs
Save most money NP-hard Real life Relatively small set Sufficient accuracy/confidence

14 A very frequently used method is Linkage Disequilibrium (LD)
A group of individuals (all SNPs are known) Select What we do A set of SNPs (tag SNPs) Relationships between tag SNPs and other SNPs ? Assay Haplotype: tag SNPs Haplotype: all SNPs Save money here

15 Linkage Disequilibrium (LD)
Related Work Linkage Disequilibrium (LD) Non-random association of alleles at two or more loci Correlated coefficient: estimation of dependency LD = correlated coefficient = r 2 Some combinations of alleles or genetic markers occur more or less frequently than would be expected, and some pattern don’t even exist. Therefore we have non-random association of alleles at two or more loci, and we call it linkage disequilibrium. To estimate the dependency between two SNPs, correlated coefficient is a frequently used statistical tool. In this talk ….

16 Linkage Disequilibrium
Related Work Linkage Disequilibrium r2 = 1: perfect correlation r2 = 0.9: strong correlation (0.95, etc.) r2 = 0: no correlation r 2 = ( P A B ) a b r 2 [ ; 1 ] ( A , B ; a b ) r 2 = 1 , P A B

17 An Example Individual SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 1 A G C T 2 3 4 5
Related Work An Example Individual SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 1 A G C T 2 3 4 5 A A T T C C

18 Minimum Dominating Set Problem
Related Work Minimum Dominating Set Problem Highly correlated This is a traditional way to solve this problem, but we want to go further. SNP

19 Examples of Tag SNPs Haplotype patterns An unknown haplotype sample P1
Suppose we wish to distinguish an unknown haplotype sample. S2 S3 S4 S5 SNP loci S6 S7 S8 S9 : Major allele S10 S11 : Minor allele S12

20 Examples of Tag SNPs Haplotype patterns An unknown haplotype sample P1
Suppose we wish to distinguish an unknown haplotype sample. S2 S3 S4 S5 SNP loci S6 S7 S8 S9 : Major allele S10 S11 : Minor allele S12

21 SNPs can work together and help each other
Examples of Tag SNPs Haplotype pattern P1 P2 P3 P4 SNPs S1 and S12 can form a set of tag SNPs. This set of SNPs is the minimum solution in this example. S1 S2 S3 S4 S5 SNPs can work together and help each other SNP loci S6 S7 S8 P1 P2 P3 P4 In a traditional single-marker model, we use one SNP to predict some other SNPs that are correlated to it. Now we want to use two or three SNPs to work together and predict other SNPs S9 S1 S10 S12 S11 S12

22 Our Approach We introduce new allele AC and AC Only one mistake snp1
haplotype1 A C G haplotype2 T haplotype3 haplotype4 haplotype5 haplotype6 haplotype7 haplotype8 haplotype9 haplotype10 We introduce new allele AC and AC Only one mistake : A C T . 7 G 1 2 3 8 A C G else T

23 Our Approach A C , G : T ( A C _ T ) , : (snp1, snp2) vs. snp3
haplotype1 A C G haplotype2 T haplotype3 haplotype4 haplotype5 haplotype6 haplotype7 haplotype8 haplotype9 haplotype10 (snp1, snp2) vs. snp3 (snp1, snp2) vs. snp4 A C , G : T ( A C _ T ) , :

24 In the Right Order A group of individuals (all SNPs are known) Select
Our Approach In the Right Order A group of individuals (all SNPs are known) Select A set of SNPs (tag SNPs) Relationships between tag SNPs and other SNPs Let’s recall that we need not only to select the set of tag SNPs, but also to provide the prediction rules to reconstruct other SNPs in the future. We have to determine the form of prediction rules before we are able to select tag SNPs accordingly. Second First

25 Our Approach Generate relationships …………
If SNP 1, 4, 10 are tag SNPs Predict SNP 17 with patterns … Accuracy / LD: 0.97 If SNP 5, 8, 13 are tag SNPs Predict SNP 11 with patterns … Accuracy / LD: 0.62 . A relationship looks like this. If SNP 1, 4, and 10 are all tag SNPs, they can form a composite SNP to predict SNP 17. We determine how to predict SNP 17 with some way that I’m going to talk about in the next slide, and it achieves some accuracy, such as 0.97.

26 How to Predict / Determine the Alleles?
Our Approach How to Predict / Determine the Alleles? LD: (tag) SNP 1, 2, 3 vs. SNP 4 Allele A/a, B/b, C/c, D/d abc P A B C D > d ) m a j o r c < i n abC Abc aBc AbC aBC SNP[123] becomes bi-allelic ABC ABc major bucket minor bucket ( A B C _ b c a ) = M D m d

27 Similar Work Ke Hao also did a similar work The same LD model
Our Approach Similar Work Ke Hao also did a similar work The same LD model Different way to determine alleles for composite SNPs Less flexibility A special case of our model Related paper: “Genome-wide selection of tag SNPs using multiple-marker correlation,” Bioinformatics, 2007

28 Sketch Get r2 value for all possible combinations
Our Approach Sketch Get r2 value for all possible combinations Find a small subset of SNPs according to LD

29 Tag SNPs are also covered by themselves
Our Approach Sketch Find a small subset of SNPs according to LD Tag SNPs Tag SNPs are also covered by themselves covered partial covered

30 Sketch Simple greedy algorithm (Ke Hao)
Our Approach Sketch Simple greedy algorithm (Ke Hao) Cover more SNPs in each iteration Modified greedy algorithm (my work) A SNP that can’t be covered by others High priority A SNP that is not picked but covered OK Break tie: partial cover

31 Our Approach Supersede No longer contributes

32 Our Approach Supersede

33 My Program: MMTagger Pick a SNP Data structure Our Approach A l g o r
1 T w - M a k e R q u : s f p S N P n c v d 2 3 Ã 4 5 6 ( ; j B ) 7 8 / * \ " 9 Pick a SNP Data structure

34 Complexity Computing r2 value Picking tag SNPs ­ ( T )
Our Approach Complexity Computing r2 value O(nk+1) for k-marker Picking tag SNPs where T is the number of relationships O(T log T) time algorithm ( T )

35 Result Our program: MMTagger Vs. Single-marker approach (LRTag)
A state-of-the-art program Single-marker Vs. Hao’s program (MultiTag) Multi-marker

36 Vs. Single-Marker Approach
Result Vs. Single-Marker Approach

37 Result MMTagger Vs. MultiTag

38 Conclusion We provide a new multi-marker model Performance
Size of tag SNP set 2- vs. 1-marker: apparently better 3- vs. 2-marker: slightly better 4-marker or more: slow, unacceptable Performance Our program outperforms the only other program with similar model

39 Thank you!


Download ppt "Wei-Bung Wang Tao Jiang"

Similar presentations


Ads by Google