Presentation is loading. Please wait.

Presentation is loading. Please wait.

Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano.

Similar presentations


Presentation on theme: "Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano."— Presentation transcript:

1 Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano Lonardi and Tao Jiang

2 Introduction The MCTS Model Our Algorithms Experimental Result Outline

3  Introduction The MCTS Model Our Algorithms Experimental Result Outline

4 Motivation With the rapid development of genotyping technologies, there are more than 10 million verified single-nucleotide polymorphisms (SNPs) in dbSNP database. We aim to select a subset of informative SNPs ( i.e. tagSNPs), in order to Save the cost for genotyping all SNPs. Perform disease association mapping.

5 TagSNP Selection Haplotype-based methods Require the information of the phased multilocus haplotypes Haplotype-free methods Do not require haplotype information TagSNP selection via r 2 linkage disequilibrium statistics

6 r 2 Linkage Disequilibrium Statistics Given a pair of genetic markers 1 and 2. r 2 statistics: r 2 = (p AB –p A. p.B ) 2 p A. (1-p A. ) p.B (1-p.B ) If r 2 is no less than a given threshold r 0, marker 1 (or marker 2) can tag marker 2 (or marker 1, respectively).

7 The TagSNP Selection Problem Instance: a set V of SNP markers and LD patterns E ={ ( v j1, v j2 )| r 2 (v j1,v j2 ) is no less than a given threshold r 0, v j1 and v j2 are in V }, Feasible solution: a subset V', such that given any v in V, there exists a v' in V', where r 2 (v,v') is no less than r 0. Objective: minimize | V' |. If we define G =( V, E ), a tagSNP set is equivalent to a dominating set on G. (a) SNP markers and their LD patterns in a population (b) TagSNPs for the population This model is introduced by Carlson et al., 2004. It is a simple and popular tagging method.

8 Introduction  The MCTS Model Our Algorithms Experimental Result Outline

9 r 2 Statistics in Single and Admixed Populations r 2 = 00.050.95 0.050.00250.0475 a 0.950.0475 0.9025 A bB Population 1 r 2 = 00.950.05 0.950.90250.0475 a 0.050.0475 0.0025 A bB Population 2 r 2 = 0.65610.5 0.45250.0475 a 0.50.0475 0.4525 A bB Admixed population: 50% population 1 50% population 2  SNP 1: A, a  SNP 2: B, b

10 TagSNP Selection across Populations A pair of SNPs have remarkably different marker frequencies and very weak LD in two populations with different evolutionary histories. may show strong LD in the admixed population. TagSNPs picked from the admixed populations or one of the populations might not be sufficient to capture the variations in all populations.

11 The MCTS Model Given a set of SNP markers and LD patterns in multiple populations, we want to find a minimum common tagSNP set for each of the populations. The above problem is called the minimum common tagSNP selection problem (MCTS). (a) SNP markers and their LD patterns in two populations. (b) The minimum TagSNP set for these two populations.

12 Introduction The MCTS Model  Our Algorithms Experimental Result Outline

13 Our Algorithms The MCTS problem can be easily formulated by an integer linear programming. We first apply some data reduction rules, then use one of the following algorithms A greedy algorithm: GreedyTag A Lagrangian relaxation algorithm: LRTag We calculate the upper bound : the number of the tagSNPs obtained by our algorithms GreedyTag_lb LRTag_lb the lower bound : the minimum number of tagSNPs needed

14 Data Reduction Rules Pick all irreplaceable markers Example: marker 7 Remove less informative markers Example: among markers 1, 2 and 6, remove marker 1 and 2. Remove less stringent occurrences Example: between the occurrences of markers 4 and 5 in population 2, remove the occurrence of marker 4.

15 A Greedy Algorithm Apply data reduction rules un-tagged occurrence? Pick the marker which tags the most of the remaining occurrences as a tagSNP yes no Output the tagSNPs

16 A Lagrangian Relaxation Algorithm Introduce the Lagrangian multipliers λ iteration++ < max_iter yes no Output the tagSNPs Obtain the relaxed integer program Initialize λ Obtain the tagSNP set based on λ iteration := 0 Update λ towards the subgradient direction Update the tagSNP set based on λ

17 Introduction The MCTS Model Our Algorithms  Experimental Result Outline

18 Experimental Result We apply our algorithms on real HapMap data ( release #19, NCBI build 34, October 2005 ). There are four populations in HapMap data. CEU: Europe descendents. CHB: Chinese, Beijing. JPT: Japanese, Tokyo. YRI: Yoruba people of Ibadan, Nigeria. We get tagSNPs for the following two datasets: Encode regions all 10 ENCODE regions Human genome chromosomes 1 – 22 10,859 markers. 2,862,454 markers

19 Experiment Result for ENCODE Regions  We compare our GreedyTag and LRTag with MultiPop-TagSelect(MPS).  Multipop-TagSelect first generates the tagSNPs for each single population, then combines the obtained tagSNPs together for multiple populations.  The gap between LRTag_lb and LRTag  r 2 = 0.5: at most two for each region totally six for all regions  r 2 = 0.8: there is no gap.

20 Experiment Result for Human Genome The numbers of tagSNPs selected by our algorithms are almost optimal.  The gap between LRTag_lb and LRTag for the whole genome  2,862,454 SNPs in total  r 2 = 0.5: 1,061  r 2 = 0.8: 142

21 Running Time of Our Algorithms Running environment a 32-processor SGI Altix 4700 supercomputer system 1.6 GHZ CPU 64 GB shared memory 15 threads in parallel. Running time r 2 = 0.5, ENCODE regions: < 7 seconds for each region, < 1 minute for all regions. Human genome: < 12 minutes for each chromosome, < 1 hour for the genome. r 2 > 0.5, our algorithms run faster the above speed.

22 Introduction The MCTS Model Our Algorithms Experimental Result Outline

23 Thanks for your time and attention!


Download ppt "Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano."

Similar presentations


Ads by Google