Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano.

Slides:



Advertisements
Similar presentations
Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
Advertisements

Julia Krushkal 4/11/2017 The International HapMap Project: A Rich Resource of Genetic Information Julia Krushkal Lecture in Bioinformatics 04/15/2010.
Genome-wide Association Study Focus on association between SNPs and traits Tendency – Larger and larger sample size – Use of more narrowly defined phenotypes(blood.
Linkage Disequilibrium
Understanding GWAS Chip Design – Linkage Disequilibrium and HapMap Peter Castaldi January 29, 2013.
MALD Mapping by Admixture Linkage Disequilibrium.
The role of variation in finding functional genetic elements Andy Clark – Cornell Dave Begun – UC Davis.
More Powerful Genome-wide Association Methods for Case-control Data Robert C. Elston, PhD Case Western Reserve University Cleveland Ohio.
A coalescent computational platform for tagging marker selection for clinical studies Gabor T. Marth Department of Biology, Boston College
A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College
SNP Resources: Finding SNPs, Databases and Data Extraction Debbie Nickerson NIEHS SNPs Workshop.
Wei-Bung Wang Tao Jiang
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
Genomewide Association Studies.  1. History –Linkage vs. Association –Power/Sample Size  2. Human Genetic Variation: SNPs  3. Direct vs. Indirect Association.
Polymorphisms – SNP, InDel, Transposon BMI/IBGP 730 Victor Jin, Ph.D. (Slides from Dr. Kun Huang) Department of Biomedical Informatics Ohio State University.
Some Algorithmic Problems Concerning the Inference and Analysis of TagSNPs, Haplotypes and Pedigrees PH.D candidate: Lan Liu Advisor: Tao Jiang.
SNP Selection University of Louisville Center for Genetics and Molecular Medicine January 10, 2008 Dana Crawford, PhD Vanderbilt University Center for.
Course Overview Personalized Medicine: Understanding Your Own Genome Fall 2014.
Haplotype Discovery and Modeling. Identification of genes Identify the Phenotype MapClone.
Design Considerations in Large- Scale Genetic Association Studies Michael Boehnke, Andrew Skol, Laura Scott, Cristen Willer, Gonçalo Abecasis, Anne Jackson,
HapMap: application in the design and interpretation of association studies Mark J. Daly, PhD on behalf of The International HapMap Consortium.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
A single-nucleotide polymorphism tagging set for human drug metabolism and transport Kourosh R Ahmadi, Mike E Weale, Zhengyu Y Xue, Nicole Soranzo, David.
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen,
Biology 101 DNA: elegant simplicity A molecule consisting of two strands that wrap around each other to form a “twisted ladder” shape, with the.
Molecular & Genetic Epi 217 Association Studies
CS177 Lecture 10 SNPs and Human Genetic Variation
Informative SNP Selection Based on Multiple Linear Regression
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
BGRS 2006 SEARCH FOR MULTI-SNP DISEASE ASSOCIATION D. Brinza, A. Perelygin, M. Brinton and A. Zelikovsky Georgia State University, Atlanta, GA, USA 123.
1 of 32 Sequence Variation in Ensembl. 2 of 32 Outline SNPs SNPs in Ensembl Haplotypes & Linkage Disequilibrium SNPs in BioMart HapMap project Strain-specific.
Fast Tag SNP Selection Wang Yue Joint work with Postdoc Guimei Liu and Prof Limsoon Wong.
Molecular & Genetic Epi 217 Association Studies: Indirect John Witte.
Polymorphism Haixu Tang School of Informatics. Genome variations underlie phenotypic differences cause inherited diseases.
Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.
Julia N. Chapman, Alia Kamal, Archith Ramkumar, Owen L. Astrachan Duke University, Genome Revolution Focus, Department of Computer Science Sources
GVS: Genome Variation Server Materials prepared by: Warren C. Lathe, PhD Updated: Q Version 2.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
Clustering and optimization in genetic data: the problem of Tag-SNPs selection Paola Bertolazzi, Serena D‘ Aguanno, Giovanni Felici *, Paola Festa** *
The HapMap Project and Haploview
The International Consortium. The International HapMap Project.
Motivations to study human genetic variation
Copyright OpenHelix. No use or reproduction without express written consent1.
Fast test for multiple locus mapping By Yi Wen Nisha Rajagopal.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
Signals of natural selection in the HapMap project data The International HapMap Consortium Gil McVean Department of Statistics, Oxford University.
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
The Haplotype Blocks Problems Wu Ling-Yun
Population stratification
Introduction to SNP and Haplotype Analysis
Of Sea Urchins, Birds and Men
Population Genetics As we all have an interest in genomic epidemiology we are likely all either in the process of sampling and ananlysising genetic data.
Itsik Pe’er, Yves R. Chretien, Paul I. W. de Bakker, Jeffrey C
TagSNP Selection Problems based on Linkage Disequilibrium and Lagrangian Relaxation Chia-Yi Ma I-Lin Wang Department of Industrial & Information Management.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Genome-Wide Association Studies: Present Status and Future Directions
Haplotypes When the presence of two or more polymorphisms on a single chromosome is statistically correlated in a population, this is a haplotype Example.
Approximation Algorithms for the Selection of Robust Tag SNPs
Selecting a Maximally Informative Set of Single-Nucleotide Polymorphisms for Association Analyses Using Linkage Disequilibrium  Christopher S. Carlson,
Yu Zhang, Tianhua Niu, Jun S. Liu 
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano Lonardi and Tao Jiang

Introduction The MCTS Model Our Algorithms Experimental Result Outline

 Introduction The MCTS Model Our Algorithms Experimental Result Outline

Motivation With the rapid development of genotyping technologies, there are more than 10 million verified single-nucleotide polymorphisms (SNPs) in dbSNP database. We aim to select a subset of informative SNPs ( i.e. tagSNPs), in order to Save the cost for genotyping all SNPs. Perform disease association mapping.

TagSNP Selection Haplotype-based methods Require the information of the phased multilocus haplotypes Haplotype-free methods Do not require haplotype information TagSNP selection via r 2 linkage disequilibrium statistics

r 2 Linkage Disequilibrium Statistics Given a pair of genetic markers 1 and 2. r 2 statistics: r 2 = (p AB –p A. p.B ) 2 p A. (1-p A. ) p.B (1-p.B ) If r 2 is no less than a given threshold r 0, marker 1 (or marker 2) can tag marker 2 (or marker 1, respectively).

The TagSNP Selection Problem Instance: a set V of SNP markers and LD patterns E ={ ( v j1, v j2 )| r 2 (v j1,v j2 ) is no less than a given threshold r 0, v j1 and v j2 are in V }, Feasible solution: a subset V', such that given any v in V, there exists a v' in V', where r 2 (v,v') is no less than r 0. Objective: minimize | V' |. If we define G =( V, E ), a tagSNP set is equivalent to a dominating set on G. (a) SNP markers and their LD patterns in a population (b) TagSNPs for the population This model is introduced by Carlson et al., It is a simple and popular tagging method.

Introduction  The MCTS Model Our Algorithms Experimental Result Outline

r 2 Statistics in Single and Admixed Populations r 2 = a A bB Population 1 r 2 = a A bB Population 2 r 2 = a A bB Admixed population: 50% population 1 50% population 2  SNP 1: A, a  SNP 2: B, b

TagSNP Selection across Populations A pair of SNPs have remarkably different marker frequencies and very weak LD in two populations with different evolutionary histories. may show strong LD in the admixed population. TagSNPs picked from the admixed populations or one of the populations might not be sufficient to capture the variations in all populations.

The MCTS Model Given a set of SNP markers and LD patterns in multiple populations, we want to find a minimum common tagSNP set for each of the populations. The above problem is called the minimum common tagSNP selection problem (MCTS). (a) SNP markers and their LD patterns in two populations. (b) The minimum TagSNP set for these two populations.

Introduction The MCTS Model  Our Algorithms Experimental Result Outline

Our Algorithms The MCTS problem can be easily formulated by an integer linear programming. We first apply some data reduction rules, then use one of the following algorithms A greedy algorithm: GreedyTag A Lagrangian relaxation algorithm: LRTag We calculate the upper bound : the number of the tagSNPs obtained by our algorithms GreedyTag_lb LRTag_lb the lower bound : the minimum number of tagSNPs needed

Data Reduction Rules Pick all irreplaceable markers Example: marker 7 Remove less informative markers Example: among markers 1, 2 and 6, remove marker 1 and 2. Remove less stringent occurrences Example: between the occurrences of markers 4 and 5 in population 2, remove the occurrence of marker 4.

A Greedy Algorithm Apply data reduction rules un-tagged occurrence? Pick the marker which tags the most of the remaining occurrences as a tagSNP yes no Output the tagSNPs

A Lagrangian Relaxation Algorithm Introduce the Lagrangian multipliers λ iteration++ < max_iter yes no Output the tagSNPs Obtain the relaxed integer program Initialize λ Obtain the tagSNP set based on λ iteration := 0 Update λ towards the subgradient direction Update the tagSNP set based on λ

Introduction The MCTS Model Our Algorithms  Experimental Result Outline

Experimental Result We apply our algorithms on real HapMap data ( release #19, NCBI build 34, October 2005 ). There are four populations in HapMap data. CEU: Europe descendents. CHB: Chinese, Beijing. JPT: Japanese, Tokyo. YRI: Yoruba people of Ibadan, Nigeria. We get tagSNPs for the following two datasets: Encode regions all 10 ENCODE regions Human genome chromosomes 1 – 22 10,859 markers. 2,862,454 markers

Experiment Result for ENCODE Regions  We compare our GreedyTag and LRTag with MultiPop-TagSelect(MPS).  Multipop-TagSelect first generates the tagSNPs for each single population, then combines the obtained tagSNPs together for multiple populations.  The gap between LRTag_lb and LRTag  r 2 = 0.5: at most two for each region totally six for all regions  r 2 = 0.8: there is no gap.

Experiment Result for Human Genome The numbers of tagSNPs selected by our algorithms are almost optimal.  The gap between LRTag_lb and LRTag for the whole genome  2,862,454 SNPs in total  r 2 = 0.5: 1,061  r 2 = 0.8: 142

Running Time of Our Algorithms Running environment a 32-processor SGI Altix 4700 supercomputer system 1.6 GHZ CPU 64 GB shared memory 15 threads in parallel. Running time r 2 = 0.5, ENCODE regions: < 7 seconds for each region, < 1 minute for all regions. Human genome: < 12 minutes for each chromosome, < 1 hour for the genome. r 2 > 0.5, our algorithms run faster the above speed.

Introduction The MCTS Model Our Algorithms Experimental Result Outline

Thanks for your time and attention!