Phasing of 2-SNP Genotypes Based on Non-Random Mating Model

Slides:



Advertisements
Similar presentations
Introduction to Haplotype Estimation Stat/Biostat 550.
Advertisements

Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
Single Nucleotide Polymorphism And Association Studies Stat 115 Dec 12, 2006.
Note that the genetic map is different for men and women Recombination frequency is higher in meiosis in women.
METHODS FOR HAPLOTYPE RECONSTRUCTION
Multiple Comparisons Measures of LD Jess Paulus, ScD January 29, 2013.
Objectives Cover some of the essential concepts for GWAS that have not yet been covered Hardy-Weinberg equilibrium Meta-analysis SNP Imputation Review.
Basics of Linkage Analysis
MALD Mapping by Admixture Linkage Disequilibrium.
Ronnie A. Sebro Haplotype reconstruction BMI /21/2004.
Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference Ion Mandoiu University of Connecticut CS&E Department.
CSE 291: Advanced Topics in Computational Biology Vineet Bafna/Pavel Pevzner
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
Phasing of 2-SNP Genotypes Based on Non-Random Mating Model Dumitru Brinza joint work with Alexander Zelikovsky Department of Computer Science Georgia.
Genotype Susceptibility And Integrated Risk Factors for Complex Diseases Weidong Mao Dumitru Brinza Nisar Hundewale Stefan Gremalshi Alexander Zelikovsky.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Advanced Algorithms and Models for Computational Biology -- a machine learning approach Population Genetics: SNPS Haplotype Inference Eric Xing Lecture.
CS177 Lecture 10 SNPs and Human Genetic Variation
Informative SNP Selection Based on Multiple Linear Regression
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
BGRS 2006 SEARCH FOR MULTI-SNP DISEASE ASSOCIATION D. Brinza, A. Perelygin, M. Brinton and A. Zelikovsky Georgia State University, Atlanta, GA, USA 123.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Finnish Genome Center Monday, 16 November Genotyping & Haplotyping.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
February 20, 2002 UD, Newark, DE SNPs, Haplotypes, Alleles.
Design and Validation of Methods Searching for Risk Factors in Genotype Case- Control Studies Dumitru Brinza Alexander Zelikovsky Department of Computer.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
CSE280Vineet Bafna In a ‘stable’ population, the distribution of alleles obeys certain laws – Not really, and the deviations are interesting HW Equilibrium.
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
Types of genome maps Physical – based on bp Genetic/ linkage – based on recombination from Thomas Hunt Morgan's 1916 ''A Critique of the Theory of Evolution'',
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
The Haplotype Blocks Problems Wu Ling-Yun
Association Mapping in Families Gonçalo Abecasis University of Oxford.
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
Yufeng Wu and Dan Gusfield University of California, Davis
Equilibria in populations
Introduction to SNP and Haplotype Analysis
Genetic Linkage.
Gonçalo Abecasis and Janis Wigginton University of Michigan, Ann Arbor
Measuring Evolutionary Change Over Time
Of Sea Urchins, Birds and Men
Constrained Hidden Markov Models for Population-based Haplotyping
upstream vs. ORF binding and gene expression?
New Courses in the Fall Biodiversity -- Pennings
Genetic Linkage.
Recombination (Crossing Over)
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS)
Imputation-based local ancestry inference in admixed populations
Washington State University
MULTIPLE GENES AND QUANTITATIVE TRAITS
The ‘V’ in the Tajima D equation is:
Haplotype Reconstruction
The Evolution of Populations
Vineet Bafna/Pavel Pevzner
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Modern Evolutionary Biology I. Population Genetics
Association Analysis Spotted history
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Accuracy of Haplotype Frequency Estimation for Biallelic Loci, via the Expectation- Maximization Algorithm for Unphased Diploid Genotype Data  Daniele.
Washington State University
Outline Cancer Progression Models
Ho Kim School of Public Health Seoul National University
Jonathan K. Pritchard, Joseph K. Pickrell, Graham Coop  Current Biology 
Haplotypes When the presence of two or more polymorphisms on a single chromosome is statistically correlated in a population, this is a haplotype Example.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

Phasing of 2-SNP Genotypes Based on Non-Random Mating Model Dumitru Brinza Alexander Zelikovsky Department of Computer Science Georgia State University Atlanta, USA

Molecular biology terms Motivation Problem formulation Previous work Outline Molecular biology terms Motivation Problem formulation Previous work Our contribution Phasing of 2-SNP genotypes Phasing of a complete genotype Results

Molecular biology terms Human Genome – all the genetic material in the chromosomes, length 3×109 base pairs Difference between any two people occur in 0.1% of genome SNP – single nucleotide polymorphism site where two or more different nucleotides occur in a large percentage of population. Genotype – The entire genetic identity of an individual, including alleles, SNPs, or gene forms. (e.g., AC CT TG AA AC TG) Haplotype – A single set of chromosomes (half of the full set of genetic material). (e.g., A C T A A T) Genotype is a mixture of two haplotypes.

From ACTG to 0,1,2 notations  Haplotype: Genotypes: Wild type SNPs are notated as 0 Mutated SNPs are notated as 1 Genotypes: Homozygous SNPs are notated as 0,1 (mixture of 00,11) Heterozygous SNPs are notated as 2 (mixture of 01,10) homozygous haplotype SNP heterozygous Two haplotypes per individual Genotype for the individual  1 0 1 0 0 1 1 1 0 2 1 2 0 2

Motivation Haplotype may contain large amount of genetic markers, which are responsible for human disease. Haplotypes may increase the power of association between marker loci and phenotypic traits. Evolutionary tree can be reconstructed based on haplotypes. Physical phasing (haplotypes inferring) is too expensive. Great need in computational methods for extracting haplotype information from the given genotype information. Existing methods are either extremely slow or less accurate for genome-wide study.

Phasing problem (Haplotype inference) Inferring haplotypes or genotype phasing is resolution of a genotype into two haplotypes Given: n genotype vectors (0, 1 or 2), Find: n pairs of haplotype vectors, one pair of haplotypes per each genotype explaining genotypes For individual genotype with h heterozygous sites there are 2h-1 possible haplotype pairs explaining this genotype (h=20k for the genome-wide). also there are around 10% missing data. This is hopeless without genetic model

Previous work PHASE – Bayesian statistical method (Stephens et al., 2001, 2003) HAPLOTYPER – proposed a Monte Carlo approach (Niu et al., 2002) Phamily – phase the trio families based on PHASE (Acherman et al., 2003) GERBIL – statistical method using maximum likelihood (ML), MST and expectation-maximization (EM) (Kimmel and Shamir, 2005) SNPHAP – use ML/EM assuming Hardy-Weinberg equilibrium (Clayton et al., 2004)

Contribution We explore phasing of genotypes with 2 SNPs which have ambiguity when the both sites are heterozygous. There are two possible phasing and the phasing problem is reduced to inferring their frequencies. Having the phasing solution for 2-SNP genotypes, we propose an algorithm for inferring the complete haplotypes for a given genotype based on the maximum spanning tree of a complete graph with vertices corresponding to heterozygous sites and edge weights given by the inferred 2-SNP frequencies. Extensive experimental validation of proposed methods and comparison with the previously known methods

Phasing of 2-SNP genotypes At least one SNP is homozygous – phasing is well defined: Both SNPs are heterozygous – ambiguity Cis- phasing Trans- phasing 01 01 or Example 01 21 01 11 0 0 22 1 1 0 1 22 1 0

Certainty of cis- or trans- phasing Normally odds ratio of being phased cis- or trans- Modified odds ratio better describes cis- or trans- phasing LD (linkage disequilibrium) between endpoints i and j

Certainty of cis- or trans- phasing Higher LD between pairs of closer SNPs We discard falsely encountered LD between non-linked SNPs which are far apart Logarithm stays for sign, cij ≤ 0 means cis- with certainty |cij| cij > 0 means trans- with certainty |cij| 0 0 22 i j 1 1 0 1 22 i j 0 1

Certainty of cis- or trans- phasing n – number of genotypes F00, F01, F10, F11 – true haplotype frequencies (observed + expected in 22) i j Genotypes ? 1 0 2 1 1 0 1 0 1  #01 + 2 1 1 0 0 1 0 0 2 0 1  #00 + 2 *  0 1 2 0 1 2 0 1 0 1  (#00 + 1 , #11 + 1) or (#01 + 1 , #10 + 1) 2 1 1 0 1 1 0 ? 0 1  #11 + 2 0 1 1 0 1 2 0 0 2 1  #10 + 1 , #11 + 1

Haplotype frequencies in 22 Chosen to fit best Hardy-Weinberg equilibrium adjusted to observed deviation in single-site genotype distribution Hardy-Weinberg Equilibrium (HWE): (F00+F01+F10+F11)2 = F002 + F012 + F102 + F112 + 2F00F01 + 2F00F10 + 2F00F11 + 2F01F10 + 2F01F11 + 2F10F11 G00 G01 G10 G11 G02 G20 G22 G21 G12 Observed deviation from HWE in one SNP (F0+F1)(F0+F1-2x)= (F0+x)2 + (F1+x)2 + 2(F0F1-x2) xG0 yG1 zG2 Haplotype frequencies in 22 are chosen to fit best Hardy-Weinberg equilibrium adjusted to observed deviation in single-site genotype distribution HWE deviation in 2 SNPs based on HWE deviation in each SNP (F00+F01+F10+F11)2 = F002 + F012 + F102 + F112 + 2F00F01 + 2F00F10 + 2F00F11 + 2F01F10 + 2F01F11 + 2F10F11 xxG00 xyG01 yxG10 yyG11 xzG02 zxG20 zzG22 zyG21 yzG12

Phasing of a complete genotype Genotype graph for genotype g is a complete graph G(g ) where: Vertices = heterozygous SNPs in g (I,j)-edge weight w(I,j)=cis-/trans- likelihood phasing Phasing of 2 heterozygous SNPs Cis- edge: 22 = 00 + 11 Trans- edge: 22 = 01 + 10 Graph coloring Color all vertices in two colors such that any 2 vertices connected with the cis- edge have the same color, and any 2 vertices connected with trans- edge have opposite colors a b c d a Genotype 2 1 2 0 1 2 0 2 0 1 Haplotype #1 1 1 0 0 1 0 0 1 0 1 b c Haplotype #2 0 1 1 0 1 1 0 0 0 1 d

Phasing of a complete genotype Graph coloring Conflicts are solved using Maximum Spanning Tree (MST) 1 2 1 1 2 1 2 2 3 1 1 3

2SNP algorithm Collect statistics on haplotype/genotype frequencies for any 2 SNPs For each 2 SNPs compute weights reflecting likelihood of trans-/cis- For each genotype g: Find Maximum Spanning Tree for the complete graph G(g ) where vertices are heterozygous sites Color G(g ) vertices and phase based on colors Missing data recovery Recover each missing site based on the closest haplotype (Hamming distance) with the phased site Runtime (two bottlenecks) O(nm) – computing haplotype frequencies for 20×m pair of SNPs in each genotype, n is number of genotypes, m number of SNP’s. O(n2m) – missing data recovery, comparison of n genotypes by Hamming distance

Datasets Chromosome 5q31: 129 genotypes with 103 SNPs derived from the 616 KB region of human Chromosome 5q31 (Daly et al., 2001). Yoruba population (D): 30 genotypes with SNPs from 51 various genomic regions, with number of SNPs per region ranging from 13 to 114 (Gabriel et al., 2002). Random matching 5q31: 128 genotypes each with 89 SNPs from 5q31 cytokine gene generated by random matching from 64 haplotypes of 32 West African Hull et al. (2004). HapMap datasets: 30 genotypes of Utah residents and Yoruba residents available on HapMap by Dec 2005. The number of SNPs varies from 52 to 1381 across 40 regions including ENm010, ENm013, ENr112, ENr113 and ENr123 spanning 500 KB regions of chromosome bands 7p15:2, 7q21:13, 2p16:3, 4q26 and 12q12 respectively, and two regions spanning the gene STEAP and TRPM8 plus 10 KB upstream and downstream.

Unrelated individuals phasing validation Phasing methods can be validated on simulated data (haplotypes are known) The validation on real data is usually performed on the trio data Offspring haplotypes are mostly known (inferred from parents haplotypes) Error types Single-Site error Number of SNPs in offspring phased haplotypes which differ from SNPs inferred from trio data, divide by (total number of SNPs) x (total number of haplotypes) Individual error Number of correctly phased offspring genotypes (no Single-Site errors) divide by total number of genotypes Switching error Minimum number of switches which should be done in pair of haplotypes of offspring phased genotype such that both haplotypes will coincide with haplotypes inferred from trio data, divide by total number of heterozygous positions in offspring genotypes.

Results

Conclusion Entire genome (30 Trios from Hapmap) Average Errors: Single-site: 3.3% Switching: 8.8% #SNPs 1.5K runtime 2 sec 2.5K 8 sec 5.0K 25 sec 10.0K 55 sec 20.0K 220 sec 40.0K 17 min 60.0K 35 min 80.0K 70 min 2SNP method Several orders of magnitude faster Scalable for genome-wide study Same accurate as PHASE and Gerbil