Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

Slides:



Advertisements
Similar presentations
CZ5225 Methods in Computational Biology Lecture 9: Pharmacogenetics and individual variation of drug response CZ5225 Methods in Computational Biology.
Advertisements

Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
METHODS FOR HAPLOTYPE RECONSTRUCTION
Sharlee Climer, Alan R. Templeton, and Weixiong Zhang
Approaching the Long-Range Phasing Problem using Variable Memory Markov Chains Samuel Angelo Crisanto 2015 Undergraduate Research Symposium Brown University.
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
More Powerful Genome-wide Association Methods for Case-control Data Robert C. Elston, PhD Case Western Reserve University Cleveland Ohio.
WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University.
Computational Tools for Finding and Interpreting Genetic Variations Gabor T. Marth Department of Biology, Boston College
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference Ion Mandoiu University of Connecticut CS&E Department.
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
CSE182-L17 Clustering Population Genetics: Basics.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
Polymorphisms – SNP, InDel, Transposon BMI/IBGP 730 Victor Jin, Ph.D. (Slides from Dr. Kun Huang) Department of Biomedical Informatics Ohio State University.
Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
BIBE 051 Haplotype Phasing using Semidefinite Programming Parag Namjoshi CSEE Department University of Maryland Baltimore County Joint work with Konstantinos.
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
SNPs Daniel Fernandez Alejandro Quiroz Zárate. A SNP is defined as a single base change in a DNA sequence that occurs in a significant proportion (more.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
CS CM124/224 & HG CM124/224 DISCUSSION SECTION (JUN 6, 2013) TA: Farhad Hormozdiari.
CS177 Lecture 10 SNPs and Human Genetic Variation
SNP Haplotypes as Diagnostic Markers Shrish Tiwari CCMB, Hyderabad.
Informative SNP Selection Based on Multiple Linear Regression
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Finnish Genome Center Monday, 16 November Genotyping & Haplotyping.
Genes in human populations n Population genetics: focus on allele frequencies (the “gene pool” = all the gametes in a big pot!) n Hardy-Weinberg calculations.
Julia N. Chapman, Alia Kamal, Archith Ramkumar, Owen L. Astrachan Duke University, Genome Revolution Focus, Department of Computer Science Sources
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
The International Consortium. The International HapMap Project.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
CSE280Stefano/Hossein Project: Primer design for cancer genomics.
Notes: Human Genome (Right side page)
Approximation Algorithms based on linear programming.
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
The Haplotype Blocks Problems Wu Ling-Yun
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
Simple-Sequence Length Polymorphisms
Single Nucleotide Polymorphisms (SNPs
Introduction to SNP and Haplotype Analysis
Of Sea Urchins, Birds and Men
Constrained Hidden Markov Models for Population-based Haplotyping
Consideration for Planning a Candidate Gene Association Study With TagSNPs Shehnaz K. Hussain, PhD, ScM Epidemiology 243: Molecular.
School of Pharmacy, University of Nizwa
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS)
Linkage: Statistically, genes act like beads on a string
Introduction to SNP and Haplotype Analysis
Linking Genetic Variation to Important Phenotypes
Haplotype Reconstruction
Phasing of 2-SNP Genotypes Based on Non-Random Mating Model
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
MUTATIONS.
EE368 Soft Computing Genetic Algorithms.
School of Pharmacy, University of Nizwa
Outline Cancer Progression Models
Ho Kim School of Public Health Seoul National University
On solving population haplotype inference problems
Genetic algorithms: case study
Approximation Algorithms for the Selection of Robust Tag SNPs
SNPs and CNPs By: David Wendel.
Approximation Algorithms for the Selection of Robust Tag SNPs
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Parsimony population haplotyping
Presentation transcript:

Haplotype Inference Yao-Ting Huang Kun-Mao Chao

Genetic Variations The genetic variations in DNA sequences (e.g., insertions, deletions, and mutations) have a major impact on genetic diseases and phenotypic differences. All humans share 99% the same DNA sequence. The genetic variations in the coding region may change the codon of an amino acid and alters the amino acid sequence.

Single Nucleotide Polymorphism A Single Nucleotide Polymorphisms (SNP), pronounced “snip,” is a genetic variation when a single nucleotide (i.e., A, T, C, or G) is altered and kept through heredity. SNP: Single DNA base variation found >1% Mutation: Single DNA base variation found <1% C T T A G C T T 99.9% C T T A G C T T 94% 6% C T T A G T T T 0.1% C T T A G T T T SNP Mutation

Observed genetic variations Mutations and SNPs Observed genetic variations SNPs Mutations Common Ancestor time present

Single Nucleotide Polymorphism SNPs are the most frequent form among various genetic variations. 90% of human genetic variations come from SNPs. SNPs occur about every 300~600 base pairs. Millions of SNPs have been identified (e.g., HapMap and Perlegen). SNPs have become the preferred markers for association studies because of their high abundance and high-throughput SNP genotyping technologies.

Single Nucleotide Polymorphism A SNP is usually assumed to be a binary variable. The probability of repeat mutation at the same SNP locus is quite small. The tri-allele cases are usually considered to be the effect of genotyping errors. The nucleotide on a SNP locus is called a major allele (if allele frequency > 50%), or a minor allele (if allele frequency < 50%). 94% A C T T A G C T T T: Major allele 6% A C T T A G C T C C: Minor allele

Haplotypes A haplotype stands for an ordered list of SNPs on the same chromosome. A haplotype can be simply considered as a binary string since each SNP is binary. -A C T T A G C T T- -A A T T T G C T C- -A C T T T G C T C- Haplotype 2 Haplotype 3 C A T A T C C T C Haplotype 1 SNP1 SNP2 SNP3 SNP1 SNP2 SNP3

Genotypes G T A T C G Haplotype data Genotype data A C The use of haplotype information has been limited because the human genome is a diploid. In large sequencing projects, genotypes instead of haplotypes are collected due to cost consideration. AC GT A T C G G T A T SNP1 SNP2 C G Haplotype data SNP1 SNP2 Genotype data A C However, the haplotype data is not easy to be obtained because the human genome is a diploid, Which s composed of two chromosomes. To obtain the haplotype data, we have to separate them first and then extract the SNPs in each chromosome. And obtain two haplotypes AT and CG. But most large sequencing projects, because of cost considerations, The diploid chromosomes are not separated. and thus we obtain the less accurate information called genotype data. Based on genotype data, we only knows the two SNPs at each locus. But we do not know the combination between of these SNPs at different loci. For example, we don’t know the haplotype data are AT CG or AG and CT. So here comes the problem. We are only interested in haplotype data. Which haplotype pair is true.. SNP1 SNP2 SNP1 SNP2

Problems of Genotypes or A T C G A G C T A C G T Genotype data Genotypes only tell us the alleles at each SNP locus. But we don’t know the connection of alleles at different SNP loci. There could be several possible haplotypes for the same genotype. AC GT A T C G SNP1 SNP2 A G C T SNP1 SNP2 A C G T SNP1 SNP2 Genotype data or SNP1 SNP2 We don’t know which haplotype pair is real.

Research Directions of SNPs and Haplotypes in Recent Years SNP Database Haplotype Inference Tag SNP Selection … Maximum Parsimony Perfect Phylogeny Statistical Methods Haplotype block LD bin Prediction Accuracy

Haplotype Inference The problem of inferring the haplotypes from a set of genotypes is called haplotype inference. This problem is already known to be not only NP-hard but also APX-hard. Most combinatorial methods consider the maximum parsimony model to solve this problem. This model assumes that the real haplotypes in natural population is rare. The solution of this problem is a minimum set of haplotypes that can explain the given genotypes. To solve this problem, most combinatorial methods consider the maximum parsimony model. This model assume that the number of real haplotpyes is rare in the population.

Maximum Parsimony A T C G A G C T A C G T A T A T A T C G A G C T h1 C G h2 A G h3 C T h4 G1 A C SNP1 SNP2 G T or G2 A SNP1 SNP2 T A T h1 Suppose we are given two genotypes. G1 and G2. A T C G A G C T Find a minimum set of haplotypes to explain the given genotypes.

Our Results We formulated this problem as an integer quadratic programming (IQP) problem. We proposed an iterative semidefinite programming (SDP) relaxation algorithm to solve the IQP problem. This algorithm finds a solution of O(log n) approximation. We implemented this algorithm in MatLab and compared with existing methods. Huang, Y.-T., Chao, K.-M., and Chen, T., 2005, “An Approximation Algorithm for Haplotype Inference by Maximum Parsimony,” Journal of Computational Biology, 12: 1261-1274.

Problem Formulation A T C G A C G T A T C G A T A T Input: Output: A set of n genotypes and m possible haplotypes. Output: A minimum set of haplotypes that can explain the given genotypes. A T h1 C G h2 G1 A C SNP1 SNP2 G T A T h1 C G h2 G2 A SNP1 SNP2 T A T h1

Integer Quadratic Programming (IQP) Define xi as an integer variable with values 1 or -1. xi = 1 if the i-th haplotype is selected. xi = -1 if the i-th haplotype is not selected. Minimizing the number of selected haplotypes is to minimize the following integer quadratic function: The first step is to formulate the haplotype inference problem as an IQP problem. Let m be the number of haplotypes and xi be the selection for each haplotype. Xi is equal to 1 if the i-th haplotype is selected and -1 otherwise. Because if one haplotype is selected, this term is 1. If no selected, this term is 0. And the summation of these terms is just the number of selected haplotypes. For each genotype at least one pair of haplotypes must be selected. For example, G1 can be resolved by h1 h2 or h3 h4. If h1 h2 are both selected, this equation is equal to 1.

Integer Quadratic Programming (IQP) Each genotype must be resolved by at least one pair of haplotypes. For genotype G1, the following integer quadratic function must be satisfied. Suppose h1 and h2 are selected 1 A T h1 C G h2 A G h3 C T h4 G1 A C SNP1 SNP2 G T or

Integer Quadratic Programming (IQP) Objective Function Constraint Functions Find a minimum set of haplotypes Maximum parsimony: We use the SDP-relaxation technique to solve this IQP problem. to resolve all genotypes.

The Flow of the Iterative SDP Relaxation Algorithm Relax the integer constraint NP-hard P Reformulation Integer Quadratic Programming Vector Formulation Semidefinite Programming No, repeat this algorithm. Existing SDP solver All genotypes resolved? Yes, done. Integral Solution Vector Solution SDP Solution Randomized rounding Incomplete Cholesky decomposition