Presentation is loading. Please wait.

Presentation is loading. Please wait.

National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.

Similar presentations


Presentation on theme: "National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational."— Presentation transcript:

1 National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational Biology Lab, Department of Computer Science & Information Engineering, National Taiwan University, Taiwan. Lecturer:Kun-Mao Chao Assistant:Yao-Ting Huang Thank Yao-Ting for preparing this wonderful lecture note.

2 National Taiwan University Department of Computer Science and Information Engineering 2 Genetic Variations The genetic variations in DNA sequences (e.g., insertions, deletions, and mutations) have a major impact on genetic diseases and phenotypic differences.  All humans share 99% the same DNA sequence.  The genetic variations in the coding region may change the codon of an amino acid and alters the amino acid sequence.

3 National Taiwan University Department of Computer Science and Information Engineering Single Nucleotide Polymorphism A Single Nucleotide Polymorphisms (SNP), pronounced “snip,” is a genetic variation when a single nucleotide (i.e., A, T, C, or G) is altered and kept through heredity.  SNP: Single DNA base variation found >1%  Mutation: Single DNA base variation found <1% C T T A G C T T C T T A G T T T SNP C T T A G C T T C T T A G T T T Mutation 94% 6% 99.9% 0.1%

4 National Taiwan University Department of Computer Science and Information Engineering 4 Mutations and SNPs Common Ancestor timepresent Observed genetic variations Mutations SNPs

5 National Taiwan University Department of Computer Science and Information Engineering 5 Single Nucleotide Polymorphism SNPs are the most frequent form among various genetic variations.  90% of human genetic variations come from SNPs.  SNPs occur about every 300~600 base pairs.  Millions of SNPs have been identified (e.g., HapMap and Perlegen). SNPs have become the preferred markers for association studies because of their high abundance and high-throughput SNP genotyping technologies.

6 National Taiwan University Department of Computer Science and Information Engineering Single Nucleotide Polymorphism A SNP is usually assumed to be a binary variable.  The probability of repeat mutation at the same SNP locus is quite small.  The tri-allele cases are usually considered to be the effect of genotyping errors. The nucleotide on a SNP locus is called  a major allele (if allele frequency > 50%), or  a minor allele (if allele frequency < 50%). A C T T A G C T T A C T T A G C T C C: Minor allele 94% 6% T: Major allele

7 National Taiwan University Department of Computer Science and Information Engineering 7 Haplotypes A haplotype stands for a set of linked SNPs on the same chromosome.  A haplotype can be simply considered as a binary string since each SNP is binary. SNP 1 SNP 2 SNP 3 -A C T T A G C T T- -A A T T T G C T C- -A C T T T G C T C- Haplotype 2 Haplotype 3 C A T A T C C T C Haplotype 1 SNP 1 SNP 2 SNP 3

8 National Taiwan University Department of Computer Science and Information Engineering 8 Genotypes The use of haplotype information has been limited because the human genome is a diploid.  In large sequencing projects, genotypes instead of haplotypes are collected due to cost consideration. A C G T AT SNP 1 SNP 2 CG Haplotype data SNP 1 SNP 2 Genotype data ACAC GTGT SNP 1 SNP 2 A T C G SNP 1 SNP 2

9 National Taiwan University Department of Computer Science and Information Engineering 9 Problems of Genotypes Genotypes only tell us the alleles at each SNP locus.  But we don’t know the connection of alleles at different SNP loci.  There could be several possible haplotypes for the same genotype. A C G T SNP 1 SNP 2 Genotype data or AT CG SNP 1 SNP 2 AG CT SNP 1 SNP 2 ACAC GTGT SNP 1 SNP 2 We don’t know which haplotype pair is real.

10 National Taiwan University Department of Computer Science and Information Engineering 10 Research Directions of SNPs and Haplotypes in Recent Years Haplotype Inference Tag SNP Selection Maximum Parsimony Perfect Phylogeny Statistical Methods Haplotype block LD bin Prediction Accuracy SNP Database …

11 National Taiwan University Department of Computer Science and Information Engineering 11 Haplotype Inference The problem of inferring the haplotypes from a set of genotypes is called haplotype inference.  This problem is already known to be not only NP-hard but also APX-hard. Most combinatorial methods consider the maximum parsimony model to solve this problem.  This model assumes that the real haplotypes in natural population is rare.  The solution of this problem is a minimum set of haplotypes that can explain the given genotypes.

12 National Taiwan University Department of Computer Science and Information Engineering 12 Maximum Parsimony AG h3h3 CT h4h4 AT h1h1 CG h2h2 AT h1h1 AT h1h1 or G1G1 A C SNP 1 SNP 2 G T G2G2 A A SNP 1 SNP 2 T T AG CT AT AT CG Find a minimum set of haplotypes to explain the given genotypes.

13 National Taiwan University Department of Computer Science and Information Engineering 13 Related Works Statistical methods:  Niu, et al. (2002) developed a PL-EM algorithm called HAPLOTYPER.  Stephens and Donnelly (2003) designed a MCMC algorithm based on Gibbs sampling called PHASE. Combinatorial methods:  Gusfield (2003) proposed an integer linear programming algorithm.  Wang and Xu (2003) developed a branching and bound algorithm called HAPAR to find the optimal solution.  Brown and Harrower (2004) proposed a new integer linear formulation of this problem.

14 National Taiwan University Department of Computer Science and Information Engineering 14 Our Results We formulated this problem as an integer quadratic programming (IQP) problem. W proposed an iterative semidefinite programming (SDP) relaxation algorithm to solve the IQP problem.  This algorithm finds a solution of O(log n) approximation. We implemented this algorithm in MatLab and compared with existing methods.  Huang, Y.-T., Chao, K.-M., and Chen, T. “An approximation algorithm for haplotype inference by pure parsimony,” To appear in Journal of Computational Biology, 2005.

15 National Taiwan University Department of Computer Science and Information Engineering 15 Problem Formulation Input:  A set of n genotypes and m possible haplotypes. Output:  A minimum set of haplotypes that can explain the given genotypes. AT h1h1 CG h2h2 AT h1h1 AT h1h1 G1G1 A C SNP 1 SNP 2 G T G2G2 A A SNP 1 SNP 2 T T AT h1h1 CG h2h2

16 National Taiwan University Department of Computer Science and Information Engineering 16 Integer Quadratic Programming (IQP) Define x i as an integer variable with values 1 or -1.  x i = 1 if the i-th haplotype is selected.  x i = -1 if the i-th haplotype is not selected. Minimizing the number of selected haplotypes is to minimize the following integer quadratic function:

17 National Taiwan University Department of Computer Science and Information Engineering 17 Integer Quadratic Programming (IQP) Each genotype must be resolved by at least one pair of haplotypes.  For genotype G 1, the following integer quadratic function must be satisfied. G1G1 A C SNP 1 SNP 2 G T AT h1h1 CG h2h2 AG h3h3 CT h4h4 or 11 Suppose h 1 and h 2 are selected

18 National Taiwan University Department of Computer Science and Information Engineering 18 Integer Quadratic Programming (IQP) Maximum parsimony: We use the SDP-relaxation technique to solve this IQP problem. Objective Function Constraint Functions to resolve all genotypes. Find a minimum set of haplotypes

19 National Taiwan University Department of Computer Science and Information Engineering 19 Research Directions of SNPs and Haplotypes in Recent Years Haplotype Inference Tag SNP Selection Maximum Parsimony Perfect Phylogeny Statistical Methods Haplotype block LD bin Prediction Accuracy SNP Database …

20 National Taiwan University Department of Computer Science and Information Engineering 20 Problems of Using SNPs for Association Studies The number of SNPs is still too large to be used for association studies.  There are millions of SNPs in a human body.  To reduce the SNP genotyping cost, we wish to use as few SNPs as possible for association studies. Tag SNPs are a small subset of SNPs that is sufficient for performing association studies without losing the power of using all SNPs.  There are many definitions of tag SNPs.  We will first study one definition of tag SNPs based on haplotype blocks model.

21 National Taiwan University Department of Computer Science and Information Engineering 21 Haplotype Blocks and Tag SNPs Recent studies have shown that the chromosome can be partitioned into haplotype blocks interspersed by recombination hotspots (Daly et al, Patil et al.).  Within a haplotype block, there is little or no recombination occurred.  The SNPs within a haplotype block tend to be inherited together. Within a haplotype block, a small subset of SNPs (called tag SNPs) is sufficient to distinguish each pair of haplotype patterns in the block.  We only need to genotype tag SNPs instead of all SNPs within a haplotype block.

22 National Taiwan University Department of Computer Science and Information Engineering 22 Recombination Hotspots and Haplotype Blocks Recombination hotspots Chromosome Haplotype blocks P1P1 P2P2 P3P3 P4P4 S1S1 S2S2 S3S3 S4S4 S5S5 S6S6 S7S7 S8S8 S9S9 S 10 S 11 S 12 SNP loci Haplotype patterns : Major allele : Minor allele

23 National Taiwan University Department of Computer Science and Information Engineering 23 A Haplotype Block Example The Chromosome 21 is partitioned into 4,135 haplotype blocks over 24,047 SNPs by Patil et al. (Science, 2001).  Blue box: major allele  Yellow box: minor allele

24 National Taiwan University Department of Computer Science and Information Engineering 24 Examples of Tag SNPs P1P1 P2P2 P3P3 P4P4 S1S1 S2S2 S3S3 S4S4 S5S5 S6S6 S7S7 S8S8 S9S9 S 10 S 11 S 12 SNP loci Haplotype patterns Suppose we wish to distinguish an unknown haplotype sample. We can genotype all SNPs to identify the haplotype sample. An unknown haplotype sample : Major allele : Minor allele

25 National Taiwan University Department of Computer Science and Information Engineering 25 Examples of Tag SNPs P1P1 P2P2 P3P3 P4P4 S1S1 S2S2 S3S3 S4S4 S5S5 S6S6 S7S7 S8S8 S9S9 S 10 S 11 S 12 SNP loci Haplotype pattern In fact, it is not necessary to genotype all SNPs. SNPs S 3, S 4, and S 5 can form a set of tag SNPs. P1P1 P2P2 P3P3 P4P4 S3S3 S4S4 S5S5

26 National Taiwan University Department of Computer Science and Information Engineering 26 Examples of Wrong Tag SNPs P1P1 P2P2 P3P3 P4P4 S1S1 S2S2 S3S3 S4S4 S5S5 S6S6 S7S7 S8S8 S9S9 S 10 S 11 S 12 SNP loci Haplotype pattern SNPs S 1, S 2, and S 3 can not form a set of tag SNPs because P 1 and P 4 will be ambiguous. P1P1 P2P2 P3P3 P4P4 S1S1 S2S2 S3S3

27 National Taiwan University Department of Computer Science and Information Engineering 27 Examples of Tag SNPs P1P1 P2P2 P3P3 P4P4 S1S1 S2S2 S3S3 S4S4 S5S5 S6S6 S7S7 S8S8 S9S9 S 10 S 11 S 12 SNP loci Haplotype pattern SNPs S 1 and S 12 can form a set of tag SNPs. This set of SNPs is the minimum solution in this example. P1P1 P2P2 P3P3 P4P4 S1S1 S 12

28 National Taiwan University Department of Computer Science and Information Engineering 28 Problems of Finding Tag SNPs The problem of finding the minimum set of tag SNPs is known to be NP-hard.  This problem is the minimum test set problem.  A number of methods have been proposed to find the minimum set of tag SNPs (Bafna et al., Zhang, et al.). In reality, we may fail to obtain some tag SNPs if they do not pass the threshold of data quality.  In the current genotyping environment, the missing rate of SNPs is around 5~10%.  We proposed two greedy algorithms and one linear programming relaxation algorithm to solve this problem.

29 National Taiwan University Department of Computer Science and Information Engineering Introduction to Linkage Disequilibrium and Programming Assignment Algorithms and Computational Biology Lab, Department of Computer Science & Information Engineering, National Taiwan University, Taiwan. Speaker: Yao-Ting Huang

30 National Taiwan University Department of Computer Science and Information Engineering 30 Research Directions of SNPs and Haplotypes in Recent Years Haplotype Inference Tag SNP Selection Maximum Parsimony Perfect Phylogeny Statistical Methods Haplotype block LD bin Prediction Accuracy SNP Database …

31 National Taiwan University Department of Computer Science and Information Engineering 31 Linkage Disequilibrium The problem of finding tag SNPs can be also solved from the statistical point of view.  We can measure the correlation between SNPs and identify sets of highly correlated SNPs.  For each set of correlated SNPs, only one SNP need to be genotyped and can be used to predict the values of other SNPs. Linkage Disequilibrium (LD) is a measure that estimates such correlation between two SNPs.  We will formally introduce the detailed information of LD later.

32 National Taiwan University Department of Computer Science and Information Engineering 32 Linkage Disequilibrium Bins The statistical methods for finding tag SNPs are based on the analysis of LD among all SNPs.  An LD bin is a set of SNPs such that SNPs within the same bin are highly correlated with each other.  The value of a single SNP in one LD bin can predict the values of other SNPs of the same bin.  These methods try to identify the minimum set of LD bins.

33 National Taiwan University Department of Computer Science and Information Engineering 33 An Example of LD Bins (1/3) SNP 1 and SNP 2 can not form an LD bin.  e.g., A in SNP 1 may imply either G or A in SNP 2. IndividualSNP 1 SNP 2 SNP 3 SNP 4 SNP 5 SNP 6 1AGACGT 2TGCCGC 3AAATAT 4TGCTAC 5TACCGC 6TGCTAC 7AAATAT 8AAATAT

34 National Taiwan University Department of Computer Science and Information Engineering 34 An Example of LD Bins (2/3) SNP 1, SNP 2, and SNP 3 can form an LD bin.  Any SNP in this bin is sufficient to predict the values of others. IndividualSNP 1 SNP 2 SNP 3 SNP 4 SNP 5 SNP 6 1AGACGT 2TGCCGC 3AAATAT 4TGCTAC 5TACCGC 6TGCTAC 7AAATAT 8AAATAT

35 National Taiwan University Department of Computer Science and Information Engineering 35 An Example of LD Bins (3/3) There are three LD bins, and only three tag SNPs are required to be genotyped (e.g., SNP 1, SNP 2, and SNP 4 ). IndividualSNP 1 SNP 2 SNP 3 SNP 4 SNP 5 SNP 6 1AGACGT 2TGCCGC 3AAATAT 4TGCTAC 5TACCGC 6TGCTAC 7AAATAT 8AAATAT

36 National Taiwan University Department of Computer Science and Information Engineering 36 Difference between Haplotype Blocks and LD bins Haplotype blocks are based on the assumption that SNPs in proximity region should tend to be correlated with each other.  The probability of recombination occurs in between is less. LD bins can group correlated of SNPs distant from each other.  A disease is usually affected by multiple genes instead of single one. The SNPs in one LD bin can be shared by other bins.  The SNPs in a haplotype block do not appear in another block.

37 National Taiwan University Department of Computer Science and Information Engineering 37 Introduction to Linkage Disequilibrium BbTotal AP AB P aB PAPA a PabPab PaPa TotalPBPB PbPb 1.0 AB Ab aB ab A, B: major alleles a, b: minor alleles P A : probability for A alleles at SNP 1 P a : probability for a alleles at SNP 1 P B : probability for B alleles at SNP 2 P B : probability for b alleles at SNP 2 P AB : probability for AB haplotypes P ab : probability for ab haplotypes SNP 1 SNP 2

38 National Taiwan University Department of Computer Science and Information Engineering 38 Linkage Equilibrium P AB = P A P B P Ab = P A P b = P A (1-P B ) P aB = P a P B = (1-P A ) P B P ab = P a P b = (1-P A ) (1-P B ) BbTotal AP AB P aB PAPA a PabPab PaPa TotalPBPB PbPb 1.0 SNP 1 SNP 2

39 National Taiwan University Department of Computer Science and Information Engineering 39 Linkage Disequilibrium P AB ≠ P A P B P Ab ≠ P A P b = P A (1-P B ) P aB ≠ P a P B = (1-P A ) P B P ab ≠ P a P b = (1-P A ) (1-P B ) BbTotal AP AB P aB PAPA a PabPab PaPa TotalPBPB PbPb 1.0 SNP 1 SNP 2

40 National Taiwan University Department of Computer Science and Information Engineering 40 An Example of Linkage Disequilibrium -- A -- -- -- G -- -- -- -- C -- -- -- G -- -- -- -- C -- -- -- C -- -- -- Suppose we have three haplotypes: AG, CG, and CC.  There is no AC haplotype, i.e., P AC = 0. Note that P AC =0, P A P C =1/9, and P AC ≠ P A P C.  These two SNPs are linkage disequilibrium. P A =1/3 P C =2/3 P G =2/3 P C =1/3

41 National Taiwan University Department of Computer Science and Information Engineering 41 An Example of Linkage Equilibrium -- A -- -- -- G -- -- -- -- C -- -- -- G -- -- -- -- C -- -- -- C -- -- -- -- A -- -- -- C -- -- -- -- A -- -- -- G -- -- -- -- C -- -- -- G -- -- -- -- C -- -- -- C -- -- -- Before recombinationAfter recombination P A =1/2 P C =1/2 P G =1/2 P C =1/2 After recombination,  P AG = P A P G = 1/4,  P CG = P C P G = 1/4,  P CC = P C P C = 1/4, and  P AC = P A P C = 1/4. These two SNPs are linkage equilibrium.

42 National Taiwan University Department of Computer Science and Information Engineering 42 Linkage Disequilibrium There are many formulas to compute LD between two SNPs, and most of them are usually normalized between -1~1 or 0~1.  LD = 1 (perfect positive correlation)  LD = 0 (no correlation or linkage equilibrium)  LD = -1 (perfect negative correlation)  LD = 0.8 (strong positive correlation)  LD = 0.12 (weak positive correlation)

43 National Taiwan University Department of Computer Science and Information Engineering 43 Linkage Disequilibrium Formulas Mathematical formulas for computing LD:  r 2 or Δ 2 :  D’:  Chi-square Test.  P value.

44 National Taiwan University Department of Computer Science and Information Engineering 44 Correlation Coefficient The correlation between two random variables A and B can be measured by the correaltion coefficient:

45 National Taiwan University Department of Computer Science and Information Engineering 45 Examples of Computing LD IndividualSNP 1 SNP 2 SNP 3 SNP 4 SNP 5 SNP 6 1ATAAGT 2GTCCTT 3GACAGT 4GACCTT 5GACAGC

46 National Taiwan University Department of Computer Science and Information Engineering 46 Minimum Clique Cover Problem This problem asks for a minimum set of LD bins.  The minimum LD value required between two SNPs in one bin is usually set to 0.8. This problem is known to be the minimum clique cover problem (by Chao, K.-M., 2005).  Consider each SNP as nodes on the graph.  There exists an edge between two nodes iff the LD of these two SNPs ≥ 0.8.

47 National Taiwan University Department of Computer Science and Information Engineering 47 Relaxation of This Problem The minimum clique cover problem is not easy to be approximated.  The relaxed problem asks for a minimum set of LD bins such that at least one SNP in an LD bin has r 2 ≥ 0.8 with other SNPs in the same bin. The relaxed problem is known to be the minimum dominating set problem.  The minimum dominating set problem is still NP-hard but is easier to be approximated.

48 National Taiwan University Department of Computer Science and Information Engineering 48 Minimum Dominating Set Problem Given a graph G(V, E), the minimum dominating set C is the minimum set of nodes, such that each node in V has at least one edge connecting to nodes in C. Consider each node as a SNP and each edge as strong LD (r 2 ≥ 0.8) between two SNPs.  The minimum dominating set of this graph is the set of tag SNPs.  We can only use this set of SNPs to predict other SNPs.

49 National Taiwan University Department of Computer Science and Information Engineering 49 Experimental Data Sets Hinds et al. (2005) identified 1,586,383 SNPs across three human populations.  African, Americans of European, and Asian. The database provides both genotype data and inferred haplotype data.

50 National Taiwan University Department of Computer Science and Information Engineering 50 The Programming Assignment Conduct an experiment on the Perlegen SNP database.  http://www.perlegen.com Find the minimum set of LD bins, such that at least one SNP has strong LD (r 2 ≥ 0.8) with other SNPs in the same bin.  Please use r 2 ≥ 0.8 as the threshold to identify strong correlation between two SNPs.  The focus of this project is to design algorithms for solving the minimum dominating set problem.

51 National Taiwan University Department of Computer Science and Information Engineering 51 Haplotype Data Format local_idLocal unique identifier for this SNP accessionNCBI Build 34 sequence accession number positionPosition within the specified Build 34 sequence allelesThe two SNP alleles: order is arbitrary NA?????_A, NA?????_B Two inferred haploid alleles. Columns 5-50: African American haplotypes Columns 51-98: European American haplotypes Columns 99-146: Han Chinese haplotypes Download phased haplotype data from  http://genome.perlegen.com/browser/download.html. http://genome.perlegen.com/browser/download.html  Please use the 24 phased haplotype data sets.

52 National Taiwan University Department of Computer Science and Information Engineering 52 The Programming Assignment Teamwork with up to 5 people in a team.  The program can be written in any programming language. Exact or approximate algorithms are both welcome (more methods, higher grades).  Please provide the analysis of proposed algorithms (e.g., the time complexity).  If using some existing method, please add appropriate citations or references.

53 National Taiwan University Department of Computer Science and Information Engineering 53 The Project Report The project report should include at least the following contents (more information, higher grades).  (1) Team member information,  (2) description of your algorithms,  (3) analysis of your algorithms (e.g., time complexity, approximation ratio),  (4) summary of experimental results, and  (5) contributions of each team member.

54 National Taiwan University Department of Computer Science and Information Engineering 54 The Experimental Setup The summary of your experimental results should at least include some statistics of the LD bins found by your algorithm.  We encourage you to conduct a comprehensive experiment and analysis. AllAfricaEuropeanChinese 1-10 SNPs15123121341312311134 ≥10 SNPs12341111 Total bins16357132451423412245

55 National Taiwan University Department of Computer Science and Information Engineering 55 The Programming Assignment Due date: 12/14 Email your program (with detailed running procedure) and project report to TA.  Yao-Ting Huang : d92023@csie.ntu.edu.tw d92023@csie.ntu.edu.tw  We may ask you to come to demo your program if necessary. Important messages will be announced on the following web page.  http://www.csie.ntu.edu.tw/~kmchao/seq05fall/ http://www.csie.ntu.edu.tw/~kmchao/seq05fall/


Download ppt "National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational."

Similar presentations


Ads by Google