Ho Kim School of Public Health Seoul National University

Ho Kim School of Public Health Seoul National University
SNP과 Haplotype 분석 소개 Ho Kim School of Public Health Seoul National University

Contents SNP (Single Nucleotide Polymorphism) Haplotypes
Linkage & Linkage disequilibrium Association study design SNP vs. Haplotype for association study Haplotype estimation Data analysis

SNPs (pronounced snips)

Mutation

Polymorphism – Definition
A sequence variation that occurs at least 1 percent of the time (> 1%) 90% of variations are SNPs Mutation If the variation is present less than 1 percent of the time (<= 1%)

SNPs in the Human Genome
All humans share 99.9% the same genetic sequence SNPs occur about every 1000 base pairs The human genome contains more than 2 million SNPs ~21,000 SNPs are found in genes SNPs are not evenly spaced along the sequence SNP-rich regions SNP-poor regions

SNPs as DNA Landmarks Help in DNA sequencing
Help in the discovery of genes responsible for many major diseases: asthma, diabetes, heart disease, schizophrenia and cancer among others

From SNP to Haplotype Phenotype Black eye GATATTCGTACGGA-T Brown eye
GATGTTCGTACTGAAT GATATTCGTACGGAAT SNP Phenotype Black eye Brown eye Blue eye AG- 2/6 GTA 3/6 AGA 1/6 Haplotypes SNP Simple to measure & understand Haplotype have the advantage in the appropriate circumstances of carrying more information about the genotype-phenotype link than do the underlying SNPs. DNA Sequence

SNP & Haplotype SNP: Single Nucleotide Polymorphism
Haplotype: A set of closely linked genetic markers present on one chromosome which tend to be inherited together (not easily separable by recombination). G A C Set of SNP polymorphisms: a SNP haplotype

Linkage and Linkage Disequilibrium (1)
Linkage: the tendency of genes or other DNA sequences at specific loci to be inherited together as a consequence of their physical proximity on a single chromosome. Linkage disequilibrium (allelic association): particular alleles at two or more neighboring loci show allelic association if they occur together with frequencies significantly different from those predicted from the individual allele frequencies. Linkage is a relation between loci, but association is a relation between alleles.

Linkage and Linkage Disequilibrium (2)
( = recombination fraction) No linkage:  = 0.5 Perfect linkage:  = 0 Linkage disequlibrium: 0   1 ( = probability of allelic association) Linkage equilibrium:  = 0 Complete linkage disequilibrium:  = 1

Allelic Association (LD) Morton et al. (2001)
Locus B Locus A Allele 1 Allele 2 Allele frequency Allele 1 Allele 2 Allele frequency 1 A, B: diallelic loci; 11, 12, 21, 22: haplotypes; : association probability

Measures of LD Covariance D = | 11 22 - 12 21 | Association
 = D/Q(1-R) All other measures are functions of Q, R, .

New Findings on Linkage Disequilibrium
In the chromosome, there are blocks of limited haplotype diversity in which more than 80% of a global human sample can typically be characterized by only three common haplotypes (Patil et al., Science 2001). Haplotype blocks are the more precise units to reflect genetic variation. Identification of haplotype structure, i.e., construction of a haplotype map, provides a basis for accurate and efficient association studies.

Daly et al. (2001). LD by distance from two markers

The Problem It’s not yet easy to measure an individual’s (only two) haplotypes Molecular haplotyping (nucleotide sequencing) is the gold standard A more efficient strategy: Focus on regions, such as certain genes Estimate haplotypes from SNP data (genotypes) Use LD map, and reduce the number of loci to represent the haplotype Use haplotype map (DB) = key SNPs + haplotype blocks with strong LD

Haplotyping: Phase Problem
C SNP1 SNP2 Diploid Observed: SNP1 G/T SNP2 A/C Possible Haplotypes: GA, TC or GC, TA n SNPs  2n possible haplotypes

Molecular Haplotyping
Hetero-duplex analysis, mismatch detection, allele-specific PCR: Have potential to get high-throughput Only practical for short haplotypes (2-5 kb vs kb) Costly Rolling Circle amplification method, etc: Can handle larger size Difficult to automate

In-silico Haplotyping
Alias: Haplotype Reconstruction, Haplotype Inference, Computational Haplotyping, Statistical Haplotyping, etc. Advantages: Cost effective High-throughput Difficulty: Phase Ambiguity: Haplotypes increase exponentially with SNPs

In-silico Haplotyping: Two Tasks
Reconstruction of the haplotypes of the sampled individuals II. Estimation of haplotypes frequencies in a population

In-silico Haplotyping: Approaches
Clark’s algorithm E-M algorithm (expectation-maximization algorithm) Bayesian algorithm

Clark’s Algorithm 1) Find Homozygotes or heterozygotes at one locus
SNP1 T T SNP2 A A SNP3 C C T-A-C Unambiguously defined SNP1 T T SNP2 A A SNP3 C G T-A-C T-A-G

Clark’s Algorithm 2) Try to solve ambiguous haplotype as a combination of solved ones SNP1 A T SNP2 A A SNP3 C G T-A-C : solved one A-A-G …………………………… Continue until either all haplotypes have been solved or until no more haplotypes can be found in this way

Clark’s Algorithm problems
No homozygotes or single SNP heterozygotes -> chain might never get started Many unsolved haplotypes left at the end Quite useful in practice !!

EM Algorithm Use multinomial likelihood with HWE Pr(AT//AA//CG)
=pr(AAC/TAG)+pr(AAG/TAC) =pr(AAC)pr(TAG)+pr(AAG)pr(TAC) Falling and Schork(2000) showed that EM is better than Clark’s algorithm

A Gibbs sampler, Stephens et al (2001)
G=(G1, …, Gn) observed multilocus genotype freq H=(H1, …, Hn) unknown haplotype pairs F=(F1, …, FM) M unknown pop’n hap freq Choose individual i from all ambiguous individuals Sample Hi(t+1) from pr(Hi|g,H-i(t)) Set Hj(t+1)=Hj(t) for j=1,2,…,i-1,i+1,…n

Haplotype Inference A: SNP data: 0 (MM), 1 (Mm), 2 (mm) for a single locus B: Haplotype data: 0(M), 1 (m) for a single locus

#1 1, 2 00000 00100 #2 1, 3 00010 #3 1, 4 01001 #4 1, 5 00001 #5 1, 1 #6 1, 1

An Example Data 169 cases, 231 controls 11 haplotypes
sex, age information

Logistic Regression Results
Without adjusting for age, sex: Haplotype 7 is most strongly associated, but not statistically significant (p=0.07) Adjusting for age, sex: Haplotype 11 is most strongly associated (p=0.03) Slightly stronger association with accounting for repeated measures (2 haplotypes per person) by GEE procedure (p=0.02)

Other Examples

Drysdale et al. PNAS 2000, 97(19) 10483–10488

Wallenstein, Hodge, and Weston, Genetic Epidemiology 15:173–181 (1998)

Cohort study Case-control study

Shaw et al. Am J of Medical Genet 114 205-213 (2002)

References Clark (1990). Inference of haplotypes from PCR-amplified samples of diploid populations. Mol Bio Evol 7: Escoffier and Slatkin (1995). Maximum likelihood estimation of molecular haplotype frequencies in a diploid population. Mol Bio Evol 12: Stephens, Smith, and Donnelly (2001) A new statistical method for haplotype reconstruction from population data. Am J Hum Genet 68, Niu, Qin, Xu and Liu (2002) Bayesian haplotype inference for multiple linked single-nucleotide ploymorphisms. Am J Hum Genet 70;

Thank you ! This file is available at /~hokim 열린 강의실, 세미나자료

Ho Kim School of Public Health Seoul National University

Similar presentations

Presentation on theme: "Ho Kim School of Public Health Seoul National University"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Ho Kim School of Public Health Seoul National University

Similar presentations

Presentation on theme: "Ho Kim School of Public Health Seoul National University"— Presentation transcript:

Similar presentations

About project

Feedback