Presentation on theme: "Julia Krushkal 4/11/2017 The International HapMap Project: A Rich Resource of Genetic Information Julia Krushkal Lecture in Bioinformatics 04/15/2010."— Presentation transcript:
1 Julia Krushkal4/11/2017The International HapMap Project: A Rich Resource of Genetic InformationJulia KrushkalLecture in Bioinformatics04/15/2010
2 The International HapMap Project Julia Krushkal4/11/2017“…Determine the common patterns of DNA sequence variation in the human genome, by characterizing sequence variants, their frequencies, and correlations between them, in DNA samples from populations with ancestry from parts of Africa, Asia and Europe.” Nature (2003)Population-specific sequence variationAllele frequenciesLinkage disequilibrium patternsHaplotype informationTag SNPsStructural genome variationBetter understanding of human population dynamics and of the history of human populationsCell lines available from Coriell Inst. for Medical ResearchA rich resource for biomedical genetic analysis
3 270 Individuals from 4 Geographically Diverse Populations HapMap Population SamplesJulia Krushkal4/11/2017Project launched in 2002 to provide a public resource for accelerating medical genetic research270 Individuals from 4 Geographically Diverse PopulationsYRI: 90 Yorubans from Ibadan, Nigeria30 parent-offspring triosCEU: 90 northern and western European-descent living in Utah, USA from the Centre d’Etude du Polymorphisme Humain (CEPH) collectionCHB: 45 unrelated Han Chinese from Beijing,ChinaJPT: 45 unrelated Japanese from Tokyo, JapanCombined in many analysesHapMapNHGRI
4 International HapMap Project Papers The Int. HapMap Consortium. A second generation human haplotype map of over 3.1 million SNPs. Nature 449,The Int. HapMap Consortium.A Haplotype Map of the Human Genome. Nature 437:The Int. HapMap Consortium. The International HapMap Project. Nature 426,The Int. HapMap Consortium. Integrating Ethics and Science in the International HapMap Project. Nature Reviews Genet 5,Thorisson et al. The International HapMap Project Web site. Genome Res 15:HapMap-related papersSabeti et al. Genome-wide detection and characterization of positive selection in human populations. Nature 449,Clark et al. Ascertainment bias in studies of human genome-wide polymorphism. Genome Res, 15:Clayton et al. Population structure, differential bias and genomic control in a large-scale, case-control association study. Nature Genet 37(11):de Bakker et al. Efficiency and power in genetic association studies. Nature Genet 37:Goldstein, Cavalleri. Genomics: Understanding human diversity. Nature 437:Hinds et al. Whole genome patterns of common DNA variation in three human populations. Science 307:Myers et al. A fine-scale map of recombination rates and hotspots across the human genome. Science, 310:Nielsen et al. Genomic scans for selective sweeps using SNP data.Genome Res 15:Smith et al. Sequence features in regions of weak and strong linkage disequilibrium. Genome Res 15:Weir et al. Measures of human population structure show heterogeneity among genomic regions. Genome Res 15:Julia Krushkal4/11/2017
6 Human Chromosomes Contain DNA 22 pairs of autosomes + Julia Krushkal4/11/2017Human ChromosomesContain DNA22 pairs of autosomes +sex-chromosomes (X and Y) + mitochondrial genomeContain functional units (genes) and other DNAHuman genome sequence is available as a reference, as a result of the Human Genome ProjectA significant amount of inter-individual variation exists
7 Chromosomes are sets of continuously linked genetic loci Example:Integrated mapof chromosome 5 from the International HapMap Project,
8 There are many types of DNA Genetic VariationSome DNA loci vary among individualsLinked genetic loci are inherited non-independentlyLoci may change with time (mutation, selection, genetic drift)Some DNA changes lead to quantitative changes in RNA expression and to quantitative or qualitative changes in protein productionSome genetic changes, even small, may lead to diseaseA large amount of natural variation occurs in healthy individuals, i.e.,many changes are neutralLoci genetically linked to the disease-causing locus can be used as genetic markers to search for the disease locusSNP1SNP2There are many types of DNAvariation, e.g.Sequence variationAAAC/TGGCTAMicrosatellite repeats…AATG AATG AATG AATG…
9 Polymorphic Site A locus with common DNA variation 2 alleles in a populationShows difference in DNA sequence among individualsIn most definitions:the most common allele with frequency < 99%,or minor allele frequency (MAF) 1%,or MAF 2%,or at least two alleles have frequencies 1%.A rare allele that occurs in <1% of the population is usually non considered a polymorphic site.90%of sequence variation among individuals is due to common variation (MAF 1%, ); 10% are rare variantsNot all disease-predisposing variants are common
10 SNP=Single Nucleotide Polymorphism A SNP locus on the distal end of the long arm of human chromosome 5 (data from Ensembl)SNP locus rsCAAATTCCATG[A or C]AGAAGGAAATACATA and C are alleles at SNP locus rs
11 A SNP locus on the distal end of the long arm of chromosome 5 SNP locus rs
12 Hardy-Weinberg Equilibrium Julia Krushkal4/11/2017Hardy-Weinberg Equilibrium2 alleles, A and Bfrequencies p and q p+q =1The allele frequencies remain constant through time.SpermEggF1Under Hardy-Weinberg equilibrium, the relative genotype frequencies are:PAA=p2PAB=2pqPBB =q2F1: (p+q )2In autosomal genes, and in absence of disturbing influences, this proportion is maintained through all subsequent generations.Departures can be characterized by disequilibrium
13 Linkage Disequilibrium Associations among alleles at different lociA B1D = Linkage disequilibriumcoefficientCoefficient of associationA B2D=pA1B1-pA1pB1Locus A Locus BNormalized disequilibriumcoefficientSquared Correlation coefficientAlso ranges from 0 to 11 – absolute or perfect linkage;0.8 is the cutoff often usedD’=D/|D|max|D| max = | min(pA1pB2, pA2pB1)|-1 D’ 1r2 =D2/(pA1pA2pB1pB2)Extended to multiallelic markers
14 Regulatory Interactions: The ENCODE Project Julia Krushkal<>4/11/20172003-Pilot project launched (1% of the genome)2007- Pilot project completed; production phase launched on the entire genomeHigh-through-put experimental and computational approaches to studies of DNA regulatory sites, regulatory interactions, and DNA modificationProduction Scale EffortPilot Scale EffortData Coordination CenterTechnology Development Effort
16 HapMap SNP Density Coverage Genome SNP VariationJulia Krushkal4/11/2017Size of human genome 3.2 109 bp99.9% identical9-10 mln SNPs may have MAF 5% 30,000 genesHapMap SNP Density CoveragePhase I (published in 2005)931,340 SNPs passed quality control1 SNP / 3000 bp11,500 nsSNP10 ENCODE regions, 500 kb each17,944 SNPs1 SNP / 279 bpPhase II (published in 2007) Consensus data set:3,107,620 SNPs, QC+ in all panels, polymorphic in 1 panel1 SNP / 875bp25-30% of all SNPs with MAF 5%The cumulative # of non-redundant SNPs is shown as a solid line, the # of SNPs validated by genotyping as a dotted line, and double-hit status as a dashed line.
18 Julia KrushkalHapMap Phase II4/11/201721,177 SNPs from Phase I that had ambiguous position or other low reliability feature were not included in Phase IIChimpanzee, rhesus macaque used for comparisons and to infer ancestral states of SNPs3,107,620 SNPs, QC+ in all panels, polymorphic in 1 panel1 SNP / 875bp SNP/kb25-30% of all SNPs with MAF 5%98.6% of the genome is within 5 kb of the nearest polymorphic SNPBetter representation of rare variation/ SNPs with MAF 1%Phase II marker data capture overwhelming majority of genome SNP variation, mean r2 of for different populations
21 Julia Krushkal4/11/2017SNP Differences among Individuals Far Exceed Differences among PopulationsPhase 1:Autosomes: Across the 1 million SNPs genotyped, only 11 have fixed differences between CEU and YRI, 21 between CEU and CHB/JPT, and 5 between YRI and CHB/JPT.X chromosome 123 SNPs were completely differentiated between YRI and CHB/JPT, but only 2 between CEU and YRI and 1 between CEU and CHB/JPT.
22 Importance of Understanding Patterns of Human Genetic Variation Julia Krushkal4/11/2017Importance of Understanding Patterns of Human Genetic VariationWithout knowing the patterns of correlation, one would need to analyze millions of SNPs and other polymorphisms in the genomeAlleles at nearby loci occur non-independentlyKnowledge about correlations among polymorphisms allows to us significantly reduce the number of genetic tests, while surveying extensively for variation patternsPatterns of correlation are complexNeed to know local patters of genetic variation rather than simply use SNPs at regular intervals
23 Haplotype Maps of the Human Genome Julia KrushkalHaplotype Maps of the Human Genome4/11/2017Genome regionsdecomposedinto discrete haplotypeblocks, which capture similarity in haplotype organizationPatil et al. 2001, Blocks of Limited Haplotype Diversity Revealed by High-Resolution Scanning of Human Chromosome 21. Science 294(5547):
24 Haplotype Maps Generated by The International HapMap Project Julia Krushkal4/11/2017Haplotype Maps Generated by The International HapMap Project3 steps of the HapMap construction(a) SNPs are identified in DNA samples from multiple individuals.(b) Adjacent SNPs that are inherited together are compiled into haplotypes.(c)"Tag" SNPs are identified within haplotypes that uniquely describe those haplotypes.Source: The International HapMap Project
25 Haplotype Maps of the Human Genome Julia Krushkal4/11/2017Haplotype Maps of the Human GenomeHelmuth 2001, Science 293:Find correlations among groups of SNPsHaplotypes were inferred for the HapMap project from trios data and from unrelated individuals using Phase (Stephens 01; Stephens and Donnely 03)
26 Julia Krushkal4/11/2017Haplotype Block Partition Results for Three Populations1,586,383 (SNPs) genotyped in 71 Americans of European, African, and Asian ancestryPopulation Blocks Average size, kb* Required SNPs African-American 235,663 8.8 570,886 European-American 109,913 20.7 275,960 Han Chinese 89,994 25.2 220,809 * Average distance spanned by segregating sites in each block. Minimum number of SNPs required to distinguish common haplotype patterns with frequencies of 5% or higher.Hinds et al Science
27 Population differences in local bin structure Hinds et al 2005Extended LD bin and haplotype block structure around the CFTR gene. LD bins, where each bin has at least one SNP with r2 > 0.8 with every other SNP, are depicted as light horizontal bars, with the positions of constituent SNPs indicated by vertical tick marks as well as the extreme ends of the bars. Isolated SNPs are indicated by plain tick marks. Haplotype blocks, within which at least 80% of observed haplotypes could be grouped into common patterns with frequencies of at least 5%, are depicted as dark horizontal bars. Unlike haplotype blocks that are by design sequential and nonoverlapping, SNPs in one LD bin can be interdigitated with SNPs in multiple other overlapping binsPopulation differences in local bin structureDifferences in allele and haplotype frequencies“Although analysis panels are characterized both by differenthaplotype frequencies and, to some extent, different combinations ofalleles, both common and rare haplotypes are often shared acrosspopulations” (The Int. HapMap Project, Nature, 2005)
28 Amount of Captured Sequence Variation in HapMap Phase II Julia Krushkal4/11/2017Amount of Captured Sequence Variation in HapMap Phase IIFor common variants (MAF 0.05) the mean maximum r2 of any SNP to a typed one is 0.90 in YRI, 0.96 in CEU and 0.95 in CHB/JPT.1.09 million SNPs capture all common Phase II SNPs with r2 0.8 in YRI.Very common SNPs with MAF 0.25 are captured extremely well (mean maximum r2 of 0.93 in YRI to 0.97 in CEU)Rarer SNPs with MAF<0.05 are less well covered (mean maximum r2 of 0.74 in CHB/JPT to 0.76 in YRI).
29 Amount of Captured Sequence Variation in HapMap Phase II Julia Krushkal4/11/2017Amount of Captured Sequence Variation in HapMap Phase IIAdditional tag SNPs are unlikely to capture large groups of additional SNPsCan use to phase new data using HapMap haplotype information, missing data imputation
31 DNA Chips and Resequencing: Julia Krushkal4/11/2017DNA Chips and Resequencing:High-through-put Analysis of Sequence VariationAn easy way to access genome-wide variationBoth Affymetrix and Illumina DNA chips contain representative SNP and CNV probesAffymetrix GeneChip 6.0:1.8 million markers for genetic variation, including 906,000 SNPs and 946,000 copy number probes.Illumina 1M Bead Chip and 1M-duo Bead Chip:~950,000 genome-spanning tag SNPs;~100,000 additional non-HapMap SNPs,>565,000 SNPs in and near coding regions such as nsSNPs, promoter regions, 3’ and 5’ UTRs; dense coverage in ADME and MHC regions.~260,000 markers located in novel and reported copy number polymorphic regions.Sequenom mass arrays (based on Maldi-TOF)
32 Common Ancestry and Segmental Sharing Julia Krushkal4/11/2017Common Ancestry and Segmental SharingRelatedness High Med LowHomozygocity
33 Recombination Hotspots Julia Krushkal4/11/2017Recombination Hotspots32,996 recombination hotspots60% of genome recombination, 6% sequence
34 Recombination and tagSNPs Julia Krushkal4/11/2017Recombination and tagSNPsRecombination hotspots are frequently insufficient to break down allelic associationsCommon haplotypes often span recombination hotspots0.5-1% of SNPs are untaggable: no SNPs with r2 0.2 within 100 kbUntaggable SNPs are not in segmental duplicationsThey often are in recombination hotspots; some may be due to genealogical structure, mutation hotspots, or gene conversion
35 Demographic History of Human Populations Julia Krushkal4/11/2017Demographic History of Human PopulationsGenealogical History and Allelic AssociationsThe genealogy for the 13 haplotypes observed in a 40-kb region of Chromosome 1 (between SNPs rs and rs932087) where there is no evidence for recombination.Location of polymorphic mutations is indicated by circles.Relative frequency of each haplotype in the sample from each of the three panels (with white indicating 0% and black indicating 100%).The dotted line in the genealogy indicates a branch of the tree that is not present in theCEU sample and whose removal results in perfect association between SNPs rs and rsCan track genealogical historyComplex patterns of stochastic mutation, recombination, selection, genetic drift in evolutionary history shape the patterns of genome variationMcVean et al., PLoS Genetics 1:e54
36 The Int. HapMap Consortium, Nature, 2005 Julia Krushkal4/11/2017The Int. HapMap Consortium, Nature, 2005
37 HapMap 3 Mirrors at Sanger Center and Baylor College of Medicine FUNDING AGENCIESNational Institutes of Health –National Human Genome Research Institute (NHGRI)Wellcome TrustMirrors at Sanger Center and Baylor College of Medicine
38 QC in HapMap 3 Hardy-Weinberg p>0.000001 (per population) missingness <0.05 (per population)<3 Mendel errors (per population; only applies to YRI, CEU, ASW, MEX, MKK)SNP must have a rsID and map to a unique genomic locationThe "consensus" data set contains data for 1115 individuals (558 males, 557 females; 924 founders and 191 non-founders), only keeping SNPs that passed QC in all populations (overall call rate is 0.998). The "consensus|polymorphic" data set has monomorphic SNPs (across the entire data set) removed.
43 PCR RESEQUENCING DATA“The sequence-based variant calls were generated by tiling with PCR primer sets spaced approximately 800 bases apart across the ENCODE 3 regions. Following filtering low-quality reads the data were analyzed with SNP Detector version 3, for polymorphic site discovery and individual genotype calling. Various QC filters were then applied. Specifically, we filtered out PCR amplicons with too many SNPs, and SNPs with discordant allele calls in mutliple amplicons. “Also filtered out were SNPs with low completeness in samples, or with too many conflicting genotype calls in two different strands.“In the QC+ data set, …filtered out samples which had low completeness, and filtered out SNPs with low call rate in each population (<80%) and not in HWE (p<0.001). In the QC+ data set, the overall false positive rate is ~3.2%, based on a limited number of validation assays.”
44 Data Content PCR RESEQUENCING DATA label number of samples ASW 55 CEU 119CHB 90CHD 30GIH 60JPT 91LWK 60MEX 27MKK 0TSI 60YRI 120total 712
46 Julia Krushkal4/11/2017HapMap Project is a Unique Resource for Genome-Wide Association StudiesResource for selection of representative tag SNPs from low diversity haplotype blocks or from highly correlated SNPsTag SNPs with r2 0.8 chosen for popular SNP chipsResource for selecting custom SNPs for dense genotyping in candidate regions, determined from genetic pathways of the 1st stage of multistage GWASLD and haplotype information utilized for missing SNP imputation for genotypic problems or in meta-analyses
47 Julia Krushkal4/11/2017As of 04/15/10, this table includes 543 publications and 2658 SNPs.
48 Published Genome-Wide Associations through 6/2009, 439 published GWA at p < 5 x 10-8 NHGRI GWA Catalog
50 Genotype Imputation Using HapMap Information Julia Krushkal4/11/2017
51 Genotype Imputation Using HapMap Information Software from Jonathan Marchini’s groupJulia Krushkal4/11/2017
52 Use of HapMap Resources in Meta-Analyses Julia Krushkal4/11/2017Use of HapMap Resources in Meta-AnalysesImputation Common HapMap panel & tagSNPs
53 Structural Genome Variation Julia Krushkal4/11/2017HapMap samples are also used as a resource for CNV analysisLarge number of copy number variants (CNVs) and other genome rearrangements found among individualsSome variation is assumed normal, other may cause diseaseGenome databases, e.g. Database of Genomics Variants at the TCAG of the Toronto Hospital of Sick Children, the Copy Number Variation Project Map at the Sanger Center
54 Julia Krushkal4/11/2017Segmental duplications are recombination hotspots, causing global genome rearrangements
56 Ethical, Legal, and Social Implications Julia Krushkal4/11/2017Ethical, Legal, and Social ImplicationsIssuesPatterns of variation can be compared among individuals and populations, e.g.Genetic profiling Racial profiling Population history studiesExtensive genetic information on donors publicly availableGeneralization of biomedical results/stigmatization/genetic determinismLimitation of population identifiersLoose populations, self-describedLimited number of representatives, e.g.,Individuals samples from residential community in Bejing Normal University do not represent all 56 officially recognized ethnicities in ChinaFuture use of cell line samples from same donors, they will not be able to withdraw their samplesNature Reviews Genet :
57 Ethical, Legal, and Social Implications Julia Krushkal4/11/2017Ethical, Legal, and Social ImplicationsApproachesEthical considerations incorporated from the inception of the HapMap projectChoice of several world populationsNew samples obtained with appropriate consent, rather than use of previously stored samplesNo personal identifiers included(CEPH samples have links to individuals, strictly confidential)No medical information includedPopulation and gender information includedCommunity engagement, taking into account international and local ethical guidelinesCommunity Advisory Groups established/can withdraw community samplesSensitivity to cultural issuesNature Reviews Genet :
67 Beyond the HapMap 1000 Genomes Julia KrushkalBeyond the HapMap4/11/20171000 GenomesAn international research consortium formed to create the most detailed and medically useful picture to date of human genetic variation. The project involves sequencing the genomes of approximately 1200 people from around the world and receives major support from the Wellcome Trust Sanger Institute in Hinxton, England, the Beijing Genomics Institute Shenzhen in China and the National Human Genome Research Institute (NHGRI), part of the National Institutes of Health (NIH).
68 1000 GenomesAn international research consortium formed to create the most detailed and medically useful picture to date of human genetic variation. The project involves sequencing the genomes of approximately 1200 people from around the world and receives major support from the Wellcome Trust Sanger Institute in Hinxton, England, the Beijing Genomics Institute Shenzhen in China and the National Human Genome Research Institute (NHGRI), part of the National Institutes of Health (NIH).
69 The goal of the 1000 Genomes Project is to find most genetic variants that have frequencies of at least 1% in the populations studied
73 Both Affymetrix and Illumina’s newest genotyping platforms include variants discovered in the 1000 genomes project.
74 Illumina HumanOmni1-Quad BeadChip Kits 4-sampleContains aggressively selected SNP and CNV probesMarkers derived from the 1,000 Genomes Project, all three HapMap phases, and recently published studies.>1 Mln available assays per sample, containing carefully selected content that delivers dense coverage of the human genome and targets regions known to play a role in human disease.SNP selection optimized to maintain comprehensive genomic coverage, while reducing tag SNP redundancy.This has enabled the inclusion of additional content carefully chosen to target high-value regions of the genome and new coding variants identified by the 1000 Genomes Project including:ABO blood typing SNPs, cSNPs, disease-associated SNPs, eSNPs, SNPs in mRNA splice sites, Absorption, Distribution, Metabolism and Excretion (ADME genes), Ancestry-Informative Markers (AIMs), HLA complexes, indels, introns, MHC regions, miRNA binding sites, mitochondrial DNA, pseudoautosomal region (PAR), promoter regions, and Y-chromosome
75 Illumina HumanOmni1-Quad BeadChip Kits Includes 10,000 SNPs targeting four 1Mb regions known to be associated with three or more human diseases> 31,000 SNPs predicted to be non-synonymous; 40,000 SNPs covering an additional 100 intervals surrounding published peak markers from the NHGRI GWAS database; and the remaining top single-marker associated SNPs from the GWAS database.High density markers with a median spacing of 1.2 kb ensure the highest level of resolution for CNV identification in the industryA complete optimization of the BeadChip design increases the available complexity, while reducing the amount of required DNA to 200 ng