Alkes Price Harvard School of Public Health April 23 & April 25, 2019

Alkes Price Harvard School of Public Health April 23 & April 25, 2019
EPI 511, Advanced Population and Medical Genetics Week 12: • Functional interpretation of genetic associations Alkes Price Harvard School of Public Health April 23 & April 25, 2019

EPI511, Advanced Population and Medical Genetics
Week 12: • Functional interpretation of genetic associations Final project: due date is officially May 17 at 5pm, but anytime before May 20 at 6am is ok. Every registered student is encouraged to schedule a 15min meeting with Alkes May 1-10 to discuss the final project ( Apr 30 guest lecture: Alexander (Sasha) Gusev, Assistant Professor, DFCI/Harvard Medical School 2

Outline 1. Introduction to functional annotation data
2. Functional architectures of diseases and complex traits 3. LD-dependent architectures of diseases and complex traits 4. Improving association and fine-mapping using functional data

Functional annotations can provide an interpretation of genetic associations
—Nat Genet editorial, July 2012

Functional annotation of the human genome
“Here, we assign biochemical functions for 80% of the genome” —ENCODE Consortium 2012 Nature

Functional annotation of the human genome
“Here, we assign biochemical functions for 80% of the genome” —ENCODE Consortium 2012 Nature “Here, we detail the many transgressions involved in assigning functionality to almost every nucleotide in the human genome” —Graur et al Genome Biol Evol

What does “functional annotation” mean?
• Coding regions: DNA that codes for proteins (~1% of genome) Untranslated regions (UTR): flanking coding regions (~1%) Promoter: RNA polymerase binding => transcription (~2%)

• Coding regions: DNA that codes for proteins (~1% of genome) Untranslated regions (UTR): flanking coding regions (~1%) Promoter: RNA polymerase binding => transcription (~2%) • Cis-regulatory regions (up to 1,000,000bp away from genes) Enhancer: region where transcription factors bind to activate transcription. Pennacchio et al. al Nat Rev Genet Shlyueva et al. al Nat Rev Genet

• Coding regions: DNA that codes for proteins (~1% of genome) Untranslated regions (UTR): flanking coding regions (~1%) Promoter: RNA polymerase binding => transcription (~2%) • Cis-regulatory regions (up to 1,000,000bp away from genes) Enhancer: region where transcription factors bind to activate transcription. Enhancer activity can vary by tissue or cell type! ≠

Enhancer activity can vary by tissue or cell type!
Shlyueva et al. al Nat Rev Genet

How do we figure out where the enhancers are, for a given tissue or cell type?
i. Identify DNase I Hypersensitivity Sites (DHS) via DNase-seq. ii. Identify histone marks via ChIP-Seq. iii. Other experimental assays. iv. Use gene expression data to perform eQTL mapping. v. Binding sites of known transcription factors. vi. Identify promoter-enhancer interactions using Hi-C. The ENCODE Project Consortium 2012 Nature Roadmap Epigenomics Consortium 2015 Nature dsQTL: Degner et al Nature, ATAC-QTL: Gate et al Nat Genet hQTL: Waszak et al Cell, Chen et al Cell, Ng et al Nat Neurosci also see van de Geijn et al Nat Methods, Taudt et al Nat Rev Genet

i. Identify DNase I Hypersensitivity Sites (DHS) via DNase-seq. • Closed chromatin => Transcription factors cannot bind. Open chromatin => Transcription factors can bind. • DNase I = enzyme that cuts DNA at regions of open chromatin. DNase-seq = sequence regions cut by DNase I to identify regions of open chromatin (DHS). Transcription factors Thurman et al Nature, Shlyueva et al. al Nat Rev Genet; also see Buenrostro et al Nat Methods (ATAC-seq), Gate et al Nat Genet (ATAC-QTL)

ii. Identify histone marks via ChIP-Seq (Park et al Nat Rev Genet) Histone mark Barski et al Cell, Ernst et al Nature, Shlyueva et al Nat Rev Genet also see Roadmap Epigenomics Consortium 2015 Nature, Lesch et al Nat Genet hQTL: Waszak et al Cell, Chen et al Cell, Ng et al Nat Neurosci

ii. Identify histone marks via ChIP-Seq. • Use chromHMM to define chromatin states from histone marks Ernst et al Nature Shlyueva et al. al Nat Rev Genet

ii. Identify histone marks via ChIP-Seq. • Use chromHMM to define chromatin states from histone marks • Cluster 2.3 million enhancer regions into 226 modules based on coordinated activity patterns across 111 cell types. Ernst et al Nature Shlyueva et al. al Nat Rev Genet Roadmap Epigenomics Consortium 2015 Nature

iii. Other experimental assays. • FANTOM5 project: experimentally determined enhancers from Cap Analysis of Gene Expression (CAGE) Claim: Enhancer reporter assay shows higher specificity (70% vs %) vs. predicted enhancers from ENCODE (The ENCODE Project Consortium 2012 Nature). Andersson et al Nature also see Core et al Nat Genet, Lara-Astiaso et al Science

iv. Use gene expression data to perform eQTL mapping. Figure from Won et al Nature; also see Cheung et al Nature, Stranger et al Nat Genet, Lappalainen et al. 2013 Nature, Wright et al Nat Genet, GTEx Consortium 2017 Nature

iv. Use gene expression data to perform eQTL mapping. also see research connecting eQTL mapping to GWAS of diseases and complex traits: Nicolae et al PLoS Genet, Gamazon et al Nat Genet, Gusev et al. 2016 Nat Genet, Zhu et al Nat Genet, Franzen et al Science, Gusev et al. 2018 Nat Genet, Hormozdiari et al Nat Genet, Gamazon et al Nat Genet

v. Binding sites of known transcription factors. • In vivo approach: use ChIP-Seq (Park et al Nat Rev Genet) to see where in the genome each transcription factor binds (can combine with DNase-seq data; Moyerbrailean et al PLoS Genet). • In vitro approach: measure binding of each transcription factor to many short sequences, to predict binding genome-wide (e.g. DeepBind; Alipanahi et al Nat Biotechnol) … but sequence “motifs” at which transcription factors bind are still poorly understood for most transcription factors. reviewed in Levo & Segal 2014 Nat Rev Genet

vi. Identify promoter-enhancer interactions using Hi-C. • Hi-C data can be combined with evidence from other sources (Won et al Nature) Dixon et al Nature, Rao et al Cell, Schmitt et al Cell Reports, Won et al Nature; reviewed in Pombo & Dillon 2015 Nat Rev Mol Cell Biol

i. Identify DNase I Hypersensitivity Sites (DHS) via DNase-seq. ii. Identify histone marks via ChIP-Seq. iii. Other experimental assays. iv. Use gene expression data to perform eQTL mapping. v. Binding sites of known transcription factors. vi. Identify promoter-enhancer interactions using Hi-C. The ENCODE Project Consortium 2012 Nature Roadmap Epigenomics Consortium 2015 Nature dsQTL: Degner et al Nature, ATAC-QTL: Gate et al Nat Genet hQTL: Waszak et al Cell, Chen et al Cell, Ng et al Nat Neurosci also see van de Geijn et al Nat Methods, Taudt et al Nat Rev Genet

Limited role for protein-coding regions in GWAS associations
• 465 GWAS-associated SNPs (NHGRI GWAS catalogue), based on 1 lead SNP per trait and genomic region. • 11% of GWAS-associated SNPs lie in coding regions spanning 1% of the genome. • 9% non-synonymous vs. 2% synonymous. • 4-fold enrichment for non-synonymous SNPs when counting all SNPs with P < 5 x 10-8 and SNPs in LD with those SNPs. Hindorff et al PNAS

Major role for DNase I Hypersensitivity Sites
• 5,134 GWAS-associated SNPs (NHGRI GWAS catalogue) 57% of these SNPs lie in DHS spanning 42% of the genome (from 349 cell/tissue types; ENCODE and Roadmap projects) • Larger DHS enrichment for externally replicated GWAS SNPs. Maurano et al Science also see Thurman et al Nature, Trynka et al Am J Hum Genet

Linking enhancers to the genes they regulate
• Idea: Enhancers that regulate a gene will have similar patterns of activation as the promoter of the regulated gene, leading to enhancer-promoter correlation across cell types in DHS patterns. • 419 GWAS-associated SNPs annotated as DHS in a subset of cell types are strongly correlated (> 0.7) to a “nearby” promoter in their DHS patterns. Maurano et al Science also see Thurman et al Nature, Trynka et al Am J Hum Genet

histone QTLs are enriched for GWAS hits
Enrichment of hQTL in GWAS vs. random SNPs, matched for MAF and distance to nearest gene. Waszak et al Cell also see Chen et al Cell, Ng et al Nat Neurosci

Linking enhancers to the genes they regulate
• Idea: Histone mark enhancers that regulate a gene will lead to histone mark-gene expression correlation across individuals. • 4,568 (22%) genes have expression correlated to a histone mark at 0.1%. FDR. Waszak et al Cell also see Chen et al Cell, Ng et al Nat Neurosci

Identifying critical cell types for each disease
• Idea: for a given disease, histone marks (e.g. H3K4me3) associated with promoter or enhancer activity may occur specifically in critical cell types relevant to that disease, in which the target gene is regulated. • For a given disease (e.g. T2D) + histone mark (e.g. H3K4me3) + set of GWAS-associated (+ tag) SNPs for that disease, identify cell types with excess of histone mark peaks near those SNPs. Trynka et al Nat Genet also see PGC-Schizophrenia Working Group 2014 Nature, Farh et al Nature

Identifying critical cell types for each disease
• Idea: Trynka et al Nat Genet also see PGC-Schizophrenia Working Group 2014 Nature, Farh et al Nature

Polygenic architectures motivate polygenic analyses of functional data
Boyle et al Cell; also see Wray et al Cell

Defining hg2 Heritability explained by genotyped SNPs (hg2), a parameter which depends on the population studied + phenotype + set of genotyped SNPs, is defined as the maximum proportion of phenotypic variance that can be explained by a linear combination of genotyped SNPs i. This is defined in the entire population, not estimated in a finite sample. No assumptions needed for this definition. Yang et al Nat Genet

Defining hg2 across functional categories
Heritability explained by genotyped SNPs (hg2), a parameter defined in the entire population that depends on the phenotype + set of genotyped SNPs, is defined as the maximum proportion of phenotypic variance that can be explained by a linear combination of genotyped SNPs i. With multiple functional categories f, even if SNPs are linked this definition generalizes to the variances explained when jointly fitting weights wi for SNPs i in each category f. Lee et al Nat Genet, Gusev et al Am J Hum Genet, Loh et al. 2015b Nat Genet

Assessing functional enrichment using raw genotype/phenotype data
• 72% of hg2 of height lies in genic regions spanning 49% of the genome (Yang, Manolio et al Nat Genet) (from Week 8) • 31% of hg2 of schizophrenia lies in CNS+ gene regions spanning 20% of the genome (Lee et al Nat Genet) • 38-79% of hg2 of 11 WTCCC diseases (6 autoimmune) lies in 16% of genome annotated as DHS (Gusev et al AJHG) [38% using genotyped SNPs; 79% using imputed SNPs]

Assessing functional enrichment using raw genotype/phenotype data
• hg2 is enriched in regions with high GC content N = 49,806 for schizophrenia N = 54,734 for dyslipidemia and hypertension Loh, Bhatia et al Nat Genet

Functional enrichment can be assessed using summary statistics
—Nat Genet editorial, July 2012 Definition: Summary statistics consist of: • GWAS association z-scores for each typed or imputed SNP • Sample sizes on which z-scores were computed (may vary by SNP) Note: Many applications also require LD information computed from a population reference panel, e.g Genomes (2015 Nature). reviewed in Pasaniuc & Price 2017 Nat Rev Genet

Functional interpretation of hg2 using summary statistics
• Stratified LD score regression: Regress χ2 association statistics on LD (from reference panel) with different functional annotation categories. Finucane et al Nat Genet

Average χ2 > 1 does not imply confounding
Linear Regression: E(χ2 statistic) = 1 + (hg2N/M)LDscore, where M = #markers, N = #samples, LDscore(SNP m) = Thus, average χ2 = 1 + (hg2N/M)LDscoreavg = 1 + hg2N/Meff Inflation is not a bad thing! (see Yang et al Eur J Hum Genet) CONFOUNDING  (from Week 8) also see Yang et al Nat Genet

LD score regression (schizophrenia example) X
λGC = 1.484 Average χ2 = 1.613 i = 1.066 LDscore(SNP m) = Regress χ2 = i + s*LDscore Intercept i = 1 if no confounding, > 1 if confounding (slide from Week 5) Bulik-Sullivan, Loh et al Nat Genet; SCZ data from PGC-SCZ 2014 Nature

LD score regression: estimating hg2
LDscore(SNP m) = (r2 from reference panel, e.g Genomes) Linear Regression: regress χ2 statistics on LD χ2 = i + s*LDscore Intercept i = 1 if no confounding, > 1 if confounding Slope s = hg2N/M, can be used to estimate hg2 (of all reference SNPs) (from Week 8) Bulik-Sullivan, Loh et al Nat Genet

Stratified LD score regression: estimating hg2 for disjoint functional categories
LDscoref (SNP m) for functional category f = (r2 from reference panel, e.g Genomes) Multi-linear Regression: regress χ2 statistics on LD with each category χ2 statistic = 1 + Σf (Nhg2(f)/M(f))LDscoref Define enrichment(f) = (% hg2) / (% SNPs) = (hg2(f)/hg2) / (M(f)/M) Finucane et al Nat Genet

Stratified LD score regression: estimating hg2 for overlapping functional categories
LDscoref (SNP m) for functional category f = (r2 from reference panel, e.g Genomes) Multi-linear Regression: regress χ2 statistics on LD with each category χ2 statistic = 1 + Σf (Nτf)LDscoref Define enrichment(f) = (% hg2) / (% SNPs) = (hg2(f)/hg2) / (M(f)/M) Finucane et al Nat Genet

Functional annotations analyzed
Mark Source/reference Coding, 3’ UTR, 5’ UTR, Promoter, Intron UCSC; Gusev et al AJHG Digital Genomic Footprint, TFBS ENCODE; Gusev et al AJHG CTCF binding site, Promoter Flanking, Repressed, Transcribed, TSS, Enhancer, Weak Enhancer ENCODE; Hoffman et al Nucleic Acids Res DHS, H3K4me1, H3K4me3, H3K9ac Trynka et al Nat Genet* Conserved Ward & Kellis 2012 Science FANTOM5 Enhancer Andersson et al Nature lincRNAs Cabili et al Genes Dev DHS and DHS promoter Maurano et al Science H3K27ac Roadmap; PGC Nature *Post-processed from ENCODE and Roadmap data by S. Raychaudhuri and X. Liu labs Note: 500bp windows around each category were also included as additional annotations.

Simulations: LD Score regression works well when the causal categories are in the model
• DHS (1x enr.) + non-DHS • DHS (3x enr.) + non-DHS • DHS only (5.5x enriched) (DHS)

Simulations: LD Score regression works well when the causal categories are not in the model
• 200bp flanking DHS • Coding • FANTOM5 enhancer* *Andersson et al Nature (DHS)

Next steps on LD score regression
Next steps on LD score regression If you feel like Dancin’ … -- Prince If you feel like Data … -- Alkes

17 traits analyzed (summary statistics only)
* *HLA locus excluded from all analyses

Coding, Enhancer, Super-enhancer and H3K4me3 enrichments in a subset of phenotypes

Coding, Enhancer, Super-enhancer, H3K4me3 and Conserved enrichments in a subset of phenotypes

Coding, Enhancer, Super-enhancer, H3K4me3, Conserved, and FANTOM5 Enhancer enrichments

Cell-type specific histone mark enrichments
inform trait biology • 4 histone marks: H3K27ac, H3K4me1, H3K4me3, H3K9ac • Up to 118 cell types; group into 10 cell-type groups • For each trait (17 total) For each cell-type group (10 total) Assess model fit when adding all cell-type specific annotations in that group as a single new annotation Results for Schizophrenia, Height, Rheumatoid Arthritis:

Cell-type-specific heritability enrichments of specifically expressed genes inform trait biology
• 2 gene expression data sets: GTEx (RNA-seq, 53 cell types; GTEx Consortium 2015 Science) Franke lab (expression array, 152 cell types; Fehrmann et al Nat Genet) • For each trait (48 total) and each cell type (205 total): Assess model fit when adding cell-type-specific annotation of specifically expressed genes (±100kb) (Finucane et al Nat Genet) Results for Schizophrenia, Height, Rheumatoid Arthritis:

Inferring low-frequency variant functional architectures by extending S-LDSC
Common variant enrichment (CVE) of an annotation = prop. of hc2 / prop. of common SNPs Low-frequency variant enrichment (LFVE) of an annotation = prop. of hlf2 / prop. of low-frequency SNPs • Separate annotations for common and low-frequency SNPs • Also include binary annotations for 10 low-frequency MAF bins • UK Biobank target samples + UK10K LD reference samples • Simulations confirm robust results (not shown) Gazal et al Nat Genet

LFVE is correlated to CVE LFVE > CVE when CVE is large
33 main annotations: r(LFVE,CVE) = 0.79 Meta-analysis across 40 UK Biobank traits (average N = 363K) assoc. method: BOLT-LMM Loh et al. biorxiv / in press Nat Genet Low-frequency variant enrichment (LFVE) Common variant enrichment (CVE)

33 main annotations: r(LFVE,CVE) = 0.79 Non-synonymous variants: 17.3% of hlf2 vs. 2.1% of hc2 (Even larger LFVE for n.s. variants • predicted as damaging: PolyPhen-2 • in genes under strong selection: shet) Low-frequency variant enrichment (LFVE) Common variant enrichment (CVE)

33 main annotations: r(LFVE,CVE) = 0.79 Regions conserved in primates: 43.5% of hlf2 vs. 21.2% of hc2 Low-frequency variant enrichment (LFVE) Common variant enrichment (CVE)

LFVE ≈ CVE for most regulatory annotations but LFVE > CVE for brain annotations
637 cell-type-specific (CTS) annotation-trait pairs with significant CVE (Finucane et al Nat Genet) Low-frequency variant enrichment (LFVE) 55 brain annotation-trait pairs with LFVE/CVE>2x Common variant enrichment (CVE)

H3K4me3 in brain DPFC-Neuroticism: 56.9% of hlf2 vs. 11.7% of hc2 (P = ) Low-frequency variant enrichment (LFVE) Common variant enrichment (CVE)

H3K4me3 in germinal matrix-Age first birth: 63.2% of hlf2 vs. 11.1% of hc2 (P = 0.001) Low-frequency variant enrichment (LFVE) Common variant enrichment (CVE)

CVE depends on both strength of selection and proportion of variants causal for trait
Forward simulations (SLiM2 + τEyre-Walker) sdn = avg selection coefficient of deleterious de novo variants π = probability of de novo variant to be causal for trait (π)

LFVE/CVE ratio depends primarily on strength of selection
Forward simulations (SLiM2 + τEyre-Walker) Non-synonymous variants: LFVE/CVE=5x, sdn=‒0.003 55 brain annotation-trait pairs: LFVE/CVE>2x, sdn<‒0.0006 (π)

What does “LD-dependent architecture” mean?

• SNPs with higher LD have higher average χ2 association statistics due to increased tagging of causal variants. Pritchard & Przeworski 2001 Am J Hum Genet

• SNPs with higher LD have higher average χ2 association statistics due to increased tagging of causal variants. “LD-dependent architecture”: dependence of causal effect sizes on the level of LD of a SNP.

• Common SNPs have higher LD and higher causal variance than rare SNPs => SNPs with higher LD have higher causal variance. Zuk et al PNAS

• Common SNPs have higher LD and higher causal variance than rare SNPs => SNPs with higher LD have higher causal variance. “LD-dependent architecture”: dependence of causal effect sizes on the level of LD of a SNP, after conditioning on MAF.

LD-dependent architectures: why should we care?
LD-dependent architectures lead to bias in hg2 estimates (from Week 8) Gusev et al PLoS Genet also see Speed et al AJHG, Yang et al Nat Genet, Speed et al Nat Genet

LD-dependent architectures: why should we care?
Understanding which regions of the genome contribute most to the genetics of complex traits and diseases can: • Help us understand biology and evolution • Increase association power (Pickrell 2014 AJHG, Sveinbjornsson et al Nat Genet) • Improve fine-mapping resolution (Kichaev et al PLoS Genet)

Enriched heritability in regulatory regions, which have lower LD, higher recombination rate
Finucane et al Nat Genet; also see Gusev et al Am J Hum Genet

BUT, in coding regions: more disease variants in regions with lower recombination rate
Hussin et al Nat Genet

Stratified LD score regression: estimating hg2 for overlapping functional categories
LDscoref (SNP m) for functional category f = (r2 from reference panel, e.g Genomes) Multi-linear Regression: regress χ2 statistics on LD with each category χ2 statistic = 1 + Σf (Nτf)LDscoref Define enrichment(f) = (% hg2) / (% SNPs) = (hg2(f)/hg2) / (M(f)/M) Finucane et al Nat Genet

Extending LD score regression: estimating hg2 for continuous functional annotations
LDscoreq (SNP m) for continuous annotation q = Multi-linear Regression: regress χ2 statistics on annotation LD scores χ2 statistic = 1 + Σq(Nτq)LDscoreq = normalized effect size of annotation q conditioned on effects of other annotations included in model (proportionate change in trait h2 per 1 s.d. increase in annot. q) Gazal et al Nat Genet

Extending LD score regression: estimating hg2 for continuous LD-related annotations
Level of LD ( LLD ): MAF-adjusted LD score (MAF-stratified quantile normalization) LDscoreLLD (SNP m) for continuous LLD annotation = Multi-linear Regression: regress χ2 statistics on annotation LD scores χ2 statistic = 1 + (NτLLD) LDscoreLLD + … (also include MAF bins + functional annotations from Finucane et al Nat Genet) Gazal et al Nat Genet

Simulations: LD score regression produces unbiased estimates of LD-dependent architectures
Simulations using real genotypes, simulated phenotypes (τLLD = 0), inference via LD score regression using 1000G reference panel.

Next steps on LD-dependent architectures
Next steps on LD-dependent architectures If you feel like Dancin’ … -- Prince If you feel like Data … -- Alkes

56 traits analyzed (summary statistics only) Public + 23andMe + UK Biobank (avg N = 101,401)
Phenotype Reference/consortium N Age at menarche Perry et al Nature 132,989 Age at menopause Day et al Nat Genet 67,265 Anorexia Boraska et al Mol Psych 17,767 Autism Spectrum PGC Cross-Disorder Group 2013 Lancet 10,263 Bipolar Disorder BIP Working Group of the PGC 2011 Nat Genet 16,731 BMI Speliotes et al Nat Genet 123,912 Celiac Disease Dubois et al Nat Genet 15,283 Coronary Artery Disease Schunkert et al Nat Genet 86,995 Crohn’s Disease Jostins et al Nature 20,883 Educational Attainment Rietveld et al Science 101,069 Ever Smoked TAG Consortium 2010 Nat Genet 74,035 Fasting Glucose Manning et al Nat Genet 46,186 HDL Teslovich et al Nature 99,900 Height Lango Allen et al Nature 133,858 LDL 95,454 Primary Biliary Cirrhosis Cordell et al Nat Commun 13,239 Rheumatoid Arthritis Okada et al Nature 38,242 Schizophrenia SCZ Working Group of the PGC 2014 Nature 70,100 Systemic Lupus Erythematosus Bentham et al Nat Genet 14,267 Triglycerides 96,598 Type-2 Diabetes Morris et al Nat Genet 69,033 Ulcerative Colitis 27,432

SNPs with lower MAF-adjusted level of LD (LLD) have larger causal effect sizes
Same sign of effect across all 56 traits investigated!

Many annotations correlated to LD could contribute to LD-dependent architectures
LD-related annotations Predicted allele age (ARGweaver; Rasmussen et al PLoS Genet) LLD in Africans (LLD-AFR) Recombination rate (±10kb window; Hussin et al Nat Genet) GC-content (±1Mb window; Loh et al. 2015b Nat Genet) Replication timing (Koren et al Am J Hum Genet) Background selection (1 − B statistic; McVicker et al PLoS Genet) Nucleotide diversity (SNPs per kb; ±10kb window) CpG content (±50kb window) Functional annotations (“Baseline model”; Finucane et al Nat Genet) Coding, DHS, histone marks, enhancers, conserved, etc.

Many annotations correlated to LD could contribute to LD-dependent architectures
| LD-related annotations Functional annotations from “baseline model” (Finucane et al Nat Genet)

Many LD-related annotations impact causal effect sizes
+ MAF Annotation + baseline model + MAF Meta-analysis of 31 independent traits

Many LD-related annotations impact causal effect sizes
+ MAF Annotation + baseline model + MAF Recombination rate has discordant sign of effect (Hill & Robertson 1966 Genet Res) Heritability is enriched in SNPs with low LLD in low recombination rate regions r = −0.63 Meta-analysis of 31 independent traits

Many LD-related annotations impact causal effect sizes after conditioning on baseline model
+ MAF Annotation + baseline model + MAF Meta-analysis of 31 independent traits

+ MAF Annotation + baseline model + MAF LLD effect is 0.37x smaller when including annotations from baseline model Some, but not all, of LD-dependent architecture due to DHS, enhancers, etc. 0.37x Meta-analysis of 31 independent traits

+ MAF Annotation + baseline model + MAF LLD effect is 0.51x smaller after adding baseline model Predicted allele age has largest effect. Meta-analysis of 31 independent traits

SNPs with smaller (more recent) allele age have larger causal effect sizes
Same sign of effect across 55 of 56 traits investigated

Many LD-related annotations impact causal effect sizes in joint fit after conditioning on baseline model Annotation + MAF Annotation + baseline model + MAF Joint-fit annotations + baseline model + MAF Meta-analysis of 31 independent traits

Many LD-related annotations impact causal effect sizes in joint fit after conditioning on baseline model Annotation + MAF Annotation + baseline model + MAF Joint-fit annotations + baseline model + MAF LLD effect is 0.51x smaller af 6 significant annotations in joint fit Meta-analysis of 31 independent traits

Many LD-related annotations impact causal effect sizes in joint fit after conditioning on baseline model Annotation + MAF Annotation + baseline model + MAF Joint-fit annotations + baseline model + MAF LLD effect is 0.51x smaller af predicted allele age has largest effect Meta-analysis of 31 independent traits

Quintiles illustrate large effects of LD-related annotations
40% 30% Proportion of heritability 20% 10% 0%

Quintiles illustrate large effects of LD-related annotations
40% 30% Proportion of heritability 20% 10% 0% Youngest 20% explain 3.8x more heritability than oldest 20% vs. 1.8x for MAF

No overall effect for recombination rate
40% 30% Proportion of heritability 20% 10% 0% Competing effects of LLD-AFR and RR imply no overall effect for RR

LD-related annotations tag negative selection
+ MAF Annotation + baseline model + MAF Joint-fit annotations + baseline model + MAF • Predicted allele age: deleterious variants are younger • LLD-AFR: information on variant history? • Recombination rate: low RR => less efficient selection (Hill & Robertson 1966 Genetics) • Diversity / Background selection: regions under selection: lower diversity, biologically important • CpG-Content: unknown functional elements? 31 traits

X X Forward simulations confirm impact of many
LD-related annotations on selection coefficient s Annotation + MAF Annotation + baseline model + MAF Joint-fit annotations + baseline model + MAF Forward Simulations: impact on s • Forward simulations using SLiM (Messer 2013 Genetics) under African-European demographic model (Gravel et al PNAS) • Recombination rate and % of deleterious SNPs vary across regions, selection coeff s varies across deleterious SNPs • Jointly regress selection coeff s on 4 LD-related annotations and minor allele frequency X X 31 traits Simulations

Improving association power using functional data
Main idea: • If certain regions of the genome are more likely to contain true signal, then association methods should favor those regions while controlling the overall false-positive rate.

Improving association power using functional data: sFDR approach
Stratified false discovery rate (sFDR) approach: • Use P-value distributions to attain a target false discovery rate (FDR; Benjamini & Hochberg 1995 JRSS, Storey et al PNAS) separately for each functional category. Schork et al PLoS Genet also see Sun et al Genet Epidemiol, Wang et al PLoS Genet

Stratified false discovery rate (sFDR) approach: True Discovery Rate, Crohn’s disease. Schork et al PLoS Genet

Stratified false discovery rate (sFDR) approach: Replication Rate (in independent data set), Crohn’s disease. Schork et al PLoS Genet

Stratified false discovery rate (sFDR) approach: • 381 SNPs at FDR=0.05, 452 at sFDR=0.05 for Crohn’s disease: +19% increase in power (+24% for height, +270% for SCZ). Schork et al PLoS Genet

Stratified false discovery rate (sFDR) approach: • 381 SNPs at FDR=0.05, 452 at sFDR=0.05 for Crohn’s disease: +19% increase in power (+24% for height, +270% for SCZ). • Caveat: #SNPs does not quantify independent associations. Definition of functional categories (LD score ≥1 to category; membership in multiple functional categories is allowed) may lead to more associations for non-independent high-LD SNPs. Schork et al PLoS Genet

Improving association power using functional data: Bayesian approach
• Use Bayesian hierarchical model to jointly model - the prior probability that a region is associated, as a function of regional annotations of 2.5Mb regions (e.g. gene density) - the prior probability that a SNP in an associated region is the (unique) causal associated SNP in the region, as a function of annotations of individual SNPs (e.g. DHS). • Infer posterior probability of association (PPA) for each region. Pickrell 2014 Am J Hum Genet also see Wen et al Am J Hum Genet, Wen 2016 Ann Appl Stat

Inferred enrichments for HDL (data: Teslovich et al Nature) Repressed, TSS annotations: chromHMM chromatin states (Hoffman et al NAR) HepG2: liver cancer-derived (95% CI) Pickrell 2014 AJHG

Inferred enrichments for HDL (data: Teslovich et al Nature) Across 18 traits analyzed: % increase in # of regions identified with PPA > 0.9. Pickrell 2014 AJHG

Improving association power using functional data: P-value weighting
P-value weighting approach: • Assess enrichment of causal variants in a functional category by maximizing likelihood under single causal variant model (Maller et al Nat Genet). • Set P-value weights equal to functional category enrichments (Genovese et al Biometrika, Roeder et al Am H Hum Genet; also see Eskin 2008 Genome Res, Darnell et al Bioinformatics) • Evaluate number of independent association signals (P < 3.5 x 10-9, based on 14.2 million SNPs in WGS data; independent associations assessed via conditional analysis). Sveinbjornsson et al Nat Genet

Improving association power using functional data: P-value weighting
Inferred enrichments using P < 1 x association signals: • 181x enrichment for loss-of-function coding variants*. • 37x enrichment for missense coding variants*. • 1.3x enrichment for DHS non-coding variants. *Note: coding variants account for >50% of significant associations in this WGS data set (vs. 11% in GWAS; Hindorff et al PNAS). Increase in power due to P-value weighting, across 219 traits: • 14% increase in 1 x < P < 3.5 x 10-9 association signals; 3% increase in set of all P < 3.5 x 10-9 association signals. Sveinbjornsson et al Nat Genet

Functional validation provides the best proof of biological causality
(from Week 6) Price et al Proc Biol Sci (Table S1)

single causal variant approach
Fine-mapping using single causal variant approach Assumption: at most 1 biologically causal SNP. (from Week 6) Maller et al Nat Genet

single causal variant approach
Fine-mapping using single causal variant approach Let X denote genotypes, and Xi denote genotypes at SNP i. Let Model M = union of models Mi over all SNPs i. Bayes Factor BFi = = Theorem 6: The posterior probability is proportional to BFi , assuming equal priors Proof: , which is equal to BFi up to a constant factor (indep. of i). Q.E.D. (from Week 6) Maller et al Nat Genet (see Section 6 of Supp Note) also see Faye et al PLoS Genet, Farh et al Nature

Improving fine-mapping using functional priors:
single causal variant approach Let X denote genotypes, and Xi denote genotypes at SNP i. Let Model M = union of models Mi over all SNPs i. Bayes Factor BFi = = Use different priors based on category of SNP i. Gusev et al Am J Hum Genet also see Li & Kellis 2016 Nucleic Acids Res

single causal variant approach Functional priors reduce 95% credible SNP sets at known loci (under single causal variant assumption): Gusev et al Am J Hum Genet also see Li & Kellis 2016 Nucleic Acids Res

Increasing evidence in favor of multiple causal variants at a locus
• 8q24 locus in prostate cancer: “We identified seven risk variants spanning 430kb independently predicting risk for prostate cancer” (Haiman et al 2007 Nat Genet; Table S1 of Price et al Proc Biol Sci) • FGFR2 locus in breast cancer: iCOGS chip genotyped in 89K Eur + 14K Asians + 2K AA “rs , rs , rs are putative functional SNPs” (Meyer et al Am J Hum Genet; Table S1 of Price et al Proc Biol Sci) • Local heritability analyses of 9 WTCCC diseases: “We find significant average increases in local heritability at known associated loci [compared to heritability of lead SNPs], consistent with multiple causal variants” (Gusev et al PLoS Genet) (from Week 6)

multiple causal variant approach
Fine-mapping using multiple causal variant approach Let X denote association z-scores for all m SNPs at the locus. There are 2m different possibilities for the set S of causal variants. Let MS denote the model in which S is the set of causal variants. Then , where γ = 0.01 (for computational efficiency: consider only causal sets S with |S| ≤ 6; likewise, in denominator, consider only causal sets S’ with |S’| ≤ 6) (from Week 6) Hormozdiari et al Genetics

multiple causal variant approach Let X denote association z-scores for all m SNPs at the locus. There are 2m different possibilities for the set S of causal variants. Let MS denote the model in which S is the set of causal variants. Then , where Aik = 1 if SNP i is in functional annotation k, 0 otherwise γk = effect size of functional annotation k on prob. of being causal (fit using EM algorithm) Kichaev et al PLoS Genet

multiple causal variants approach Probabilistic model on z-scores (PAINTOR): Inferred enrichments for HDL (data: Teslovich et al Nature) Kichaev et al PLoS Genet; also see Kichaev & Pasaniuc 2015 Am J Hum Genet, Kichaev et al Bioinformatics, Chen et al Genetics

multiple causal variants approach Probabilistic model on z-scores (PAINTOR): 90% credible SNP sets for 4 lipid traits (data: Teslovich et al Nature) Note: simulations show that 90% credible SNP sets are well-calibrated for PAINTOR, but mis-calibrated for methods that assume a single causal variant. Kichaev et al PLoS Genet; also see Kichaev & Pasaniuc 2015 Am J Hum Genet, Kichaev et al Bioinformatics, Chen et al Genetics

Conclusions Valuable functional annotation data is being generated by ENCODE, Roadmap and other projects. Stratified LD Score regression detects functional enrichment for coding regions, conserved regions, enhancer regions, etc. and also detects cell-type-specific functional enrichments. Low-LD variants have larger causal effects (at a given MAF), consistent with the action of negative selection. • Functional annotations can increase association power and improve fine-mapping.

EPI511, Advanced Population and Medical Genetics
Week 12: • Functional interpretation of genetic associations Final project: due date is officially May 17 at 5pm, but anytime before May 20 at 6am is ok. Every registered student is encouraged to schedule a 15min meeting with Alkes May 1-10 to discuss the final project ( Apr 30 guest lecture: Alexander (Sasha) Gusev, Assistant Professor, DFCI/Harvard Medical School 125

Acknowledgements • TA Margaux Hujoel. • Guest lecturer Sasha Gusev.
• Authors of the 29 papers listed on the course syllabus. All of the other researchers whose work we have discussed.

Alkes Price Harvard School of Public Health April 23 & April 25, 2019

Similar presentations

Presentation on theme: "Alkes Price Harvard School of Public Health April 23 & April 25, 2019"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Alkes Price Harvard School of Public Health April 23 & April 25, 2019

Similar presentations

Presentation on theme: "Alkes Price Harvard School of Public Health April 23 & April 25, 2019"— Presentation transcript:

Similar presentations

About project

Feedback