Download presentation

Published byKaila Kilbourne Modified over 4 years ago

1
**Association Analysis, Logistic Regression, R and S-PLUS**

Richard Mott

2
**Logistic Regression in Statistical Genetics**

Applicable to Association Studies Data: Binary outcomes (eg disease status) Dependent on genotypes [+ sex, environment] Aim is to identify which factors influence the outcome Rigorous tests of statistical significance Flexible modelling language Generalisation of Chi-Squared Test

3
**What is R ? Statistical analysis package Free**

Similar to commercial package S-PLUS Runs on Unix, Windows, Mac Many packages for statistical genetics, microarray analysis available in R Easily Programmable

4
**Modelling in R Data for individual labelled i=1…n: Response yi**

Genotypes gij for markers j=1..m

5
**Coding Unphased Genotypes**

Several possibilities: AA, AG, GG original genotypes 12, 21, 22 1, 2, 3 0, 1, 2 # of G alleles Missing Data NA default in R

6
**Using R Load genetic logistic regression tools**

> source(‘logistic.R’) Read data table from file > t <- read.table(‘geno.dat’, header=TRUE) Column names names(t) t$y response (0,1) t$m1, t$m2, …. Genotypes for each marker

7
**Contigency Tables in R ftable(t$y,t$m31) prints the contingency table**

>

8
**Chi-Squared Test in R > chisq.test(t$y,t$m31)**

Pearson's Chi-squared test data: t$y and t$m31 X-squared = , df = 2, p-value = Warning message: Chi-squared approximation may be incorrect in: chisq.test(t$y, t$m31) >

9
**The Logistic Model Prob(Yi=0) = exp(hi)/(1+exp(hi))**

hi = Sj xij bj - Linear Predictor xij – Design Matrix (genotypes etc) bj – Model Parameters (to be estimated) Model is investigated by estimating the bj’s by maximum likelihood testing if the estimates are different from 0

10
**The Logistic Function Prob(Yi=0) = exp(hi)/(1+exp(hi))**

11
**Types of genetic effect at a single locus**

AA AG GG Recessive 1 Dominant Additive 2 Genotype a b

12
**Additive Genotype Model**

Code genotypes as AA x=0, AG x=1, GG x=2 Linear Predictor h = b0 + xb1 P(Y=0|x) = exp(b0 + xb1)/(1+exp(b0 + xb1)) PAA = P(Y=0|x=0) = exp(b0)/(1+exp(b0)) PAG = P(Y=0|x=1) = exp(b0 + b1)/(1+exp(b0 + b1)) PGG = P(Y=0|x=2) = exp(b0 + 2b1)/(1+exp(b0 + 2b1))

13
**Additive Model: b0 = -2 b1 = 2 PAA = 0.12 PAG = 0.50 PGG = 0.88**

Prob(Y=0) h

14
**Additive Model: b0 = 0 b1 = 2 PAA = 0.50 PAG = 0.88 PGG = 0.98**

Prob(Y=0) h

15
**Recessive Model Code genotypes as Linear Predictor**

AA x=0, AG x=0, GG x=1 Linear Predictor h = b0 + xb1 P(Y=0|x) = exp(b0 + xb1)/(1+exp(b0 + xb1)) PAA = PAG = P(Y=0|x=0) = exp(b0)/(1+exp(b0)) PGG = P(Y=0|x=1) = exp(b0 + b1)/(1+exp(b0 + b1))

16
**Recessive Model: b0 = 0 b1 = 2 PAA = PAG = 0.50 PGG = 0.88**

Prob(Y=0) h

17
**Genotype Model Each genotype has an independent probability**

Code genotypes as (for example) AA x=0, y=0 AG x=1, y=0 GG x=0, y=1 Linear Predictor h = b0 + xb1+yb2 two parameters P(Y=0|x) = exp(b0 + xb1+yb2)/(1+exp(b0 + xb1+yb2)) PAA = P(Y=0|x=0,y=0) = exp(b0)/(1+exp(b0)) PAG = P(Y=0|x=1,y=0) = exp(b0 + b1)/(1+exp(b0 + b1)) PGG = P(Y=0|x=0,y=1) = exp(b0 + b2)/(1+exp(b0 + b2))

18
**Genotype Model: b0 = 0 b1 = 2 b2 = -1 PAA = 0.5 PAG = 0.88 PGG = 0.27**

Prob(Y=0) h

19
**Models in R response y genotype g**

AA AG GG model DF Recessive 1 y ~ dominant(g) Dominant y ~ recessive(g) Additive 2 y ~ additive(g) Genotype a b y ~ genotype(g)

20
**Data Transformation g <- t$m1**

use these functions to treat a genotype vector in a certain way: a <- additive(g) r <- recessive(g) d <- dominant(g) g <- genotype(g)

21
**Fitting the Model Equivalent models: genotype = dominant + recessive**

afit <- glm( t$y ~ additive(g),family=‘binomial’) rfit <- glm( t$y ~ recessive(g),family=‘binomial’) dfit <- glm( t$y ~ dominant(g),family=‘binomial’) gfit <- glm( t$y ~ genotype(g),family=‘binomial’) Equivalent models: genotype = dominant + recessive genotype = additive + recessive genotype = additive + dominant genotype ~ standard chi-squared test of genotype association

22
Parameter Estimates > summary(glm( t$y ~ genotype(t$m31), family='binomial')) Coefficients: Estimate Std. Error z value Pr(>|z|) b0 (Intercept) <2e-16 *** b1 genotype(t$m31) b2 genotype(t$m31) --- Signif. codes: 0 `***' `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 >

23
**Analysis of Deviance Chi-Squared Test**

> anova(glm( t$y ~ genotype(t$m31), family='binomial')) Analysis of Deviance Table Model: binomial, link: logit Response: t$y Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev NULL genotype(t$m31)

24
Model Comparison Compare general model with additive, dominant or recessive models: > afit <- glm(t$y ~ additive(t$m20)) > gfit <- glm(t$y ~ genotype(t$m20)) > anova(afit,gfit) Analysis of Deviance Table Model 1: t$y ~ additive(t$m20) Model 2: t$y ~ genotype(t$m20) Resid. Df Resid. Dev Df Deviance >

25
**Scanning all Markers > logscan(t,model=‘additive’)**

Deviance DF Pval LogPval m e e m e e m e e m e e m e e m e e m e e m e e m e e m e e m e e …

26
Multilocus Models Can test the effects of fitting two or more markers simultaneously Several multilocus models are possible Interaction Model assumes that each combination of genotypes has a different effect eg t$y ~ t$m10 * t$m15

27
Multi-Locus Models > f <- glm( t$y ~ genotype(t$m13) * genotype(t$m26) , family='binomial') > anova(f) Analysis of Deviance Table Model: binomial, link: logit Response: t$y Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev NULL genotype(t$m13) genotype(t$m26) genotype(t$m13):genotype(t$m26) > pchisq(6.03,2,lower.tail=F) calculate p-value [1]

28
**Adding the effects of Sex and other Covariates**

Read in sex and other covariate data, eg. age from a file into variables, say a$sex, a$age Fit models of the form fit1 <- glm(t$y ~ additive(t$m10) + a$sex + a$age, family=‘binomial’) fit2 <- glm(t$y ~ a$sex + a$age, family=‘binomial’)

29
**Adding the effects of Sex and other Covariates**

Compare models using anova – test if the effect of the marker m10 is significant after taking into account sex and age anova(fit1,fit2)

30
Multiple Testing Take care interpreting significance levels when performing multiple tests Linkage disequilibrium can reduce the effective number of independent tests Permutation is a safe procedure to determine significance Repeat j=1..N times: Permute disease status y between individuals Fit all markers Record maximum deviance maxdev[j] over all markers Permutation p-value for a marker is the proportion of times the permuted maximum deviance across all markers exceeds the observed deviance for the marker logscan(t,permute=1000) slow!

31
**Haplotype Association**

Different from multiple genotype models Phase taken into account Haplotype association can be modelled in a similar logistic framework Treat haplotypes as extended alleles Fit additive, recessive, dominant & genotype models as before Eg haplotypes are h = AAGCAT, ATGCTT, etc y ~ additive(h) y ~ dominant(h) etc

Similar presentations

Presentation is loading. Please wait....

OK

Statistical methods for genetic association studies

Statistical methods for genetic association studies

© 2018 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google