Download presentation

Presentation is loading. Please wait.

Published byKaila Kilbourne Modified over 2 years ago

1
Association Analysis, Logistic Regression, R and S-PLUS Richard Mott

2
Logistic Regression in Statistical Genetics Applicable to Association Studies Data: –Binary outcomes (eg disease status) –Dependent on genotypes [+ sex, environment] Aim is to identify which factors influence the outcome Rigorous tests of statistical significance Flexible modelling language Generalisation of Chi-Squared Test

3
What is R ? Statistical analysis package Free Similar to commercial package S-PLUS Runs on Unix, Windows, Mac Many packages for statistical genetics, microarray analysis available in R Easily Programmable

4
Modelling in R Data for individual labelled i=1…n: –Response y i –Genotypes g ij for markers j=1..m

5
Coding Unphased Genotypes Several possibilities: –AA, AG, GG original genotypes –12, 21, 22 –1, 2, 3 –0, 1, 2 # of G alleles Missing Data –NA default in R

6
Using R Load genetic logistic regression tools > source(‘logistic.R’) Read data table from file –> t <- read.table(‘geno.dat’, header=TRUE) Column names –names(t) –t$y response (0,1) –t$m1, t$m2, …. Genotypes for each marker

7
Contigency Tables in R ftable(t$y,t$m31) prints the contingency table > ftable(t$y,t$m31) >

8
Chi-Squared Test in R > chisq.test(t$y,t$m31) Pearson's Chi-squared test data: t$y and t$m31 X-squared = , df = 2, p-value = Warning message: Chi-squared approximation may be incorrect in: chisq.test(t$y, t$m31) >

9
The Logistic Model Prob(Y i =0) = exp( i exp( i )) i = j x ij b j - Linear Predictor x ij – Design Matrix (genotypes etc) b j – Model Parameters (to be estimated) Model is investigated by –estimating the b j ’s by maximum likelihood –testing if the estimates are different from 0

10
The Logistic Function Prob(Y i =0) = exp( i exp( i )) Prob(Y=0)

11
Types of genetic effect at a single locus AAAGGG Recessive001 Dominant110 Additive012 Genotype0

12
Additive Genotype Model Code genotypes as –AA x=0, –AG x=1, –GG x=2 Linear Predictor – = b 0 + xb 1 P(Y=0|x) = exp(b 0 + xb 1 )/(1+exp(b 0 + xb 1 )) P AA = P(Y=0|x=0) = exp(b 0 )/(1+exp(b 0 )) P AG = P(Y=0|x=1) = exp(b 0 + b 1 )/(1+exp(b 0 + b 1 )) P GG = P(Y=0|x=2) = exp(b 0 + 2b 1 )/(1+exp(b 0 + 2b 1 ))

13
Additive Model: b 0 = -2 b 1 = 2 P AA = 0.12 P AG = 0.50 P GG = 0.88 Prob(Y=0)

14
Additive Model: b 0 = 0 b 1 = 2 P AA = 0.50 P AG = 0.88 P GG = 0.98 Prob(Y=0)

15
Recessive Model Code genotypes as –AA x=0, –AG x=0, –GG x=1 Linear Predictor – = b 0 + xb 1 P(Y=0|x) = exp(b 0 + xb 1 )/(1+exp(b 0 + xb 1 )) P AA = P AG = P(Y=0|x=0) = exp(b 0 )/(1+exp(b 0 )) P GG = P(Y=0|x=1) = exp(b 0 + b 1 )/(1+exp(b 0 + b 1 ))

16
Recessive Model: b 0 = 0 b 1 = 2 P AA = P AG = 0.50 P GG = 0.88 Prob(Y=0)

17
Genotype Model Each genotype has an independent probability Code genotypes as (for example) –AA x=0, y=0 –AG x=1, y=0 –GG x=0, y=1 Linear Predictor – = b 0 + xb 1 +yb 2 two parameters P(Y=0|x) = exp(b 0 + xb 1 +yb 2 )/(1+exp(b 0 + xb 1 +yb 2 )) P AA = P(Y=0|x=0,y=0) = exp(b 0 )/(1+exp(b 0 )) P AG = P(Y=0|x=1,y=0) = exp(b 0 + b 1 )/(1+exp(b 0 + b 1 )) P GG = P(Y=0|x=0,y=1) = exp(b 0 + b 2 )/(1+exp(b 0 + b 2 ))

18
Genotype Model: b 0 = 0 b 1 = 2 b 2 = -1 P AA = 0.5 P AG = 0.88 P GG = 0.27 Prob(Y=0)

19
Models in R response y genotype g AAAGGGmodelDF Recessive001y ~ dominant(g)1 Dominant011y ~ recessive(g)1 Additive012y ~ additive(g)1 Genotype0 y ~ genotype(g)2

20
Data Transformation g <- t$m1 use these functions to treat a genotype vector in a certain way: –a <- additive(g) –r <- recessive(g) –d <- dominant(g) –g <- genotype(g)

21
Fitting the Model afit <- glm( t$y ~ additive(g),family=‘binomial’) rfit <- glm( t$y ~ recessive(g),family=‘binomial’) dfit <- glm( t$y ~ dominant(g),family=‘binomial’) gfit <- glm( t$y ~ genotype(g),family=‘binomial’) Equivalent models: –genotype = dominant + recessive –genotype = additive + recessive –genotype = additive + dominant –genotype ~ standard chi-squared test of genotype association

22
Parameter Estimates > summary(glm( t$y ~ genotype(t$m31), family='binomial')) Coefficients: Estimate Std. Error z value Pr(>|z|) b 0 (Intercept) <2e-16 *** b 1 genotype(t$m31) b 2 genotype(t$m31) Signif. codes: 0 `***' `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 >

23
Analysis of Deviance Chi-Squared Test > anova(glm( t$y ~ genotype(t$m31), family='binomial')) Analysis of Deviance Table Model: binomial, link: logit Response: t$y Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev NULL genotype(t$m31)

24
Model Comparison Compare general model with additive, dominant or recessive models: > afit <- glm(t$y ~ additive(t$m20)) > gfit <- glm(t$y ~ genotype(t$m20)) > anova(afit,gfit) Analysis of Deviance Table Model 1: t$y ~ additive(t$m20) Model 2: t$y ~ genotype(t$m20) Resid. Df Resid. Dev Df Deviance >

25
Scanning all Markers > logscan(t,model=‘additive’) Deviance DF Pval LogPval m e e m e e m e e m e e m e e m e e m e e m e e m e e m e e m e e …

26
Multilocus Models Can test the effects of fitting two or more markers simultaneously Several multilocus models are possible Interaction Model assumes that each combination of genotypes has a different effect eg t$y ~ t$m10 * t$m15

27
Multi-Locus Models > f <- glm( t$y ~ genotype(t$m13) * genotype(t$m26), family='binomial') > anova(f) Analysis of Deviance Table Model: binomial, link: logit Response: t$y Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev NULL genotype(t$m13) genotype(t$m26) genotype(t$m13):genotype(t$m26) > pchisq(6.03,2,lower.tail=F) calculate p-value [1]

28
Adding the effects of Sex and other Covariates Read in sex and other covariate data, eg. age from a file into variables, say a$sex, a$age Fit models of the form fit1 <- glm(t$y ~ additive(t$m10) + a$sex + a$age, family=‘binomial’) fit2 <- glm(t$y ~ a$sex + a$age, family=‘binomial’)

29
Adding the effects of Sex and other Covariates Compare models using anova – test if the effect of the marker m10 is significant after taking into account sex and age anova(fit1,fit2)

30
Multiple Testing Take care interpreting significance levels when performing multiple tests Linkage disequilibrium can reduce the effective number of independent tests Permutation is a safe procedure to determine significance Repeat j=1..N times: –Permute disease status y between individuals –Fit all markers –Record maximum deviance maxdev[j] over all markers Permutation p-value for a marker is the proportion of times the permuted maximum deviance across all markers exceeds the observed deviance for the marker –logscan(t,permute=1000) slow!

31
Haplotype Association –Different from multiple genotype models –Phase taken into account –Haplotype association can be modelled in a similar logistic framework Treat haplotypes as extended alleles Fit additive, recessive, dominant & genotype models as before –Eg haplotypes are h = AAGCAT, ATGCTT, etc –y ~ additive(h) –y ~ dominant(h) etc

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google