Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.

Genome-wide association studies Usman Roshan

Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds ratio) as an approximation to relative risk –Chi-square test to determine significant SNPs –Logistic regression model for determining odds ratio

SNP genotype representation The example F: AACACAATTAGTACAATTATGAC M:AACAGAATTAGTACAATTATGAC is represented as CGCCGG …

SNP genotype encoding If SNP is A/B (alphabetically ordered) then count number of times we see B. Previous example becomes A/TC/TG/T…A/TC/TG/T… H0:AATTGG…020… H1:ATCCGT… =>101… H2:AACTGT…011… Now we have data in numerical format

Example GWAS A/TC/GA/G … Case 1AACCAA Case 2ATCGAA Case 3AACGAA Control 1TTGGGG Control 2TTCCGG Control 3TACGGG

Encoded data A/TC/GA/GA/TC/GA/G Case1AACCAA000 Case2ATCGAA110 Case3AACGAA =>010 Con1TTGGGG222 Con2TTCCGG202 Con3TACGGG112

Example Odds of AA in case = (80/100)/(20/100) = 4 Odds of AA in control = (70/100)/(30/100) = 7/3 Odds ratio of AA = 4/(7/3) = 12/7 AAACCC Case80155 Control7015

Chi-square statistic Define the statistic: where c i = observed frequency for i th outcome e i = expected frequency for i th outcome n = total outcomes The probability distribution of this statistic is given by the chi-square distribution with n-1 degrees of freedom. Proof can be found at http://ocw.mit.edu/NR/rdonlyres/Mathematics/18-443Fall2003/4226DF27-A1D0-4BB8-939A-B2A4167B5480/0/lec23.pdf Great. But how do we use this to get a SNP p-value?

Null hypothesis for case control contingency table We have two random variables: –D: disease status –G: allele type. Null hypothesis: the two variables are independent of each other (unrelated) Under independence –P(D,G)= P(D)P(G) –P(D=case) = (c1+c2+c3)/n –P(G=AA) = (c1+c4)/n Expected values –E(X1) = P(D=case)P(G=risk)n We can calculate the chi-square statistic for a given SNP and the probability that it is independent of disease status (using the p-value). SNPs with very small probabilities deviate significantly from the independence assumption and therefore considered important. AAACCC Casec1c2c3 Controlc4c5c6 n=c1+c2+c3+c4+c5+c6

Chi-square statistic exercise Compute expected values and chi-square statistic Compute chi-square statistic and p-value by referring To chi-square distribution AAACCC Case80155 Control601525

Logistic regression The odds ratio estimated from the contingency table directly has a skewed sampling distribution. A better (discriminative) approach is to model the log likelihood ratio log(Pr(G|D=case)/Pr(G|D=control)) as a linear function. In other words: Why: –Log likelihood ratio is a powerful statistic –Modeling as a linear function yields a simple algorithm to estimate parameters G is number of copies of the risk allele With some manipulation this becomes

How do we get the odds ratio from logistic regression? (I) Using Bayes rule we have And by taking the ratio with G=1 and G=0 we get By exponentiating both sides we get

How do we get the odds ratio from logistic regression? (II) Since the original ratio (see previous slide) is equal to e w and is equal to the odds ratio we conclude that the odds ratio is given by this value. Continued from previous slide: by rearranging the terms in the numerator and denominator we get By symmetry of odds ratio this is

How to find w and w 0 ? And so e w is our odds ratio. But how do we find w and w 0 ? –We assume that one’s disease status D given their genotype G is a Bernoulli random variable. –Using this we form the sample likelihood –Differentiate the likelihood by w and w 0 –Use gradient descent

Today Basic classification problem Maximum likelihood Logistic regression likelihood Algorithm for logistic regression likelihood Support vector machine

Supervised learning for two classes We are given n training samples (x i,y i ) for i=1..n drawn i.i.d from a probability distribution P(x,y). Each x i is a d-dimensional vector (x i  R d ) and y i is +1 or -1 Our problem is to learn a function f(x) for predicting the labels of test samples x i ’  R d for i=1..n’ also drawn i.i.d from P(x,y)

Loss function Loss function: c(x,y,f(x)) Maps to [0,inf] Examples:

Test error We quantify the test error as the expected error on the test set (in other words the average test error). In the case of two classes: We’d like to find f that minimizes this but we need P(y|x) which we don’t have access to.

Expected risk Suppose we didn’t have test data (x’). Then we average the test error over all possible data points x We want to find f that minimizes this but we don’t have all data points. We only have training data.

Empirical risk Since we only have training data we can’t calculate the expected risk (we don’t even know P(x,y)). Solution: we approximate P(x,y) with the empirical distribution p emp (x,y) The delta function  x (y)=1 if x=y and 0 otherwise.

Empirical risk We can now define the empirical risk as Once the loss function is defined and training data is given we can then find f that minimizes this.

Example of minimizing empirical risk (least squares) Suppose we are given n data points (x i,y i ) where each x i  R d and y i  R d. We want to determine a linear function f(x)=ax+b for predicting test points. Loss function c(x i,y i,f(x i ))=(y i -f(x i )) 2 What is the empirical risk?

Empirical risk for least squares Now finding f has reduced to finding a and b. Since this function is convex in a and b we know there is a global optimum which is easy to find by setting first derivatives to 0.

Empirical risk for logistic regression Recall that logistic regression model: where G is number of copies of risk allele. In order to use this to predict one’s risk of disease or to determine the odds ratio we need to know w and w 0. We use maximum likelihood to find w and w 0. But it doesn’t yield a simple closed form solution like least squares.

Maximum likelihood We can classify by simply selecting the model M that has the highest P(M|D) where D=data, M=model. Thus classification can also be framed as the problem of finding M that maximizes P(M|D) By Bayes rule:

Maximum likelihood Suppose we have k models to consider and each has the same probability. In other words we have a uniform prior distribution P(M)=1/k. Then In this case we can solve the classification problem by finding the model that maximizes P(D|M). This is called the maximum likelihood optimization criterion.

Maximum likelihood Suppose we have n i.i.d. samples (xi,yi) drawn from M. The likelihood P(D|M) is Consequently the log likelihood is

Maximum likelihood and empirical risk Maximizing the likelihood P(D|M) is the same as maximizing log(P(D|M)) which is the same as minimizing -log(P(D|M)) Set the loss function to Now minimizing the empirical risk is the same as maximizing the likelihood

Maximum likelihood example Consider a set of coin tosses produced by a coin with P(H)=p (P(T)=1-p) We want to determine the probability P(H) of the coin that produced HTHHHTHHHTHTH. Solution: –Form the log likelihood –Differentiate w.r.t. p –Set to the derivative to 0 and solve for p How about the probability P(H) of the coin that produces k heads and n-k tails?

Classification by likelihood Suppose we have two classes C 1 and C 2. Compute the likelihoods P(D|C 1 ) and P(D|C 2 ). To classify test data D’ assign it to class C 1 if P(D|C 1 ) is greater than P(D|C 2 ) and C 2 otherwise.

Logistic regression (in light of disease risk prediction) We assume that the probability of disease given their genotype is where G is the numeric format genotype. Problem: given training data we want to estimate w and w 0 by maximum likelihood

Maximum likelihood for logistic regression Assume that one’s disease status given their genotype is a Bernoulli random variable with probability P(D|G i ).In other words training sample i is case with probability P(D=case|G i ) and control with probability 1-P(D=case|G i ). Assume we have m cases and n-m controls. The likelihood is given by

Maximum likelihood for logistic regression Likelihood: -Log likelihood Set first derivates to 0 and solve for w and w0. No closed form therefore have to use gradient descent.

Gradient descent Given a convex function f(x,y) how can we find x and y that minimizes this? Solution is gradient descent –The gradient vector points in the direction of greatest increase in the function. –Therefore solution is to move in small increments in the negative direction of the gradient until we reach the optimum.

Disease risk prediction How exactly do we predict risk? –Personal genomics companies: Composite odds ratio score –Academia: Composite odds ratio score and recently other classifiers as well –Still an open problem: depends upon classifier and the set of SNPs

Composite odds ratio score Recall that we can obtain the odds ratio from the logistic regression model Define =e w Now we can predict the risk with n alleles

Example of risk prediction study (type 1 diabetes)

Example of risk prediction study (arthritis)

Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.

Similar presentations

Presentation on theme: "Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.

Similar presentations

Presentation on theme: "Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds."— Presentation transcript:

Similar presentations

About project

Feedback