Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.

Slides:



Advertisements
Similar presentations
Regularized risk minimization
Advertisements

Prof. Navneet Goyal CS & IS BITS, Pilani
Week11 Parameter, Statistic and Random Samples A parameter is a number that describes the population. It is a fixed number, but in practice we do not know.
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
Statistical Estimation and Sampling Distributions
Genome-wide association studies BNFO 602 Roshan. Application of SNPs: association with disease Experimental design to detect cancer associated SNPs: –Pick.
Regression Usman Roshan CS 675 Machine Learning. Regression Same problem as classification except that the target variable y i is continuous. Popular.
Multivariate linear models for regression and classification Outline: 1) multivariate linear regression 2) linear classification (perceptron) 3) logistic.
Visual Recognition Tutorial
Classification and risk prediction
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
Genome-wide association studies Usman Roshan. SNP Single nucleotide polymorphism Specific position and specific chromosome.
Classification and risk prediction Usman Roshan. Disease risk prediction What is the best method to predict disease risk? –We looked at the maximum likelihood.
Today Linear Regression Logistic Regression Bayesians v. Frequentists
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
Genome-wide association studies BNFO 601 Roshan. Application of SNPs: association with disease Experimental design to detect cancer associated SNPs: –Pick.
Single nucleotide polymorphisms Usman Roshan. SNPs DNA sequence variations that occur when a single nucleotide is altered. Must be present in at least.
BNFO 602 Lecture 2 Usman Roshan. Bioinformatics problems Sequence alignment: oldest and still actively studied Genome-wide association studies: new problem,
Visual Recognition Tutorial
Thanks to Nir Friedman, HU
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Review of Lecture Two Linear Regression Normal Equation
Crash Course on Machine Learning
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
AM Recitation 2/10/11.
Chapter 5 Sampling and Statistics Math 6203 Fall 2009 Instructor: Ayona Chatterjee.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人:黃子齊
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Genome-wide association studies Usman Roshan. SNP Single nucleotide polymorphism Specific position and specific chromosome.
Logistic Regression Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata September 1, 2014.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Contingency tables Brian Healy, PhD. Types of analysis-independent samples OutcomeExplanatoryAnalysis ContinuousDichotomous t-test, Wilcoxon test ContinuousCategorical.
Regression Usman Roshan CS 698 Machine Learning. Regression Same problem as classification except that the target variable y i is continuous. Popular.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Copyright © 2010 Pearson Education, Inc. Slide
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.
Regression Usman Roshan CS 675 Machine Learning. Regression Same problem as classification except that the target variable y i is continuous. Popular.
Regress-itation Feb. 5, Outline Linear regression – Regression: predicting a continuous value Logistic regression – Classification: predicting a.
1 Introduction to Statistics − Day 3 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Brief catalogue of probability densities.
Machine Learning 5. Parametric Methods.
Logistic Regression Saed Sayad 1www.ismartsoft.com.
Section 6.4 Inferences for Variances. Chi-square probability densities.
Genome-wide association studies
CHAPTER 3: BAYESIAN DECISION THEORY. Making Decision Under Uncertainty Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
Applied statistics Usman Roshan.
Regression Usman Roshan.
Chapter 7. Classification and Prediction
Usman Roshan CS 675 Machine Learning
Applied statistics Usman Roshan.
Probability Theory and Parameter Estimation I
Empirical risk minimization
Bounding the error of misclassification
Regularized risk minimization
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Ying shen Sse, tongji university Sep. 2016
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Chi2 (A.K.A X2).
Generally Discriminant Analysis
Regression Usman Roshan.
Parametric Methods Berlin Chen, 2005 References:
Empirical risk minimization
Presentation transcript:

Genome-wide association studies Usman Roshan

Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds ratio) as an approximation to relative risk –Chi-square test to determine significant SNPs –Logistic regression model for determining odds ratio

SNP genotype representation The example F: AACACAATTAGTACAATTATGAC M:AACAGAATTAGTACAATTATGAC is represented as CGCCGG …

SNP genotype encoding If SNP is A/B (alphabetically ordered) then count number of times we see B. Previous example becomes A/TC/TG/T…A/TC/TG/T… H0:AATTGG…020… H1:ATCCGT… =>101… H2:AACTGT…011… Now we have data in numerical format

Example GWAS A/TC/GA/G … Case 1AACCAA Case 2ATCGAA Case 3AACGAA Control 1TTGGGG Control 2TTCCGG Control 3TACGGG

Encoded data A/TC/GA/GA/TC/GA/G Case1AACCAA000 Case2ATCGAA110 Case3AACGAA =>010 Con1TTGGGG222 Con2TTCCGG202 Con3TACGGG112

Example Odds of AA in case = (80/100)/(20/100) = 4 Odds of AA in control = (70/100)/(30/100) = 7/3 Odds ratio of AA = 4/(7/3) = 12/7 AAACCC Case80155 Control7015

Chi-square statistic Define the statistic: where c i = observed frequency for i th outcome e i = expected frequency for i th outcome n = total outcomes The probability distribution of this statistic is given by the chi-square distribution with n-1 degrees of freedom. Proof can be found at Great. But how do we use this to get a SNP p-value?

Null hypothesis for case control contingency table We have two random variables: –D: disease status –G: allele type. Null hypothesis: the two variables are independent of each other (unrelated) Under independence –P(D,G)= P(D)P(G) –P(D=case) = (c1+c2+c3)/n –P(G=AA) = (c1+c4)/n Expected values –E(X1) = P(D=case)P(G=risk)n We can calculate the chi-square statistic for a given SNP and the probability that it is independent of disease status (using the p-value). SNPs with very small probabilities deviate significantly from the independence assumption and therefore considered important. AAACCC Casec1c2c3 Controlc4c5c6 n=c1+c2+c3+c4+c5+c6

Chi-square statistic exercise Compute expected values and chi-square statistic Compute chi-square statistic and p-value by referring To chi-square distribution AAACCC Case80155 Control601525

Logistic regression The odds ratio estimated from the contingency table directly has a skewed sampling distribution. A better (discriminative) approach is to model the log likelihood ratio log(Pr(G|D=case)/Pr(G|D=control)) as a linear function. In other words: Why: –Log likelihood ratio is a powerful statistic –Modeling as a linear function yields a simple algorithm to estimate parameters G is number of copies of the risk allele With some manipulation this becomes

How do we get the odds ratio from logistic regression? (I) Using Bayes rule we have And by taking the ratio with G=1 and G=0 we get By exponentiating both sides we get

How do we get the odds ratio from logistic regression? (II) Since the original ratio (see previous slide) is equal to e w and is equal to the odds ratio we conclude that the odds ratio is given by this value. Continued from previous slide: by rearranging the terms in the numerator and denominator we get By symmetry of odds ratio this is

How to find w and w 0 ? And so e w is our odds ratio. But how do we find w and w 0 ? –We assume that one’s disease status D given their genotype G is a Bernoulli random variable. –Using this we form the sample likelihood –Differentiate the likelihood by w and w 0 –Use gradient descent

Today Basic classification problem Maximum likelihood Logistic regression likelihood Algorithm for logistic regression likelihood Support vector machine

Supervised learning for two classes We are given n training samples (x i,y i ) for i=1..n drawn i.i.d from a probability distribution P(x,y). Each x i is a d-dimensional vector (x i  R d ) and y i is +1 or -1 Our problem is to learn a function f(x) for predicting the labels of test samples x i ’  R d for i=1..n’ also drawn i.i.d from P(x,y)

Loss function Loss function: c(x,y,f(x)) Maps to [0,inf] Examples:

Test error We quantify the test error as the expected error on the test set (in other words the average test error). In the case of two classes: We’d like to find f that minimizes this but we need P(y|x) which we don’t have access to.

Expected risk Suppose we didn’t have test data (x’). Then we average the test error over all possible data points x We want to find f that minimizes this but we don’t have all data points. We only have training data.

Empirical risk Since we only have training data we can’t calculate the expected risk (we don’t even know P(x,y)). Solution: we approximate P(x,y) with the empirical distribution p emp (x,y) The delta function  x (y)=1 if x=y and 0 otherwise.

Empirical risk We can now define the empirical risk as Once the loss function is defined and training data is given we can then find f that minimizes this.

Example of minimizing empirical risk (least squares) Suppose we are given n data points (x i,y i ) where each x i  R d and y i  R d. We want to determine a linear function f(x)=ax+b for predicting test points. Loss function c(x i,y i,f(x i ))=(y i -f(x i )) 2 What is the empirical risk?

Empirical risk for least squares Now finding f has reduced to finding a and b. Since this function is convex in a and b we know there is a global optimum which is easy to find by setting first derivatives to 0.

Empirical risk for logistic regression Recall that logistic regression model: where G is number of copies of risk allele. In order to use this to predict one’s risk of disease or to determine the odds ratio we need to know w and w 0. We use maximum likelihood to find w and w 0. But it doesn’t yield a simple closed form solution like least squares.

Maximum likelihood We can classify by simply selecting the model M that has the highest P(M|D) where D=data, M=model. Thus classification can also be framed as the problem of finding M that maximizes P(M|D) By Bayes rule:

Maximum likelihood Suppose we have k models to consider and each has the same probability. In other words we have a uniform prior distribution P(M)=1/k. Then In this case we can solve the classification problem by finding the model that maximizes P(D|M). This is called the maximum likelihood optimization criterion.

Maximum likelihood Suppose we have n i.i.d. samples (xi,yi) drawn from M. The likelihood P(D|M) is Consequently the log likelihood is

Maximum likelihood and empirical risk Maximizing the likelihood P(D|M) is the same as maximizing log(P(D|M)) which is the same as minimizing -log(P(D|M)) Set the loss function to Now minimizing the empirical risk is the same as maximizing the likelihood

Maximum likelihood example Consider a set of coin tosses produced by a coin with P(H)=p (P(T)=1-p) We want to determine the probability P(H) of the coin that produced HTHHHTHHHTHTH. Solution: –Form the log likelihood –Differentiate w.r.t. p –Set to the derivative to 0 and solve for p How about the probability P(H) of the coin that produces k heads and n-k tails?

Classification by likelihood Suppose we have two classes C 1 and C 2. Compute the likelihoods P(D|C 1 ) and P(D|C 2 ). To classify test data D’ assign it to class C 1 if P(D|C 1 ) is greater than P(D|C 2 ) and C 2 otherwise.

Logistic regression (in light of disease risk prediction) We assume that the probability of disease given their genotype is where G is the numeric format genotype. Problem: given training data we want to estimate w and w 0 by maximum likelihood

Maximum likelihood for logistic regression Assume that one’s disease status given their genotype is a Bernoulli random variable with probability P(D|G i ).In other words training sample i is case with probability P(D=case|G i ) and control with probability 1-P(D=case|G i ). Assume we have m cases and n-m controls. The likelihood is given by

Maximum likelihood for logistic regression Likelihood: -Log likelihood Set first derivates to 0 and solve for w and w0. No closed form therefore have to use gradient descent.

Gradient descent Given a convex function f(x,y) how can we find x and y that minimizes this? Solution is gradient descent –The gradient vector points in the direction of greatest increase in the function. –Therefore solution is to move in small increments in the negative direction of the gradient until we reach the optimum.

Disease risk prediction How exactly do we predict risk? –Personal genomics companies: Composite odds ratio score –Academia: Composite odds ratio score and recently other classifiers as well –Still an open problem: depends upon classifier and the set of SNPs

Composite odds ratio score Recall that we can obtain the odds ratio from the logistic regression model Define =e w Now we can predict the risk with n alleles

Example of risk prediction study (type 1 diabetes)

Example of risk prediction study (arthritis)