Linear and Nonlinear Modeling Class web site: Statistics for Microarrays.

Slides:



Advertisements
Similar presentations
Lecture 11 (Chapter 9).
Advertisements

Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
Linear Regression.
Experimental Design and Differential Expression Class web site: Statistics for Microarrays.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 13 Nonlinear and Multiple Regression.
HSRP 734: Advanced Statistical Methods July 24, 2008.
Variance and covariance M contains the mean Sums of squares General additive models.
Section 4.2 Fitting Curves and Surfaces by Least Squares.
Sandrine Dudoit1 Microarray Experimental Design and Analysis Sandrine Dudoit jointly with Yee Hwa Yang Division of Biostatistics, UC Berkeley
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
Generalised linear models
Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review
Chapter 11 Survival Analysis Part 2. 2 Survival Analysis and Regression Combine lots of information Combine lots of information Look at several variables.
Cluster Analysis Class web site: Statistics for Microarrays.
Linear and generalised linear models
Linear and generalised linear models
Linear and generalised linear models Purpose of linear models Least-squares solution for linear models Analysis of diagnostics Exponential family and generalised.
Maximum likelihood (ML)
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Regression Chapter 14.
Classification and Prediction: Regression Analysis
Review for Final Exam Some important themes from Chapters 9-11 Final exam covers these chapters, but implicitly tests the entire course, because we use.
Variance and covariance Sums of squares General linear models.
Survival Analysis A Brief Introduction Survival Function, Hazard Function In many medical studies, the primary endpoint is time until an event.
Analysis of Complex Survey Data
Correlation & Regression
Single and Multiple Spell Discrete Time Hazards Models with Parametric and Non-Parametric Corrections for Unobserved Heterogeneity David K. Guilkey.
Objectives of Multiple Regression
Regression and Correlation Methods Judy Zhong Ph.D.
Multiple testing in high- throughput biology Petter Mostad.
1 G Lect 11W Logistic Regression Review Maximum Likelihood Estimates Probit Regression and Example Model Fit G Multiple Regression Week 11.
Simple Linear Regression
Essentials of survival analysis How to practice evidence based oncology European School of Oncology July 2004 Antwerp, Belgium Dr. Iztok Hozo Professor.
Model Building III – Remedial Measures KNNL – Chapter 11.
Ch4 Describing Relationships Between Variables. Pressure.
Bayesian Analysis and Applications of A Cure Rate Model.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Linear correlation and linear regression + summary of tests
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
Danila Filipponi Simonetta Cozzi ISTAT, Italy Outlier Identification Procedures for Contingency Tables in Longitudinal Data Roma,8-11 July 2008.
Copyright ©2011 Brooks/Cole, Cengage Learning Inference about Simple Regression Chapter 14 1.
Lecture 12: Cox Proportional Hazards Model
Statistics for Differential Expression Naomi Altman Oct. 06.
Single-Factor Studies KNNL – Chapter 16. Single-Factor Models Independent Variable can be qualitative or quantitative If Quantitative, we typically assume.
Generalized Linear Models (GLMs) and Their Applications.
01/20151 EPI 5344: Survival Analysis in Epidemiology Cox regression: Introduction March 17, 2015 Dr. N. Birkett, School of Epidemiology, Public Health.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Machine Learning 5. Parametric Methods.
Tutorial I: Missing Value Analysis
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
Proportional Hazards Model Checking the adequacy of the Cox model: The functional form of a covariate The link function The validity of the proportional.
Nonparametric Statistics
Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.
DURATION ANALYSIS Eva Hromádková, Applied Econometrics JEM007, IES Lecture 9.
STA302/1001 week 11 Regression Models - Introduction In regression models, two types of variables that are studied:  A dependent variable, Y, also called.
Estimating standard error using bootstrap
Nonparametric Statistics
Probability Theory and Parameter Estimation I
Ch3: Model Building through Regression
CHAPTER 29: Multiple Regression*
Nonparametric Statistics
Single-Factor Studies
Regression Models - Introduction
Single-Factor Studies
What is Regression Analysis?
Presentation transcript:

Linear and Nonlinear Modeling Class web site: Statistics for Microarrays

cDNA gene expression data Data on G genes for n samples Genes mRNA samples Gene expression level of gene i in mRNA sample j = (normalized) Log( Red intensity / Green intensity) sample1sample2sample3sample4sample5 …

Identifying Differentially Expressed Genes Goal: Identify genes associated with covariate or response of interest Examples: –Qualitative covariates or factors: treatment, cell type, tumor class –Quantitative covariate: dose, time –Responses: survival, cholesterol level –Any combination of these!

Modeling Introduction Want to capture important features of the relationship between a (set of) variable(s) and one or more responses Many models are of the form g(Y) = f(x) + error Differences in the form of g, f and distributional assumptions about the error term

Examples of Models Linear: Y =  0 +  1 x +  Linear: Y =  0 +  1 x +  2 x 2 +  (Intrinsically) Nonlinear: Y =  x 1  x 2  x 3  +  Generalized Linear Model (e.g. Binomial): ln(p/[1-p]) =  0 +  1 x 1 +  2 x 2 Proportional Hazards (in Survival Analysis): h(t) = h 0 (t) exp(  x)

Linear Modeling A simple linear model: E(Y) =  0 +  1 x Gaussian measurement model: Y =  0 +  1 x + , where  ~ N(0,  2 ) More generally: Y = X  + , where Y is n x 1, X is n x G,  is G x 1,  is n x 1, often assumed N(0,  2 I nxn )

Analysis of Designed Experiments An important use of linear models Here, the response (Y) is the gene expression level Define a (design) matrix X so that E(Y) = X , where  is a vector of contrasts Many ways to define design matrix/contrasts

Contrasts (I) Example: One-way layout, k classes Y ij =  +  j +  ij, i = 1,…,n j ; j = 1,…,k X = … 0 = [1 X a ] … 0 … … 1 Problem: this specification is over- parametrized

Contrasts (II) Could resolve by removing column of 1’s: Y ij =  j +  ij, i = 1,…,n j ; j = 1,…,k X = 1 0 … 0 = [X a ] 0 1 … 0 … 0 0 … 1 Here, parameters are class means

Contrasts (III) Define design matrix X * = [1 X a C a ], with C a (k x (k-1)) chosen so that X * has rank k (the number of columns) Parameters may become difficult to interpret In the balanced case (equal number of observations in each class), can choose orthogonal contrasts (Helmert)

Model Fitting For the standard (fixed effects) linear model, estimation is usually by least squares Can be more complicated with random effects or when x-variables subject to measurement error as well (so that estimates are not biased)

Model Checking Examination of residuals –Normality –Time effects –Nonconstant variance –Curvature Detection of influential observations

Do we need robust methods? Tukey (1962): “A tacit hope in ignoring deviations from ideal models was that they would not matter; that statistical procedures which were optimal under the strict model would still be approximately optimal under the approximate model. Unfortunately it turned out that this hope was often drastically wrong; even mild deviations often have much larger effects than were anticipated by most statisticians.”

Robust Regression Idea: downweight observations that produce large residuals More computationally intensive than least squares regression (which gives equal weight to each observation) Use maximum likelihood if can assume specific error distribution When not, use M-estimators

M-estimators ‘Maximum likelihood type’ estimators Assume independent errors with distribution f(  ) Robust estimator minimizes  i  (e i /s) =  i  {(Y i – x i ’  )/s}, where  (.) is some function and s is an estimate of scale  (u) = u 2 corresponds to minimizing the sum of squares

M-estimation Procedure To minimize  i  {(Y i – x i ’  )/s} wrt the  ’s, take derivatives and equate to 0 Resulting equations do not have an explicit solution in general Solve by iteratively reweighted least squares

Examples of Weight Functions

Generalized Linear Models (GLM/GLIM) Response Y assumed to have exponential family distribution: f(y) = exp[a(y)b(  ) + c(  ) + d(y)] Parameters  and explanatory variables X; linear predictor  =  1 x 1 +  2 x 2 + …  p x p Mean response , link function l (  ) =  Allows unified treatment of statistical methods for several important classes of models

Some Examples LinkBinomialGammaNormalPoisson logit Default probit X cloglog X identity XXX inverse Default Log XDefault Sqrt X

(BREAK)

Survival Modeling Response T is a (nonnegative) lifetime Cumulative distribution function (cdf) F(t), density f(t) More usual to work with the survivor function S(t) = 1 – F(t) = P(T > t) and the instantaneous failure rate, or hazard function h(t) = lim  t->0 P(t  T< t+  t | T  t)/  t

Relations Between Functions Cumulative hazard function H(t) =  0 t h(s) ds h(t) = f(t)/S(t) H(t) = -log S(t)

Censoring Incomplete information on the lifetime A censored observation is one whose value is incomplete due to random factors for each individual Most commonly, observation begins at time t = 0 and ends before the outcome of interest is observed (right-censoring)

Estimation of Survivor Function Most commonly used estimate is Kaplan-Meier (also called product limit) estimator Risk set r(t) = number of cases alive just before time t S(t) =  t i  t [r(t i ) – d i ]/r(t i ) ^

Cox Proportional Hazards Model Baseline hazard function h 0 (t) Modified multiplicatively by covariates Hazard function for individual case is h(t) = h 0 (t) exp(  1 x 1 +  2 x 2 + … +  p x p ) If nonproportionality: –1. Does it matter –2. Is it real

Strategies for Gene Expression-based Modeling The biggest problem is the large number of variables (genes) One possibility is to first reduce the number of genes under consideration (e.g. consider variability across samples, or coefficient of variation) Screening/Prioritizing: One gene at a time approach Two at a time, perhaps plus interaction

Example: Survival analysis with expression data Bittner et al. dataset: –15 of the 31 melanomas had associated survival times –3613 ‘strongly detected’ genes

‘cluster’ unclustered Average Linkage Hierarchical Clustering

Association of Variables Variables tested for association with cluster: –Sex (p =.68, n = = 27)  Age (p =.14, n = = 25)  Mutation status (p =.17, n = = 19) –Biopsy site (p =.88, n = = 24) –Pigment (p =.26, n = = 22) –Breslow thickness (p =.26, n = = 9) –Clark level (p =.44, n = = 11)  Specimen type (p =.11, n = = 23)

Survival analysis: Bittner et al. Bittner et al. also looked at differences in survival between the two groups (the ‘cluster’ and the ‘unclustered’ samples) ‘Cluster’ seemed associated with longer survival

Kaplan-Meier Survival Curves

unclustered cluster Average Linkage Hierarchical Clustering, survival samples only

Kaplan-Meier Survival Curves, new grouping

Identification of Genes Associated with Survival For each gene j, j = 1, …, 3613, model the instantaneous failure rate, or hazard function, h(t) with the Cox proportional hazards model: h(t) = h 0 (t) exp(  j x ij ) and look for genes with both: large effect size  j large standardized effect size  j /SE(  j ) ^ ^^

Sites Potentially Influencing Survival Image Clone ID UniGene Cluster UniGene Cluster Title Hs Glutamate receptor interacting protein Hs.57419Transcriptional repressor Hs.74649Cytochrome c oxidase subunit Vlc Hs ESTs, Highly similar to topoisomerase Hs.77665KIAA0102 gene product

Findings Top 5 genes by this method not in Bittner et al. ‘weighted gene list’ - Why? weighted gene list based on entire sample; our method only used half weighting relies on Bittner et al. cluster assignment other possibilities?

Statistical Significance of Cox Model Coefficients

Advantages of Modeling Can address questions of interest directly –Contrast with what has become the ‘usual’ (and indirect) approach with microarrays: clustering, followed by tests of association between cluster group and variables of interest Great deal of existing machinery Quantitatively assess strength of evidence

Limitations of Single Gene Tests May be too noisy in general to show much Do not reveal coordinated effects of positively correlated genes Hard to relate to pathways

Not Covered… Careful followup –Assessment of proportionality –Inclusion of combinations of genes, interactions –Consideration of alternative models Power assessment –Not worth it here, there can’t be much!

Some ideas for further work Expand models to include more genes, possibly two-way interactions –Issue of automation –Still very small scale compared to probable pathway size, number of genes involved, etc. Nonparametric tree-based modeling –Will require much larger sample sizes

Acknowledgements Debashis Ghosh Erin Conlon Sandrine Dudoit José Correa