Download presentation
Presentation is loading. Please wait.
Published byAmos Jordan Modified over 9 years ago
1
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 11: Models Marshall University Genomics Core Facility
2
What is a model? In general, a model is a (simpler) representation of something else – We use models to study complex phenomena – Easier to manipulate than the real thing of interest – Easier to focus on specific aspects – E.g. we use mouse models to study human disease Easier to control behavior of the mouse Easier to control genetics… Marshall University School of Medicine
3
What is a mathematical model? A mathematical model is an equation (or set of equations) that describes a physical state or process – Describes how values in the state or process are related to each other Aim is not to provide a perfect model – A good model is simple enough to be easy to understand – Yet complex enough to be useful Marshall University School of Medicine
4
Statistical Models Statistical models are mathematical models that model both the ideal predictions and the random “scatter” or “noise” – Model both the population values and the “random” variation from the population values “Random” variation is really just variation not explained or accounted for by the model Marshall University School of Medicine
5
Model terminology A model is an equation (or set of equations) The equation defines the outcome, or dependent variable as a function of – one or more independent variables, and – one or more parameters Each data point has its own values for the independent and dependent variables The values of the parameters are properties of the population – Do not vary from data point to data point Marshall University School of Medicine
6
Fitting a model to data The parameters are properties of the population – They are unknown Typically, we collect a sample of data points Assuming the model is correct, we can use the sample to estimate the parameters of the model – This is called “fitting a model to the data” – Results in estimates and confidence intervals for each of the parameters Marshall University School of Medicine
7
Simplest possible model The simplest possible model for a data set involves no independent variable! Sample values from a population Assume the population values follow a Normal distribution Our model is Marshall University School of Medicine
8
Average as a model In the simple model Y=μ+ε, – Y is the dependent variable Different value for each data point – μ is a parameter The mean of the population Single, unknown value we will estimate from our data – ε is the “random error” Different for each data point, assumed normally distributed with mean zero Can make the roles of the variable types more explicit by writing Y i =μ+ε i Marshall University School of Medicine
9
Why the mean is important If we assume the model is correct: – Our data are sampled from a population where the values are some fixed value, plus some scatter that is normally distributed with mean zero then we want to use our data to estimate μ It turns out that the value of μ that makes our observed data the most likely, out of all possible choices of μ, is the mean of our data – The mean is the maximum likelihood estimate of μ Marshall University School of Medicine
10
A more sophisticated model: linear regression Remember the example from linear regression: – Measured insulin sensitivity and %C20-22 content in 13 healthy men – Hypothesized that an increase in %C20-22 content caused an increase in insulin sensitivity – Used linear regression to fit the model Y = intercept + slope × X + scatter to the data Y is the insulin sensitivity, X the %C20-22 content In more conventional notation: Y = β 0 + β 1 × X + ε, or Y i = β 0 + β 1 × X i + ε i Marshall University School of Medicine
11
Linear regression as a statistical model The linear regression model has two parameters: – β0, the intercept – β1, the slope These are both properties of the population We use the data to estimate them – Uses the method of “least squares” – Gives the maximum likelihood estimate for the two parameters – The values of the parameters that maximize the chances of our data being observed Marshall University School of Medicine
12
Recap of models The linear regression in this example gave an estimate of the slope of 37.2, and an estimate of the intercept of -486.5 Our estimated model is Insulin sensitivity = 37.2 × %C20-22 - 486.5 + ε The model is not assumed to be perfect! Simple, but powerful enough to draw some basic conclusions – Within the range of the data, an increase in one unit in %C20-22 results, on average, in an increase in 37.2 units in insulin sensitivity Marshall University School of Medicine
13
Other types of model We will look at other types of model in upcoming lectures: – Multiple regression More than one independent variable – Logistic regression Outcome variable is binary, one or more independent variables – Proportional hazards regression Outcome variable is survival time, one or more independent variables Marshall University School of Medicine
14
Comparing Models In the linear regression example, we also computed a p-value – The null hypothesis was that the slope was zero – I.e. we compared the model Y = β 0 + β 1 × X + ε to Y = β 0 + ε So we can think of this statistical test as the comparison between two models – In fact, we can think of most (perhaps all) statistical tests as the comparison between two models Marshall University School of Medicine
15
Hypothesis test of linear regression as a comparison of models Marshall University School of Medicine
16
Why model comparison is not straightforward It is not enough just to compare the “residuals” between two models – Remember the residuals are the error terms in the model – A model with more parameters will always come closer to the data However, the confidence intervals will be wider So the model will be less useful for predicting future values Marshall University School of Medicine
17
Comparing the models and R 2 HypothesisDistance measured from Sum of squaresPercentage of variation NullMean155,642.3100% Linear relationshipStraight line63,361.3740.7% Difference(Improvement)92,280.9359.3% Marshall University School of Medicine The total sum of squares of the distance of points from the mean i.e. the total variance is 155,642.3. The total sum of squares of the residuals is 63,361.37 The difference between these is 92,280.93, which is 59.3% of the total variance So the linear model results in an improvement in the variance which is 59.3% of the total: this is the definition of R 2 : R 2 =0.593
18
Interpreting the difference in variance With a little algebra, you can show that the difference between the total variance and the sum of the squares of the residuals is the sum of the squares of the distance between the regression line and the mean So the regression line “accounts for 59.3% of the variance” Marshall University School of Medicine
19
Computing a p-value for model comparison To compute a p-value for the comparison of models, we look at both the sum of squares for each model and the degrees of freedom for each model – The number of degrees of freedom is the number of data points, minus the number of parameters in the model – We had 13 data points, so there are 12 degrees of freedom for the null hypothesis model, and 11 degrees of freedom for the linear model Marshall University School of Medicine
20
Mean squares and F-ratio Source of variation Sum of squaresDegrees of Freedom Mean squaresF-ratio Regression92,2811 16.0 Random63,361115,760 Total155,64212 Marshall University School of Medicine The same data presented in the format of an ANOVA (we will see this later) “Total” represents the total variation in the data “Random” is the variation in the data around the regression line “Regression” is the difference between them: the sum of squares of distances from the regression line to the mean The “mean squares” is the sum of squares divided by the degrees of freedom The F-ratio is the ratio of mean squares
21
Computing a p-value The null hypothesis is that the “horizontal line model” is the correct model – i.e. the slope in the regression model is zero If the null hypothesis were true, the F-ratio would be close to 1 (this is not obvious!) The distribution of values of the F-ratio, assuming the null hypothesis is known, is a known distribution – Called the F-distribution – depends on two different degrees of freedom – so a p-value can be computed The p-value in this example is p=0.0021 Marshall University School of Medicine
22
Recap We re-examined the linear regression example and re-cast it as a comparison of statistical models Can compute a p-value for the null hypothesis that the simpler model is “correct” – “As correct as the more complex model” This is the same p-value we computed before The R 2 value is the proportion of variance “explained by” the regression We can do the same for other statistical tests! Marshall University School of Medicine
23
A t-test considered as a comparison of models Recall the GRHL2 expression in Basal-A and Basal-B cancer cells We can re-cast this as a linear regression… – Let x=0 for Basal A cells and x=1 for Basal B cells Our linear model is: Expression = β 0 + β 1 × x + ε with the null hypothesis Expression = β 0 + ε What is β 1 ? – Slope = increase in expression for increase in one unit of x – = difference in expression between Basal A and Basal B – = difference in means… Marshall University School of Medicine
24
t-test as a comparison of models Marshall University School of Medicine
25
Results of running the t-test as a comparison of models Running the linear regression gives estimates of the intercept of 1.933 and slope of -1.861 The table of variances is Marshall University School of Medicine ModelSum of squares DFMean Squares F-ratio Regression23.3351 55.993 Residual10.419250.417 Total33.75326
26
Interpreting the table of variances The total sum of squares (33.753) is the sum of squares of the differences between each value and the overall mean – This, divided by the df (33.753/26=1.298) is the sample variance The residual sum of squares is the sum of the squares of each expression value minus its predicted value – The predicted value is just the mean for its basal type – This is the “within group” variance The regression sum of squares is the sum of squares of the differences between predicted values and the overall mean – This is the sum of squares of the differences between the group means and the overall mean One squared difference for each data point These interpretations will be really useful to consider when we study ANOVA Marshall University School of Medicine
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.