Modeling in R Sanna Härkönen.

Modeling in R Sanna Härkönen

Model fitting: simple linear model
Important measures: Correlation r Coefficient of determination R2 p-values Residuals (examining their distribution)

PEARSON CORRELATION r Measures linear relationship between two variables Even though correlation would be low, there can be strong relationship (non-linear) between the variables Can be positive/negative depending on the relationship (-1..1) Equation:

EXAMPLE: SAME CORRELATION (0.816), BUT different RELATIONSHIP
LINEAR FIT OK ONLY HERE:

REGRESSION ANALYSIS Examining relationships of variables
Dependent variable: the variable that is explained by the independent variable(s) Coefficient of determination R2 = r2, where r is correlation For example if D would be expressed as a function of H -> D is dependent and H is independent variable.

SIMPLE LINEAR REGRESSION
Fitting linear regression line between two variables. y = β0 + β1 *x + ε (y is the dependent (=response) variale, x is the independent (=predictor) variable, β0 is the constant, β1 is the slope and ε is the random error) Method: least squares regression, where the regression line is fitted so that the sum of squares of model residuals (measured y – modeled y)2 is minimized

INTERPRETATION: r and R2
Relationship: non-significant moderate remarkable strong |r|=0.0 R2=0.0 |r|=0.4 R2=0.16 |r|=0.6 R2=0.36 |r|=0.8 R2=0.64 |r|=1 R2=1 H explains ~14% of the variation in D. Poor fit. H explains ~90% of the variation in D. Very good fit.

FITTING A SIMPLE LINEAR MODEL
Linear relationship Import the data to R (command read.csv()) Examine summary statistics of your variables (summary() command in R) Examine the relationships of variables by plotting them (plot() command in R) If you see a linear relationship between the dependent variable and explanatory variables -> you can fit a linear model If the relationship is not linear, you can try to first linearize it by doing conversion for the variable(s) (e.g. logarithm, exponential, …) and then apply linear regression with the conversed values Fit the linear model in R: command lm(y~x), where y is dependent and x is independent variable Examine the results of the regression (significance of variables, R2 etc) using summary() command Examine the residuals Non-linear relationship of X and Y Linear relationship of X and exp(Y)

Summary statistics Dataset ”a”: summary

Plotting plot(a$D, a$TOTAL_VOLUME)

plot(a$BA, a$TOTAL_VOLUME)
Need for linearizing??

R example: BUILDING LINEAR MODEL in R
Building linear model for basal area (BA1) as a function of height (H1)

RESULTS OF REGRESSION ANALYSIS : R
Summary statistics of residuals (= original_y – modeled_y) Intercept and slope for the model. -> Y = X Standard error of the estimates t-test values (estimate/SE) and their p-values: express if the variable is significant with certain significance level Degrees of freedom: sample size – number of variables in the model F-test’s value and its p-value express if the independent variables in the model capable to explain the dependent variable. R-squared: R2 Adjusted R-squared: takes into account number of variables in the model. It is used when comparing regression models with different number of variables. How to interpret p-value: <0.01 very significant (with >99% probability) <0.05 significant (with >95% probability) > 0.05: not significant Residual standard error: (sqrt(sum((mod_y-orig_y)^2)/(n-2))

Residuals Important to check after model fitting
Residuals : measured Y – modeled Y

Interpreting Residual Plots
Residuals should look like this Variable transformation required Outliers non-constant variance and outliers variable Xj should be included in the model [1] from: VANCLAY, J “Modelling Forest Growth and Yield. Application to Mixed Tropical Forests” CAB International.. BLAS MOLA’s SLIDES

Residuals: Y_measured – Y_Modeled
If the model is good, the residuals should be homoscedastic, i.e no trend with x should be present in residuals follow normal distribution R command plot.lm(your_model) can be used for examining residuals: Upper figure: residuals should be equally distributed around the 0-line. In the example figure, howerev, there seems to be lowering trend in residuals -> not good. Lower figure: all the residuals would be on the straight line, if the residuals follow normal distribution. -> in the example figure they don’t seem to completely follow normal distribution.

EXAMPLE

Exercises in GROUPS: which is the best model? Which is the worst? WHY?

R examples Multiple regression: lm(volume ~ height + diameter + basal area) Using dummy variables (categorical): (e.g. species, forest type etc categories) lm(volume ~ height + factor(tree_species)

Total volume as function of H

TOTAL VOLUME as function of H and BA

Total volume as function of H, BA and forest type (dummy)
Interpretation of output, if dummy variable is used: Forest types 1-7 present. Forest type 1 is the ”base” category (no multipliers). If forest type is 2 -> factor(a$FOREST_TYPE)2 coefficient is 1 and is multiplied with estimate value In that case all other forest type coefficients are 0. Etc with other forest types

Interpret these R summaries of the model fits.
Write down the equations (y=a + b*x) of both models. Which model is better? Are the intercept and slope significant in both models? Are both models capable for estimating the desired variable? What else would you need to check when considering the model goodness?

Modeling in R Sanna Härkönen.

Similar presentations

Presentation on theme: "Modeling in R Sanna Härkönen."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Modeling in R Sanna Härkönen.

Similar presentations

Presentation on theme: "Modeling in R Sanna Härkönen."— Presentation transcript:

Similar presentations

About project

Feedback