Lecture 4 - Model Formulation

Lecture 4 - Model Formulation
C. D. Canham Lecture 4 Model Formulation and Choice of Functional Forms: Translating Your Ideas into Models

C. D. Canham Topics Alternate models as multiple working hypotheses. Null models Choice of functional forms

The triangle of statistical inference
Lecture 4 - Model Formulation The triangle of statistical inference C. D. Canham Data Inference Probability Model Scientific Model* (hypothesis) Whenever you are framing a research questions you have implicitly (in traditional parametric statistics), or explicitly (in the likelihood approach) three cornerstones around which you will build your reaserch. A scientific model, a probability model, and your data. Let’s spend a little bit of time discussing what each of these corners is and is not. The first corner, the model, is what in traditional parametric statistics we call your hypothesis. Hillborn and Mangel discuss in detail the distinctions between model and hypothesis. (read Chapter 2 as an assignment). Models are SPECIFIC HYPOTHESES. In fact, we would like you to forget altogether the word hypothesis and to think about the world in terms of SPECIFIC MODELS. We hope that by the end of two weeks, you will realize just how much more rich and productive this approach is when compared to traditional hypothesis testing. (Mention also Nester’s article). *All hypotheses can be expressed as models!

C. D. Canham The Scientific Method “Science is a process for learning about nature in which competing ideas are measured against observations” Feynman 1965

C. D. Canham Scientific Process Devise alternative hypotheses. Devise experiment(s) with alternative possible outcomes. Carry out experiments. Recycle procedure. -- Platt 1964 (Strong inference) The Popperian view of science: DEDUCTION vs INDUCTION (Bacon induction) Use data to come up with explanation; if the data do not support discard theory of model. Popper deduction view of science. Grand theories guide the kinds of data we collect. In fact theories are never accepted or rejected, rather we use or not, and this depends on the amount of evidence that exists for or against them. But this is time consuming and not very useful for many questions…..

The method of multiple working hypotheses
Lecture 4 - Model Formulation The method of multiple working hypotheses C. D. Canham “It differs from the simple working hypothesis in that it distributes the effort and divides the affection. “ “ Bring up into review every rationale explanation of the phenomenon in hand and to develop every tenable hypothesis relative to its nature. “ “ Some of the hypotheses have already been proposed and used while others are the investigator’s own creations. “ An adequate explanation often involves the coordination of several causes. “ “ When faithfully followed for a sufficient time it develops the habit of parallel or complex thought. “ “ The power of viewing phenomena analytically and synthetically at the same time appears to be gained . “ President of Universit of Wisconsin AAAS Geologist Founder of the Journal of Geology The Lakatosian view of science ---T. C.Chamberlain, Science 15: 92.

What is the best model to use?
Lecture 4 - Model Formulation C. D. Canham What is the best model to use? This is the critical question in making valid inferences from data. Careful a priori consideration of alternative models will often require a major change in emphasis among scientists. Model specification is more difficult than the application of likelihood techniques.

Formulation of Candidate Models
Lecture 4 - Model Formulation C. D. Canham Formulation of Candidate Models Translating your qualitative ideas into a quantitative, algebraic model that can be tested against alternative models… Conceptually difficult. Subjective. Original and innovative. Models represent a scientific hypothesis. We are advocating a philosophy of science-based A PRIORI modeling. Hypothesis testing is a means of TESTING a model.

Where do models come from?
Lecture 4 - Model Formulation C. D. Canham Where do models come from? Scientific literature. Results of manipulative experiments. Personal experience. Scientific debate. Natural resource management questions. Monitoring programs. Judicial hearings. Then of course there is also thinking!! About the literature.

C. D. Canham Are models truth? Truth has infinite dimensions Sample data are finite Models should provide a good approximation to the data Larger data sets will support more complex approximations to reality For instance, when modeling populations we often assume that there is a parameter that FITS the data. For instance, we may assume an overall population survival rate when it is clear that individuals differ in their survival probabilities

C. D. Canham Model selection is implicit in science “..empiricism, like theory, is based on a series of simplifying assumptions…By choosing what to measure and what to ignore, an empiricist is making as many assumptions as does any theoretician.” David Tilman

Develop a set of a priori candidate models
Lecture 4 - Model Formulation C. D. Canham Develop a set of a priori candidate models Include a global model that includes all potential relevant effects. Test of global model (R-square, goodness of fit tests). Develop alternative simpler models. We are advocating a philosophy of science-based A PRIORI modeling. Hypothesis testing is a means of TESTING a model. Avoid the collection of a lot of data bout a particular question and then sorting out what looks important or relevant. This approach will cause many spurrious correlations in the data invalid inference.

Assessing alternative models
Lecture 4 - Model Formulation C. D. Canham Assessing alternative models How well does the model approximate “truth” relative to its competitors? (high accuracy or low bias). How repeatable is the prediction of a model relative to its competitors? (high precision or low variance).

Why do model selection at all? Principle of parsimony
Lecture 4 - Model Formulation Why do model selection at all? Principle of parsimony C. D. Canham Bias 2 Variance Overfitted models parameter estimates have variances that are very large. Underfitted models introduce bias because there are some effects that are not accounted for that may be reflected in one of the parameters included particularly if parameters are correlated. MUST BALANCE OVERFITTING AND UNDERFITTING. Number of parameters Few Many

Principle of parsimony applied to model selection
Lecture 4 - Model Formulation C. D. Canham Principle of parsimony applied to model selection We typically penalize added complexity. A more complex model has to exceed a certain threshold of improvement over a simpler model. Added complexity usually makes a model more unstable. Complex models spread the data too thinly over data. Model selection is not about whether something is true or not but about whether we have enough information to characterize it properly.

C. D. Canham Reality: Actual data Suppose that this is truth We can generate 10 data sets using 3 parameter values –0.3, 1 and 0.01 (the range of values for the errror addition). Generate 10 data sets each with 21 from this model. Example from page of Burnham and Anderson

C. D. Canham A set of candidate models We try to approximate this model using a polynomial function. Fit three models: One is a simple linear model, one a quadratic model and finally the last one is a fifth order polynomial.

C. D. Canham Too simple: High bias (low accuracy) UNDERFITTING!! If we approximate with a one degree polynomial, we underfit. The model is highly biased and not accurate: we are not predicting the value of Y Correctly for many of the x’s.

C. D. Canham Too complicated: High variance (low precision) OVERFITTING!! If we fit the model using a 5th degree polynomial, the model is not biased (like before it tended to overpredict for many of the values of x and is Fairly accurate) but it has high variance (depending on the dataset, the values of Y are all over the place, right?). The approximation was built using a bootstrapped samples of the original dataset. That is the value is not precise.

C. D. Canham The compromise: a parsimonious model REASONABLE FIT If we get it just right, we use a 2nd degree polynomial to approximate the function, we get relatively low bias and relatively high precision.

C. D. Canham Null Models Parametric methods advocate testing hypotheses against a null expectation (Ho ). Often the null is probably false simply on a priori grounds (e.g., the parameter θ had no effect). In likelihood terms this usually means the null model is the one that sets the value of parameter θ equal to 0 or 1.

States of mind of a null hypothesis tester
Lecture 4 - Model Formulation C. D. Canham States of mind of a null hypothesis tester Practical importance of Statistical significance observed difference of observed difference Not significant Significant Not important Important Important  Not significant This is bad! Type II error If only I had had more money… Not important but significant Type I error (false positive)

Model Selection Methods
Lecture 4 - Model Formulation C. D. Canham Model Selection Methods Adjusted R- square. Likelihood Ratio Tests. Akaike’s Information Criterion. We will talk about these topics later…

Choice of Functional Forms
Lecture 4 - Model Formulation C. D. Canham Choice of Functional Forms Model formulation requires the specification of a functional form that formalizes the relationship between the predictive variables and the process we are trying to understand. The functional form should clarify the verbal description of the mechanisms driving the process under study. Choosing a functional form is a skill that needs to be developed over time.

Choice of Functional Forms: Mechanism vs. phenomenology
Lecture 4 - Model Formulation Choice of Functional Forms: Mechanism vs. phenomenology C. D. Canham Mechanistic: based on some biological or ecological model. Phenomenological: functions that fit the data well or are simple/convenient to use.

Choice of functional forms: What matters?
Lecture 4 - Model Formulation C. D. Canham Choice of functional forms: What matters? Does it represent what happens in your model? Does the shape of the function resemble actual data? Is the range of data desired delivered by this function? Does the function allow for ready variation of the aspects of the question that the researcher wants to explore? What happens at either end (as x 0 and x)? What happens in the middle? Critical points (maxima, minima).

Model Functions Vs. Probability Density Functions
Lecture 4 - Model Formulation C. D. Canham Model Functions Vs. Probability Density Functions Properties of pdf’s Prob(x) x

Some useful functions (not necessarily pdf’s!)
Lecture 4 - Model Formulation Some useful functions (not necessarily pdf’s!) C. D. Canham Exponential. Weibull. Logistic. Lognormal. Power. Generalized Poisson. Logarithmic.

C. D. Canham Exponential There are other kinds of functions that are important mathematically , polynomials, but in general we avoid them because they are not bounded they tend to infinity a x gets larger (or – infinity)

C. D. Canham Exponential: Decline in maximum potential growth as a function of crowding 1 Species A Species B Effect on growth (Growth multiplier) Exponential switch to Excel NCI (Neighborhood Crowding Index)

Michaelis-Menten function
Lecture 4 - Model Formulation C. D. Canham Michaelis-Menten function S is slope of increment A is the asymptote a = 1.43 s = 0.76 a = 1.63 s = 0.31

Weibull function C. D. Canham The exponential is a special case of the Weibull function (β=0):

Weibull Example: Dispersal functions
Lecture 4 - Model Formulation C. D. Canham Weibull Example: Dispersal functions The first part of the equation predicts the total number of recruits produced by a parent tree of 30 cm in DBH. The exponent beta converts diameter of trees of different sizes to reproductive output. The second part of the equation describes the mean density of recruits to be found in a 1-m square quadrat centered at a given distance (distance) from the parent tree. Does it describe the proportion of seeds that is to be found as we move away from the focalk tree? D determines the rapidity in the decline as we move away from the parent tree Low theta has the same effect Methods (Ribbens et al Ecology 75(6): In fact, theta and D tended to tradeoff so beta was fixed at 3. STR and beta also traded off. Beta was fixed at 2.

C. D. Canham Logistic 0.2 0.4 0.6 0.8 1 2 4 6 8 10 X Y a=-2,b=1 a=2,b=-1

Logistic: Probability of mortality as a function of storm severity
Lecture 4 - Model Formulation C. D. Canham Logistic: Probability of mortality as a function of storm severity Canham et al. 2001

C. D. Canham Lognormal

C. D. Canham Lognormal: Leaf litterfall as a function of distance to the parent tree Scaling issues. If you want the number to be a proportion, you have to scale the modifier to range from 0 to 1. For instance, in this case, we are trying to figure out first, how much litter a tree produces and then we want to know how much of the litter is spread out at different distances from the parent tree. Essentially you can take the same approach when modelling dispersal. A tree produces a given number of seeds,what proportion of this total is then spread out at a certain radius from the parent tree? Data from GMF, CT

Lognormal: Growth as a function of DBH
Lecture 4 - Model Formulation C. D. Canham Lognormal: Growth as a function of DBH DBH (cm) 20 40 60 80 100 120 140 5 10 15 25 30 35 CASARB DACEXC MANBID INGLAU SLOBER CECSCH TABHET GUAGUI ALCLAT SCHMOR BUCTET Max. Potential Growth (cm/yr) Another application of the lognormal is the modifier for the growth equation. We assume that a tree grows a maximum potential growth. Then we can modify that growth for a given size. At what size will the maximum growth occur? How fast does growth increase to or decrease from that maximum growth? The choice of form should also consider the range of values that you will analyze. For instance, if you are interested in understanding growth or survival only for saplings (1-2 cm DBH), you would not use this function. Rather, you would assume that all trees have the same potential maximum growth rate regardless of size or that the relationship between growth and DBH is linear. In any case, you would consult some papers and draw your own conclusions about which functions may work best under your particular scenario. Data from LFDP, Puerto Rico

C. D. Canham Power function: small mammal distribution as a function of canopy tree neighborhood What do these parameters represent? A is the height of the curve: What is the maximum number of small mammals caught? B is the width of the curve What is the habitat width of mammals along the different habitats. C is the mode of the curve where along the community ordination axis does the maximum number of captures occur? Schnurr et al

Parameter trade-offs: More than one way to get there….
Lecture 4 - Model Formulation C. D. Canham Parameter trade-offs: More than one way to get there…. Trade-off? Investigate this with the seed dispersal dataset NCI (Neighborhood Crowding Index)

C. D. Canham Things to keep in mind Scaling issues: Pay attention to units, scales, and conversions. Multiplicative functions and parameter tradeoff. Computational issues Large exponent values Division by zero Logs of negative numbers

Some useful references
Lecture 4 - Model Formulation C. D. Canham Some useful references Catalog of curves for curve fitting. British Columbia Ministry of Forests. Abramowitz, M. and I. Stegun Handbook of Mathematical Functions. McGill, B “Strong and weak tests of macroecological theory”. Oikos. VanClay, J “Growth models for tropical forests: a synthesis of models and methods”. Forest Science.

Lecture 4 - Model Formulation

Similar presentations

Presentation on theme: "Lecture 4 - Model Formulation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 4 - Model Formulation

Similar presentations

Presentation on theme: "Lecture 4 - Model Formulation"— Presentation transcript:

Similar presentations

About project

Feedback