Presentation on theme: "Lecture 22 Spatial Modelling 1 : Incorporating spatial modelling in a random effects structure."— Presentation transcript:
Lecture 22 Spatial Modelling 1 : Incorporating spatial modelling in a random effects structure
Lecture Contents Introduction to spatial modelling Nested random effect levels House price dataset Including distance as a fixed effect Direction effects Focused clustering (Falkirk dataset)
Spatial statistical modelling Here we require a statistical approach that accounts for the spatial location at which a response is collected. This means that the model that is fitted to the data needs to account for the spatial effects. This may be to account for any effects due to location in the model or to predict values of the response at other locations via some form of interpolation that accounts for both other predictor variables and/or the spatial location.
Types of spatial data There are many forms of spatial data but we can broadly divide these into three types: (Cressie 1993) 1.Geostatistical data – here measurements are taken at a fixed number of chosen locations in a geographical area. 2.Lattice data – here measurement are taken at on a regular lattice and at each point on this lattice a measurement is collected. 3.Point process data – here each observation is the location of a response and its co-ordinates are also recorded.
Geostatistical data Such data are collected in various fields, particularly mining and earth sciences. A measurement e.g. %age coal ash is taken at each of a number of locations. Methods such as variograms and spatial Kriging are used to analyse such data. Other application areas include weather maps and agricultural field trials. Note such data is not ideally suited to standard random effect modelling.
Disease mapping One particular type of spatial modelling that is often linked with random effect modelling is disease mapping. Here cases of a disease (either human or animal) are observed over a chosen region e.g. a country. We then wish to infer the relative risk of the disease for a particular individual at a particular location based on the data collected. Both our practicals this afternoon will consider disease mapping datasets. The other two types of spatial data relate to disease data.
Lattice Data Such data is common in many fields, for example image analysis where the pixels in an image are found on a regular rectangular lattice. More importantly we will consider disease count data where counts of a disease are recorded for contiguous regions on a map. Although a map is not regular we can construct a lattice from a map by identifying neighbouring regions and linking neighbouring regions to form a lattice.
Example Here we see a map of 5 regions in the left hand picture, and on the right it has been converted to a lattice with connections between regions that share boundaries.
Point process data This data is also commonly found in disease mapping although may be used in many applications where cases of an event are seen at particular locations. Each item of data consists of the location of an event, the response (type of event) and potentially predictor variables for the event. Note Rasmus has worked more extensively in this area and will be happy to answer questions here.
Disease point process modelling In disease mapping our data is typically binary i.e. people are infected (or die from) a disease or are not. The data occur in point process form but there are 2 problems with analysing them as a point process: 1.All our responses are 1 as we only observe the infected/dead people! 2.Due to confidentiality and the sensitive nature of medical data the data cannot often be released as individual records. To counter point 1 we could sample control cases at random from the population however point 2 means that we typically total up cases for fixed areas and use a Poisson model on the lattice data that this creates.
Why might there be spatial effects? This depends on the response variable and application area. It is possible that geography is itself a predictor for our response or is a surrogate for other factors. Many factors can be linked to location e.g. weather, deprivation, altitude, pollution, wealth which might influence the response. So if our response is influenced by any of these factors then accounting for spatial effects many improve our model.
Nested random effects/ levels of geography The simplest link to random effect models is to consider nested random effects. We have considered pupils nested in schools and cows nested in herds. In some sense the schools and herds are spatial units in that schools generally take children from their locality and a herd is based on a particular farm. However we could also fit where the pupils live as another classification of the data which is more spatial. On the next slide we consider a dataset with more levels of geography.
UK house prices dataset An MMath student of mine (David Goodacre) studied a dataset of house prices in the UK. The data supplied by the Nationwide building society consists of average house prices in areas of the UK over a 12 year period (1992-2003). The data is for 753 towns in the UK and there are 3 levels of geography (towns nested in counties nested in regions.) Note that if we had individual house sale information then we could have considered point process approaches but here we consider random effect modelling.
A 4-level VC model for the house price dataset The following model was fitted to the data where i indexes year, j indexes town, k indexes county and l indexes region. The response, y is the log of the average price. This model can be fitted using both frequentist and likelihood methods in packages that allow four levels in the model.
Links with other topics It is worth noting that this house price dataset is a repeated measures dataset as you considered yesterday. It also contains missing data as in any year in which there were less than 50 sales in a postal town will lead to a missing observation. However we here assume MAR conditional on the model we are fitting.
Estimates for house price dataset Below are given IGLS estimates for the model: ParameterEstimate (SE) β0β0 4.036 (0.067) β1β1 -0.020 (0.002) β2β2 0.009 (0.0001) σ2fσ2f 0.045 (0.021) σ2vσ2v 0.016 (0.004) σ2uσ2u 0.045 (0.003) σ2eσ2e 0.013 (0.0002) Here we see that the model consists of parallel curves with both year and year 2 very significant. The variance is greatest between regions and between postal towns
Region Level Effects Here we see that the south east of the UK and London are the most expensive whilst Scotland the North and Wales are the cheapest.
County level effects After accounting for regions the pattern of county effects is more sporadic. We can however pick up 2 regions, Cheshire in the North West and Surrey in the South East that are more expensive than their neighbours.
Region level predictions Here we see a graph of region level predictions:
Further Modelling In his project Dave looked at random slopes models at the various levels of the model, so that we could pick out whether the increase in prices was different in different regions. He also looked at fitting models of a more spatial nature! See next lecture.
Why are spatial effects different? The main difference with spatial effects is that we have additional information about each (spatial) unit. For example if we observe the average house price of a town in Grampian, a town in Surrey and 2 towns in Berkshire then we know something of the spatial relation of these towns. We might expect the prices in the 2 towns in Berkshire to be similar and to be more similar to Surrey which is also in the South East than Grampian that is in Scotland. In our current models we will fit an effect for Berkshire which will capture some of the relationship between its 2 towns and a South East effect that will capture the link with the Surrey town.
Problems with the nested classification approach As we have seen the nested classification approach can capture much of the spatial variability however we have to decide on the geographic definitions of areas. We generally use easily available definitions e.g. county and region but there is no guarantee that these are the best classifications. We also have the problem of border effects, for example two towns on either side of a region border will not share either region or county effects but may have very similar prices. We will look at another approach here before studying more complex spatial approaches in the next lecture.
Including location in fixed effects It may be the case that there is a trend e.g. in house prices in the UK they generally fall as we move North and West. We could therefore add in two (fixed effect) predictors giving the N/S and E/W co-ordinates of each point. If the unit of observation is an area e.g. postal town we would generally use the co-ordinates of the centroid of the unit. If a linear relationship is not sensible then we could consider polynomial terms in each direction. For example (excluding random effects)
Distance effects Another possibility in terms of UK house prices is to consider the distance from London. This distance can be constructed from the co- ordinates of each point. The graph to the left gives the combined region and county effects and suggests a distance from London effect might be appropriate.
Distance and direction effects In some scenarios the direction as well as the distance from a particular point is important. This is not the case with house prices however in pollution data then direction can be very important where a dominant wind direction will suggest that particular directions away from the source will experience more pollution than others. We will next look at a dataset from Falkirk in Scotland that is analysed in Lawson, Browne & Vidal Rodeiro (2003)
Focused Clustering One research area in public health looks at the impact of sources of pollution on the health status of communities. The detection of patterns of health events associated with pollution sources is known as focused clustering. The statistical modelling involved usually relates to the point process nature of such data. Lawson, Browne & Vidal Rodeiro (2003) devote a whole chapter to Focused clustering and include some fairly complex models that can be considered in WinBUGS. Here we will look at some simpler models that can be fitted in MLwiN to a dataset from Falkirk in Scotland.
Respiratory cancer in Falkirk The figure to the right shows the census geographies of 26 regions found around a foundry (marked by *) in Falkirk, Scotland. It is thought conceivable that the foundry was an air pollution hazard in the early 1970s prior to the study. This could have an impact on the respiratory cancer experience of those living in the areas close to the foundry
Falkirk dataset The data consists of observed and expected counts of respiratory cancer cases in the time period 1978-1983. We first compare the standardized mortality rates (SMRs) = observed/expected against the locations of the centroids of the 26 areas in Falkirk (relative to the foundry) to look for patterns.
Position of the sites Note in the graphs to the right that the 3 highest SMRs are close to the source both in the N/S and E/W directions. We can convert these locations to distance and direction measures.
Distance and direction Here we see that there appears to be a negative relationship between distance and SMR but no obvious pattern with regard to the direction relationship.
(Extra) Poisson modelling We have modelled the effects of deprivation, distance and direction in the following Poisson model: Note that we have used 1 st order MQL in MLwiN and allowed extra-Poisson variation. This shows there is less variation than a Poisson distribution so we will also try fitting SMR as a Normal distributed response.
Normal response model for SMR Here we see that none of the predictors has a significant effect which is probably because the dataset is so small. We do see however that the risk reduces as distance from the foundry increases and for areas with larger deprivation scores. (suggesting higher rates in less deprived areas but not significantly.)
Information for the practical In the practical we will return to using nested random effects to account for spatial effects. Our data is from the European community and consists of male deaths from malignant melanoma in 9 countries in the EU. The practical is a (modified) chapter from Browne (2003) and looks at MCMC methods for this dataset. It is also analysed using quasilikelihood methods in the MLwiN users guide and you are welcome to also try these methods.