Presentation on theme: "Applying Geostatistical Methods to Lattice Data: An Initial Examination of U.S. Presidential Elections in Iowa A.C. Thomas Statistics 225 December 14,"— Presentation transcript:
Applying Geostatistical Methods to Lattice Data: An Initial Examination of U.S. Presidential Elections in Iowa A.C. Thomas Statistics 225 December 14, 2004
Sources/Guides Main source: “Hierarchical Models”, chapters 2 and 3 (geostatistical and spatial data) Data sources: http://www.sos.state.ia.us/elections/results/ (1996/2000) http://www.sos.state.ia.us/elections/results/ http://www.cnn.com/ (2004)http://www.cnn.com/ Special thanks: Brad Carlin (UMN), Andy Gelman (Columbia), Paul Edlefsen (Harvard) GeoR: P.J. Ribeiro and P.J. Diggle
Motivation In this course, we have learned about three different methods of examining spatial data (depending on relevant conditions) with some interchangeabilities Often, we may not have the tools to examine data sets using one method (i.e. the shortcomings of R in manipulating lattice data) In this case, we will compare and contrast the effectiveness of a geostatistical method used on lattice data to a lattice method through self cross-validation
Interrelationship Geostats and kriging: using variograms and distance relationships to predict quantities across distances Lattices: using neighbour relationships to predict quantities across distances Direct similarities: some weighting schemes across distances directly resemble covariograms
Why election data? Why not? Spatial organization is well understood and constant in time (county borders have not changed across data sets) and built into R (maps library) While specific challengers change over time, parties are relatively constant, as are other control variables Ramifications are germane to the functioning of society (and the insatiable appetite of news junkies and policy wonks)
Questions: For this data set, does a geostatistical approximation produce a result comparable in error to a lattice model? If so, can we use fitted information from one election to predict the complete results of the next one? (And how much are we off?)
Why Iowa? 99 counties which have roughly equal area, removing a possible nuisance (and are rectilinear, so easier to draw) Swing state, with a rough vote balance over time Not too big, not too small in either population or size
Simplification: No third parties For now, considering only the votes for Democrat and Republican candidates in presidential elections from 1996-2004 Not so bad in 2000/2004, when independent vote was about 3% of total Worse in 1996 (Perot’s successful campaign drew a lot), up to 10% of total votes
Initial impressions There seems to be a tendency to vote more Republican the further west we look (Observation, courtesy Matt Anthony: as we go east, we hit Illinois, a Democratic core.) What is the population distribution by county over time?
Quick-and-dirty non-spatial analysis Question: how does population size correlate with the Democratic vote? Correlation between blue vote and “total” vote: 1996: = 0.18 2000: = 0.30 2004: = 0.29. So population would appear to be an important covariate.
Geostatistical analysis Locations: centroids of each county (obtained through centroid.polygon function in maps library of R) Data: Republican percentage of vote (arbitrarily chosen, not necessarily personal political affiliation)
Initial fitting Semivariogram appears to increase without bound, suggesting nonstationarity Plan: use Universal Kriging with this semivariogram Problem: Trend appears to be power law, with power greater than 2 (impossible to fit with conventional definitions Possible solutions: a) remove trend from data. b) don’t care.
Plan A: Remove trend from data What it does: lets us remove known spatial dependence, look at other trends Initial look: –major discrepancies.
Plan B: Don’t care. The goodness of fit only tails off at the end Preliminary results show the other option to be extremely inaccurate due to noise levels in residual data
Meaningful Kriging Since we want to test the predictive power of this method, we should test it on our current data through cross-validation Key: remove one point, use semivariogram with remaining points to interpolate the value at each centroid Then, return trend to data and compare with original values Use universal kriging with second-degree trend
1996 Redux – Predicted Values In total, Dole “receives” 9,726 more votes than predicted. Absolute error: 43,526 Total 2-party votes: 1,112,902
Fitting variograms between models For all, power model was appropriate choice ^2 + ^2 * t^ 1996: ^2 = 9.24e-4, =1.98, ^2=0.031 2000: ^2 = 9.93e-4, =2.00, ^2=0 2004: ^2 = 1.16e-3, =2.00, ^2=0.025 All roughly identical, even with different total averages
2000 Predicted Prediction: Bush gets 26,000 more votes Absolute error: 181,880 Total Bush/Gore votes: 1,272,890
2004 Prediction Prediction: Bush gets 32,094 more votes Absolute difference: 74,458 Total votes: 1,479,702
“Naïve Neighbour” For a baseline comparison, take the simplest (stupidest) lattice cross- validation test – “ask your neighbour”, trivial SAR weights Predicted value at a square is simply the mean of border-sharing neighbours (data is Republican percentage of vote)
“NN” 1996 Dole: 10,819 more predicted Total deviation: 40,923
“NN” 2000 Bush gets 28,535 extra in prediction Total deviation: 59,670
“NN” 2004 Bush gets 37,175 more Total deviation: 76,926
Cross-validation summary Geosta t error NN error Geostat total error NN total error Voting pop. 19969,72610,81943,52640,9231,112,90 2 200026,00028,53561,48559,6701,272,89 0 200432,09437,17574,45876,9261,479,70 2
Conclusions Data is definitely not stationary, even after removing trends Good kriging is about as effective as “naïve neighbour”, both without covariates Prediction with these tools at this simple level is not yet accurate enough Each method overpredicts the Republican vote Fitting information for each year is very close
Future Developments and Unanswered Questions – New! I’ve since introduced universal co-kriging with population, past voting behavior and second-degree spatial dependences using the gstat package. Needed: data from the last 4 elections, conveniently packaged. Other prediction using spatial methods.