F. Jay Breidt *,** Colorado State University Jean D. Opsomer ** Iowa State University (+ more folks acknowledged soon) Research supported by EPA STAR Grants.

Slides:



Advertisements
Similar presentations
Statistical Time Series Analysis version 2
Advertisements

Use of Estimating Equations and Quadratic Inference Functions in Complex Surveys Leigh Ann Harrod and Virginia Lesser Department of Statistics Oregon State.
Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
Objectives 10.1 Simple linear regression
# 1 METADATA: A LEGACY FOR OUR GRANDCHILDREN N. Scott Urquhart STARMAP Program Director Department of Statistics Colorado State University.
An Overview STARMAP Project I Jennifer Hoeting Department of Statistics Colorado State University
Uncertainty and confidence intervals Statistical estimation methods, Finse Friday , 12.45–14.05 Andreas Lindén.
1 Multiple Frame Surveys Tracy Xu Kim Williamson Department of Statistical Science Southern Methodist University.
Multi-Lag Cluster Enhancement of Fixed Grids for Variogram Estimation for Near Coastal Systems Kerry J. Ritter, SCCWRP Molly Leecaster, SCCWRP N. Scott.
Model- vs. design-based sampling and variance estimation on continuous domains Cynthia Cooper OSU Statistics September 11, 2004 R
Raymond J. Carroll Texas A&M University Non/Semiparametric Regression and Clustered/Longitudinal Data.
Mitigating Risk of Out-of-Specification Results During Stability Testing of Biopharmaceutical Products Jeff Gardner Principal Consultant 36 th Annual Midwest.
Model assessment and cross-validation - overview
STARMAP/DAMARS 9/10/04# 1 STARMAP YEAR 3 N. Scott Urquhart STARMAP Director Department of Statistics Colorado State University Fort Collins, CO
Data mining and statistical learning - lecture 6
Robust sampling of natural resources using a GIS implementation of GRTS David Theobald Natural Resource Ecology Lab Dept of Recreation & Tourism Colorado.
Nonparametric, Model-Assisted Estimation for a Two-Stage Sampling Design Mark Delorey, F. Jay Breidt, Colorado State University Abstract In aquatic resources,
Raymond J. Carroll Texas A&M University Postdoctoral Training Program: Non/Semiparametric.
1 STARMAP: Project 2 Causal Modeling for Aquatic Resources Alix I Gitelman Stephen Jensen Statistics Department Oregon State University August 2003 Corvallis,
Forecasting JY Le Boudec 1. Contents 1.What is forecasting ? 2.Linear Regression 3.Avoiding Overfitting 4.Differencing 5.ARMA models 6.Sparse ARMA models.
Multiple regression analysis
Kernel methods - overview
State-Space Models for Within-Stream Network Dependence William Coar Department of Statistics Colorado State University Joint work with F. Jay Breidt This.
Semiparametric Mixed Models in Small Area Estimation Mark Delorey F. Jay Breidt Colorado State University September 22, 2002.
Curve-Fitting Regression
1 Accounting for Spatial Dependence in Bayesian Belief Networks Alix I Gitelman Statistics Department Oregon State University August 2003 JSM, San Francisco.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
Distribution Function Estimation in Small Areas for Aquatic Resources Spatial Ensemble Estimates of Temporal Trends in Acid Neutralizing Capacity Mark.
Two-Phase Sampling Approach for Augmenting Fixed Grid Designs to Improve Local Estimation for Mapping Aquatic Resources Kerry J. Ritter Molly Leecaster.
Nonparametric, Model-Assisted Estimation for a Two-Stage Sampling Design Mark Delorey Joint work with F. Jay Breidt and Jean Opsomer September 8, 2005.
Example For simplicity, assume Z i |F i are independent. Let the relative frame size of the incomplete frame as well as the expected cost vary. Relative.
Model Selection in Semiparametrics and Measurement Error Models Raymond J. Carroll Department of Statistics Faculty of Nutrition and Toxicology Texas A&M.
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
October, A Comparison of Variance Estimates of Stream Network Resources Sarah J. Williams Candidate for the degree of Master of Science Colorado.
Distribution Function Estimation in Small Areas for Aquatic Resources Spatial Ensemble Estimates of Temporal Trends in Acid Neutralizing Capacity Mark.
State-Space Models for Biological Monitoring Data Devin S. Johnson University of Alaska Fairbanks and Jennifer A. Hoeting Colorado State University.
STAT262: Lecture 5 (Ratio estimation)
1 An Introduction to Nonparametric Regression Ning Li March 15 th, 2004 Biostatistics 277.
"Developing statistically-valid and -defensible frameworks to assess status and trends of ecosystem condition at national scales" "Developing statistically-valid.
Applications of Nonparametric Survey Regression Estimation in Aquatic Resources F. Jay Breidt, Siobhan Everson-Stewart, Alicia Johnson, Jean D. Opsomer.
Distribution Function Estimation in Small Areas for Aquatic Resources Spatial Ensemble Estimates of Temporal Trends in Acid Neutralizing Capacity Mark.
Distribution Function Estimation in Small Areas for Aquatic Resources Spatial Ensemble Estimates of Temporal Trends in Acid Neutralizing Capacity Mark.
1 Adjustment Procedures to Account for Nonignorable Missing Data in Environmental Surveys Breda Munoz Virginia Lesser R
Scot Exec Course Nov/Dec 04 Ambitious title? Confidence intervals, design effects and significance tests for surveys. How to calculate sample numbers when.
Overview G. Jogesh Babu. Probability theory Probability is all about flip of a coin Conditional probability & Bayes theorem (Bayesian analysis) Expectation,
Outline Separating Hyperplanes – Separable Case
CPE 619 Simple Linear Regression Models Aleksandar Milenković The LaCASA Laboratory Electrical and Computer Engineering Department The University of Alabama.
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
Statistics for Business and Economics Chapter 10 Simple Linear Regression.
1 Ratio estimation under SRS Assume Absence of nonsampling error SRS of size n from a pop of size N Ratio estimation is alternative to under SRS, uses.
Effect Size Estimation in Fixed Factors Between- Groups Anova.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
1 Enhancing Small Area Estimation Methods Applications to Istat’s Survey Data Ranalli M.G. ~ Università di Perugia D’Alo’ M., Di Consiglio L., Falorsi.
Topic (vi): New and Emerging Methods Topic organizer: Maria Garcia (USA) UNECE Work Session on Statistical Data Editing Oslo, Norway, September 2012.
DAMARS/STARMAP 8/11/03# 1 STARMAP YEAR 2 N. Scott Urquhart STARMAP Director Department of Statistics Colorado State University Fort Collins, CO
Gaussian Processes For Regression, Classification, and Prediction.
Generalized Additive Models: An Introduction and Example
Machine Learning 5. Parametric Methods.
Tutorial I: Missing Value Analysis
Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
LECTURE 17: BEYOND LINEARITY PT. 2 March 30, 2016 SDS 293 Machine Learning.
Density Estimation in R Ha Le and Nikolaos Sarafianos COSC 7362 – Advanced Machine Learning Professor: Dr. Christoph F. Eick 1.
Exposure Prediction and Measurement Error in Air Pollution and Health Studies Lianne Sheppard Adam A. Szpiro, Sun-Young Kim University of Washington CMAS.
Model Comparison. Assessing alternative models We don’t ask “Is the model right or wrong?” We ask “Do the data support a model more than a competing model?”
Estimating standard error using bootstrap
Chapter 14 Introduction to Multiple Regression
Parametric Methods Berlin Chen, 2005 References:
Presentation transcript:

F. Jay Breidt *,** Colorado State University Jean D. Opsomer ** Iowa State University (+ more folks acknowledged soon) Research supported by EPA STAR Grants R (*CSU) and R (**OSU) Nonparametric Survey Regression Estimation Using Penalized Splines

The Usual Disclaimer The work reported here was developed under STAR Research Assistance Agreements CR and CR awarded by the U.S. Environmental Protection Agency (EPA) to Colorado State University and Oregon State University. This presentation has not been formally reviewed by EPA. The views expressed here are solely those of the authors. EPA does not endorse any products or commercial services mentioned in this report.

Outline Background: Scales of inference Specific versus generic Model-assisted and model-based inference Penalized splines: Comparison to other smoothers; two-stage; small area Variations: network data, increment data Other: Non-Gaussian time series Summary: Status of STARMAP.2 and DAMARS.5

Scales of Inference in Surveys Large area: sample itself suffices for inference no model needed Medium area: use auxiliary information through a model model helps inference but is not critical Small area: sample size is small or zero inference must be based on a model

Specific and Generic Inference Specific: one study variable, few population parameters lots of modeling resources to specify, estimate, and diagnose a model willingness to defend the model Generic: many study variables, many population parameters no resources to model every variable no single model is adequate/defensible

Generic Inferences in Aquatic Resources Generic inference is a common problem for federal, state, and tribal agencies Example: conduct a survey and prepare a report analyze large numbers of chemical, biological, and physical variables estimate means, quantiles, and distribution functions break down both by political classifications and by various ecological classifications

Model-Assisted Survey Inference Scarce modeling resources for generic inference, so we don’t trust models Can we use a model without depending on the model? Model-assisted inference: efficiency gains if model is right sensible inference even if model is wrong

Model-Assisted Estimators Form of model-assisted estimator: (model-based prediction)+(design bias adjustment) model incorporates auxiliary information bias adjustment corrects for bad models Classical parametric model-assisted: prediction from linear regression model Our idea: nonparametric model-assisted prediction from kernel regression or other “smoother” (JB & JO (2000), Annals of Stat)

Why Nonparametric? More flexible model specification smooth mean function, positive variance function Approximately correct more often more opportunities for efficiency gains from auxiliary information often, not a large efficiency loss if parametric specification is correct

Goals of Our Research Focus on generic inference Use flexible nonparametric models to reduce misspecification bias model-assisted: medium area problem model-based: small area problem Make the methods operationally feasible for state and tribal agencies linear smoothers generate generic weights

Penalized Splines Very useful class of linear smoothers Readily fits into standard linear mixed model framework Modular, extensible, computationally convenient Automated smoothing parameter selection and fitting with standard software Several ongoing projects: Model-assisted p-spline estimation (Gerda Claeskens, JO, JB); two-stage extensions (Mark Delorey) Small area p-spline estimation (Gerda, Giovanna Ranalli, Goran Kauermann, JO, JB) Smoothing on networks (Giovanna, JB) Semiparametric mixed models for increment-averaged core data (Nan-Jung Hsu, Steve Ogle, JB)

Penalized Splines Truncated linear basis allows slope changes at each of many knots:  Penalize for unnecessary slope changes:

P-Splines: Influence of Penalty Fits with increasing penalty parameter

Penalized Splines Computation Computation using S-Plus Set up design matrix + truncated linear splines Z <- outer(x, knots, "-") Z 0) C <- cbind(one,x,Z) Solve for spline with fixed degrees of freedom D <- diag(rep(0,2),rep(1,K)) mhat <- X %*% solve(t(C) %*% diag(1/pi) %*% C +lambda^2 * D) %*% t(C) %*% diag(1/pi)%*%y For data-determined df/roughness penalty, can use lme() to select via REML

Model-Assisted P-Spline Estimator Model-based prediction + design bias adjustment:  Asymptotically design-unbiased and design consistent  Asymptotic variance given by

Design of Simulation Study Model-assisted estimators Polynomial regression Poststratification (piecewise constant) Local polynomial regression (kernel) Penalized spline Model-based estimator Penalized spline All use common degrees of freedom: 3 or 6 Eight response variables on one population Two noise levels N=1000 Designs SI or STSI 1000 replicate samples of size n=50

Estimator Comparisons: Common Degrees of Freedom

MSE Ratio Relative to Model- Assisted Penalized Splines

Further Results from Simulation Variance estimation For all estimators, variance estimator has negative bias Weighted residual variance estimator performs better Confidence interval coverage Somewhat less than nominal for all estimators (90-92%) Undercoverage not as severe as bias would suggest Negative weights: (2 df)x(2 designs)x(1000 reps)x(50 weights) = 200,000 weights 902 negative REG weights 145 negative LLR weights 2 negative MA weights

Two-Stage P-Spline Estimation Available auxiliary information in two-stage sampling: All clusters All elements All elements in sampled clusters Mark Delorey (poster): focus on first case Simulation study comparing Horvitz-Thompson, regression, model-based p-spline, model-assisted p- spline with and without cluster random effects Operational issues with df, cluster variance component Some results: p-spline is good!

Semiparametric Small Area Estimation Gerda, Giovanna, Goran Kauermann, JO, JB Example: ANC level for Northeastern lakes 557 observations over 113 HUCs Average sample size/HUC: HUCs contain less than 5 observations Site-specific covariates: lake location and elevation Simple way to capture spatial effects?

Semiparametric Small Area Model Replace linear function of covariates by more general model: direct estimator = truth + sampling error truth = semiparametric regression + area-specific deviation Semiparametric regression expressed as linear mixed model Thin plate splines Low-rank radial basis functions

Small Area Estimation Results EBLUP for this model easily handled with standard software (SAS proc mixed, SPlus lme() )

P-Splines for Increment Data Common for soil, sediment core data: Datum represents not a single depth point but a depth increment (e.g., cylinder of soil 2.5cm in diameter x 15cm high, collected at cm) Ignoring increment structure leads to biased, inconsistent estimators Integrate linear mixed model representation: Definite integral of truncated linear basis (x-κ) + becomes differenced quadratic basis [(top-κ) + ] 2 - [(bottom-κ) + ] 2 Immediate extension to small area estimation E.g., soil mapping by map unit symbol

Carbon Sequestration (Nan-Jung Hsu, Steve Ogle, JB) Broad class of semiparametric mixed models for increment-averaged data

Smoothing on Networks Current research with post-doc, Giovanna Ranalli have noisy data on stream network have within-network distance measure (rather than “as the crow flies”) want interpolations at unsampled locations in network Semiparametric methodology readily extends to this setting low-rank radial basis functions Possible real data from EPA (John Faustini)

Smoothing on Stream Networks Toy stream network Two first-order, one second- order stream segment Regression function is exponential along straight reach (two segments), constant along remaining segment, continuous at intersection n=150 noisy observations obtained along network

Toy Network Results Noisy observations smoothed via Low-rank thin plate spline (2D, ignoring network structure) Within-network radial basis functions (1D, accounts for network structure) Network smooth offers 25-30% reduction in MISE over spatial smooth

Non-Gaussian Time Series Potential models for one-dimensional spatial processes

Identification and Estimation In Gaussian case, models of differing causality/invertibility cannot be identified Identification in non-Gaussian case: Fit causal/invertible ARMA via Gaussian quasi-MLE Examine residuals for IID-ness If not IID, fit All-Pass model (LAD [Breidt, Davis, Trindade, Ann. Stat. (2001)], MLE, rank estimation) to determine order of non-causality or non-invertibility Prediction and Estimation in non-Gaussian case: Best MS prediction requires trickery Exact MLE, Bayes for non-Gaussian MA Exact and conditional MLE for MA with roots near unit circle [Rosenblatt, Davis, Breidt, Hsu]

Asymptotic Results for All-Pass

Where Are We Now? DAMARS.5: Nonparametric model-assisted 1. Extensions 1.1 continuous spatial domains (Siobhan; poster; Giovanna, work in progress) 1.2 multiple phases (Kim (PhD 2004, ISU), working paper) 1.3 multiple auxiliary variables (gam: Gretchen, Goran, JO, JB, JASA 2 nd submission) alternative smoothing (Gerda, JO, JB, p-splines; Biometrika 2 nd submission; Ranalli and Montanari, neural nets, JASA 2 nd submission) Other: two-stage kernels (Kim, JO, JB; JRSS submission); two-stage splines (Mark, JB, poster) 2. Applications 2.1 CDF estimation (Alicia, JO, JB; poster, CJS submission) 2.2 “Medium” area (Siobhan, JO, JB; poster) 2.3 Surveys over time (Jehad Al-Jararha, JO, JB, spam with partial overlap;) 2.4 Nonresponse (da Silva and Opsomer, Survey Methodology 2004)

Where Are We Now? STARMAP.2: Local Inferences 1. Small area Nonparametric model-assisted for spatial (Siobhan, poster; Giovanna, work in progress); Semiparametric (Gerda, Giovanna, Goran, JO, JB, working paper); Increments (Nan-Jung, Steve, JB, working paper) 1.1 MLE for all-pass (Beth, RD, JB, JMVA submission) ; rank for all-pass (Beth, RD, JB, working paper); Prediction for MA (Breidt and Hsu, Stat Sinica 2004); Exact MLE for MA (Nan-Jung, RD, JB) Spatial trend detection (Hsin-Cheng Huang) Design aspects: (Bill, JB, poster) 2. Deconvolution Formulated as another small area estimation problem using constrained Bayes methods (Mark, JB, poster) Methodology seems OK; example (88 HUCs in MAHA) still being tweaked; work in progress 3. Causal inference (Alix G)

Some Summaries (these projects only) Some Invited Talks and Seminars Winemiller Symposium (Columbia, MO) Computational Environmetrics (Chicago, IL) Monitoring Symposium (Denver, CO) ICSA (Singapore) EMAP 2004 (Newport, RI) ENAR (Pittsburgh PA) IWAP (Piraeus, Greece) IMS-ASA (Calcutta, India) Western Ecology Division, EPA (Corvallis, OR) University of Maryland (Baltimore County, MD) + Jean’s talks

More Summaries (these projects only) People Students: Ji-Yeon Kim, ISU PhD completed Spring 2004 (JO and JB); Bill Coar, Mark Delorey, Jehad Al-Jararha, CSU PhD work in progress; ISU student? Post-Doctoral Research Associate: Giovanna Ranalli Visiting Research Scientists: Nan-Jung Hsu and Hsin-Cheng Huang Unsuspecting Collaborators: Gerda Claeskens and Goran Kauermann Papers 2 appeared, 2 tentatively accepted, 1 invited revision, 4 submitted, n working papers

Optimal Sampling Design under Frame Imperfections Motivated by problems with RF3 perennial classification About 20% errors of omission and of commission! Previous work: logistic regression for probability of perennial as function of covariates (Bill Coar) Compare optimal biased and unbiased designs using anticipated MSE criterion Account for differential costs (in frame, not in frame; perennial, non-perennial) Minimize AMSE for fixed cost Further work Asymptotic results for cases of negligible, non-negligible bias Empirical results