# Nonparametric regression modeling BST 764 – FALL 2014 – DR. CHARNIGO – UNIT TWO.

## Presentation on theme: "Nonparametric regression modeling BST 764 – FALL 2014 – DR. CHARNIGO – UNIT TWO."— Presentation transcript:

Nonparametric regression modeling BST 764 – FALL 2014 – DR. CHARNIGO – UNIT TWO

Contents: Formulation… 3-5 Kernel smoothing Ref1 … 6-14 Local regression Ref2 … 15-20 Compound estimation… 21-26 Tuning parameter selection… 27-33 Practical applications… 34-37 Ref1 Hardle, W. (1990). Applied Nonparametric Regression. Cambridge University Press, Cambridge. Ref2 Loader, C. (1999). Local Regression and Likelihood. Springer, New York.

Formulation Have a look at the following graphs. There is clearly a relationship between x and y, but that relationship is neither linear nor quadratic.

Formulation Why do you suppose that, in many situations, people do rely on linear and quadratic models ? One could attempt to fit a cubic model to the data shown on the previous slide. Do you think that a cubic model would suffice ? When the mathematical relationship between x and y is unknown, we may agree to adopt a nonparametric regression model: Y i = µ(x i ) + ε i for i = 1, …, n

Formulation Typically, but not invariably, assumptions for this model are similar to the following: 1. Error terms are iid with mean zero and finite variance σ 2 2. Explanatory variable is “fixed”, or our analysis is conditional upon realized values (i.e., randomness in Y i is inherited from ε i not x i ) 3. Mean response function μ has some known number J of continuous derivatives Why do you suppose this model is called nonparametric ? Is it reasonable to assume differentiability of μ ?

Kernel smoothing If we adopt this model, then the issue becomes how to estimate μ. To get some inspiration, let’s go online and look at a stock chart. We can see, for example, 10-day and 50-day moving averages. How do you think these are defined ? Do you suppose that what we see in a stock chart fits the assumptions of a nonparametric regression model ?

Kernel smoothing Let K denote a “kernel” function. Usually, but not invariably, a kernel function is nonnegative and symmetric about a peak at 0. Some examples follow, apart from multiplicative constants: K 1 (x) := 1 |x|< 1 K 2 (x) := (1 – x 2 ) 1 |x|< 1 K 3 (x) := exp(-x 2 / 2) How do the above kernel functions differ from each other ?

Kernel smoothing We may define the following as an estimator of μ(x 0 ), where x 0 is a fixed number. In what follows, we assume x 0 is not near a boundary. ∑ i=1 n K[ (x i – x 0 ) h -1 ] Y i / ∑ i=1 n K[ (x i – x 0 ) h -1 ] This is a weighted average. How do you suppose it differs from the moving averages in the stock chart ? We refer to h as a bandwidth. How would you interpret it ? If K is normalized to integrate to 1 and L := lim n→∞ (x i+1 – x i ) n > 0 exists, then the denominator may be approximated by nh/L. Why ?

Kernel smoothing Here are the results of kernel smoothing with the three kernel functions identified earlier and bandwidths of 0.1 and 1 respectively. What do you observe ?

Kernel smoothing Supposing that the denominator has been simplified to nh/L, what is the variance of a kernel estimator of μ(x 0 ) ? Exercise. Show that the variance is approximately proportional to 1/nh. Give an intuitive argument for why this is plausible. What is the bias ? The bias depends heavily on the unknown function μ, so evaluating the bias-variance tradeoff is challenging. (For that matter, the variance depends on σ 2, which may be unknown, but estimating a parameter seems less difficult than estimating a function.)

Kernel smoothing Nevertheless, we can gain some insight: Exercise. The bias may be approximated by ∫ K(v) µ(x 0 + v h) dv - μ(x 0 ). If J > 2, then µ(x 0 + v h) ≈ μ(x 0 ) + v h μ’(x 0 ) + v 2 h 2 μ’’(x 0 ) / 2. Why ? What happens to ∫ K(v) μ(x 0 ) dv, h ∫ v K(v) μ’(x 0 ) dv, and h 2 /2 ∫ v 2 K(v) μ’’(x 0 ) dv ?

Kernel smoothing Conclusion: The variance is approximately proportional to 1/nh, and the bias is approximately proportional to h 2 (unless μ’’(x 0 ) should happen to equal 0, which is not typical, or ∫ v 2 K(v) dv should happen to equal 0, which is impossible with a nonnegative kernel). Exercise. If we wish to balance squared bias against variance, we should let h be approximately proportional to n -1/5. In this case, mean square error of estimation is approximately proportional to n -4/5. The practical problem, which we will revisit later, is how to choose h for a fixed, finite n. (The guidance on proportionality only tells us how h should change with n.)

Kernel smoothing Now suppose we wish to estimate the derivative μ’(x 0 ), for which I will provide some motivation when discussing practical applications of nonparametric regression. Then it seems we can use the following formula: ∑ i=1 n -K’[ (x i – x 0 ) h -1 ] Y i / (nh 2 /L) Any concerns about this ? If such concerns can be addressed, it turns out that mean square error of estimation is approximately proportional to n -2/5. Why is estimating a derivative harder than estimating the function itself ?

Kernel smoothing Some final remarks for this section: 1. If we could effectively estimate the bias (or argue that doing so is unnecessary), we could make a 95% confidence interval for μ(x 0 ). Connecting these dots, as it were, would produce a confidence band for the function μ(x). However, the probability of capturing the function itself in the confidence band would be less than 95%. Why ? 2. Testing for an association between x and y seems unnecessary; if we were unwilling to use linear regression in the first place, we have pretty much taken for granted the presence of an association. Nonetheless, we could test a hypothesis that μ(x) equals a particular function (special case: a constant which does not depend on x) by seeing whether that function fits inside a confidence band. 3. The error variance can be estimated using a residual mean square, as in linear regression, but care is required to define degrees of freedom.

Local regression For an ordinary linear regression model, we have Y i = β 0 + β 1 x i + ε i for i = 1, …, n. The parameters β 0 and β 1 are estimated by minimizing a sum of squares, ∑ i=1 n (Y i - β 0 – β 1 x i ) 2. The idea of local regression in a nonparametric regression model with J > 1 is that, for x close to x 0, µ(x) ≈ β 0 (x 0 ) + β 1 (x 0 ) (x – x 0 ). Note that, here, the intercept and the slope depend on x 0 !

Local regression For example, suppose that µ(x) = sin x. If x 0 = 0, then what are β 0 (x 0 ) and β 1 (x 0 ) ? How can we visualize this ? What if x 0 = π/2 ? With this in mind, consider minimizing a locally weighted sum of squares, ∑ i=1 n w(x 0, x i ) ( Y i - β 0 (x 0 ) – β 1 (x 0 ) {x i – x 0 } )2, where the weight function w(x 0, x i ) is larger when |x i – x 0 | is smaller. Why is having a weight function indispensable ?

Local regression Note that the estimators of β 0 (x 0 ) and β 1 (x 0 ) can be obtained in closed form; this is just a two-dimensional calculus problem. However, that is typically done only to understand the estimators theoretically; in practice, one uses a computer, as with ordinary regression. The interpretation is that, by estimating β 0 (x 0 ), we are really estimating μ(x 0 ). And, by estimating β 1 (x 0 ), we are really estimating μ’(x 0 ). Moreover, by letting x 0 roam through a continuum of values, we can obtain estimators of the functions μ(x) and μ’(x). Interestingly, the latter estimator is not generally equal to the derivative of the former estimator.

Local regression There are different ways to specify the weight function w(x 0, x i ). For example, one may take w(x 0, x i ) := K( {x 0 – x i } h -1 ), where K is a kernel function and h is a bandwidth. The bandwidth can also be allowed to vary with x 0, h = h(x 0 ). Can you think of a reason for allowing that ? The preceding considerations also permit us to relate local regression to kernel smoothing. Suppose that ∑ i=1 n K( {x 0 – x i } h -1 ) ( Y i - β 0 (x 0 ) – β 1 (x 0 ) {x i – x 0 } ) 2 Is replaced by ∑ i=1 n K( {x 0 – x i } h -1 ) ( Y i - β 0 (x 0 ) ) 2. Exercise: What happens when you minimize the latter ?

Local regression We can go in the other direction, too. We can increase the order of the local polynomial to P, where P < J, and minimize ∑ i=1 n w(x 0, x i ) ( Y i - β 0 (x 0 ) – β 1 (x 0 ) {x i – x 0 } – … – β p (x 0 ) {x i – x 0 } P / P! ) 2. What advantages/disadvantages may there be to doing this ? Problem: Using data generated from the code below and the locfit package REF, fit and display local regression models for P = 0, 1, 2 with rectangular kernel function and h = 0.1, 0.5, 1. Re-do with h = h(x 0 ) = 8 th, 38 th, and 68 th percentiles of |x i – x 0 |. set.seed(102)x <- sort(rnorm(1000)) y <- sin(2*x)+exp(-0.5*x)+0.5*abs(x)+rnorm(1000)*0.5 REF Loader, C. (2013). locfit: Local Regression, Likelihood and Density Estimation. R package version 1.5-9.1. http://CRAN.R-project.org/package=locfit.

Local regression Let us now discuss the following works: 3. Cleveland, W. (1979). Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association, 74, 829. 4. Stone, C. (1980). Optimal rates of convergence for nonparametric estimators. Annals of Statistics, 8, 1348.

Compound estimation Compound estimation, which I developed with Professor Srinivasan, is a new method for estimating not only μ(x) but its derivatives in such a way that the derivatives of the estimator of μ(x) are the estimators of the derivatives of μ(x). This interchangeability of differentiation and estimation is termed self- consistency. The set-up is as follows. Use local regression or any other nonparametric regression method of your choice to obtain estimators of μ(x) and its derivatives at a finite grid of points x = a 1, a 2, …, a M.

Compound estimation Let these be denoted ĉ( p, a m ) for p = 0, 1, …, P and m = 1, …, M. Thus, µ(x) in proximity to a m may be estimated using a local polynomial ∑ p=0 P (x – a m ) p ĉ( p, a m ) / p! =: ũ(x, P, a m ). These local polynomials, in turn, may be combined to estimate µ(x) globally. What would be good or bad about using ∑ m=1 M (1/M) ũ(x, P, a m ) to estimate µ(x) globally ?

Compound estimation Instead of using ∑ m=1 M (1/M) ũ(x, P, a m ) to estimate µ(x) globally, we use ∑ m=1 M w(x, a m ) ũ(x, P, a m ), where the weight function w(x, a m ) := exp[ -β (x – a m ) 2 ] / ∑ k=1 M exp[-β (x – a k ) 2 ] for some β > 0. What does this weight function accomplish ? What role does the tuning parameter β play ? Are there any other “hidden” tuning parameters ?

Compound estimation We then define the estimator of μ’(x) to be the first derivative of ∑ m=1 M w(x, a m ) ũ(x, P, a m ), and similarly for derivatives of order up to P. Notice that self-consistency is guaranteed by this definition, even if the underlying local polynomial coefficients lacked that property. If P = J and tuning parameters are chosen appropriately, then the convergence rates of the compound estimator and its derivatives are virtually indistinguishable from optimal. Problem: Using the same data set as before, apply compound estimation with M = 9, P = 2, h = 0.1, 0.5, 1, and β = 1, 10, 100.

Compound estimation Professor Srinivasan and I (along with my graduated Ph.D. student Limin Feng) also extended compound estimation to the following model: Y j k = μ( x 1 j k, x 2 j k ) + δ 3 x 3 j k + δ 4 x 4 j k + α j + ε j k We regard j as an index of subjects and k as an index of observations on a particular subject. There are four predictor variables, though this number can vary. Why may this model be called “semiparametric” ? Under certain conditions, μ( x 1, x 2 ) and its derivatives can be estimated at essentially optimal convergence rates, while δ 3 and δ 4 can be estimated at optimal convergence rates. What do you think is optimal for the latter ?

Compound estimation Let us now discuss the following works: 5. Charnigo, R. and Srinivasan, C. (2011). Self-consistent estimation of mean response functions and their derivatives. Can. J. Stat., 39, 280. 8. Charnigo, R., Feng, L., and Srinivasan, C. (2014). Nonparametric and semiparametric compound estimation in multiple covariates. Submitted for publication.

Tuning parameter selection In local regression, for example, one must choose the bandwidth h. (In fact, one must choose the degree of the local polynomial and the kernel function as well. However, my impression is that statisticians are more willing to tolerate subjective choices of the degree and the kernel function than of the bandwidth.) Let û h (x) denote the estimator of µ(x) obtained with bandwidth h. What do you suppose may be undesirable about choosing h to minimize a residual sum of squares, ∑ i=1 n ( Y i - û h (x i ) ) 2 ?

Tuning parameter selection However, some approaches (see Loader’s book, Ref2 on slide 2, and references therein) entail modifications to a residual sum of squares. For example, the C p criterion entails adding a penalty term to the residual sum of squares, so that the expected value of the C p criterion is close to that of ∑ i=1 n ( û h (x i ) - µ(x i ) ) 2. The cross validation criterion entails calculating an alternative residual sum of squares in which û h (x i ) is replaced by û h,-i (x i ), where -i indicates that the model has been re-fit without observation i for the purpose of estimating µ(x i ). Have you seen anything like this (i.e., the -i) before ? What may be a disadvantage of cross validation ?

Tuning parameter selection The generalized cross validation criterion entails dividing the (ordinary) residual sum of squares by a quantity anticipated to become small when h does. This criterion is actually derived as an approximation to cross validation using a special relationship between û h (x i ) and û h,-i (x i ) that exists for local regression but not (necessarily) other estimation methods. All of the aforementioned criteria, though, presume that the goal is to optimize estimation of µ(x). Now suppose we are more interested in optimizing estimation of µ’(x). Does good estimation of µ(x) imply good estimation of µ’(x) ? In what direction, if any, should the bandwidth be adjusted if we are more interested in estimating the derivative than the mean response ?

Tuning parameter selection Professor Srinivasan and I (along with my graduated Ph.D. student Benjamin Hall) developed a generalized C p (GC p ) criterion for use when one wishes to emphasize estimation of µ’(x) (or of a higher order derivative) rather than of µ(x). The basic idea is to define a proxy for the error sum of squares in estimating the derivative, ∑ i=1 n ( û h ’(x i ) – µ’(x i ) ) 2. Note that Y i are just noise-corrupted versions of µ(x i ). If we had noisy versions of µ’(x i ), call them Z i, then we could define a sort of residual sum of squares in estimating the derivative, ∑ i=1 n ( Z i – û h ’(x i ) ) 2.

Tuning parameter selection A penalty term could be appended to such a residual sum of squares to obtain a quantity with expected value close to that of ∑ i=1 n ( û h ’(x i ) – µ’(x i ) ) 2. One could attempt to define such a Z i using a difference quotient, such as (Y i+1 – Y i ) / (x i+1 – x i ). Do you see any potential difficulty with that ? Instead, we defined Z i to be a weighted average of such difference quotients, mitigating the aforementioned difficulty.

Tuning parameter selection Our generalized C p criterion can accommodate other nonparametric regression methods besides local regression (in particular, compound estimation) and settings in which there are multiple predictor variables. The main requirement is that, at any fixed value of the tuning parameter(s), the estimated derivative should have a linear representation in terms of observed outcomes, û h ’(x) = ∑ i=1 n L h (x, x i ) Y i. Note that this linear representation also expedites calculations regarding the bias and variance of the estimator itself. However, once we select the tuning parameter(s), whether by GC p or another approach, the estimator is no longer truly linear. Why not ?

Tuning parameter selection Let us now discuss the following works: 7. Charnigo, R., Hall, B., and Srinivasan, C. (2011). A generalized Cp criterion for derivative estimation. Technometrics, 53, 238. 9. Charnigo, R. and Srinivasan, C. (2014). A multivariate generalized Cp and surface estimation. Biostatistics, in press.

Practical applications In the papers examined so far, we have seen several practical applications of nonparametric (or semiparametric) regression in health, science, and engineering: Describing abrasion of rubber specimens Understanding effects of lead exposure Quantifying growth during childhood Inferring chemical composition of material Remote monitoring of Parkinson’s disease Ascertaining liver function inexpensively

Practical applications One other interesting application is to a pattern recognition problem in nanoscale engineering, which I considered with Professor Srinivasan and four others (including two M.S. students in Statistics). Suppose that nanoparticles with configuration “c” give rise to data via a nonparametric regression model Y i = μ c (x i ) + ε i when scattering radiation, such that the predictor corresponds to the angle at which an observation is made and the response is a noisy version of a quantity that describes the scattering observed at that angle. Assume that μ c (x) and its derivatives are known with negligible error, although in practice one would have to estimate them.

Practical applications Now suppose we have nanoparticles of an unknown configuration that we wish to infer. We can allow the nanoparticles to scatter radiation, collect data, and obtain an estimate û(x) of the underlying mean response as well as estimates û’(x) and û”(x) of the derivatives. Then we can find the “c” that minimizes ∫ 0 180 | û(x) – μ c (x) | dx or ∫ 0 180 | û’(x) – μ c ’(x) | dx or ∫ 0 180 | û”(x) – μ c ”(x) | dx and thereby make an inference about the unknown configuration. Do you see any potential problems or weaknesses with this approach ?

Practical applications Let us now discuss the following work: 6. Charnigo, R., Francoeur, M., Mengüç, M., Brock, A., Leichter, M., and Srinivasan, C. (2007). Derivatives of Scattering Profiles: Tools for Nanoparticle Characterization. J. Opt. Soc. Amer. A, 24, 2578. Exercise: Find an application of nonparametric (or semiparametric) regression distinct from any considered so far in this course. You may consider an application presented in a paper, but not in one of the papers I have assigned for reading.

Download ppt "Nonparametric regression modeling BST 764 – FALL 2014 – DR. CHARNIGO – UNIT TWO."

Similar presentations