Inverse Problems and Data Errors Conference in Honor of Douglas O. Gough On the Occasion of his 60 th Birthday Chateau de Mons, Caussens, France P.B. Stark.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Advanced topics in Financial Econometrics Bas Werker Tilburg University, SAMSI fellow.
Change-Point Detection Techniques for Piecewise Locally Stationary Time Series Michael Last National Institute of Statistical Sciences Talk for Midyear.
CHAPTER 13 M ODELING C ONSIDERATIONS AND S TATISTICAL I NFORMATION “All models are wrong; some are useful.”  George E. P. Box Organization of chapter.
Data Errors, Model Errors, and Estimation Errors Frontiers of Geophysical Inversion Workshop Waterways Experiment Station Vicksburg, MS February.
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
Use of Kalman filters in time and frequency analysis John Davis 1st May 2011.
Estimation  Samples are collected to estimate characteristics of the population of particular interest. Parameter – numerical characteristic of the population.
Chap 8: Estimation of parameters & Fitting of Probability Distributions Section 6.1: INTRODUCTION Unknown parameter(s) values must be estimated before.
ECE 472/572 - Digital Image Processing Lecture 8 - Image Restoration – Linear, Position-Invariant Degradations 10/10/11.
Visual Recognition Tutorial
Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Spectrum Estimation in Helioseismology P.B. Stark Department of Statistics University of California Berkeley CA
Point estimation, interval estimation
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Minimaxity & Admissibility Presenting: Slava Chernoi Lehman and Casella, chapter 5 sections 1-2,7.
1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)
Presenting: Assaf Tzabari
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Introduction to Signal Estimation. 94/10/142 Outline 
Visual Recognition Tutorial
2008 Chingchun 1 Bootstrap Chingchun Huang ( 黃敬群 ) Vision Lab, NCTU.
Maximum likelihood (ML)
Principles of the Global Positioning System Lecture 10 Prof. Thomas Herring Room A;
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
PATTERN RECOGNITION AND MACHINE LEARNING
Principles of Pattern Recognition
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
Model Inference and Averaging
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Quantifying Uncertainty in Inverse Problems Workshop on Statistical Methods for Inverse Problems Institute for Pure and Applied Mathematics University.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Statistical Approaches to Inverse Problems
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
Chapter 5 Parameter estimation. What is sample inference? Distinguish between managerial & financial accounting. Understand how managers can use accounting.
CHEE825 Fall 2005J. McLellan1 Spectral Analysis and Input Signal Design.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
Statistical Measures of Uncertainty in Inverse Problems Workshop on Uncertainty in Inverse Problems Institute for Mathematics and Its Applications Minneapolis,
Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.
Statistics and Quantitative Analysis U4320 Segment 5: Sampling and inference Prof. Sharyn O’Halloran.
CHAPTER 17 O PTIMAL D ESIGN FOR E XPERIMENTAL I NPUTS Organization of chapter in ISSO –Background Motivation Finite sample and asymptotic (continuous)
Machine Learning 5. Parametric Methods.
Stochastic Process Theory and Spectral Estimation Bijan Pesaran Center for Neural Science New York University.
Introduction to Estimation Theory: A Tutorial
Stochastic Process Theory and Spectral Estimation
Spectrum Estimation in Helioseismology
ESTIMATION METHODS We know how to calculate confidence intervals for estimates of  and  2 Now, we need procedures to calculate  and  2, themselves.
Week 21 Order Statistics The order statistics of a set of random variables X 1, X 2,…, X n are the same random variables arranged in increasing order.
Computacion Inteligente Least-Square Methods for System Identification.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate its.
Week 21 Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
How Statistics Helps 9th US Congress on Computational Mechanics San Francisco, CA 25 July 2007 P.B. Stark Department of Statistics University of California.
Estimating standard error using bootstrap
Visual Recognition Tutorial
12. Principles of Parameter Estimation
Probability Theory and Parameter Estimation I
Statistical Measures of Uncertainty in Inverse Problems
Model Inference and Averaging
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Summarizing Data by Statistics
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
12. Principles of Parameter Estimation
Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Presentation transcript:

Inverse Problems and Data Errors Conference in Honor of Douglas O. Gough On the Occasion of his 60 th Birthday Chateau de Mons, Caussens, France P.B. Stark Department of Statistics University of California Berkeley CA

Acknowledgements Media pirated from websites of –Global Oscillations Network Group (GONG) –Solar and Heliospheric Observer (SOHO) Solar Oscillations Investigation Much joint work, incl. S. Evans (UCB), I.K. Fodor (LLNL), C.R. Genovese (CMU), D.O.Gough (Cambridge), Y. Gu (GONG), R. Komm (GONG), T. Sekii (Natl. Astron. Obs.of Japan), M.J. Thompson (QMW) Source: sohowww.nascom.nasa.gov Source:

The Difference between Theory and Practice In Theory, there is no difference between Theory and Practice. In Practice, there is.

Part I: Theory

Forward Problems in Statistics: Ingredients Measurable space X of possible data. Set  of possible descriptions of the world--models. Family P = {P  :  } of probability distributions on X, indexed by models . Forward operator   P  maps model  into a probability measure on X. Data X are a sample from P . P  is whole story: stochastic variability in the “truth,” contamination by measurement error, systematic error, censoring, etc.

Models Index set  that usually has special structure. For example,  could be a convex subset of a separable Banach space T. The forward mapping   P  maps the index of the model to a probability distribution for the data. The physical significance of  generally gives   P  reasonable analytic properties, e.g., continuity.

Forward Problems in Physics and Applied Math Composition of steps: –transform correct description of the world into ideal, noise-free, infinite-dimensional data (“physics”) –censor ideal data to retain only a finite list of numbers, because can only measure, record, and compute with such lists –possibly corrupt the list with deterministic measurement error. Equivalent to single-step procedure with corruption on par with physics, and mapping incorporating the censoring.

Physical v. Statistical Forward Problems Statistical framework for forward problems is more general: Forward problems of Applied Math and Physics are instances of statistical forward problems.

Parameters A parameter of a model θ is the value g(θ) at θ of a continuous G -valued function g defined on  (g can be the identity.)

Inverse Problems Observe data X drawn from distribution P θ for some unknown . (Assume  contains at least two points; otherwise, data superfluous.) Use X and the knowledge that  to learn about  ; for example, to estimate a parameter g(  ).

Applied Math and Statistical Perspectives Applied Math: recover a parameter of a pde or the solution of an integral equation from infinitely many data, noise-free or with deterministic error. –Common issues: existence, uniqueness, construction, stability for deterministic noise. Statistics: estimate or draw inference about parameter from finitely many noisy data –Common issues: identifiability, consistency, bias, variance, efficiency, MSE, etc.

Many Connections Identifiability--distinct parameter values yield distinct probability distributions for the observables--similar to uniqueness--forward operator maps at most one model into the observed data. Consistency--parameter can be estimated with arbitrary accuracy as the number of data grows--related to stability of a recovery algorithm--small changes in the data produce small changes in the recovered model.  quantitative connections too.

Physical Inverse Problems Inverse problems in applied Physics often are tackled using applied math methods for “Ill-posed problems” (e.g., Tichonov regularization, analytic inversions) Those methods are designed to answer different questions; can behave poorly with data (bad bias & variance) Inference  construction: statistical viewpoint more appropriate for data

Elements of the Statistical View Distinguish between characteristics of the problem, and characteristics of methods used to draw inferences Fundamental property of a parameter: g is identifiable if for all η, υ  Θ, {g(η)  g(υ)}  {P η  P υ }. In most inverse problems, g(θ) = θ not identifiable, and few linear functionals of θ are identifiable

Decision Rules A (randomized) decision rule δ: X  M 1 ( A ) x  δ x (.), is a measurable mapping from the space X of possible data to the collection M 1 ( A ) of probability distributions on a separable metric space A of actions. A non-randomized decision rule is a randomized decision rule that, to each x  X, assigns a unit point mass at some value a = a(x)  A.

Estimators An estimator of a parameter g(θ) is a decision rule for which the space A of possible actions is the space G of possible parameter values. ĝ is common notation for an estimator of g(θ). Usually write non-randomized estimator as a G- valued function of x instead of a M 1 ( G )-valued function.

Comparing Estimators  Infinitely many estimators. Which one to use? The best one! But what does best mean?

Loss and Risk Formulate as 2-player game: Nature v. Statistician. Nature picks θ from Θ. θ is secret. Statistician picks δ from a set D of decision rules. δ is secret. Generate data X from P θ, apply δ. Statistician pays loss l (θ, δ(X)). l should be dictated by scientific context, but… Risk is expected loss: r(θ, δ) = E θ l (θ, δ(X)) Good estimator has small risk, but what does small mean?

Strategy Rare that a single estimator has smallest risk for every  Estimator is admissible if not dominated by another. Minimax estimator minimizes sup  r (θ, δ) over  D Bayes estimator minimizes over  D for a given prior probability distribution  on  Duality: minimax is Bayes for least favorable prior

Common Risk: Mean Distance Error (MDE) Let d G denote the metric on G. MDE at θ of estimator ĝ of g is MDE θ (ĝ, g) = E θ [d(ĝ, g(θ))]. When metric derives from norm, MDE is called mean norm error (MNE). When the norm is Hilbertian, (MNE) 2 is called mean squared error (MSE).

Bias When G is a Banach space, can define bias at θ of ĝ: bias θ (ĝ, g) = E θ [ĝ - g(θ)] (when the expectation is well-defined). If bias θ (ĝ, g) = 0, say ĝ is unbiased at θ (for g). If ĝ is unbiased at θ for g for every θ , say ĝ is unbiased for g. If such ĝ exists, g is unbiasedly estimable. If g is unbiasedly estimable then g is identifiable.

More Notation Let T be a separable Banach space, T * its normed dual. Write the pairing between T and T * : T * x T  R.

Linear Forward Problems A forward problem is linear if Θ is a subset of a separable Banach space T For some fixed sequence (κ j ) j=1 n of elements of T *, X = (X j ) j=1 n, where X j = + ε j, θ  Θ, and ε = (ε j ) j=1 n is a vector of stochastic errors whose distribution does not depend on θ (so X = R n ).

Linear Forward Problems, contd. The functionals {κ j } are the “representers” or “data kernels” The distribution P θ is the probability distribution of X. Typically, dim(Θ) =  ; at the very least, n < dim(Θ), so estimating θ is an underdetermined problem. Define K : T  R n Θ  ( ) j=1 n. Abbreviate forward problem by X = Kθ + ε, θ  Θ.

Linear Inverse Problems Use X = Kθ + ε, and the knowledge θ  Θ to estimate or draw inferences about g(θ). Probability distribution of X depends on θ only through Kθ, so if there are two points θ 1, θ 2  Θ such that Kθ 1 = Kθ 2 but g(θ 1 )  g(θ 2 ), then g(θ) is not identifiable.

Backus-Gilbert ++ : Necessary conditions Let g be an identifiable real-valued parameter. Suppose  θ 0  Θ, a symmetric convex set Ť  T, c  R, and ğ: Ť  R such that: 1. θ 0 + Ť  Θ 2. For t  Ť, g(θ 0 + t) = c + ğ(t), and ğ(-t) = -ğ(t) 3. ğ(a 1 t 1 + a 2 t 2 ) = a 1 ğ(t 1 ) + a 2 ğ(t 2 ), t 1, t 2  Ť, a 1, a 2  0, a 1 +a 2 = 1, and 4. sup t  Ť | ğ(t)| < . Then  1 × n matrix Λ s.t. the restriction of ğ to Ť is the restriction of Λ. K to Ť.

Backus-Gilbert ++ : Sufficient Conditions Suppose g = (g i ) i=1 m is an R m -valued parameter that can be written as the restriction to Θ of Λ. K for some m×n matrix Λ. Then 1. g is identifiable. 2. If E[ε] = 0, Λ. X is an unbiased estimator of g. 3. If, in addition, ε has covariance matrix Σ = E[εε T ], the covariance matrix of Λ. X is Λ. Σ. Λ T whatever be P θ.

Corollary: Backus-Gilbert Let T be a Hilbert space; let Θ = T ; let g  T = T * be a linear parameter; and let {κ j } j=1 n  T *. The parameter g(θ) is identifiable iff g = Λ. K for some 1×n matrix Λ. In that case, if E[ε] = 0, then ĝ = Λ. X is unbiased for g. If, in addition, ε has covariance matrix Σ = E[εε T ], then the MSE of ĝ is Λ. Σ. Λ T.

Consistency in Linear Inverse Problems X i =  i  +  i, i=1, 2, 3, …  subset of separable Banach {  i }   * linear, bounded on  {  i } iid   consistently estimable w.r.t. weak topology iff  {T k }, T k Borel function of X 1,..., X k s.t. ,  >0,    *, lim k P  {|  T k -   |>  } = 0

µ a prob. measure on  ; µ a (B) = µ(B-a), a   Hellinger distance Pseudo-metric on  **: If restriction to  converges to metric compatible with weak topology, can estimate  consistently in weak topology. For given sequence of functionals {  i }, µ rougher  consistent estimation easier. Importance of the Error Distribution

Example: Linear Combinations of Splitting Kernels Cuts through kernels for rotation: A: l=15, m=8. B: l=28, m=14. C: l=28, m=24. D: two targeted combinations: 0.7R, 60 o ; 0.82R, 30 o Thompson et al., Science 272, Estimated rotation rate as a function of depth at three latitudes. Source: SOHO-SOI/MDI website

Part II: Practice

Linear Forward Problem for Rotation Note language change: θ is now latitude, Ώ is the model. Relationship assumes eigenfunctions and radial structure known. Observational errors usually are assumed to be zero- mean independent normal random variables with known variances.

Data Reduction Harvey et al., Science 272,

GONG Data Pipeline 1.Read tapes from sites. 2.Correct for CCD characteristics 3.Transform intensities to Doppler velocities 4.Calibrate velocities using daily calibration images 5.Find image geometry and modulation transfer function (atmospheric effects, lens dirt, instrument characteristics,...) 6.High-pass filter to remove steady flows 7.Remap images to heliographic coordinates, interpolate, resample, correct for line-of-sight 8.Transform to spherical harmonics: window, Legendre stack in latitude, FFT in longitude 9.Adjust spherical harmonic coefficients for estimated modulation transfer function 10.Merge time series of spherical harmonic coefficients from different stations; weight for relative uncertainties 11.Fill data gaps of up to 30 minutes by ARMA modeling 12.Take periodogram of time series of spherical harmonic coefficients 13.Fit parametric model to power spectrum by iterative approximate maximum likelihood 14.Identify quantum numbers; report frequencies, linewidths, background power, and uncertainties

More Steps Time series of spherical harmonic coefficients Spectra of time series, and fitted parametric models Top: GONG website. Bottom: Hill et al., Science, 1996.

Effect of Gaps Don’t observe process of interest. Observe process × window Fourier transform of data is FT of process, convolved with FT of window. FT of window has many large sidelobes Convolution causes energy to “leak” from distant frequencies into any particular band of interest. Power spectrum of window 95% duty cycle window

Tapering Want simplicity of periodogram, but less leakage Traditional approach—multiply data by a smooth “taper” that vanishes where there are no data Smoother taper  smaller sidelobes, but more local smearing (loss of resolution) Pose choosing taper as optimization problem

What taper minimizes “leakage” while maximizing resolution? Leakage is a bias; optimality depends on definition Broad-band & asymptotic yield eigenvalue problems: Optimal Tapering

Prolate Spheroidal Tapers Maximize the fraction of energy in a band [-w, w] around zero Analytic solution when no gaps: –2wT tapers nearly perfect –others very poor Must choose w Character different with gaps T = 1024, w = Fodor&Stark, IEEE Trans. Sig. Proc., 48,

Minimum Asymptotic Bias Tapers Minimize integral of spectrum against frequency squared Leading term in asymptotic bias T = 1024 Fodor&Stark, IEEE Trans. Sig. Proc., 48,

Sine Tapers Without gaps, approximate minimum asymptotic bias tapers With gaps, reorthogonalize T = Fodor&Stark, IEEE Trans. Sig. Proc. 48,

Optimization Problems Prolate and minimum asymptotic bias tapers are top eigenfunctions of large eigenvalue problems The problems have special structure; can be solved efficiently (top 6 tapers for T=103,680 in < 1day) Sine tapers very cheap to compute

Sample Concentration of Tapers T=1024, w = OrderProlateProlate w/ gapsProjected prolate Fodor & Stark, IEEE Trans. Sig. Proc., 48,

Sample Asymptotic Bias of Tapers, T = 1024 OrderMinimum Asympt BiasMin Bias w/ gapsProjected Min Bias Fodor & Stark, IEEE Trans. Sig. Proc., 48,

Multitaper Estimation Top several eigenfunctions have eigenvalues close to 1. Eigenvalues drop to zero, abruptly for no-gap Estimates using orthogonal tapers are asymptotically independent (mild conditions) Averaging spectrum estimates from several “good” tapers can decrease variance without increasing bias much. Get rank K quadratic estimator.

Multitaper Procedure Compute K orthogonal tapers, each with good concentration Multiply data by each taper in turn Compute periodogram of each product Average the periodograms Special case: break data into segments

Cheapest is Fine For simulated and real helioseismic time series of length T=103,680, no discernable systematic difference among 12-taper multitaper estimates using the three families of tapers. Use re-orthogonalized sine tapers because they are much cheaper to compute, for each gap pattern in each time series.

Multitaper Simulation Can combine with segmenting to decrease dependence Prettier than periodogram; less leakage. Better, too? T=103,680. Truth in grey. Left panels: periodogram. Right panels: 3-segment 4-taper gapped sine taper estimate. Fodor&Stark, 2000.

Multitaper: SOHO Data Easier to identify mode parameters from multitaper spectrum Maximum likelihood more stable; can identify 20% to 60% more modes (Komm et al., Ap.J., 519, ) SOHO l=85, m=0. T = 103, 680. Periodogram (left) and 3-segment 4 sine taper estimate Fodor&Stark, IEEE Trans. Sig. Proc., 48,

Error bars: Confidence Level in Simulation Method% CoverageMethod% Coverage Parametric chi-square72Blockwise bootstrap56 Simulation from Estimate82Bootstrap pivot percentile71 Jackknife75Pre-pivot bootstrap78 Bootstrap normal77Iterated pre-pivot bootstrap91 Bootstrap-t78Bootstrap percentile of percentiles90 Bootstrap percentile69Bootstrap pivot percentile of percentiles96 1,000 realizations of simulated normal mode data. 95% target confidence level. Fodor&Stark, IEEE Trans. Sig. Proc., 48,