ECE-7000: Nonlinear Dynamical Systems 12.5.1 Overfitting and model costs Overfitting  The more free parameters a model has, the better it can be adapted.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Random Processes Introduction (2)
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Brief introduction on Logistic Regression
Pattern Recognition and Machine Learning
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
AGC DSP AGC DSP Professor A G Constantinides©1 Modern Spectral Estimation Modern Spectral Estimation is based on a priori assumptions on the manner, the.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Ch11 Curve Fitting Dr. Deshi Ye
Observers and Kalman Filters
The loss function, the normal equation,
Visual Recognition Tutorial
Pattern Recognition and Machine Learning
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
9. SIMPLE LINEAR REGESSION AND CORRELATION
Ensemble Learning: An Introduction
Chapter 11 Multiple Regression.
Estimation and the Kalman Filter David Johnson. The Mean of a Discrete Distribution “I have more legs than average”
Principles of the Global Positioning System Lecture 10 Prof. Thomas Herring Room A;
Physics of fusion power
Principles of the Global Positioning System Lecture 11 Prof. Thomas Herring Room A;
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
PATTERN RECOGNITION AND MACHINE LEARNING
Isolated-Word Speech Recognition Using Hidden Markov Models
880.P20 Winter 2006 Richard Kass 1 Confidence Intervals and Upper Limits Confidence intervals (CI) are related to confidence limits (CL). To calculate.
Random Sampling, Point Estimation and Maximum Likelihood.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Modern Navigation Thomas Herring
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Overview Particle filtering is a sequential Monte Carlo methodology in which the relevant probability distributions are iteratively estimated using the.
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
ELEC 303 – Random Signals Lecture 18 – Classical Statistical Inference, Dr. Farinaz Koushanfar ECE Dept., Rice University Nov 4, 2010.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
CCN COMPLEX COMPUTING NETWORKS1 This research has been supported in part by European Commission FP6 IYTE-Wireless Project (Contract No: )
CHAPTER 5 SIGNAL SPACE ANALYSIS
Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.
Linear Models for Classification
Additional Topics in Prediction Methodology. Introduction Predictive distribution for random variable Y 0 is meant to capture all the information about.
Estimation Method of Moments (MM) Methods of Moment estimation is a general method where equations for estimating parameters are found by equating population.
Sampling and estimation Petter Mostad
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Autoregressive (AR) Spectral Estimation
ECE-7000: Nonlinear Dynamical Systems 2. Linear tools and general considerations 2.1 Stationarity and sampling - In principle, the more a scientific measurement.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Stochastic Processes and Transition Probabilities D Nagesh Kumar, IISc Water Resources Planning and Management: M6L5 Stochastic Optimization.
ECE-7000: Nonlinear Dynamical Systems 3. Phase Space Methods 3.1 Determinism: Uniqueness in phase space We Assume that the system is linear stochastic.
Computacion Inteligente Least-Square Methods for System Identification.
1 Ka-fu Wong University of Hong Kong A Brief Review of Probability, Statistics, and Regression for Forecasting.
MODEL DIAGNOSTICS By Eni Sumarminingsih, Ssi, MM.
CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.
Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.
Lecture 1.31 Criteria for optimal reception of radio signals.
Chapter 7. Classification and Prediction
Deterministic Dynamics
LECTURE 11: Advanced Discriminant Analysis
Ch3: Model Building through Regression
CH 5: Multivariate Methods
Synaptic Dynamics: Unsupervised Learning
Dynamical Statistical Shape Priors for Level Set Based Tracking
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Kalman Filter: Bayes Interpretation
Probabilistic Surrogate Models
Presentation transcript:

ECE-7000: Nonlinear Dynamical Systems Overfitting and model costs Overfitting  The more free parameters a model has, the better it can be adapted to the data.  However, too many adjustable parameters makes global features of system to be erroneously.  Overfitting becomes evident if the in-sample prediction error is significantly smaller than out-of-sample error.  Add a term for the model-costs to the minimisation problem.  Add an appropriate function of the number of adjustable parameters to the likelihood function. Cost function  Suppose we have a general function for the dynamics., depending on k adjustable parameters  We want to find the particular set of parameters which maximizes the probability  That is, maximize the log-likelyhood function

ECE-7000: Nonlinear Dynamical Systems Overfitting and model costs Cost function (continued)  If the errors are Gaussian distribution with variance, the probability of the data is  Estimating the variance by we obtain which is maximal when the mean squared error is minimal.  The complexity of a model is now taken into account by adding a term- to the log-likelyhood function  The choice of and modified log-likelihood function  = 1, redundant parameters may lead to better predictions.  = 1/2, If the main interest is the model itselt.

ECE-7000: Nonlinear Dynamical Systems Cost function (continued)  If we have N data points, the number of relevant bits in each parameter will typically scale as, and the total description of length is  The modified log-likelihood functions are  Mean squares cost function for Gaussian errors Overfitting and model costs (Akaike) (Rissanen) (Akaike) (Rissanen)

ECE-7000: Nonlinear Dynamical Systems The errors-in-variables problem Error-in-variables problem  Occurs when an ordinary least squares procedure is applied when the independent and dependent variables are noisy.  Over 10% noise level, the consequence become obvious. Solution to Error-in-variables problem  Treat all variables in a symmetric way.  Consider the sum of the orthogonal distance  Fit the surface to a collection of point, and ignoring the fact that the points are not mutually independent.  Lead to problem when the noise level is over 15~20% Other possible cost function  auto-synchronisation  This is done dynamically while the model is coupled to the data and thus constitutes a very attractive concept for real application. The best predictor is always found be the minization of the prediction error.

ECE-7000: Nonlinear Dynamical Systems 12.6 Model verification Other model verification method  Up to now, the actual value of the cost function at its minimum is only rough indicator for model verification.  The simplest step is to sub-divide the data set into a training set and a test set.  The fit of the parameters is performed on the data of the training set, and the resulting F is inserted into the cost function and its value is computed for the test set.  If the error is larger on the second set, something has gone wrong. When forecast error is weak  The difference between embedding dimension and the attractor dimension is large, there is much freedom to construct dynamical equations.  Iteration points would escape from the observed attractor. Iterate equations  The most severe case  Select one point as initial condition, create one’s tractory by fitted model.  In ideal case, this attractor should look like a skeleton in the body of noise observation.

ECE-7000: Nonlinear Dynamical Systems The stochastic counterpart of the equation of motion of a deterministic dynamical system in continuous time is Langevin eqaution: (1) where the phase space vector and a deterministic vector field in, noise term composed of a - dependent dimensional tensor G, white noise process. Since the solution of equation (1) is highly unstable, the time evolution of phase space densities are studied. The equation of motion for the phase space density is called the Fokker-Planck equation: Fokker-Planck equations from data (2)

ECE-7000: Nonlinear Dynamical Systems Fokker-Planck equations from data The transition from the Langevin equation to the Fokker-Planck equation and back is uniquely given by the way the drift term and the diffusion tensor Correspond to the deterministic and stochastic part of the Lengevin equation: If eq (1) and (2) in the previous slide is suitable for a given data set,  Fit the Langevin equation to the observed data  Fit the Fokker-Planck equation to the observed data  Instead of above two method, we can directly exploit the time series data to find estimate the drift-and the diffusion terms of the Fokker-Planck equation  Under the assumption of the time independence of the parameter, drift and diffusion can be determined by the following conditional average:

ECE-7000: Nonlinear Dynamical Systems Fokker-Planck equations from data In practice, since equals only with zero probability, one will exploit the average on a neighborhood of with a suitable neighborhood diameter The time interval is limited from the below by the sampling interval of data. So, a useful first order correction in is given by the following estimate of the diffusion term: In these expression, the knowledge of in physical units in not required for existent estimates, both and can be measured in arbitrary temporal units.

ECE-7000: Nonlinear Dynamical Systems Markov chains in embedding space Both autoregressive models and deterministic dynamical systems can be regarded as special classes of Markov models. Rather than providing a unique future state, probability density for the occurrence of the future states is obtained.  In continuous time, Langevin equation generate a Markov model.  In discrete time and space, simple transition matrix describes Markov model  Univariate Markov model in continuous space but discrete time is often called a Markov chain, where the order of the Markov chain denotes how many past time steps are needed in order to define the current state vector. Symbolic dynamics :  Time discrete Markov models with a discrete state space occur  The big issue is to find a partitioning of state space  The probabilistic nature of the dynamics is introduced through a coarse gaining  Scalar real valued Markov chain of order  produces a sequence of random variables  Its dynamics is fully characterised by all transition probability densities from the last random variables onto the next one

ECE-7000: Nonlinear Dynamical Systems No embedding for Markov chains Assuming the followings:  A reasonable description of a noisy dynamical system is a vector valued Langevin equation.  The measurement apparatus provides us with a time series of a single observable only. We can say that Langevin dynamics generates a Markov process in continuous time in the original state space.  Scalar, real valued time series represent a Markov chain of some finite order.  Like an embedding theorem, the order of the Markov chain were related to the dimensionality of the original space.  It is wrong. However, the memory is decaying fast, so that a finite order Markov chain may be a reasonable approximation, where the errors made by ignoring some temporal correlation with the past are then smaller than other modeling errors resulting from the fitness of the data base.