Blind Source Separation by Independent Components Analysis Professor Dr. Barrie W. Jervis School of Engineering Sheffield Hallam University England

Slides:

Advertisements

Similar presentations

Independent Component Analysis

Advertisements

Independent Component Analysis: The Fast ICA algorithm

An Information-Maximization Approach to Blind Separation and Blind Deconvolution A.J. Bell and T.J. Sejnowski Computational Modeling of Intelligence (Fri)

1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.

AGC DSP AGC DSP Professor A G Constantinides©1 Modern Spectral Estimation Modern Spectral Estimation is based on a priori assumptions on the manner, the.

Protein- Cytokine network reconstruction using information theory-based analysis Farzaneh Farhangmehr UCSD Presentation#3 July 25, 2011.

Artificial Intelligence Lecture 2 Dr. Bo Yuan, Professor Department of Computer Science and Engineering Shanghai Jiaotong University

Visual Recognition Tutorial

Independent Component Analysis & Blind Source Separation

The Simple Linear Regression Model: Specification and Estimation

Independent Component Analysis (ICA)

Application of Statistical Techniques to Neural Data Analysis Aniket Kaloti 03/07/2006.

Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.

AGC DSP AGC DSP Professor A G Constantinides© Estimation Theory We seek to determine from a set of data, a set of parameters such that their values would.

Independent Component Analysis & Blind Source Separation Ata Kaban The University of Birmingham.

Modeling and Estimation of Dependent Subspaces J. A. Palmer 1, K. Kreutz-Delgado 2, B. D. Rao 2, Scott Makeig 1 1 Swartz Center for Computational Neuroscience.

Independent Component Analysis (ICA) and Factor Analysis (FA)

ICA Alphan Altinok. Outline  PCA  ICA  Foundation  Ambiguities  Algorithms  Examples  Papers.

Linear and Non-Linear ICA-BSS I C A  Independent Component Analysis B S S  Blind Source Separation Carlos G. Puntonet Dept.of Architecture.

Lecture II-2: Probability Review

Adaptive Signal Processing

Normalised Least Mean-Square Adaptive Filtering

Survey on ICA Technical Report, Aapo Hyvärinen, 1999.

Review of Probability.

Data Selection In Ad-Hoc Wireless Sensor Networks Olawoye Oyeyele 11/24/2003.

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

INFORMATION THEORY BYK.SWARAJA ASSOCIATE PROFESSOR MREC.

Independent Components Analysis with the JADE algorithm

ERP DATA ACQUISITION & PREPROCESSING EEG Acquisition: 256 scalp sites; vertex recording reference (Geodesic Sensor Net)..01 Hz to 100 Hz analogue filter;

Heart Sound Background Noise Removal Haim Appleboim Biomedical Seminar February 2007.

1 7. Two Random Variables In many experiments, the observations are expressible not as a single quantity, but as a family of quantities. For example to.

Independent Component Analysis Zhen Wei, Li Jin, Yuxue Jin Department of Statistics Stanford University An Introduction.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

COMMUNICATION NETWORK. NOISE CHARACTERISTICS OF A CHANNEL 1.

SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.

ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.

A note about gradient descent: Consider the function f(x)=(x-x 0 ) 2 Its derivative is: By gradient descent (If f(x) is more complex we usually cannot.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.

An Introduction to Blind Source Separation Kenny Hild Sept. 19, 2001.

PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.

PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Two Random Variables.

CHAPTER 5 SIGNAL SPACE ANALYSIS

1 8. One Function of Two Random Variables Given two random variables X and Y and a function g(x,y), we form a new random variable Z as Given the joint.

CS Statistical Machine learning Lecture 24

NCAF Manchester July 2000 Graham Hesketh Information Engineering Group Rolls-Royce Strategic Research Centre.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

SYSTEMS Identification Ali Karimpour Assistant Professor Ferdowsi University of Mashhad Reference: “System Identification Theory For The User” Lennart.

Basic Concepts of Information Theory Entropy for Two-dimensional Discrete Finite Probability Schemes. Conditional Entropy. Communication Network. Noise.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:

Principal Component Analysis (PCA)

Independent Component Analysis Independent Component Analysis.

An Introduction of Independent Component Analysis (ICA) Xiaoling Wang Jan. 28, 2003.

Object Orie’d Data Analysis, Last Time

Dimension reduction (1) Overview PCA Factor Analysis Projection persuit ICA.

Basic Concepts of Information Theory Entropy for Two-dimensional Discrete Finite Probability Schemes. Conditional Entropy. Communication Network. Noise.

12. Principles of Parameter Estimation

LECTURE 11: Advanced Discriminant Analysis

LECTURE 09: BAYESIAN ESTIMATION (Cont.)

Brain Electrophysiological Signal Processing: Preprocessing

Ch3: Model Building through Regression

Of Probability & Information Theory

Application of Independent Component Analysis (ICA) to Beam Diagnosis

3.1 Expectation Expectation Example

PCA vs ICA vs LDA.

Error rate due to noise In this section, an expression for the probability of error will be derived The analysis technique, will be demonstrated on a binary.

Modern Spectral Estimation

Blind Source Separation: PCA & ICA

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

EE513 Audio Signals and Systems

12. Principles of Parameter Estimation

Presentation transcript:

Blind Source Separation by Independent Components Analysis Professor Dr. Barrie W. Jervis School of Engineering Sheffield Hallam University England

The Problem Temporally independent unknown source signals are linearly mixed in an unknown system to produce a set of measured output signals. It is required to determine the source signals.

Methods of solving this problem are known as Blind Source Separation (BSS) techniques. In this presentation the method of Independent Components Analysis (ICA) will be described. The arrangement is illustrated in the next slide.

Arrangement for BSS by ICA s1s2:sns1s2:sn x1x2:xnx1x2:xn u1u2:unu1u2:un y 1 =g 1 (u 1 ) y 2 =g 2 (u 2 ) : y n =g n (u n ) g(.) AW

Neural Network Interpretation The s i are the independent source signals, A is the linear mixing matrix, The x i are the measured signals, W  A -1 is the estimated unmixing matrix, The u i are the estimated source signals or activations, i.e. u i  s i, The g i (u i ) are monotonic nonlinear functions (sigmoids, hyperbolic tangents), The y i are the network outputs.

Principles of Neural Network Approach Use Information Theory to derive an algorithm which minimises the mutual information between the outputs y=g(u). This minimises the mutual information between the source signal estimates, u, since g(u) introduces no dependencies. The different u are then temporally independent and are the estimated source signals.

Cautions I The magnitudes and signs of the estimated source signals are unreliable since –the magnitudes are not scaled –the signs are undefined because magnitude and sign information is shared between the source signal vector and the unmixing matrix, W. The order of the outputs is permutated compared wiith the inputs

Cautions II Similar overlapping source signals may not be properly extracted. If the number of output channels  number of source signals, those source signals of lowest variance will not be extracted. This is a problem when these signals are important.

Information Theory I If X is a vector of variables (messages) x i which occur with probabilities P(x i ), then the average information content of a stream of N messages is bits and is known as the entropy of the random variable, X.

Information Theory II Note that the entropy is expressible in terms of probability. Given the probability density distribution (pdf) of X we can find the associated entropy. This link between entropy and pdf is of the greatest importance in ICA theory.

Information Theory III The joint entropy between two random variables X and Y is given by For independent variables

Information Theory IV The conditional entropy of Y given X measures the average uncertainty remaining about y when x is known, and is The mutual information between Y and X is In ICA, X represents the measured signals, which are applied to the nonlinear function g(u) to obtain the outputs Y.

Bell and Sejnowski’s ICA Theory (1995) Aim to maximise the amount of mutual information between the inputs X and the outputs Y of the neural network. (Uncertainty about Y when X is unknown) Y is a function of W and g(u). Here we seek to determine the W which produces the u i  s i, assuming the correct g(u).

Differentiating: (=0, since it did not come through W from X.) So, maximising this mutual information is equivalent to maximising the joint output entropy, which is seen to be equivalent to minimising the mutual information between the outputs and hence the u i, as desired.

The Functions g(u) The outputs y i are amplitude bounded random variables, and so the marginal entropies H(y i ) are maximum when the y i are uniformly distributed - a known statistical result. With the H(y i ) maximised, I(Y,X) = 0, and the y i uniformly distributed, the nonlinearity g i (u i ) has the form of the cumulative distribution function of the probability density function of the s i, - a proven result.

Pause and review g(u) and W W has to be chosen to maximise the joint output entropy H(Y,X), which minimises the mutual information between the estimated source signals, u i. The g(u) should be the cumulative distribution functions of the source signals, s i. Determining the g(u) is a major problem.

One input and one output For a monotonic nonlinear function, g(x), Also Substituting: (we only need to maximise this term) (independent of W)

A stochastic gradient ascent learning rule is adopted to maximise H(y) by assuming Further progress requires knowledge of g(u). Assume for now, after Bell and Sejnowski, that g(u) is sigmoidal, i.e. Also assume

Learning Rule: 1 input, 1 output Hence, we find:

Learning Rule: N inputs, N outputs Need Assuming g(u) is sigmoidal again, we obtain:

The network is trained until the changes in the weights become acceptably small at each iteration. Thus the unmixing matrix W is found.

The Natural Gradient The computation of the inverse matrix is time-consuming, and may be avoided by rescaling the entropy gradient by multiplying it by Thus, for a sigmoidal g(u) we obtain This is the natural gradient, introduced by Amari (1998), and now widely adopted.

The nonlinearity, g(u) We have already learnt that the g(u) should be the cumulative probability densities of the individual source distributions. So far the g(u) have been assumed to be sigmoidal, so what are the pdfs of the s i ? The corresponding pdfs of the s i are super- Gaussian.

Super- and sub-Gaussian pdfs Gaussian Super-Gaussian Sub-Gaussian * Note: there are no mathematical definitions of super- and sub-Gaussians

Super- and sub-Gaussians vSuper-Gaussians:kurtosis (fourth order central moment, measures the flatness of the pdf) > 0.infrequent signals of short duration, e.g. evoked brain signals. vSub-Gaussians kurtosis < 0 signals mainly “on”, e.g. 50/60 Hz electrical mains supply, but also eye blinks.

Kurtosis Kurtosis = 4th order central moment = and is seen to be calculated from the current estimates of the source signals. To separate the independent sources, information about their pdfs such as skewness (3rd. moment) and flatness (kurtosis) is required. First and 2nd. moments (mean and variance) are insufficient.

A more generalised learning rule Girolami (1997) showed that tanh(u i ) and -tanh(u i ) could be used for super- and sub-Gaussians respectively. Cardoso and Laheld (1996) developed a stability analysis to determine whether the source signals were to be considered super- or sub-Gaussian. Lee, Girolami, and Sejnowski (1998) applied these findings to develop their extended infomax algorithm for super- and sub-Gaussians using a kurtosis-based switching rule.

Extended Infomax Learning Rule With super-Gaussians modelled as and sub-Gaussians as a Pearson mixture model the new extended learning rule is

Switching Decision and the k i are the elements of the N-dimensional diagonal matrix, K, and Modifications of the formula for k i exist, but in our experience the extended algorithm has been unsatisfactory.

Reasons for unsatisfactory extended algorithm 1) Initial assumptions about super- and sub-Gaussian distributions may be too inaccurate. 2) The switching criterion may be inadequate. Alternatives Postulate vague distributions for the source signals which are then developed iteratively during training. Use an alternative approach, e.g, statistically based, JADE (Cardoso).

Summary so far We have seen how W may be obtained by training the network, and the extended algorithm for switching between super- and sub-Gaussians has been described. Alternative approaches have been mentioned. Next we consider how to obtain the source signals knowing W and the measured signals, x.

Source signal determination The system is: s i unknown x i measured u i  s i estimated yiyi Mixing matrix A Unmixing matrix W g(u)g(u) Hence U=W.x and x=A.S where A  W -1, and U  S. The rows of U are the estimated source signals, known as activations (as functions of time). The rows of x are the time-varying measured signals.

Source Signals Channel number Time, or sample number

Expressions for the Activations We see that consecutive values of u are obtained by filtering consecutive columns of x by the same row of W. The ith row of u is the ith row of w by the columns of x.

Procedure Record N time points from each of M sensors, where N  5M. Pre-process the data, e.g. filtering, trend removal. Sphere the data using Principal Components Analysis (PCA). This is not essential but speeds up the computation by first removing first and second order moments. Compute the u i  s i. Include desphering. Analyse the results.

Optional Procedures I The contribution of each activation at a sensor may be found by “back-projecting” it to the sensor.

Optional Procedures II A measured signal which is contaminated by artefacts or noise may be extracted by “back- projecting” all the signal activations to the measurement electrode, setting other activations to zero. (An artefact and noise removal method).

Current Developments Overcomplete representations - more signal sources than sensors. Nonlinear mixing. Nonstationary sources. General formulation of g(u).

Conclusions It has been shown how to extract temporally independent unknown source signals from their linear mixtures at the outputs of an unknown system using Independent Components Analysis. Some of the limitations of the method have been mentioned. Current developments have been highlighted.