HST.582J/6.555J/16.456J Gari D. Clifford Associate Director, Centre for Doctoral Training, IBME, University of Oxford

Slides:



Advertisements
Similar presentations
Independent Component Analysis
Advertisements

Independent Component Analysis: The Fast ICA algorithm
Eigen Decomposition and Singular Value Decomposition
Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.
Dimension reduction (2) Projection pursuit ICA NCA Partial Least Squares Blais. “The role of the environment in synaptic plasticity…..” (1998) Liao et.
Color Imaging Analysis of Spatio-chromatic Decorrelation for Colour Image Reconstruction Mark S. Drew and Steven Bergner
Face Recognition Ying Wu Electrical and Computer Engineering Northwestern University, Evanston, IL
Dimension reduction (1)
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification.
Object Orie’d Data Analysis, Last Time Finished NCI 60 Data Started detailed look at PCA Reviewed linear algebra Today: More linear algebra Multivariate.
Independent Component Analysis & Blind Source Separation
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) – FastMap Dimensionality Reductions or data projections.
Independent Component Analysis (ICA)
3D Geometry for Computer Graphics
Dimensional reduction, PCA
Independent Component Analysis & Blind Source Separation Ata Kaban The University of Birmingham.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
3/24/2006Lecture notes for Speech Communications Multi-channel speech enhancement Chunjian Li DICOM, Aalborg University.
Independent Component Analysis (ICA) and Factor Analysis (FA)
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
An Introduction to Independent Component Analysis (ICA) 吳育德 陽明大學放射醫學科學研究所 台北榮總整合性腦功能實驗室.
3D Geometry for Computer Graphics
Bayesian belief networks 2. PCA and ICA
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.
(1) A probability model respecting those covariance observations: Gaussian Maximum entropy probability distribution for a given covariance observation.
Multidimensional Data Analysis : the Blind Source Separation problem. Outline : Blind Source Separation Linear mixture model Principal Component Analysis.
Survey on ICA Technical Report, Aapo Hyvärinen, 1999.
Summarized by Soo-Jin Kim
Chapter 2 Dimensionality Reduction. Linear Methods
Independent Components Analysis with the JADE algorithm
Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.
Independent Component Analysis on Images Instructor: Dr. Longin Jan Latecki Presented by: Bo Han.
Heart Sound Background Noise Removal Haim Appleboim Biomedical Seminar February 2007.
Independent Component Analysis Zhen Wei, Li Jin, Yuxue Jin Department of Statistics Stanford University An Introduction.
Md. Zia Uddin Bio-Imaging Lab, Department of Biomedical Engineering Kyung Hee University.
Blind Source Separation by Independent Components Analysis Professor Dr. Barrie W. Jervis School of Engineering Sheffield Hallam University England
Principal Component Analysis Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Basics of Neural Networks Neural Network Topologies.
N– variate Gaussian. Some important characteristics: 1)The pdf of n jointly Gaussian R.V.’s is completely described by means, variances and covariances.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
A note about gradient descent: Consider the function f(x)=(x-x 0 ) 2 Its derivative is: By gradient descent (If f(x) is more complex we usually cannot.
An Introduction to Blind Source Separation Kenny Hild Sept. 19, 2001.
Introduction to Linear Algebra Mark Goldman Emily Mackevicius.
Lecture 3 BME452 Biomedical Signal Processing 2013 (copyright Ali Işın, 2013) 1 BME452 Biomedical Signal Processing Lecture 3  Signal conditioning.
PCA vs ICA vs LDA. How to represent images? Why representation methods are needed?? –Curse of dimensionality – width x height x channels –Noise reduction.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Principal Component Analysis (PCA)
Signal & Weight Vector Spaces
Independent Component Analysis Independent Component Analysis.
Introduction to Independent Component Analysis Math 285 project Fall 2015 Jingmei Lu Xixi Lu 12/10/2015.
An Introduction of Independent Component Analysis (ICA) Xiaoling Wang Jan. 28, 2003.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 09: Discriminant Analysis Objectives: Principal.
Unsupervised Learning II Feature Extraction
Dimension reduction (1) Overview PCA Factor Analysis Projection persuit ICA.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Ch 12. Continuous Latent Variables ~ 12
LECTURE 11: Advanced Discriminant Analysis
Background on Classification
9.3 Filtered delay embeddings
Brain Electrophysiological Signal Processing: Preprocessing
Application of Independent Component Analysis (ICA) to Beam Diagnosis
PCA vs ICA vs LDA.
Probabilistic Models with Latent Variables
Bayesian belief networks 2. PCA and ICA
Blind Source Separation: PCA & ICA
Principal Component Analysis
Feature space tansformation methods
Principal Component Analysis
Presentation transcript:

HST.582J/6.555J/16.456J Gari D. Clifford Associate Director, Centre for Doctoral Training, IBME, University of Oxford © G. D. Clifford Blind Source Separation: PCA & ICA

What is BSS? Assume an observation (signal) is a linear mix of >1 unknown independent source signals The mixing (not the signals) is stationary We have as many observations as unknown sources To find sources in observations - need to define a suitable measure of independence … For example - the cocktail party problem (sources are speakers and background noise):

The cocktail party problem - find Z A z1z1 z2z2 zNzN XTXT ZTZT X T =AZ T x1x1 x2x2 xNxN

Formal statement of problem N independent sources … Z mn ( M xN ) linear square mixing … A nn ( N xN ) (#sources=#sensors) produces a set of observations … X mn ( M xN ) ….. X T = AZ T

Formal statement of solution ‘demix’ observations … X T ( N xM ) into Y T = WX T Y T ( N xM )  Z T W ( N xN )  A -1 How do we recover the independent sources? (We are trying to estimate W  A -1 ) …. We require a measure of independence!

‘Signal’ source ‘Noise’ sources Observed mixtures ZTZT X T =AZ T Y T =WX T

X T = A Z T Y T = W X T TT TT

The Fourier Transform (Independence between components is assumed)

Recap: Non-causal Wiener filtering x[n] - observation y[n] - ideal signal d[n] - noise component Ideal Signal S y (f ) Noise Power S d (f ) Observation S x (f ) Filtered signal: S filt (f ) = S x (f ).H(f ) f

BSS is a transform? Like Fourier, we decompose into components by transforming the observations into another vector space which maximises the separation between interesting (signal) and unwanted (noise). Unlike Fourier, separation is not based on frequency- It’s based on independence Sources can have the same frequency content No assumptions about the signals (other than they are independent and linearly mixed) So you can filter/separate in-band noise/signals with BSS

Principal Component Analysis Second order decorrelation = independence Find a set of orthogonal axes in the data (independence metric = variance) Project data onto these axes to decorrelate Independence is forced onto the data through the orthogonality of axes Conventional noise / signal separation technique

Singular Value Decomposition Decompose observation X=AZ into…. X=USV T S is a diagonal matrix of singular values with elements arranged in descending order of magnitude (the singular spectrum) The columns of V are the eigenvectors of C=X T X (the orthogonal subspace … dot(v i,v j )=0 ) … they ‘demix’ or rotate the data U is the matrix of projections of X onto the eigenvectors of C … the ‘source’ estimates

Singular Value Decomposition Decompose observation X=AZ into…. X=USV T

Eigenspectrum of decomposition S = singular matrix … zeros except on the leading diagonal S ij (i=j) are the eigenvalues ½ Placed in order of descending magnitude Correspond to the magnitude of projected data along each eigenvector Eigenvectors are the axes of maximal variation in the data Variance = power (analogous to Fourier components in power spectra) [stem(diag(S).^2)] Eigenspectrum= Plot of eigenvalues

SVD: Method for PCA

SVD noise/signal separation To perform SVD filtering of a signal, use a truncated SVD decomposition (using the first p eigenvectors) Y=US p V T [Reduce the dimensionality of the data by discarding noise projections S noise =0 Then reconstruct the data with just the signal subspace] Most of the signal is contained in the first few principal components. Discarding these and projecting back into the original observation space effects a noise-filtering or a noise/signal separation

Real data Add noise

Real data X X p =US p V T S2S2 X p … p=2 n X p … p=4

Two dimensional example

Independent Component Analysis As in PCA, we are looking for N different vectors onto which we can project our observations to give a set of N maximally independent signals (sources) output data (discovered sources) dimensionality = dimensionality of observations Instead of using variance as our independence measure (i.e. decorrelating) as we do in PCA, we use a measure of how statistically independent the sources are.

ICA: The basic idea... Assume underlying source signals (Z ) are independent. Assume a linear mixing matrix (A )… X T =AZ T in order to find Y (  Z ), find W, (  A -1 )... Y T =WX T How? Initialise W & iteratively update W to minimise or maximise a cost function that measures the (statistical) independence between the columns of the Y T.

Non-Gaussianity  statistical independence? From the Central Limit Theorem, - add enough independent signals together, Gaussian PDF Sources, Z P(Z) ( subGaussian ) Mixtures (X T =AZ T ) P(X) ( Gaussian )

Recap: Moments of a distribution

Higher order moments (3 rd -skewness) xx

Higher order moments (4 th -kurtosis) xx SuperGaussianSubGaussian Gaussians are mesokurtic with κ =3

Non-Gaussianity  statistical independence? Central Limit Theorem: add enough independent signals together, Gaussian PDF  make data components non-Gaussian to find independent sources Sources, Z P(Z) (  <3 (1.8) ) Mixtures (X T =AZ T ) P(X) (  =3.4) (  =3)

Recall – trying to estimate W Assume underlying source signals (Z ) are independent. Assume a linear mixing matrix (A )… X T =AZ T in order to find Y (  Z ), find W, (  A -1 )... Y T =WX T Initialise W & iteratively update W with gradient descent to maximise kurtosis.

Gradient descent to find W Given a cost function, , we update each element of W ( ) at each step, , … and recalculate cost function (  is the learning rate (~ 0.1), and speeds up convergence.) w ij

W=[1 3; -2 -1] Iterations w ij  10 5 Weight updates to find: (Gradient ascent)

Gradient descent min (1/|  1 |, 1/|  2 |) |  = max

Gradient Descent  = min (1/|  1 |, 1/|  2 |) |  = max Cost function, , can be maximum  or minimum 1 / 

Gradient descent example Imagine a 2-channel ECG, comprised of two sources; –Cardiac –Noise … and SNR=1 X1X1 X2X2

Iteratively update W and measure  Y1Y1 Y2Y2

Y1Y1 Y2Y2

Y1Y1 Y2Y2

Y1Y1 Y2Y2

Y1Y1 Y2Y2

Y1Y1 Y2Y2

Y1Y1 Y2Y2

Y1Y1 Y2Y2

Maximized  for non-Gaussian signal Y1Y1 Y2Y2

Outlier insensitive ICA cost functions

In general we require a measure of statistical independence which we maximise between each of the N components. Non-Gaussianity is one approximation, but sensitive to small changes in the distribution tail. Other measures include: Measures of statistical independence I Mutual Information I, J Entropy (Negentropy, J )… and Maximum (Log) Likelihood (Note: all are related to  )

Entropy-based cost function Kurtosis is highly sensitive to small changes in distribution tails. A more robust measures of Gaussianity is based on differential entropy H(y), … negentropy: where y gauss is a Gaussian variable with the same covariance matrix as y. J(y) can be estimated from kurtosis … Entropy: measure of randomness- Gaussians are maximally random

Minimising Mutual Information Mutual information (MI) between two vectors x and y : always non-negative and zero if variables are independent … therefore we want to minimise MI. MI can be re-written in terms of negentropy … where c is a constant. … differs from negentropy by a constant and a sign change I I = H x + H y - H xy

Generative latent variable modelling N observables, X... from N sources, z i through a linear mapping W=w ij Latent variables assumed to be independently distributed Find elements of W by gradient ascent - iterative update by where is some learning rate (const) … and is our objective cost function, the log likelihood Independent source discovery using Maximum Likelihood

… some real examples using ICA The cocktail party problem revisited

Why? … In X T =AZ T, insert a permutation matrix B … X T =ABB -1 Z T  B -1 Z T … = sources with different col. order.  sources change by a scaling A AB … ICA solutions are order and scale independent because κ is dimensionless Observations Separation of mixed observations into source estimates is excellent … apart from: Order of sources has changed Signals have been scaled

Separation of sources in the ECG

 =3  =11  =0  =3  =1  =5  =0  =1.5

Transformation inversion for filtering Problem - can never know if sources are really reflective of the actual source generators - no gold standard De-mixing might alter the clinical relevance of the ECG features Solution: Identify unwanted sources, set corresponding (p) columns in W -1 to zero (W p -1 ), then multiply back through to remove ‘noise’ sources and transform back into original observation space.

Transformation inversion for filtering

Real data X Y=WX Z X filt = W p -1 Y

ICA results 4 X Y=WX Z X filt = W p -1 Y

PCA is good for Gaussian noise separation ICA is good for non-Gaussian ‘noise’ separation PCs have obvious meaning - highest energy components ICA - derived sources : arbitrary scaling/inversion & ordering …. need energy-independent heuristic to identify signals / noise Order of ICs change - IC space is derived from the data. - PC space only changes if SNR changes. ICA assumes linear mixing matrix ICA assumes stationary mixing De-mixing performance is function of lead position ICA requires as many sensors (ECG leads) as sources Filtering - discard certain dimensions then invert transformation In-band noise can be removed - unlike Fourier! Summary

Fetal ECG lab preparation

Fetal abdominal recordings Maternal ECG is much larger in amplitude Maternal and fetal ECG overlap in time domain Maternal features are broader, but Fetal ECG is in-band of maternal ECG (they overlap in freq domain) 5 second window … Maternal HR=72 bpm / Fetal HR = 156bpm Maternal QRS Fetal QRS

MECG & FECG spectral properties Fetal QRS power region ECG Envelope Movement Artifact QRS Complex P & T Waves Muscle Noise Baseline Wander

Fetal / Maternal Mixture Maternal Noise Fetal

Appendix: Outlier insensitive ICA cost functions

In general we require a measure of statistical independence which we maximise between each of the N components. Non-Gaussianity is one approximation, but sensitive to small changes in the distribution tail. Other measures include: Measures of statistical independence I Mutual Information I, J Entropy (Negentropy, J )… and Maximum (Log) Likelihood (Note: all are related to  )

Entropy-based cost function Kurtosis is highly sensitive to small changes in distribution tails. A more robust measures of Gaussianity is based on differential entropy H(y), … negentropy: where y gauss is a Gaussian variable with the same covariance matrix as y. J(y) can be estimated from kurtosis … Entropy: measure of randomness- Gaussians are maximally random

Minimising Mutual Information Mutual information (MI) between M random variables: always non-negative and zero if variables are independent.  minimise MI. MI can be re-written in terms of negentropy … where c is a constant. differs from negentropy by a constant and a sign change I I = H x + H y - H xy

Generative latent variable modelling N observables, X... from N sources, z i through a linear mapping W=w ij Latent variables assumed to be independently distributed Find elements of W by gradient ascent - iterative update by where  is some learning rate (const) … and is our objective cost function, the log likelihood Independent source discovery using Maximum Likelihood

Appendix Worked example (see lecture notes)

Worked example