# Independent Component Analysis: The Fast ICA algorithm

## Presentation on theme: "Independent Component Analysis: The Fast ICA algorithm"— Presentation transcript:

Independent Component Analysis: The Fast ICA algorithm
Jonathan Kam EE 645

Overview The Problem Definition of ICA Restrictions Ways to solve ICA
NonGaussianity Mutual Information Maximum Likelihood Fast ICA algorithm Simulations Conclusion This is a brief overview of what I would like to talk about today. The first thing we will go over is the problem or why people started to research ICA. I will then go over some of the restrictions that are need in order for this algorithm to work. After that, I will talk about different ways in which ICA can be solved and a little about the pros and cons of using the different methods Then I will go into detail about the Fast ICA algorithm I will then go over some simulations Conclude on other ways in which ICA could be used

The Problem Cocktail Problem
Several Sources Several Sensors Ex: Humans hear mixed signal, but able to unmix signals and concentrate on a sole source Recover source signals given only mixtures No prior knowledge of sources or mixing matrix aka Blind Source Separation (BSS) The original problem stems from the cocktail problem. The cocktail problem is basically the scenario that you are at a party with several conversations going on around the room. Even though we hear these conversations as mixed signals in our ears, humans are able to unmix these signals into different components, so that we can concentrate on a sole conversation. The model of the problem is this. You are have several sources which you have no prior information about. You then have several sensors that pick up a linear mixture of these source signals. How can you recover the source signals given only the mixtures. This problem is also commonly known as Blind Source Separation or Blind Signal Separation

Assumptions Source signals are statistically independent
Knowing the value of one of the components does not give any information about the others ICs have nongaussian distributions Initial distributions unknown At most one Gaussian source Recovered sources can be permutated and scaled In order to solve this problem using ICA there are several assumptions that have to be made in order to correctly determine the independent components. The first assumption is that all source signals are statistically independent. This means that knowing the value of one of the components does not give any information about the other components. In order to do so these ICS have to have some unknown nongaussian distributions. If we knew the distributions to the sources, then this problem could be siimplified and you wouldn’t need ICA to solve it This assumption is not totally correct because BSS problem can be solved if there is at most one Gaussian source. I will talk later about why there needs to only be at most only one gaussian source The last assumption from the ICA algorithm is that the recovered sources will not be equal to the original source. They will be some permutated and scaled form the original signal

xj = aj1s1 + aj2s2 + … + ajnsn, for all j
Definition of ICA Observe N linear mixtures x1,…,xn of n independent components xj = aj1s1 + aj2s2 + … + ajnsn, for all j aj is the column of the mixing matrix A Assume each mixture xj and each IC sk is a random variable Time difference between mixes dropped Independent components are latent variables Cannot be directly observed Now I will go over the basic definition of ICA. First is the input. ICA is given N observed linear mixtures x1 through xn for n independent components. The aj seen in the equation is the column data in the mixing matrix. The sk is the independent component that we wish to find. It is assumed that each mixture xj and each IC sk is a random variable For simplicity of this model the time difference between captured mix signals have been dropped The problem with receiving only the mixtures is that it makes the independent component latent variables, meaning they cannot be directly observed

Definition of ICA ICA Mixture model: x=As Goal
A is mixing matrix; s is matrix of source signals Goal Find some matrix W, so that s = Wx W = inverse of A Now that we understand the model of our incoming mixtures, the ICA model can be defined as x = As Where A is some unknown mixing matrix, s is our source signals and x is our observed mixtures. With this model we can then see that the goal of ICA is to find some matrix W, so that s = Wx This means that the ideal solution to ICA would be to find W so that it is the inverse of the mixing matrix A

Definition: Independence
Two functions independent if E{h1(y1)h2(y2)} = E{h1(y1)} E{h2(y2)} If variables are independent, they are uncorrelated Uncorrelated variables Defined: E{y1y2} = E{y1} E{y2} = 0 Uncorrelation doesn’t equal independence Ex: (0,1),(0,-1),(1,0),(-1,0) E{y12y22} = 0 ≠ ¼ = E{y12} E{y22} ICA has to prove independence The key to solving the ICA algorithm is to find independent components. In order to determine whether or not two signals are independent there joint expectation have to be equal to the product of their expectations. (Independence of two functions h1 and h2 is defined as the expectation of h1 and h2 are equal to the expectation of h1 times the expectation of h2) By definition of independence, if variables are independent, they are also uncorrelated. However even though variables maybe uncorrelated, it doesn’t mean that they are independent. This slide shows an example where uncorrleated variables don’t come out to independent variables. Making the mixtures uncorrelated can be easily done as a preprocessing step because having uncorrelated variables is a step towards have independent variables, but ICA has to eventually find independent variabls

ICA restrictions Cannot determine variances Cannot determine order
s and A are unknown Scalar multipliers on s could be canceled out by a divisor on A Multiplier could even be -1 Cannot determine order Order of terms can changed. With this proposed ICA model there are some restrictions what the ICA algorithm can solve. The first restriction is that ICA cannot determine variances. Since s and A are unknown any scalar multiplier on s could be canceled out by a divisor on A The output signal could even be scaled by a -1 which would reverse the amplitude of the signal The next restriction I that with this ICA model, there is no order to which the independent components are found

ICA restrictions At most 1 Gaussian source
x1 and x2 Gaussian, uncorrelated, and unit variance Density function is completely symmetric Does not contain info on direction of the columns of the mixing matrix A. The final restriction is that there can be at most 1 gaussian source For example if we had two gaussian sources x1 and x2 that were uncorrelated an unit variance, there density function is represented by the equation below. This density function is completely symmetric, which means that it does not contain any info on the direction of the columns of the mixing matrix A. Therefore A cannot be estimated Actually the problem can be restated that the distribution of any orthogonal transformation of the gaussian x1 and x2 has exactly the same distribution x1 and x2, and that they are independent. This means that ICA can only estimate the sources up to an orthogonal transformation

ICA estimation Nongaussianity estimates independent
Estimation of y = wT x let z = AT w, so y = wTAs = zTs y is a linear combination of si, therefore zTs is more gaussian than any of si zTs becomes least gaussian when it is equal to one of the si wTx = zTs equals an independent component Maximizing nongaussianity of wTx gives us one of the independent components Maximizing nongaussianity by measuring nongaussiantiy Minimizing mutual information Maximum Likelihood The key to estimating the independent components is through nongaussianity. Non gaussianty estimates independence. For example an estimate of a source is y equals w transpose X. If you let z equal the transpose of the mixing matrix A times w then you end up with the relationship that y equal z transpose s This means that y is some linear combination of s Since the Sum of 2 independent random variables is more gaussian than the original variables z transpose s is more gaussian than any of the s components In fact z transpose s becomes the least gaussian when it is equal to a source This results in you can find an independent component when w transpose x is equal to that of z transpose s. Therefore if we maximize the nongaussianity of w transpose x then we can estimate one of the independent components This can be done through several methods The first method is to maximize the nongaussianity by statistically measuring the gaussianity of the variables. Other ways would be to minimize the mutual information between the mixtures or through the maximum likelihood function.

Measuring nongaussianity
Kurtosis Fourth order cumulant Classical measure of nongaussianity kurt(y) = E{y4} – 3(E{y2})2 For gaussian y, fourth moment = 3(E{y2})2 Kurtosis for gaussian random variables is 0 Con – not a robust measure of nongaussianity Sensitive to outliers In order to maximize nongaussianity you have to be able to measure nongaussanity. One method is through kurtosis. Kurtosis is the fourth order cumulant. The classical equation for kurtosis is given below. When a function is guassian its fourth moment is 3 times the squared expectation of y squared. This means that the kurtosis of gaussian random variables is 0 The problem with using kurtosis as a measurment is that it is not a very robust measure of nongaussianity. It is sensitive to outliers and which may be strong enough to change the value for kurtosis

Measuring nongaussianity
Entropy (H): degree of information that an observation gives A Gaussian variable has the largest entropy among all random variables of equal variance Negentropy J Based on the information theoretic quantity of differential entropy Computationally difficult Another way to measure nongaussianity is to solve for negentropy. Entropy of a random variable is the degree of information that an observation gives While negentropy is the difference between the entropy of a gaussian y and the entropy of the variables in Y Negentropy: hard to calculate because it would require an estimate of the pdf Because negentropy is computationally difficult a simpler approximation of negentropy would be more useful

Negentropy approximations
Classical method using higher-order moments Validity is limited to nonrobustness of kurtosis A simple way to approximate negentropy is through the use of kurtosis. Solving for negentropy can be seen in the equation for J This approximation however is weakened by the kurtosis function in the second half of the equation

Negentropy approximations
Hyvärinen 1998b: maximum-entropy principle G is some contrast function v is a Gaussian variable of zero mean and unit variance Taking G(y) = y4 makes the equation kurtosis based approximation Hyvarinen in 1998 came up with another way to approximate negentropy. It is called the maximum-entropy principle Where the negentropy is proportionaly related to the expectations of some contrast function G. This equation can be modified to solve for kurtosis by taking G(y) equal to the y to the fourth power

Negentropy approximations
Instead of kurtosis function, choose a contrast function G that doesn’t grow too fast Where 1≤a1≤2 Since the kurtosis function grows too fast which makes it not robust. Hyvarinen proposes that you chose a contrast function G such that the contrast function doesn’t grow to fast. Two proposed contrast functions are shown as G1 andd G2

Minimizing mutual information
Mutual information I is defined as Measure of the dependence between random variables I = 0 if variables are statistically independent Equivalent to maximizing negentropy I is the differences between the Entropy function (degree of information that the observation of a variable gives)

Maximum Likelihood Estimation
Closely related to infomax principle Infomax (Bell and Sejnowski, 1995) Maximizing the output entropy of a neural network with non-linear outputs Densities of ICs must be estimated properly If estimation is wrong ML will give wrong results If you use reasonable measures of nongaussianity then this problem does not arise.

Fast ICA Preprocessing Fast ICA algorithm Unmixing signals
Maximize non gaussianity Unmixing signals

Fast ICA: Preprocessing
Centering Subtract its mean vector to make x a zero-mean variable ICA algorithm does not need to estimate the mean Estimate mean vector of s by A-1m, where m is the mean the subtracted mean Estimation of mean vector is an optional last step in the fast ica algorithm

Fast ICA: Preprocessing
Whitening Transform x so that its components are uncorrelated and their variances equal unity Use eigen-value decomposition (EVD) of the covariance matrix E D is the diagonal matrix of its eigenvalues E is the orthogonal matrix of eigenvectors

Fast ICA: Preprocessing
Whitening Transforms the mixing matrix into Ã. Makes Ã orthogonal Lessons the amount of parameters that have to be estimated from n2 to n(n-1)/2 In large dimensions an orthogonal matrix contains approximately ½ the number of parameters

Fast ICA Algorithm One-unit (component) version
1. Choose an initial weight vector w. 2. Let w+ = E{xg(wTx)} – E{g′(wTx)}w Derivatives of contrast functions G g1(u) = tanh(a1u), g2(u) = u exp (-u2/2) 3. w = w+/||w+||. (Normalization step) 4. If not converged go back to 2 -converged if norm(wnew – wold) > ξ or norm(wold-wnew)> ξ - ξ typically around One-unit is one component

Fast ICA Algorithm Several unit algorithm
Define B as mixing matrix and B′ as a matrix whose columns are the previously found columns of B Add projection step before step 3 Step 3 becomes 3. Let w(k) = w(k) - B′B′Tw(k). w = w+/||w+||

Simple Simulation Separation of 2 components
Figure 1: Two independent non gaussian wav samples Used the tanh function

Simple Simulation Figure 2: Mixed signals

Simple Simulation Recovered signals vs original signals
Amplitude scaled Not in any particular order Figure 3: Recovered signals Figure 4: Original signals

Simulation Results IC 1 recovered in 6 steps and IC 2 recovered in 2 steps Retested with samples Requires approximately the same number of steps

Figure 5: 2 wav samples and noise signal
Gaussian Simulation Figure 5: 2 wav samples and noise signal

Gaussian Simulation Figure 6: 3 mixed signals

Gaussian Simulation Comparison of recovered signals vs original signals Figure 7: Recovered signals Figure 8: Original signal

Gaussian Simulation 2: Tried with 2 gaussian components
Components were not estimated properly due to more than one Gaussian component Figure 10: Original signals Figure 11: Recovered signals

Conclusion Fast ICA properties Other Applications
No step size, unlike gradient based ICA algorithms Finds any non-Gaussian distribution using any non linear g contrast function. Components can be estimated one by one Other Applications Separation of Artifacts in image data Find hidden factors in financial data Reduce noise in natural images Medical signal processing – fMRI, ECG, EEG (Mackeig)

References [1] Aapo Hyvärinen and Erkki Oja, Independent Component Analysis: Algorithms and Applications. Neural Networks Research Centre Helsinki University of Technology; Neural Networks, 13 (4-5): , 2000 [2] Aapo Hyvärinen and Erkki Oja, A Fast Fixed-Point Algorithm for Independent Component Analysis. Helsinki University of Technology Laboratory of Computer and Information Science, Neural Computation, 9:1483:1492, 1997 [3] Anthony J. Bell and Terrence J. Sejnowski, The ‘Independent Components’ of Natural Scenes are Edeg Filters. Howard Hughes Medical Institute Computational Neurobiology Laboratory [4] Te-Won Lee, Mark Girolami, Terrence J. Sejnowski, Independent Component Analysis Using and Extended Infomax Algorithm for Mixed Subgaussian and Supergaussian Sources. 1997 [5] Antti Leino, Independent Component Analysis An Overview. 2004 [6] Erik G. Learned-Miller, John W. Fisher III, ICA Using Spacings Estimates of Entropy Journal of Machine Learning Research 4 (2003)

Similar presentations