# A rd Data Dimension Arie Yeredor nd -Order Statistics Just Weight and See! + Sept. 28 th, 2010.

## Presentation on theme: "A rd Data Dimension Arie Yeredor nd -Order Statistics Just Weight and See! + Sept. 28 th, 2010."— Presentation transcript:

a rd Data Dimension Arie Yeredor nd -Order Statistics Just Weight and See! + Sept. 28 th, 2010

Outline Introduction A Better Outline (better motivated & better detailed…)

The Classical Mixture Model Static, Square-Invertible, Real-Valued, Noiseless All sources are zero-mean, mutually independent.

SOS and ICA Traditionally, Second-Order Statistics have played Second Fiddle to Higher-Order Statistics in ICA. In classical ICA, where the sources have i.i.d. time-structures (or any different temporal structures which are ignored), SOS are insufficient for separation: SOS do about half of the job, enabling spatial whitening (sphering), but unable to resolve a possible residual orthogonal mixing.

SOS and ICA However, when the sources have diverse temporal statistical structures, SOS can be sufficient for separation. Moreover, if the sources are Gaussian, SOS-based separation can even be optimal (in the sense on minimum residual interference to source ratio (ISR)).

SOS and ICA Classical SOS approaches can roughly be divided into two categories: –Approaches exploiting the special congruence structure of the mixtures correlation matrices through (approximate) joint diagonalization; –Approaches based on the principle of Maximum Likelihood (which only in the Gaussian case can be based on SOS alone). Yet, some other approaches offer an interesting insight into the borderline between the two.

The sources temporal SOS structure Let denote the vector of the observed segment of the -th source, and let denote its correlation (covariance) matrix.

A stationary white source

A stationary MA source

A stationary AR source

A block-stationary source, iid in each block

A block-stationary source, MA in each block

A cyclostationary source

A general nonstationary source

Most existing classical methods address the case of stationary sources:

Joint-Diagonalization Based Approaches First(?*) there was AMUSE (Tong et al., 1990): –Estimate ; –Estimate for some ; –Obtain a consistent estimate of from the exact joint diagonalization of and. *) also proposed two years earlier by Fêty and Van Uffelen

Joint-Diagonalization Based Approaches SOBI (Belouchrani et al., 1997), considered the approximate joint diagonalization of several estimated correlation matrices (at several lags); TDSEP (Ziehe and Müller, 1998) considered the joint diagonalization of linear combinations of estimated correlation matrices.

Likelihood Based Approaches QML (Pham and Garat, 1997): Solve a nonlinear set of estimating equations constructed from correlation matrices between filtered versions of the mixtures; EML (Dégerine and Zaïdi, 2004): Assume that the sources are all Auto-Regressive (AR), and jointly estimate their AR parameters and the mixing matrix, resulting in estimating equations similar in form to QML.

Borderline Approaches GMI (Pham, 2001): Joint diagonalization of cross- spectral matrices in the frequency-domain, using a special off-diagonality criterion (which reflects the Kullback-Leibler divergence between these matrices and their diagonal forms). WASOBI (Weights-Adapted SOBI, Yeredor, 2000; Tichavsk ý et al., 2006): Apply optimal weighting to the approximate joint diagonalization used in SOBI. Can be regarded (asymptotically) as Maximum-Likelihood estimation with respect to the estimated correlation matrices, which are asymptotically Gaussian.

What about non-stationary sources? A few particular forms of non-stationarity have been considered, e.g.: –Block-stationary sources, iid in each block: BGL (Pham and Cardoso, 2001) – resulting in joint diagonalization of zero-lag correlation matrices estimated over different segments. –Block-stationary sources, AR in each block: BARBI (Tichavsky et al., 2009) – weighted AJD of lagged correlation matrices estimated in different blocks. –Cyclostationary sources – e.g., Liang et al., 1997; Abed-Meraim et al., 2001; Ferréol, 2004; Pham, 2007; CheViet et al., 2008… –Sources with distinct time-frequency representations: Belouchrani & Amin, 1998; Zhang & Amin, 2000; Giulieri et al., 2003; Fadaili et al., 2005…

But… Quite often, the sources are neither: –Stationary –Block-stationary –Cyclostationary –Sparse in Time-Frequency –…–… Then what?... (this is the main subject of this talk)

We shall assume: Each ( -th) source has its own general covariance matrix, which is not specially structured in any way; All sources are jointly Gaussian and uncorrelated with each other (hence, mutually independent).

A Fully-Blind Scenario The mixing matrix, the sources and their covariance matrices are unknown.

A Semi-Blind Scenario The mixing matrix and the sources are unknown. The sources covariance matrices are known.

The Mixture Model Again The zero-mean sources are mutually uncorrelated,, each has its own general covariance matrix.

We are now ready for the better outline (better motivated, better detailed…)

A Better Outline The induced Cramér-Rao Lower Bound (iCRLB) on the attainable Interference to Source Ratios (ISRs) The semi-blind case: –Derivation of the iCRLB –ML estimation leading to Hybrid Exact / Approximate joint Diagonalization (HEAD) –An example The fully-blind case: –The iCRLB is essentially the same –Iterative Quasi-ML estimation based on consistent estimation of the sources covariance matrices, e.g. from multiple snapshots Comparative performance demonstration Things worth weighting for?

What is an Induced CRLB? The CRLB is a well-known lower bound on the mean square error (MSE) attainable in unbiased estimation of a vector of (deterministic) parameters. In the context of ICA the full set of parameters includes the elements of the mixing matrix or, equivalently, of the demixing matrix, and, in a fully-blind scenario, also some parameters related to the sources distributions. However, the MSE in the estimation of these parameters is usually of very little interest.

What is an Induced CRLB? A more interesting measure of performance is the Interference to Source Ratio (ISR), measuring the residual relative energy of other sources present in each separated source. Let denote the estimated separation matrix. We define the contamination matrix as the matrix

What is an Induced CRLB? We have, for the reconstructed sources, so is the residual mixing matrix after separation. Thus, the ratio describes the ratio between the gain of the -th source and that of the -th source in the reconstruction of the latter.

What is an Induced CRLB? The -th element of the ISR matrix is defined as the mean square value of this ratio, multiplied by the ratio of energies of the -th and -th sources: –Note that this quantity is no longer data-dependent. –Also, it is insensitive to any scaling ambiguity in the rows of.

What is an Induced CRLB? Under a small errors assumption, we have, and therefore, so, namely. The ISR can therefore be closely approximated as

What is an Induced CRLB? Since, each element of is a linear function of, and therefore the second moment of each element of can be determined from the second moments of the elements of, which are all bounded by the CRLB. Therefore, the CRLB on estimation of the elements of induces a bound on the ISR through this linear relation.

Derivation of the iCRLB A key property of the iCRLB is its equivariance with respect to the mixing matrix. An equivariant estimator of is any estimator satisfying, since it then follows that so the ISR matrix depends only on properties of the sources distributions.

Derivation of the iCRLB Most (but certainly not all) popular ICA algorithms lead to equivariant separation. Does this mean that the iCRLB is equivariant? –Genreally – no; –However, it can be shown that the ML separator is equivariant, and since the CRLB (and therefore also the iCRLB) is attained (at least asymptotically) by the ML separator, it follows that the iCRLB is equivariant as well.

Derivation of the iCRLB An appealing consequence of the equivariance is that we may compute the iCRLB for any value of the mixing-matrix, knowing that the same result applies to any (nonsingular) mixing matrix. We choose the convenient non-mixing condition

In order to compute the iCRLB we rearrange the mixing relation:

Derivation of the iCRLB (contd.) The covariance matrix of the zero-mean sources vector is given by the block- diagonal matrix The covariance matrix of the zero-mean observations vector is therefore given by

Derivation of the iCRLB (contd.) Therefore, under our Gaussian model assumption the observations vector is also a zero-mean Gaussian vector, Note that

Derivation of the iCRLB (contd.) Beginning with the semi-blind scenario, the only unknown parameters are the elements of, and the respective elements of the Fisher Information Matrix (FIM) are well-known to be given in this case by (a matrix).

Derivation of the iCRLB (contd.) Using the relation and the non-mixing assumption, these elements can be conveniently expressed as Let us define.

Derivation of the iCRLB (contd.) Thus, with particular ordering of the elements of, the FIM can take a block-diagonal form, with blocks of and diagonal terms:

Derivation of the iCRLB (contd.) So and therefore

The iCRLB Under the non-mixing condition we have, and therefore the CRLB induces the following iCRLB: (with ).

The iCRLB – Key Properties Invariance with respect to the mixing matrix; Invariance with respect to other sources: The bound on depends only on the covariance matrices of the -th and -th sources, and is unaffected by the other sources; Invariance with respect to scale: The ISR bound is invariant to any scaling of the sources. Note that this property is not shared by the bound on the variance of elements of alone.

The iCRLB – Key Properties Non-identifiability condition: If sources and have similar covariance matrices (i.e., is a scaled version of ), then, implying an infinite bound on and on - which in turn implies non-identifiability of elements of. Otherwise, it can be shown that, so if no two sources have similar covariance matrices, all ISR bounds are finite and is identifiable. –Recall, however, that this bound was developed for Gaussian sources. With non-Gaussian sources this condition can be shown to be applicable to estimation based exclusively on SOS; And yet, when this condition is breached, the mixture may still be identifiable using HOS.

The iCRLB – Key Properties Resemblance to other bounds: Assuming equal-energy sources, we have: The same general form is shared by the ISR bound obtained, e.g. by Tichavský et al. (2006) and by Ollila et al. (2008) for unit-variance sources with iid temporal structures: where is the pdf of the -th source.

But can the iCRLB be reached? Yes, asymptotically, using ML separation (still considering the semi-blind case)

ML Separation (Semi-Blind Case) Recalling the notation, the likelihood of is given by Differentiating with respect to each element of and equating zero, we obtain the likelihood equations: (with ).

ML Separation (Semi-Blind Case) Define the generalized correlation matrices With slight manipulations, the same likelihood equations can be written as where is the -th column of, and is Kroneckers delta.

The HEAD Problem The set of equations can be seen as a hybrid exact-approximate joint diagonalization condition, termed the HEAD problem (Yeredor, 2009) and also, in a different context, termed Structured Joint Congruence transformation – StJoCo (Song et al., 2010).

What is HEAD? Classical Approximate Joint Diagonalization: Given a set of target matrices, each of dimensions, find a matrix, such that the transformed matrices are as diagonal as possible (often subject to some scaling constraints). HEAD: The number of matrices in the set equals the matrices dimension ( ). The -th transformed matrix is exactly diagonal in its -th row and column, with the scaling constraint of being at the -th location. All other values are irrelevant.

HEAD (contd.) Note that HEAD is a set of nonlinear equations, not an optimization problem. It has been shown (Yeredor, 2009) that if all (symmetric) target-matrices are positive-definite, then a solution of HEAD must exist. The HEAD problem has already been encountered in the context of ICA (in slightly different forms) in QML (by Pham & Garat, 1997) and in exact ML separation of AR sources (Dégerine & Zaïdi, 2004). Different iterative solutions have been proposed by Pham & Garat, 1997, by Dégerine & Zaïdi, 2004, by Yeredor, 2009 and by Song et al., 2010.

HEAD for Nearly Jointly Diagonalizable Matrices

Summary of ML Separation (the semi-blind scenario) Inputs: Observed mixtures: ( ) Sources covariance matrices: ( ) Construct generalized correlation matrices: Obtain the estimated demixing matrix as the solution to the HEAD problem,

An Example To capture the essence of the performance improvement relative to classical methods, we first consider two sources with parametrically-controlled temporal and spectral diversity.

Experiment Setup We generated two MA sources of length, multiplied by temporal envelopes. The zeros are at: –Source 1: –Source 2: The envelope is a Laplacian window of nominal half-width, centered around: – for source 1; – for source 2.

Experiment Setup Dependence of spectral diversity on :

Experiment Setup Dependence of spectral diversity on :

Comparative Performance Demonstration We compare the following approaches: –SOBI: Using ordinary correlation matrices up to lag 4. Can only capture the spectral diversity; –BGL: Using zero-lag correlation matrices taken over two blocks, one for, one for. Can only capture the temporal diversity; –SOBGL: Jointly diagonalize the SOBI and BGL matrices. Can capture both, but is ad-hoc and sub-optimal. –ML: optimally captures both.

Comparative Performance Demonstration

Now lets consider the fully-blind case

The Fully-Blind Case Obviously, the semi-blind scenario is often non-realistic. We therefore turn to consider the fully- blind scenario, where the sources covariance matrices are unknown. Sometimes in a fully-blind scenario estimation of the unknown covariance matrices from the data can be made possible.

The Fully-Blind Case If the sources covariance matrices are succinctly parameterized in any way, and the sources exhibit sufficient ergodicity, these parameters may be estimated from a single realization of each source. For example, if the sources are stationary AR processes, the AR parameters may be consistently estimated from a single (sufficiently long) realization of each source. An iterative approach can then be taken: apply any initial (consistent) separation, estimate the covariance matrix for each separated source, and then plug the estimated covariance matrices into the semi-blind ML estimation. Repeat to refine, if necessary. This approach is taken, e.g. in WASOBI for AR sources.

The Fully-Blind Case For non-stationary sources, estimation of the covariance from a single realization is usually impossible (depending on the parameterization, if any). However, sometimes repeated realizations (snapshots) of the nonstationary mixture may be available, each realization being triggered to some external stimulus.

The Fully-Blind Case For non-stationary sources, estimation of the covariance from a single realization is usually impossible (depending on the parameterization, if any). However, sometimes repeated realizations (snapshots) of the nonstationary mixture may be available, each realization being triggered to some external stimulus.

The Fully-Blind Case This is a 3 rd Data Dimension. It is then possible to take a similar iterative approach, where following initial separation (of all the mixtures realizations), the covariance matrix of each source is estimated from the estimated ensemble of sources realizations. If the covariance matrices can be succinctly parameterized, the required number of snapshots may be relatively modest.

What About the Bound? In the fully-blind case the vector of unknown parameters is augmented with parameters related to the unknown covariance matrices (at most per each matrix, but possibly fewer). However, it can be shown (Yeredor, 2010) that if the determinants of the sources covariance matrices are all known (namely, do not depend on the unknown parameters), then the resulting FIM is block-diagonal, with the two distinct blocks accounting for the elements of the mixing (or demixing) matrix and for the unknown parameters of the covariance matrices.

What About the Bound? This implies, that if the determinants of the covariance matrices are known, then the CRLB on estimation of all elements of the demixing (or mixing) matrices is the same in the fully-blind as in the semi-blind case.

What About the Bound? Moreover, it can further be shown, that for the iCRLB to be the same in the semi- and fully- blind scenarios, knowledge of the determinants is not necessary. When the determinants are unknown, the only off-block-diagonal elements in the FIM under a non-mixing condition ( ) are those involving the diagonal elements of, whose variance does not affect the iCRLB.

What About the Bound? Therefore, the iCRLB is indifferent to the knowledge of the sources covariance matrices. Of course, this does not mean in general that the same ISR attainable in the fully-blind case is always attainable in the semi-blind case as well. Nevertheless, it does imply that in scenarios involving multiple independent snapshots, the ML estimate, attaining the iCRLB asymptotically (in the number of snapshots) would indeed exhibit the same (optimal) asymptotic performance in the fully-blind as in the semi- blind cases.

Blind-ML Separation (the fully-blind scenario) Inputs: Snapshots of the observed mixtures: ( matrices, each ) Apply some initial, consistent separation, obtaining estimates of the sources snapshots: From the obtained snapshots of estimated sources, estimate the covariance matrices of each source, either directly or via some succinct parameterization, obtaining: ( matrices, each )

Blind-ML Separation (the fully-blind scenario) Using the estimated covariance matrices, construct generalized correlation matrices: Obtain the estimated residual demixing matrix as the solution to the HEAD problem, The estimated overall demixing matrix is.

Simulation Results For the fully-blind case

A Single-Snapshot (nearly) Fully-Blind Experiment We first consider a case of cyclostationary sources, where the covariance matrix can be consistently estimated from a single- snapshot. We generated four nonstationary AR sources, driven by periodically-modulated driving sequences.

A Single-Snapshot Experiment (contd.) is a Gaussian white-noise process. Let and then

A Single-Snapshot Experiment (contd.) We used the following parameters: (note that and are stationary) 000.60.5 --55 0 145 0 --7050 0.750.7 0.75Poles magnitudes +/-85 0 +/-95 0 +/-85 0 +/-95 0 Poles phases

A Single-Snapshot Experiment (contd.) Consistent (though sub-optimal) covariance matrices estimation (via estimation of the parameters): –Attain a near-separation condition; –Apply Yule-Walker equations to each separated source to obtain estimates of its AR parameters; –Use inverse filtering to recover (estimate) the driving sequences; –Using a DTFT of the squared estimated driving sequences obtain an estimate of the period. –Using a linear LS fit of the squared amplitude, obtain estimates of and.

A Single-Snapshot Experiment (contd.) The estimated parameters are then used for constructing the estimated covariance matrix (for each source): with and then

A Single-Snapshot Experiment: Results The results are presented in terms of each element of the ISR matrix, vs. the observation length.

A Multiple-Snapshot (nearly) Fully-Blind Experiment Next, we consider time-varying AR (TVAR) processes (of order 4): with such that the instantaneous poles drift linearly:

A Multiple-Snapshot (nearly) Fully-Blind Experiment The sources covariance matrices are estimated via estimation of the time-varying AR parameters for each source from multiple snapshots, using the Dym-Gohberg algorithm (1981), which is roughly based on a local Yule-Walker equations approach. The information on the linear poles variation is not exploited, no particular relation between the TVAR parameters is assumed. Note that covariance estimation under the TVAR model requires the estimation of roughly parameters, rather than parameters per source, thus requiring considerably fewer snapshots for reliable estimation.

A Multiple-Snapshot (nearly) Fully-Blind Experiment Results are presented in terms of the overall mean ISR (for all five sources) vs. the number of snapshots. The observation length was.

Things Worth Weighting For?… We have seen that the asymptotically- optimal ML separation applies the HEAD solution to a set of generalized correlation matrices. For AR sources of order, WASOBI applies optimally-weighted AJD to a set of ordinary correlation matrices. Are these two (apparently) different approaches related in any way?

Lets Compare Denote the generalized correlation matrices Denote the ordinary correlation matrices

Lets Compare Assume nearly-separated sources, so where is a small matrix, to be estimated from the data. In addition, since the sources are nearly separated, allgeneralized and ordinary correlation matrices are nearly diagonal, meaning that their off-diagonal elements are generally much smaller than their diagonal elements.

Lets Compare: HEAD We need to solve the HEAD equations but (neglecting small terms to second order) so (still neglecting small terms to second order) so that are only relevant for and

Lets Compare: HEAD We end up solving in pairs (for each ): or

Lets Compare: WASOBI The LS AJD fit requires: where. Neglecting small terms to second-order, The equations for a given pair (for all ) involve only and.

Lets Compare: WASOBI The LS equations can therefore be decoupled for each pair, taking the form: But in order to apply optimal weighting, we generally need the joint covariance matrix of all the vectors, which, for convenience, we shall refer to as Off-Diagonal Terms (ODIT) vectors.

Rearrangements of Elements: The ODIT Vectors

Rearrangements or Elements: The ODIT Vectors

Lets Compare: WASOBI (contd.) Fortunately, however, if the sources are (nearly) separated, this covariance matrix can be easily shown to be block-diagonal, and the optimally weighted LS equations can be decoupled as well: each with a weight matrix, leading to

Lets compare: WASOBI (contd.) But this equation can also be written as: Recall the HEAD equation

Lets Compare: Conclusion Indeed, it can be shown that when all the sources are Gaussian AR processes of maximal order, the two sets of equations become asymptotically equivalent, supporting the claim of general optimality of the optimal WLS fit of the estimated ordinary correlation matrices. In other cases (nonstationary / non-AR / higher order AR sources) the ordinary correlation matrices are not a sufficient statistic, and therefore even if optimal weighting is used, optimal separation cannot be attained without the explicit use of the appropriate generalized correlation matrices.

Lets Compare: Conclusion However, in general fully-blind scenarios, when the sources covariance matrices are unknown and thegeneralized correlation matrices cannot be constructed, it might make sense to use AJD ofordinary correlations instead (compromising optimality). Rather than having to estimate the sources covariance matrices (each ), it would suffice to estimate the covariance matrices of the ODIT vectors (each ) – which would be rather easy using the multiple snapshots (if available).

An Example We used the same TVAR sources as before, modifying WASOBI to use an empirical ODIT weighting approach, which employs the empirical estimates of the covariance of each ODIT vector, based on the multiple snapshots.

Conclusion We considered the framework of SOS-based separation for Gaussian sources with arbitrary temporal covariance structures; We derived the iCRLB on the attainable ISR; For the semi-blind scenario we have shown that the asymptotically-optimal ML separation is attained by solving the HEAD problem for generalized correlation matrices; For the fully-blind case the iCRLB remains the same, and asymptotically-optimal blind-ML separation is still possible if the sources covariance matrices can be consistently estimated (e.g., from multiple snapshots);

Conclusion We explored the relation between applying HEAD to generalized correlation matrices and applying weighted AJD to ordinary correlation matrices; A sub-optimal alternative to estimating the sources full covariance matrices in a fully-blind scenario, is to estimate the covariance matrices of the much smaller ODIT vectors, and use as weights in the AJD process (following an initial separation stage).

A Better Conclusion? Good things come to those who Weight… Thank You!

Download ppt "A rd Data Dimension Arie Yeredor nd -Order Statistics Just Weight and See! + Sept. 28 th, 2010."

Similar presentations