A note about gradient descent: Consider the function f(x)=(x-x 0 ) 2 Its derivative is: By gradient descent (If f(x) is more complex we usually cannot.

Slides:



Advertisements
Similar presentations
Independent Component Analysis
Advertisements

Independent Component Analysis: The Fast ICA algorithm
Discovering Cyclic Causal Models by Independent Components Analysis Gustavo Lacerda Peter Spirtes Joseph Ramsey Patrik O. Hoyer.
Dimension reduction (2) Projection pursuit ICA NCA Partial Least Squares Blais. “The role of the environment in synaptic plasticity…..” (1998) Liao et.
Face Recognition Ying Wu Electrical and Computer Engineering Northwestern University, Evanston, IL
Dimension reduction (1)
Visual Recognition Tutorial
Independent Component Analysis & Blind Source Separation
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Independent Component Analysis (ICA)
L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Dimensional reduction, PCA
Time Series Basics Fin250f: Lecture 3.1 Fall 2005 Reading: Taylor, chapter
Independent Component Analysis & Blind Source Separation Ata Kaban The University of Birmingham.
A gentle introduction to Gaussian distribution. Review Random variable Coin flip experiment X = 0X = 1 X: Random variable.
Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.
Independent Component Analysis (ICA) and Factor Analysis (FA)
Tch-prob1 Chapter 4. Multiple Random Variables Ex Select a student’s name from an urn. S In some random experiments, a number of different quantities.
An Introduction to Independent Component Analysis (ICA) 吳育德 陽明大學放射醫學科學研究所 台北榮總整合性腦功能實驗室.
CS Pattern Recognition Review of Prerequisites in Math and Statistics Prepared by Li Yang Based on Appendix chapters of Pattern Recognition, 4.
Bayesian belief networks 2. PCA and ICA
Probability & Statistics for Engineers & Scientists, by Walpole, Myers, Myers & Ye ~ Chapter 11 Notes Class notes for ISE 201 San Jose State University.
Distribution Function properties. Density Function – We define the derivative of the distribution function F X (x) as the probability density function.
Today Wrap up of probability Vectors, Matrices. Calculus
Survey on ICA Technical Report, Aapo Hyvärinen, 1999.
Review of Probability.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Unsupervised learning
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,
Independent Component Analysis on Images Instructor: Dr. Longin Jan Latecki Presented by: Bo Han.
Heart Sound Background Noise Removal Haim Appleboim Biomedical Seminar February 2007.
Additive Data Perturbation: data reconstruction attacks.
Independent Component Analysis Zhen Wei, Li Jin, Yuxue Jin Department of Statistics Stanford University An Introduction.
Blind Source Separation by Independent Components Analysis Professor Dr. Barrie W. Jervis School of Engineering Sheffield Hallam University England
Unsupervised learning
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
N– variate Gaussian. Some important characteristics: 1)The pdf of n jointly Gaussian R.V.’s is completely described by means, variances and covariances.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
The BCM theory of synaptic plasticity. The BCM theory of cortical plasticity BCM stands for Bienestock Cooper and Munro, it dates back to It was.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
CHAPTER 17 O PTIMAL D ESIGN FOR E XPERIMENTAL I NPUTS Organization of chapter in ISSO –Background Motivation Finite sample and asymptotic (continuous)
Data Modeling Patrice Koehl Department of Biological Sciences National University of Singapore
SYSTEMS Identification Ali Karimpour Assistant Professor Ferdowsi University of Mashhad Reference: “System Identification Theory For The User” Lennart.
Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.
PCA vs ICA vs LDA. How to represent images? Why representation methods are needed?? –Curse of dimensionality – width x height x channels –Noise reduction.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Principal Component Analysis (PCA)
Independent Component Analysis Independent Component Analysis.
Feature Selection and Extraction Michael J. Watts
Introduction to Independent Component Analysis Math 285 project Fall 2015 Jingmei Lu Xixi Lu 12/10/2015.
An Introduction of Independent Component Analysis (ICA) Xiaoling Wang Jan. 28, 2003.
ECE 8443 – Pattern Recognition Objectives: Reestimation Equations Continuous Distributions Gaussian Mixture Models EM Derivation of Reestimation Resources:
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
HST.582J/6.555J/16.456J Gari D. Clifford Associate Director, Centre for Doctoral Training, IBME, University of Oxford
Dimension reduction (1) Overview PCA Factor Analysis Projection persuit ICA.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Biointelligence Laboratory, Seoul National University
LECTURE 11: Advanced Discriminant Analysis
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Brain Electrophysiological Signal Processing: Preprocessing
Outline Introduction Signal, random variable, random process and spectra Analog modulation Analog to digital conversion Digital transmission through baseband.
PCA vs ICA vs LDA.
Bayesian belief networks 2. PCA and ICA
Blind Source Separation: PCA & ICA
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Presented by Nagesh Adluru
A Fast Fixed-Point Algorithm for Independent Component Analysis
Independent Factor Analysis
Presentation transcript:

A note about gradient descent: Consider the function f(x)=(x-x 0 ) 2 Its derivative is: By gradient descent (If f(x) is more complex we usually cannot solve explicitly the convergence to the fixed points.) x0x0 + -

Solving the differential equation: or in the general form: What is the solution of this type of equation: Try:

Objective function formulation We can define a function R m (Intrator 1992) This function is called a Risk/Objective/Energy/ Lyaponov/Index/Contrast –function in different uses The minimization of this function can be obtained by gradient descent:

Therefore And the stochastic analog is: where It can be shown that the stochastic ODE converges to the deterministic ODE.

Using the objective function formulation: Fixed points in various cases have been derived. A connection has been established with the statistical theory of Projection Pursuit (See chapter 3 of Theory of Cortical plasticity Cooper, Intrator, Blais, Shouval )

One way of looking at PCA is that we move to a basis that diagonalizes the correlation matrix. Whereby each PC grows independently. From the basis of the new N principal components x k we can form the rotation matrix such that we get the correlation basis in the new rotated space: so that

Graphically: first PC second PC What does PCA do? Dimensionality reduction (hierarchy) Eliminate correlations – by diagonalizing correlation matrix Another thing PCA does is that it finds the projection (direction) of maximal variance. Or assuming |w|=1 and =0

ICA – Independent Component Analysis Usually: Definition of Independent components: In ICA we typically assume this can be done by a linear transformation: Such that x ’ are independent. The approach described here follows most closely the work of Hevarinen and Oja (1996,2000)

Cocktail party effect Original signals s Mixed signals x Task of ICA – estimate s from x or equivalently estimate the mixing matrix A, or it’s inverse W such that: Or in matrix notation s1s1 s2s2 x1x1 x2x2

Illustration of ICA: s1s1 s2s2 x1x1 x2x2 0 ( Note the ICA approach only makes sense if the data is indeed a superposition of independent sources)

Definition of independence: This implies that And also implies the private case of decorrelation: But the inverse is not true; decorrelation does not imply independence. Example: pairs of discrete variables (y 1,y 2 ) such that points each have the probability of ¼. These variables are uncorrelated, but show at home: 1. that uncorrelated 2. that not independent

Non Gaussian is independent (Caveat – signals cannot be Gaussian) Central limit theorem: “ A sum of many independent random variables approaches a Gaussian distribution as the number of variables increases. Consequently: A sum of two independent, non Gaussian random variables, is “More Gaussian” than each of the signals.

Example Here independent component are sub-Gaussian (light tails) An exponential is super-Gaussian

Contrast functions to measure ‘distance’ from a Gaussian distribution 1.Kurtosis- a standard simple to understand measure based on the forth moment. There are two forms of Kurtosis, one is: typically assume that: E{y}=0 and E{y 2 }=1, so: At home- calculate the Kurtosis of a uniform distribution form -1 to 1 and an exponential exp(-|x|).

Other options (cost functions): 1.Negentropy (HO paper 2000) 2.KL distance between P(x 1 …x n ) and P(x 1 ). …P(x n ) (HO paper 2000) 3. BCM objective function (Theory of … Book ch 3)

We will use here a similar approach to that used in the objective function formulation of BCM. Use Gradient descent to maximize Kurtosis. What does the sign depend on? However, this rule is not stable for growth of w, and therefore an additional constraint should be used to keep w 2 =1. This could be done with a similar trick to Oja For another approach (FastICA see HO, 2000)

Example 1 – cash flow in retail stores Original – preprocessed data 5 Independent components

Example 2: ICA from natural images Are these independent components of natural Images? ICA, BCM and Projection Persuit

Projection Pursuit – find non Gaussian Projections

Generating Heterosynaptic and Homosynaptic models From Kurtosis we got: This alone is non stable to growth of w. can use the same trick to keep w normalized as in the Oja rule. For this case we obtain: This is a Heterosynaptic rule. Note- the different uses of “Heterosynaptic” Heterosynaptic term

There is another form for Kurtosis: Therefore: This produces a (more complex) Homosynaptic rule with a sliding threshold

What are the consequences of these different rules? General form: Heterosynaptic term

Monocular Deprivation Homosynaptic model (BCM) High noise Low noise

Monocular Deprivation Heterosynaptic model (K2) High noise Low noise

Noise Dependence of MD Two families of synaptic plasticity rules Homosynaptic Heterosynaptic Blais, Shouval, Cooper. PNAS, 1999 QBCM K1K1 S1S1 Noise std S2S2 K2K2 PCA Noise std Normalized Time

What did we learn until now?