Discovering Cyclic Causal Models by Independent Components Analysis Gustavo Lacerda Peter Spirtes Joseph Ramsey Patrik O. Hoyer.

Slides:



Advertisements
Similar presentations
Tests of Static Asset Pricing Models
Advertisements

1 Learning Causal Structure from Observational and Experimental Data Richard Scheines Carnegie Mellon University.
Independent Component Analysis
5.1 Real Vector Spaces.
Multiple Regression Analysis
SEM PURPOSE Model phenomena from observed or theoretical stances
Structural Equation Modeling
1. Person 1 1.Stress 2.Depression 3. Religious Coping Task: learn causal model 2 Data from Bongjae Lee, described in Silva et al
Linear Regression.
Topic Outline Motivation Representing/Modeling Causal Systems
Dimension reduction (2) Projection pursuit ICA NCA Partial Least Squares Blais. “The role of the environment in synaptic plasticity…..” (1998) Liao et.
Weakening the Causal Faithfulness Assumption
PRECALCULUS I SOLVING SYSTEMS OF EQUATIONS Dr. Claude S. Moore Cape Fear Community College Chapter 8.
Outline 1)Motivation 2)Representing/Modeling Causal Systems 3)Estimation and Updating 4)Model Search 5)Linear Latent Variable Models 6)Case Study: fMRI.
CS479/679 Pattern Recognition Dr. George Bebis
1 Chapter 4 Experiments with Blocking Factors The Randomized Complete Block Design Nuisance factor: a design factor that probably has an effect.
Chapter 4 Randomized Blocks, Latin Squares, and Related Designs
Dimension reduction (1)
Ch11 Curve Fitting Dr. Deshi Ye
The General Linear Model. The Simple Linear Model Linear Regression.
Structural Equation Modeling
Visual Recognition Tutorial
Independent Component Analysis (ICA)
Chapter 10 Simple Regression.
Common Factor Analysis “World View” of PC vs. CF Choosing between PC and CF PAF -- most common kind of CF Communality & Communality Estimation Common Factor.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
SYSTEMS Identification
Tracking using the Kalman Filter. Point Tracking Estimate the location of a given point along a sequence of images. (x 0,y 0 ) (x n,y n )
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Independent Component Analysis (ICA) and Factor Analysis (FA)
7. Least squares 7.1 Method of least squares K. Desch – Statistical methods of data analysis SS10 Another important method to estimate parameters Connection.
1 gR2002 Peter Spirtes Carnegie Mellon University.
Linear and generalised linear models
Linear and generalised linear models Purpose of linear models Least-squares solution for linear models Analysis of diagnostics Exponential family and generalised.
CAUSAL SEARCH IN THE REAL WORLD. A menu of topics  Some real-world challenges:  Convergence & error bounds  Sample selection bias  Simpson’s paradox.
Bayes Net Perspectives on Causation and Causal Inference
Objectives of Multiple Regression
Regression and Correlation Methods Judy Zhong Ph.D.
1 Tetrad: Machine Learning and Graphcial Causal Models Richard Scheines Joe Ramsey Carnegie Mellon University Peter Spirtes, Clark Glymour.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
One-Factor Experiments Andy Wang CIS 5930 Computer Systems Performance Analysis.
Key Stone Problem… Key Stone Problem… next Set 22 © 2007 Herbert I. Gross.
Section 2.2 Echelon Forms Goal: Develop systematic methods for the method of elimination that uses matrices for the solution of linear systems. The methods.
Multiple Regression The Basics. Multiple Regression (MR) Predicting one DV from a set of predictors, the DV should be interval/ratio or at least assumed.
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
1 Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
A note about gradient descent: Consider the function f(x)=(x-x 0 ) 2 Its derivative is: By gradient descent (If f(x) is more complex we usually cannot.
Talking Points Joseph Ramsey. LiNGAM Most of the algorithms included in Tetrad (other than KPC) assume causal graphs are to be inferred from conditional.
SEM Basics 2 Byrne Chapter 2 Kline pg 7-15, 50-51, ,
SYSTEMS Identification Ali Karimpour Assistant Professor Ferdowsi University of Mashhad Reference: “System Identification Theory For The User” Lennart.
Lecture 2: Statistical learning primer for biologists
CJT 765: Structural Equation Modeling Class 8: Confirmatory Factory Analysis.
Principal Component Analysis (PCA)
Introduction to Independent Component Analysis Math 285 project Fall 2015 Jingmei Lu Xixi Lu 12/10/2015.
An Algorithm for the Consecutive Ones Property Claudio Eccher.
Dimension reduction (1) Overview PCA Factor Analysis Projection persuit ICA.
Dimension reduction (2) EDR space Sliced inverse regression Multi-dimensional LDA Partial Least Squares Network Component analysis.
Data Modeling Patrice Koehl Department of Biological Sciences
The simple linear regression model and parameter estimation
LECTURE 11: Advanced Discriminant Analysis
Brain Electrophysiological Signal Processing: Preprocessing
Latent Variables, Mixture Models and EM
Markov Properties of Directed Acyclic Graphs
Modelling data and curve fitting
Center for Causal Discovery: Summer Short Course/Datathon
10701 / Machine Learning Today: - Cross validation,
One-Factor Experiments
Matrices are identified by their size.
16. Mean Square Estimation
Presentation transcript:

Discovering Cyclic Causal Models by Independent Components Analysis Gustavo Lacerda Peter Spirtes Joseph Ramsey Patrik O. Hoyer

Structural Equation Models (SEMs) Graphical models that represent causal relationships. Manipulating x3 to a fixed value… x1x2 x3 x4 x1x2 x3 x4 f3(x1, x2)x3 = x4 = f4(x3) k M: M(do (x3 = k)):

Structural Equation Models (SEMs) Can be acyclic …or cyclic The data produced by cyclic models can be interpreted as equilibrium points of dynamical systems x1x2 x3 x4

Linear Structural Equation Models (SEMs) (deterministic example) The structural equations are linear e.g.: x3 = 1.2 x x2 - 3 x4 = -5 x3 + 1 Each edge weight tells us the corresponding coefficient x1x2 x3 x

Linear Structural Equation Models (SEMs) (with randomness) Now, each variable has an additive noise term with non- zero variance. x1 = e1 x2 = e2 x3 = 1.2 x x2 – 3 + e3 x4 = -5 x e4 x = B x + e x1x2 x3 x e1 e2 e3 e4

Linear Structural Equation Models (SEMs) (with randomness) x = B x + e Solving for x, we get: x = (I – B) -1 e Let A = (I – B) -1 then x = A e A is called the “mixing matrix”. x1x2 x3 x e1 e2 e3 e4

Linear Structural Equation Models (SEMs) (with randomness) The “mixing matrix” shows how the noise propagates: x1x2 x3 x e1 e2 e3 e4

Linear Structural Equation Models (SEMs) (with randomness) The “mixing matrix” shows how the noise propagates: Done. x1x2x3x4 e1e2e3e4 x1x2 x3 x e1 e2 e3 e4 Let’s make it:

What can we learn from observational data alone? Until recently, the best we could do was identify the d-separation equivalence class We couldn’t tell the difference between: x1 x2 x1 x2 M1:M2:

Why not? Because it was assumed that the error terms are Gaussian …and when they are Gaussian, these two graphs are distribution-equivalent x1 x2 x1 x2 M1:M2:

Independent Components Analysis (ICA) Cocktail party problem You want to get back the original signals, but all you have are the mixtures. What can you do? x = A e x1x2 e2e1

Independent Components Analysis (ICA) Cocktail party problem This equation has infinitely many solutions! For any invertible A, there is a solution! But if you assume that the signals are independent, it is possible to estimate A and e from just x. How? x = A e x1x2 e2e1

Independent Components Analysis (ICA) Cocktail party problem Any choice of A implies a list of samples of e Each list of implied samples of e has a degree of independence We want the A for which the implied e’s are maximally independent e’s maximally independent ↔ e’s maximally non-Gaussian Intuition: Central Limit Theorem x = A e x1x2 e2e1

Independent Components Analysis (ICA) We don’t know which source signal is which, i.e. which is Alex and which is Bob Scaling: when used with SEMs, the variance of each error term is confounded with its coefficients on each x.

The LiNGAM approach (Shimizu et al, 2006) What happens if we generate data from this linear SEM … and then run ICA? x1 x2x3 x4 e1 e2 e3 e

The LiNGAM approach We would expect to see: Except that ICA doesn’t know the scaling x1x2x3x4 e1e2e3e

The LiNGAM approach So we should expect to see something like: …and we’d need to normalize by dividing all children of e1 by 2 x1x2x3x4 e1e2e3e

The LiNGAM approach getting us: Except that ICA doesn’t know the order of the e’s, i.e. which e’s go with which x’s… x1x2x3x4 e1e2e3e

The LiNGAM approach really, ICA gives us something like: So first we need to find the right permutation of the e’s And then do the scaling Note that, since the model is a DAG, there is exactly one valid way to permute the error terms. x1x2x3x4 e… x1x2x3x4 e1e2e3e x1x2x3x4 e1e2e3e

The LiNGAM approach After some matrix magic, we get back: x1 x2x3 x4 e1 e2 e3 e B = I – A -1

The LiNGAM approach Discovers the full structure of the DAG … by assuming causal sufficiency (i.e. independence of the error terms)  “causal sufficiency”: no latent variable is a cause of more than one observed variable  linear case, causal sufficiency ↔ independence of the error terms In particular, now M1 and M2 can be distinguished! x1 x2 x1 x2 M1:M2:

The LiNGAM approach Gaussian Uniform Images by Patrik Hoyer et al, used with permission from “Estimation of causal effects using linear non-Gaussian causal models with hidden variables”

The LiNGAM approach Note that, once the valid permutation was found, there were no left- pointing arrows. This is because:  the generating model was a DAG.  we wrote down the x’s in an order compatible with it But it is possible for ICA to return a matrix that does not satisfy the acyclicity assumption LiNGAM will pretend the red edge is not there x1x2x3x4 e1e2e3e4

The LiNGAM approach LiNGAM cannot discover cyclic models… because:  since it assumes the data was generated by a DAG,  it searches for a single valid permutation If we search for any number of valid permutations… then we can discover cyclic models too. That’s exactly what we did!

The LiNG-DG approach When the data looks acyclic, it works just like LiNGAM, and returns a single model. When the data looks cyclic, more than one permutation is considered valid. Thus, it returns a distribution-equivalent set containing more than one model. “distribution-equivalent” means you can’t do better, at least without experimental data or further assumptions.

The LiNG-DG approach Let’s simulate using this model: Error terms are generated by sampling from a Gaussian and squaring data points We test which ICA coefficients are zero by using bootstrap sampling followed by a quantile test Ready? x5 e1 x4 x1 x2 x3 e2 e5 e4 e

The LiNG-DG approach LiNG-DG returns a set with 2 models: #1#2

LiNG-DG + the stability assumption Note that only one of these models is stable. If our data is a set of equilibria, then the true model must be stable. Under what conditions are we guaranteed to have a unique stable model?

LiNG-DG + the stability assumption Theorem: if the true model’s cycles don’t intersect, then only one model is stable. For simple cycle models, cycle-products are inverted: c1 = 1/c2. So at least one cycle will be > 1 (in modulus) and thus unstable. each cycle works independently, and any valid permutation* will invert at least one cycle, creating an unstable model. *except for the identity permutation

very large class: not even covariance equivalent d-separation equivalence class What should one use? non-GaussianGaussian DAG DG Constraint-based methods e.g. PC, CPC, SGS (or Geiger and Heckerman 1994 for a Bayesian alternative) LiNGAM unique model Richardson’s CCD LiNG-DG 2 cases acyclicunique model cyclic distribution- equivalence class unknown or both or too little data Check out Hoyer, Hyvärinen, Glymour, Spirtes, Scheines,Ramsey, Lacerda, Shimizu (submitted) ?

UAI is due soon! Please send me your comments:

Appendix 1: self-loops Equilibrium equations usually correspond with the dynamical equations. EXCEPT if a self-loop has coefficient 1, we will get the wrong structure, and the predicted results of intervention will be wrong! self-loop coefficients are underdetermined. Our stability results only hold if we assume no self-loops.

Appendix 2: search and pruning Testing zeros: local vs non-local methods To estimate the variance of the estimated coefficients, we use bootstrap sampling, carefully. How to find row-permutations of W that have a zeroless diagonal:  Acyclic: Hungarian algorithm  General: k-best linear assignments, or constrained n- Rooks (put rooks on the non-zero entries)