Peter Spirtes, Jiji Zhang 1. Faithfulness comes in several flavors and is a kind of principle that selects simpler (in a certain sense) over more complicated.

Slides:



Advertisements
Similar presentations
Great Theoretical Ideas in Computer Science
Advertisements

Discovering Cyclic Causal Models by Independent Components Analysis Gustavo Lacerda Peter Spirtes Joseph Ramsey Patrik O. Hoyer.
1. Person 1 1.Stress 2.Depression 3. Religious Coping Task: learn causal model 2 Data from Bongjae Lee, described in Silva et al
Weakening the Causal Faithfulness Assumption
Bayesian Networks, Winter Yoav Haimovitch & Ariel Raviv 1.
Outline 1)Motivation 2)Representing/Modeling Causal Systems 3)Estimation and Updating 4)Model Search 5)Linear Latent Variable Models 6)Case Study: fMRI.
Great Theoretical Ideas in Computer Science for Some.
Structure Learning Using Causation Rules Raanan Yehezkel PAML Lab. Journal Club March 13, 2003.
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
1 NP-Complete Problems. 2 We discuss some hard problems:  how hard? (computational complexity)  what makes them hard?  any solutions? Definitions 
Learning Causality Some slides are from Judea Pearl’s class lecture
From: Probabilistic Methods for Bioinformatics - With an Introduction to Bayesian Networks By: Rich Neapolitan.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
NP-Complete Problems Reading Material: Chapter 10 Sections 1, 2, 3, and 4 only.
Bayesian Network Representation Continued
Michael Bender - SUNY Stony Brook Dana Ron - Tel Aviv University Testing Acyclicity of Directed Graphs in Sublinear Time.
Learning Equivalence Classes of Bayesian-Network Structures David M. Chickering Presented by Dmitry Zinenko.
Inferring Causal Graphs Computing 882 Simon Fraser University Spring 2002.
Bayesian Networks Alan Ritter.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Causal Models, Learning Algorithms and their Application to Performance Modeling Jan Lemeire Parallel Systems lab November 15 th 2006.
Causal Modeling for Anomaly Detection Andrew Arnold Machine Learning Department, Carnegie Mellon University Summer Project with Naoki Abe Predictive Modeling.
5-3 Inference on the Means of Two Populations, Variances Unknown
1 Day 2: Search June 9, 2015 Carnegie Mellon University Center for Causal Discovery.
Bayes Net Perspectives on Causation and Causal Inference
Fixed Parameter Complexity Algorithms and Networks.
Causal Inference and Graphical Models Peter Spirtes Carnegie Mellon University.
Chapter 7 Hypothesis testing. §7.1 The basic concepts of hypothesis testing  1 An example Example 7.1 We selected 20 newborns randomly from a region.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
1 Inferring structure to make substantive conclusions: How does it work? Hypothesis testing approaches: Tests on deviances, possibly penalised (AIC/BIC,
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
Learning Linear Causal Models Oksana Kohutyuk ComS 673 Spring 2005 Department of Computer Science Iowa State University.
Ch 8. Graphical Models Pattern Recognition and Machine Learning, C. M. Bishop, Revised by M.-O. Heo Summarized by J.W. Nam Biointelligence Laboratory,
Course files
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
Learning With Bayesian Networks Markus Kalisch ETH Zürich.
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
INTERVENTIONS AND INFERENCE / REASONING. Causal models  Recall from yesterday:  Represent relevance using graphs  Causal relevance ⇒ DAGs  Quantitative.
NP-Complete Problems. Running Time v.s. Input Size Concern with problems whose complexity may be described by exponential functions. Tractable problems.
Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.
Exploratory studies: you have empirical data and you want to know what sorts of causal models are consistent with it. Confirmatory tests: you have a causal.
Talking Points Joseph Ramsey. LiNGAM Most of the algorithms included in Tetrad (other than KPC) assume causal graphs are to be inferred from conditional.
1 Bayesian Networks (Directed Acyclic Graphical Models) The situation of a bell that rings whenever the outcome of two coins are equal can not be well.
Lecture 2: Statistical learning primer for biologists
Chapter 8: Simple Linear Regression Yang Zhenlin.
The Visual Causality Analyst: An Interactive Interface for Causal Reasoning Jun Wang, Stony Brook University Klaus Mueller, Stony Brook University, SUNY.
Machine Learning 5. Parametric Methods.
1 Acceleration of Inductive Inference of Causal Diagrams Olexandr S. Balabanov Institute of Software Systems of NAS of Ukraine
NPC.
1 BN Semantics 2 – Representation Theorem The revenge of d-separation Graphical Models – Carlos Guestrin Carnegie Mellon University September 17.
1 BN Semantics 1 Graphical Models – Carlos Guestrin Carnegie Mellon University September 15 th, 2006 Readings: K&F: 3.1, 3.2, 3.3.
The NP class. NP-completeness Lecture2. The NP-class The NP class is a class that contains all the problems that can be decided by a Non-Deterministic.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
1 Day 2: Search June 9, 2015 Carnegie Mellon University Center for Causal Discovery.
1Causal Inference and the KMSS. Predicting the future with the right model Stijn Meganck Vrije Universiteit Brussel Department of Electronics and Informatics.
1 Day 2: Search June 14, 2016 Carnegie Mellon University Center for Causal Discovery.
Chapter 11 – Test of Independence - Hypothesis Test for Proportions of a Multinomial Population In this case, each element of a population is assigned.
The NP class. NP-completeness
12. Principles of Parameter Estimation
Chapter 5 STATISTICS (PART 4).
Markov Properties of Directed Acyclic Graphs
ICS 353: Design and Analysis of Algorithms
Parameterised Complexity
Center for Causal Discovery: Summer Short Course/Datathon
Estimating Networks With Jumps
Discrete Event Simulation - 4
An Algorithm for Bayesian Network Construction from Data
Elements of a statistical test Statistical null hypotheses
BN Semantics 3 – Now it’s personal! Parameter Learning 1
12. Principles of Parameter Estimation
Presentation transcript:

Peter Spirtes, Jiji Zhang 1

Faithfulness comes in several flavors and is a kind of principle that selects simpler (in a certain sense) over more complicated models. We show how to weaken the assumption of standard faithfulness so that it needs to be applied in fewer circumstances. We show how to weaken the assumption of strong (ε)- faithfulness) so that it does not prohibit the existence of weak edges. We show how to modify the causal search algorithms so that they make fewer mind changes as the sample size grows. 2

3 X Y Z W True Graph W = aZ + ε W Z = bX + cY + ε Z X = ε X Y = ε Y X Y Z W X Y Z W X Y Z W X Y Z W I P (W,X|Z) = 0 I P (W,Y|Z) = 0 I P (X,Y| ∅ ) = 0

S1. Form the complete undirected graph H on the given set of variables V. S2. For each pair of variables X and Y in V, search for a subset S of V\{X, Y} such that X and Y are independent conditional on S. Remove the edge between X and Y in H iff such a set is found. S3. Let K be the graph resulting from S2. For each unshielded triple (i.e., X and Y are adjacent, Y and Z are adjacent, but X and Z are not adjacent), if X and Z are independent conditional on some subset of V\{X, Y} that does not contain Y, then orient the triple as a collider: X  Y  Z. S4. Execute the entailed orientation rules. 4

Causal Markov Assumption: For a set of variables for which there are no unmeasured common causes, each variable is independent of its non-effects conditional on its direct causes. Non-obvious equivalent formulation: If I G (X,Y|Z) in causal DAG G with no unmeasured common causes then I P (X,Y|Z) = 0. If I P (X,Y|Z) = 0 then I G (X,Y|Z) in causal DAG G. Converse of Causal Markov Assumption. If I P (X,Y|Z) is a rational function of parameters, then violations are Lebesgue measure 0. 5

Reduction of Underdetmination If I(A,B| ∅  then prefer A → C ← B to A → C → B Computational Efficiency If A – C – B and I(A,B| ∅  then don’t need to check I(A,B|C  Statistical Efficiency The Markov equivalence class can be found without testing independence conditional on a set with more than maximum degree of any variable in the true causal graph. 6

If causal sufficiency, Causal Markov and Causal Faithfulness Assumptions, then there exist pointwise consistent estimators of Markov equivalence class SGS PC GES (Gaussian, multinomial) If just assume Causal Markov Assumption and causal sufficiency there are no pointwise consistent estimators of Markov Equivalence Class Gaussian Multinomial Unrestricted 7

If causal sufficiency, Causal Markov and Causal Faithfulness Assumptions, then no uniform consistent estimator of Markov Equivalence Class Gaussian Multinomial Unrestricted 8

(A4: ε-faithfulness) The partial correlations between X(i) and X( j) given {X(r); r  k} for some set k  {1,…,p n }\{i,j} are denoted by r n;i,j|k. Their absolute values are bounded from below and above: 9

10

Uhler et al.: (A4) tends to be violated fairly often, if the parameter values are assigned randomly, and ε is not very small. There are two ways to get very small partial correlations – almost cancellations and very weak edges. (A4) forbids both – it entails that there are no very weak edges. 11

X Y X Y Z Z Z Z W W W W 12

13 X Y I P (W,{X,Y}|Z) I P (W,{X,Y}| ∅ ) Z I P (X,Y| ∅ ) I P (W,Z| ∅ ) W Output Small Sample X Y I P (W,{X,Y}|Z) I P (W,{X,Y}| ∅ ) Z I P (X,Y| ∅ ) I P (W,Z| ∅ ) W Output Medium- Sample X Y I P (W,{X,Y}|{Z}) I P (W,{X,Y}| ∅ ) Z I P (X,Y| ∅ ) I P (W,Z| ∅ ) W Output Medium+ Sample X Y I P (W,{X,Y}|{Z}) I P (W,{X,Y}| ∅ ) Z I P (X,Y| ∅ ) I P (W,Z| ∅ ) W Output Large Sample

14 X Y I P (W,{X,Y}|Z) I P (W,{X,Y}| ∅ ) Z I P (X,Y| ∅ ) I P (W,Z| ∅ ) W Output Small Sample X Y I P (W,{X,Y}|Z) I P (W,{X,Y}| ∅ ) Z I P (X,Y| ∅ ) I P (W,Z| ∅ ) W Output Medium- Sample X Y I P (W,{X,Y}|{Z}) I P (W,{X,Y}| ∅ ) Z I P (X,Y| ∅ ) I P (W,Z| ∅ ) W Output Medium+ Sample X Y I P (W,{X,Y}|{Z}) I P (W,{X,Y}| ∅ ) Z I P (X,Y| ∅ ) I P (W,Z| ∅ ) W Output Large Sample

X → Y → Z → W X – Y – Z – W X – Y – Z → W I P (X,Z|Y) I P (X,Z|Y) I P (X,Z|Y) I P (Y,W|{X,Z)}I P (Y,W|{X,Z)} I P (Y,W|{X,Z)} I P (X,W| ∅ ) True Graph Small Sample Large Sample 15

X → Y → Z → W X – Y – Z – W X – Y – Z → W I P (X,Z|Y) I P (X,Z|Y) I P (X,Z|Y) I P (Y,W|{X,Z)}I P (Y,W|{X,Z)} I P (Y,W|{X,Z)} I P (X,W| ∅ ) True Graph Small Sample Large Sample 16

X Y Z W 17

18 X Y I P (W,{X,Y}|Z) I P (W,{X,Y}| ∅ ) Z I P (X,Y| ∅ ) I P (W,Z| ∅ ) W Output Small Sample X Y I P (W,{X,Y}|Z) I P (W,{X,Y}| ∅ ) Z I P (X,Y| ∅ ) I P (W,Z| ∅ ) W Output Medium- Sample X Y I P (W,{X,Y}|{Z}) I P (W,{X,Y}| ∅ ) Z I P (X,Y| ∅ ) I P (W,Z| ∅ ) W Output Medium+ Sample X Y I P (W,{X,Y}|{Z}) I P (W,{X,Y}| ∅ ) Z I P (X,Y| ∅ ) I P (W,Z| ∅ ) W Output Large Sample

S3*. Let K be the undirected graph resulting from S2. For each unshielded triple, If X and Z are not independent conditional on any subset of V\{X, Y} that contains Y, then orient the triple as a collider: X  Y  Z. If X and Z are not independent conditional on any subset of V\{X, Y} that does not contain Y, then mark the triple as a non-collider. Otherwise, mark the triple as ambiguous (or unfaithful). 19

Adjacency – If X – Y in the causal DAG then I P (X,Y|Z) ≠ 0 for any Z. 20

Triangle – For any three variables that form a triangle in causal DAG G If Z is a non-collider on the path, then X and Y are not independent conditional on any subset of V\{X, Y} that does not contain Z; If Z is a collider on the path, then X and Y are not independent conditional on any subset of V\{X, Y} that contains Z. Suppose X → Y ← Z and I P (X,Z|Y) = 0. This is faithful to X → Y → Z. This cannot be detected, so it must be assumed. 21

X ¬I(X,Z| ∅ ) Z¬I(X,Y|Z) Y¬I(Y,Z| ∅ ) X ¬I(X,Z| ∅ )¬I(X,Z|W)¬I(X,Z|Y,W) Z¬I(Y,Z| ∅ )¬I(Y,Z|W) ¬I(Y,Z|X,W) Y¬I(X,Y|Z) ¬I(X,Y|W) ¬I(X,Y|Z,W) W¬I(X,W| ∅ ) ¬I(X,W|Z)¬I(X,W|Y) ¬I(Y,W| ∅ )¬I(Y,W|X) ¬I(Y,W|Z) ¬I(Z,W| ∅ )¬I(Z,W|X) ¬I(Z,W|Y) 22

The population distribution is not Markov to any proper subDAG of the true causal DAG. Causal Minimality is entailed by manipulation definition of causation if a distribution is positive. There is a weaker kind of causal minimality – P- minimality: the population distribution is not Markov to any DAG that entails a proper superset of the conditional independence relations. Is this sufficient for the correctness of VCSGS? 23

X → Y → Z → W X – Y – Z – W X – Y – Z – W I P (X,Z|Y) I P (X,Z|Y) I P (X,Z|Y) I P (Y,W|{X,Z)}I P (Y,W|{X,Z)} I P (Y,W|{X,Z)} I P (X,W| ∅ ) True Graph Small Sample Large Sample 24

X → Y → Z → W X – Y – Z – W X – Y – Z → W I P (X,Z|Y) I P (X,Z|Y) I P (X,Z|Y) I P (Y,W|{X,Z)}I P (Y,W|{X,Z)} I P (Y,W|{X,Z)} I P (X,W| ∅ ) True Graph Small Sample Large Sample 25

X → Y → Z → W X – Y – Z – W X – Y – Z → W I P (X,Z|Y) I P (X,Z|Y) I P (X,Z|Y) I P (Y,W|{X,Z)}I P (Y,W|{X,Z)} I P (Y,W|{X,Z)} I P (X,W| ∅ ) True Graph Small Sample Large Sample 26

V1. Form the complete undirected graph H on the given set of variables V. V2. For each pair of variables X and Y in V, search for a subset S of V\{X, Y} such that X and Y are independent conditional on S. Remove the edge between X and Y in H and mark the pair as ‘apparently non-adjacent’, if and only if such a set is found. V3. Let K be the graph resulting from V2. For each apparently unshielded triple (i.e., X and Y are adjacent, Y and Z are adjacent, but X and Z are apparently non-adjacent), If X and Z are not independent conditional on any subset of V\{X, Y} that contains Y, then orient the triple as a collider: X  Y  Z. If X and Z are not independent conditional on any subset of V\{X, Y} that does not contain Y, then mark the triple as a non-collider. Otherwise, mark the triple as ambiguous (or unfaithful), and mark the pair as ‘definitely non-adjacent’. 27

V4. Execute the same orientation rules as in S4, until none of them applies. V5. Let M be the graph resulting from V4. For each consistent disambiguation of the ambiguous triples in M (i.e., each disambiguation that leads to a pattern), test whether each vertex V in the resulting pattern satisfies the Markov condition. If V and W satisfy the Markov condition in every pattern, then mark the ‘apparently non-adjacent’ pair as ‘definitely non-adjacent’. 28

29 Faithfulness Adjacency-Faithfulness Triangle-Faithfulness P-Minimality

If Triangle Faithfulness Assumption, Causal Minimality Assumption, and Causal Markov Assumption, then VCSGS is a consistent estimator of the extended Markov equivalence class. Is it complete? 30

V5*. Let M be the graph resulting from V4. For each consistent disambiguation of the ambiguous triples in M (i.e., each disambiguation that leads to a pattern), test whether each vertex V in the resulting pattern satisfies the Markov condition. If V and W satisfy the Markov condition in some pattern, then mark the ‘apparently non-adjacent’ pair as ‘definitely non-adjacent’. 31

Assumption NVV(J): Assumption UBC(C): 32

Given a set of variables V, suppose the true causal model over V is M =, where P is a Gaussian distribution over V, and G is a DAG with vertices V For any three variables X, Y, Z that form a triangle in G (i.e., each pair of vertices is adjacent), If Y is a non-collider on the path, then |r(X, Z|W)| ≥ k  |e M (X – Z)| for all W  V that do not contain Y; and If Y is a collider on the path, then |r(X, Z|W)| ≥ k  |e M (X – Z)| for all W  V that do contain Y. 33

S3* (sample version). Let K be the undirected graph resulting from the adjacency phase. For each unshielded triple, If there is a set W not containing Y such that the test of r(X, Z|W) = 0 returns 0 (i.e., accepts the hypothesis), and for every set U that contains Y, the test of |r(X,Z|U)| = 0 returns 1 (i.e., rejects the hypothesis), and the test of |r(X,Z|U) – r(X,Z|W)|  L returns 0 (i.e., accepts the hypothesis), then orient the triple as a collider: X  Y  Z. If there is a set W containing Y such that the test of r(X, Z|W) = 0 returns 0 (i.e., accepts the hypothesis), and for every set U that does not contain Y, the test of |r(X,Z|U)| = 0 returns 1 (i.e., rejects the hypothesis), and the test of |r(X,Z|U) – r(X,Z|W)|  L returns 0 (i.e., accepts the hypothesis), then mark the triple as a non-collider. Otherwise, mark the triple as ambiguous. 34

Say that CSGS(L, n, M) errs if it contains (i) an adjacency not in G M ; or (ii) a marked non-collider not in G M, or (iii) an orientation not in G M. Theorem: Given causal sufficiency of the measured variables V, the Causal Markov, k-Triangle- Faithfulness, NVV(J), and UBC(C) Assumptions, the CSGS algorithm is uniformly consistent in the sense that 35

For each vertex Z If every vertex not adjacent to Z is not confirmed to be non-adjacent to Z return ‘Unknown’ for every edge containing Z else For every non-adjacent pair in EP(G), let the estimate be 0 For each vertex Z such that all of the edges containing Z are oriented in EP(G), if Y is a parent of Z in EP(G), let the estimate be the sample regression coefficient of Y in the regression of Z on its parents in EP(G). 36

Let M 1 be an output of the Estimation Algorithm, and M 2 be a causal model. We define the structural coefficient distance, d[M 1,M 2 ], between M 1 and M 2 to be where by convention if = “Unknown”. 37

E1. Run the CSGS algorithm on an i.i.d. sample of size n from P M. E2. Let the output from E1 be CSGS(L, n, M). Apply step V5 in the VCSGS algorithm (from section 3), using tests of zero partial correlations and record which non- adjacencies are confirmed. E3. Apply the Estimation Algorithm to CSGS(L, n, M), the confirmed non-adjacencies, and the sample of size n. 38

Given causal sufficiency of the measured variables V, the Causal Markov, k-Triangle-Faithfulness, NVV(J), and UBC(C) Assumptions, the Edge Estimation I algorithm is uniformly consistent in the sense that for every  > 0 For a large enough and dense enough graph, this still allows for the possibility of large manipulation errors (due to many small edge errors. 39

40 X 1 X 2 X

41 if k > 0.014, then the k-Triangle-Faithfulness Assumption is violated for models M 2 and M 3, but not for M 1. If < k < then the k-Triangle-Faithfulness Assumption is violated for models M 3, but not for M 1 or M 2.

E1. Run Edge Estimation Algorithm I. E2. Set ForbiddenOrientations = {}. E3. For each maximal clique in CSGS(L, n, M) such that if a vertex in the clique is not adjacent to some vertex not in the clique, it is definitely non-adjacent (i) for each possible orientation O of all of the unoriented edges in the maximal clique Apply the orientation O to each of the unoriented edges. Apply Meeks’ orientation rules. If application of the rules produces a cycle or a new unshielded collider add O to ForbiddenOrientations Add O to ForbiddenOrientations if for any Y and W such that Y is a non-collider the path, and W  V and does contain Y 42

E4. For each unoriented edge X – Y in CSGS(L, n, M), if there is only one orientation X  Y that does not occur in ForbiddenOrientations, and every vertex that Y is not adjacent to, Y is definitely not adjacent to, orient as X  Y E5. For each vertex V such that some edge containing V in CSGS(L, n, M) is not oriented, if there is only one orientation of all of the edges containing V that is not in ForbiddenOrientations, and every vertex that V is not adjacent to, V is definitely not adjacent to, let the estimate of each edge equal be the sample regression coefficient of V on its parents in the non-forbidden orientation. 43

Theorem: Given causal sufficiency of the measured variables V, the Causal Markov, k-Triangle- Faithfulness, NVV(J), and UBC(C) Assumptions, the Edge Estimation II algorithm is uniformly consistent in the sense that for every  > 0 where O(L,n,M) is the graphical output of the Edge Estimation II algorithm, and is the output of the Edge Estimation II algorithm. 44

We weaken the assumption of faithfulness so that fewer inferences from conditional independence to d- separation need to be made. We strengthened the assumption so that it allows one to make inferences from “almost independence” in a probability distribution to d-separation in a causal graph, allowing for the existence of uniformly consistent estimation algorithms. 45

We changed the concept of correctness to allow for missing weak edges, and saying “don’t know” about some features of Markov equivalence classes. The new simplicity assumption broke up the Markov equivalence class in the sense that it considers some models in a Markov equivalence class simpler than other models in the same Markov equivalence class. This allowed for uniformly consistent estimates of linear coefficients in a causal model, as well as causal structure. 46

Can we get similar results for: PC FCI non-linear models increasing numbers of variables and vertex degree and decreasing k (analogous to Kalisch and Buhlmann)? If parameter values are randomly assigned, how often is k-triangle faithfulness violated as a function of sample size clique size parameter distribution k 47

Kalisch, M., and P. Bühlmann (2007). Estimating high- dimensional directed acyclic graphs with the PC- algorithm. Journal of Machine Learning Research 8, 613–636. Spirtes, P., and Zhang, J. (forthcoming) A Uniformly Consistent Estimator of Causal Effects Under The k- Triangle-Faithfulness Assumption, Statistical Science. Spirtes, P., and Zhang, J. (submitted) Three Faces of Faithfulness, Synthese. 48