1 Some Comments on Sebastiani et al Nature Genetics 37(4)2005.

Presentation on theme: "1 Some Comments on Sebastiani et al Nature Genetics 37(4)2005."— Presentation transcript:

1 Some Comments on Sebastiani et al Nature Genetics 37(4)2005

2 Bayesian Classifiers & Structure Learners They come in several varieties designed to balance the following properties to different degrees: 1. Expressiveness: can they learn/represent arbitrary or constrained functions? 2. Computational tractability: can perform learning and inference fast? 3. Sample efficiency: how much sample is needed? 4. Structure discovery: can they be used to infer structural relationships, even causal ones?

3 Variants: Exhaustive Bayes Simple (aka Naïve) Bayes Bayesian Networks (BNs) TANs (Tree-Augmented Simple Bayes) BAN (Bayes Net Augmented Simple Bayes) Bayesian Multinets (TAN or BAN based) FAN Several others exist but not examined here (e.g., MB classifiers, model averaging)

4 Exhaustive Bayes Bayes’ Theorem (or formula) says that: P (D) * P(F| D) P (D | F) = P(F)

5 Exhaustive Bayes 1. Expressiveness: can learn any function 2. Computational tractability: exponential 3. Sample efficiency: exponential 4. Structure discovery: does not reveal structure

6 Simple Bayes Requires that findings are independent conditioned on the disease states (note: this does not mean that the findings are independent in general, but rather, that they are conditionally independent).

7 Simple Bayes: less sample and less computation but more restricted in what it can learn than Exhaustive Bayes Simple Bayes can be implemented by plugging in the main formula: P(F | D) =  P(Fi | Dj) where Fi is the i th (singular) finding and Dj the j th (singular) disease. i,j

8 Naive Bayes 1. Expressiveness: can learn a small fraction of functions (that shrinks exponentially fast as # of dimensions grows) 2. Computational tractability: linear 3. Sample efficiency: needs linear number of parameters to # of variables; each parameter can be estimated fairly efficiently since it involves conditioning on one variable (the class node). E.g., in the diagnosis context one needs only prevalence of disease and sensitivity of each finding for the disease. 4. Structure discovery: does not reveal structure

9 Bayesian Networks: Achieve trade-off between flexibility of Exhaustive Bayes and tractability of Simple Bayes Also allow discovery of structural relationships

10 Bayesian Networks 1. Expressiveness: can represent any function 2. Computational tractability: Depends on the dependency structure of the underlying distribution. It is worst-case intractable but for sparse or tree-like networks it can be very fast. Representational tractability is excellent in sparse networks 3. Sample efficiency: There is no formal characterization because (a) highly depends on the underlying structure of the distribution and (b) in most practical learners local errors propagate to remote areas in the network. Large-scale empirical studies show that very complicated structures (i.e., with hundreds or even thousands of variables and medium to small densities) can be learned accurately with relatively small samples (i.e., a few hundred samples). 4. Structure discovery: under well-defined and reasonable conditions is capable of revealing causal structure.

11 Bayesian Networks: The Bayesian Network Model and Its Uses BN=Graph (Variables (nodes), dependencies (arcs)) + Joint Probability Distribution + Markov Property Graph has to be DAG (directed acyclic) in the standard BN model A BC JPD P(A+, B+, C+)=0.006 P(A+, B+, C-)=0.014 P(A+, B-, C+)=0.054 P(A+, B-, C-)=0.126 P(A-, B+, C+)=0.240 P(A-, B+, C-)=0.160 P(A-, B-, C+)=0.240 P(A-, B-, C-)=0.160 Theorem: any JPD can be represented in BN form

12 Bayesian Networks: The Bayesian Network Model and Its Uses Markov Property: the probability distribution of any node N given its parents P is independent of any subset of the non-descendent nodes W of N A CD FG B EH JI e.g., : D  {B,C,E,F,G | A} F  {A,D,E,F,G,H,I,J | B, C }

13 Bayesian Networks: The Bayesian Network Model and Its Uses Theorem: the Markov property enables us to decompose (factor) the joint probability distribution into a product of prior and conditional probability distributions A BC The original JPD: P(A+, B+, C+)=0.006 P(A+, B+, C-)=0.014 P(A+, B-, C+)=0.054 P(A+, B-, C-)=0.126 P(A-, B+, C+)=0.240 P(A-, B+, C-)=0.160 P(A-, B-, C+)=0.240 P(A-, B-, C-)=0.160 Becomes: P(A+)=0.8 P(B+ | A+)=0.1 P(B+ | A-)=0.5 P(C+ | A+)=0.3 P(C+ | A-)=0.6 Up to Exponential Saving in Number of Parameters! P(V) =  p(V i |Pa(V i )) i

14 Bayesian Networks: The Bayesian Network Model and Its Uses Once we have a BN model of some domain we can ask questions: A CD FG B EH Forward: P(D+,I-| A+)=? Backward: P(A+| C+, D+)=? Forward & Backward: P(D+,C-| I+, E+)=? Arbitrary abstraction/Arbitrary predictors/predicted variables JI

15 Other Restricted Bayesian Classifiers:TANs, BANs, FANs, Multinets

16 Other Restricted Bayesian Classifiers:TANs, BANs, FANs, Multinets 1. Expressiveness: can represent limited classes of functions (more expressive than SB, less so than BNs) 2. Computational tractability: Worse than Simple Bayes, often faster than BNs. 3. Sample efficiency: There is no formal characterization. Limited empirical studies so far, however results promising. 4. Structure discovery: not designed to reveal causal structure.

17 TANs The TAN classifier extends Naïve Bayes with “augmenting” edges among findings such that the resulting network among the findings is a tree F2 D F1 F3 F4

18 TAN multinet The TAN multinet classifier uses a different TAN for each value of D and then chooses the predicted class to be the value of D that has the highest posterior given the findings (over all TANs) F2 D=1 F1 F3 F4F2 D=2 F1 F3 F4 F2 D=3 F1 F3 F4

19 BANs The BAN classifier extends Naïve Bayes with “augmenting” edges among findings such that the resulting network among the findings is a graph F2 D F1 F3 F4

20 FANs (Finite Mixture Augmented Naïve Bayes) The FAN classifier extends Naïve Bayes by modeling extra dependencies among findings via an unmeasured hidden confounder (Finite Mixture model) parameterized via EM F2 D F1 F3 F4 H

21 How feasible is to learn structure accurately with Bayesian Network Learners and realistic samples? Abundance of empirical evidence shows that it is very feasible. A few examples: C.F. Aliferis, G.F. Cooper. “An Evaluation of an Algorithm for Inductive Learning of Bayesian Belief Networks Using Simulated Data Sets”. In Proceedings of Uncertainty in Artificial Intelligence 1994.   67 random BNs with samples from <200 to 1500 and up to 50 variables obtained mean sensitivity of 92% and mean superfluous arcs ratio of 5% I. Tsamardinos, L.E. Brown, C.F. Aliferis. "The Max-Min Hill-Climbing Bayesian Network Structure Learning Algorithm" Machine Learning Journal, 2006   22 networks from 20 variables to 5000, and samples from 500 to 5000 yielding excellent Structural Hamming Distances (for details please see paper). I. Tsamardinos, C.F. Aliferis, A. Statnikov. "Time and Sample Efficient Discovery of Markov Blankets and Direct Causal Relations" In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 24-27, 2003, Washington, DC, USA, ACM Press, pages 673-678   8 networks with 27 to 5000 variabes and 500 to 5000 samples yield average sensitivity/specificity of 90% See www.dsl-lab.org for detailswww.dsl-lab.org

22 Other comments on the paper 1. The paper aspires to build powerful classifiers and to reveal structure in one modeling step. Several important predictive modeling approaches and structure learners that a priori seem more suitable are ignored. 2. Analysis is conducted exclusively with a commercial product owned by one of the authors. Conflict is disclosed in the paper. 3. Using (approximately) a BAN may facilitate parameterization however it does not facilitate structure discovery. 4. Ordering SNPs is a good idea. 5. No more than 3 parents per node means that approx. 20 samples are used for each independent cell in the conditional probability tables. Experience shows that this number is more than enough for sufficient parameterization IF this density is correct. 6. The proposed classifier achieves accuracy close to what one gets by classifying everything to the class with higher prevalence (since the distribution is very unbalanced). However close inspection shows that the classification is much more discriminatory. Accuracy is a very poor metric to show this.

23 Other comments on the paper 7. A very appropriate analysis not pursued here is to convert the graph to its equivalence class and examine structural dependencies there. 8. No examination of structure stability in the 5-folds of cross validation, or via bootstrapping. 9. Table 1 confuses explanatory with predictive modeling. SNP contributions are estimated in the very small sample while they should be estimated in the larger sample (table 1 offers an explanatory analysis). 10. It is not clear what set each SNP/gene SNP set is removed from to compute Table 1. 11. Mixing source populations in the evaluation set may have biased the evaluation. 12. Discretization has a huge effect on structure discovery algorithms. The applied discretization procedure of continuous variables is suboptimal. 13. When using selected cases and controls artifactual dependencies are introduced among some of the variables. This is well known and corrections to the Bayesian metric have been devised to deal with this. The paper ignores this despite that its purpose is precisely to infer such dependencies.

24 Other comments on the paper 14. The paper makes the argument that by enforcing that arcs go from the phenotype to SNPs the resulting model needs less sample to parameterize. While this may be true for the parameterization of the phenotype node, it is not true in general for the other nodes. In fact by doing so genotype nodes have, in general, to be more densely connected and thus their parameterization becomes more sample-intensive. At the same time the validity of the inferred structure may be compromised. 15. There has been quite a bit of “simulations to evaluate heuristic choices” and parameter values chosen by “sensitivity analysis” and other such pre- modeling that open up the possibility for some manual over-fitting.