1 Some Comments on Sebastiani et al Nature Genetics 37(4)2005.

Slides:



Advertisements
Similar presentations
CS188: Computational Models of Human Behavior
Advertisements

Bayesian network classification using spline-approximated KDE Y. Gurwicz, B. Lerner Journal of Pattern Recognition.
A Tutorial on Learning with Bayesian Networks
BAYESIAN NETWORKS Ivan Bratko Faculty of Computer and Information Sc. University of Ljubljana.
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Knowledge Representation and Reasoning University "Politehnica" of Bucharest Department of Computer Science Fall 2010 Adina Magda Florea
Rutgers CS440, Fall 2003 Review session. Rutgers CS440, Fall 2003 Topics Final will cover the following topics (after midterm): 1.Uncertainty & introduction.
BAYESIAN NETWORKS. Bayesian Network Motivation  We want a representation and reasoning system that is based on conditional independence  Compact yet.
Learning Bayesian Networks. Dimensions of Learning ModelBayes netMarkov net DataCompleteIncomplete StructureKnownUnknown ObjectiveGenerativeDiscriminative.
Overview of Inference Algorithms for Bayesian Networks Wei Sun, PhD Assistant Research Professor SEOR Dept. & C4I Center George Mason University, 2009.
From: Probabilistic Methods for Bioinformatics - With an Introduction to Bayesian Networks By: Rich Neapolitan.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
CPSC 322, Lecture 26Slide 1 Reasoning Under Uncertainty: Belief Networks Computer Science cpsc322, Lecture 27 (Textbook Chpt 6.3) March, 16, 2009.
Bayesian Networks Chapter 2 (Duda et al.) – Section 2.11
1 Graphical Models in Data Assimilation Problems Alexander Ihler UC Irvine Collaborators: Sergey Kirshner Andrew Robertson Padhraic Smyth.
Bayes Nets Rong Jin. Hidden Markov Model  Inferring from observations (o i ) to hidden variables (q i )  This is a general framework for representing.
Bayesian Belief Networks
Learning Bayesian Networks. Dimensions of Learning ModelBayes netMarkov net DataCompleteIncomplete StructureKnownUnknown ObjectiveGenerativeDiscriminative.
Graphical Models Lei Tang. Review of Graphical Models Directed Graph (DAG, Bayesian Network, Belief Network) Typically used to represent causal relationship.
Goal: Reconstruct Cellular Networks Biocarta. Conditions Genes.
Artificial Intelligence and Lisp Lecture 7 LiU Course TDDC65 Autumn Semester, 2010
5/25/2005EE562 EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS Lecture 16, 6/1/2005 University of Washington, Department of Electrical Engineering Spring 2005.
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
1 Bayesian Networks Chapter ; 14.4 CS 63 Adapted from slides by Tim Finin and Marie desJardins. Some material borrowed from Lise Getoor.
Cristina Manfredotti D.I.S.Co. Università di Milano - Bicocca An Introduction to the Use of Bayesian Network to Analyze Gene Expression Data Cristina Manfredotti.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
1 CMSC 471 Fall 2002 Class #19 – Monday, November 4.
Made by: Maor Levy, Temple University  Probability expresses uncertainty.  Pervasive in all of Artificial Intelligence  Machine learning 
A Brief Introduction to Graphical Models
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
1 Naïve Bayes Models for Probability Estimation Daniel Lowd University of Washington (Joint work with Pedro Domingos)
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
Bayesian Learning By Porchelvi Vijayakumar. Cognitive Science Current Problem: How do children learn and how do they get it right?
1 CS 391L: Machine Learning: Bayesian Learning: Beyond Naïve Bayes Raymond J. Mooney University of Texas at Austin.
For Wednesday Read Chapter 11, sections 1-2 Program 2 due.
Aprendizagem Computacional Gladys Castillo, UA Bayesian Networks Classifiers Gladys Castillo University of Aveiro.
Using Bayesian Networks to Analyze Whole-Genome Expression Data Nir Friedman Iftach Nachman Dana Pe’er Institute of Computer Science, The Hebrew University.
Bayesian Networks for Data Mining David Heckerman Microsoft Research (Data Mining and Knowledge Discovery 1, (1997))
Direct Message Passing for Hybrid Bayesian Networks Wei Sun, PhD Assistant Research Professor SFL, C4I Center, SEOR Dept. George Mason University, 2009.
Introduction to Bayesian Networks
Ch 8. Graphical Models Pattern Recognition and Machine Learning, C. M. Bishop, Revised by M.-O. Heo Summarized by J.W. Nam Biointelligence Laboratory,
Learning the Structure of Related Tasks Presented by Lihan He Machine Learning Reading Group Duke University 02/03/2006 A. Niculescu-Mizil, R. Caruana.
Learning With Bayesian Networks Markus Kalisch ETH Zürich.
INTERVENTIONS AND INFERENCE / REASONING. Causal models  Recall from yesterday:  Represent relevance using graphs  Causal relevance ⇒ DAGs  Quantitative.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Review: Bayesian inference  A general scenario:  Query variables: X  Evidence (observed) variables and their values: E = e  Unobserved variables: Y.
Dependency Networks for Collaborative Filtering and Data Visualization UAI-2000 발표 : 황규백.
Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University.
Bayesian networks and their application in circuit reliability estimation Erin Taylor.
Wei Sun and KC Chang George Mason University March 2008 Convergence Study of Message Passing In Arbitrary Continuous Bayesian.
1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:
Lecture 29 Conditional Independence, Bayesian networks intro Ch 6.3, 6.3.1, 6.5, 6.5.1,
1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:
1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.
Pattern Recognition and Machine Learning
04/21/2005 CS673 1 Being Bayesian About Network Structure A Bayesian Approach to Structure Discovery in Bayesian Networks Nir Friedman and Daphne Koller.
Introduction on Graphic Models
1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:
CPSC 322, Lecture 26Slide 1 Reasoning Under Uncertainty: Belief Networks Computer Science cpsc322, Lecture 27 (Textbook Chpt 6.3) Nov, 13, 2013.
CS 2750: Machine Learning Bayesian Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2016.
Bayesian Networks Chapter 2 (Duda et al.) – Section 2.11 CS479/679 Pattern Recognition Dr. George Bebis.
Reasoning Under Uncertainty: Belief Networks
CS 2750: Machine Learning Directed Graphical Models
Qian Liu CSE spring University of Pennsylvania
Bayesian Networks: Motivation
Read R&N Ch Next lecture: Read R&N
Class #19 – Tuesday, November 3
Class #16 – Tuesday, October 26
Pegna, J.M., Lozano, J.A., and Larragnaga, P.
Readings: K&F: 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7 Markov networks, Factor graphs, and an unified view Start approximate inference If we are lucky… Graphical.
Presentation transcript:

1 Some Comments on Sebastiani et al Nature Genetics 37(4)2005

2 Bayesian Classifiers & Structure Learners They come in several varieties designed to balance the following properties to different degrees: 1. Expressiveness: can they learn/represent arbitrary or constrained functions? 2. Computational tractability: can perform learning and inference fast? 3. Sample efficiency: how much sample is needed? 4. Structure discovery: can they be used to infer structural relationships, even causal ones?

3 Variants: Exhaustive Bayes Simple (aka Naïve) Bayes Bayesian Networks (BNs) TANs (Tree-Augmented Simple Bayes) BAN (Bayes Net Augmented Simple Bayes) Bayesian Multinets (TAN or BAN based) FAN Several others exist but not examined here (e.g., MB classifiers, model averaging)

4 Exhaustive Bayes Bayes’ Theorem (or formula) says that: P (D) * P(F| D) P (D | F) = P(F)

5 Exhaustive Bayes 1. Expressiveness: can learn any function 2. Computational tractability: exponential 3. Sample efficiency: exponential 4. Structure discovery: does not reveal structure

6 Simple Bayes Requires that findings are independent conditioned on the disease states (note: this does not mean that the findings are independent in general, but rather, that they are conditionally independent).

7 Simple Bayes: less sample and less computation but more restricted in what it can learn than Exhaustive Bayes Simple Bayes can be implemented by plugging in the main formula: P(F | D) =  P(Fi | Dj) where Fi is the i th (singular) finding and Dj the j th (singular) disease. i,j

8 Naive Bayes 1. Expressiveness: can learn a small fraction of functions (that shrinks exponentially fast as # of dimensions grows) 2. Computational tractability: linear 3. Sample efficiency: needs linear number of parameters to # of variables; each parameter can be estimated fairly efficiently since it involves conditioning on one variable (the class node). E.g., in the diagnosis context one needs only prevalence of disease and sensitivity of each finding for the disease. 4. Structure discovery: does not reveal structure

9 Bayesian Networks: Achieve trade-off between flexibility of Exhaustive Bayes and tractability of Simple Bayes Also allow discovery of structural relationships

10 Bayesian Networks 1. Expressiveness: can represent any function 2. Computational tractability: Depends on the dependency structure of the underlying distribution. It is worst-case intractable but for sparse or tree-like networks it can be very fast. Representational tractability is excellent in sparse networks 3. Sample efficiency: There is no formal characterization because (a) highly depends on the underlying structure of the distribution and (b) in most practical learners local errors propagate to remote areas in the network. Large-scale empirical studies show that very complicated structures (i.e., with hundreds or even thousands of variables and medium to small densities) can be learned accurately with relatively small samples (i.e., a few hundred samples). 4. Structure discovery: under well-defined and reasonable conditions is capable of revealing causal structure.

11 Bayesian Networks: The Bayesian Network Model and Its Uses BN=Graph (Variables (nodes), dependencies (arcs)) + Joint Probability Distribution + Markov Property Graph has to be DAG (directed acyclic) in the standard BN model A BC JPD P(A+, B+, C+)=0.006 P(A+, B+, C-)=0.014 P(A+, B-, C+)=0.054 P(A+, B-, C-)=0.126 P(A-, B+, C+)=0.240 P(A-, B+, C-)=0.160 P(A-, B-, C+)=0.240 P(A-, B-, C-)=0.160 Theorem: any JPD can be represented in BN form

12 Bayesian Networks: The Bayesian Network Model and Its Uses Markov Property: the probability distribution of any node N given its parents P is independent of any subset of the non-descendent nodes W of N A CD FG B EH JI e.g., : D  {B,C,E,F,G | A} F  {A,D,E,F,G,H,I,J | B, C }

13 Bayesian Networks: The Bayesian Network Model and Its Uses Theorem: the Markov property enables us to decompose (factor) the joint probability distribution into a product of prior and conditional probability distributions A BC The original JPD: P(A+, B+, C+)=0.006 P(A+, B+, C-)=0.014 P(A+, B-, C+)=0.054 P(A+, B-, C-)=0.126 P(A-, B+, C+)=0.240 P(A-, B+, C-)=0.160 P(A-, B-, C+)=0.240 P(A-, B-, C-)=0.160 Becomes: P(A+)=0.8 P(B+ | A+)=0.1 P(B+ | A-)=0.5 P(C+ | A+)=0.3 P(C+ | A-)=0.6 Up to Exponential Saving in Number of Parameters! P(V) =  p(V i |Pa(V i )) i

14 Bayesian Networks: The Bayesian Network Model and Its Uses Once we have a BN model of some domain we can ask questions: A CD FG B EH Forward: P(D+,I-| A+)=? Backward: P(A+| C+, D+)=? Forward & Backward: P(D+,C-| I+, E+)=? Arbitrary abstraction/Arbitrary predictors/predicted variables JI

15 Other Restricted Bayesian Classifiers:TANs, BANs, FANs, Multinets

16 Other Restricted Bayesian Classifiers:TANs, BANs, FANs, Multinets 1. Expressiveness: can represent limited classes of functions (more expressive than SB, less so than BNs) 2. Computational tractability: Worse than Simple Bayes, often faster than BNs. 3. Sample efficiency: There is no formal characterization. Limited empirical studies so far, however results promising. 4. Structure discovery: not designed to reveal causal structure.

17 TANs The TAN classifier extends Naïve Bayes with “augmenting” edges among findings such that the resulting network among the findings is a tree F2 D F1 F3 F4

18 TAN multinet The TAN multinet classifier uses a different TAN for each value of D and then chooses the predicted class to be the value of D that has the highest posterior given the findings (over all TANs) F2 D=1 F1 F3 F4F2 D=2 F1 F3 F4 F2 D=3 F1 F3 F4

19 BANs The BAN classifier extends Naïve Bayes with “augmenting” edges among findings such that the resulting network among the findings is a graph F2 D F1 F3 F4

20 FANs (Finite Mixture Augmented Naïve Bayes) The FAN classifier extends Naïve Bayes by modeling extra dependencies among findings via an unmeasured hidden confounder (Finite Mixture model) parameterized via EM F2 D F1 F3 F4 H

21 How feasible is to learn structure accurately with Bayesian Network Learners and realistic samples? Abundance of empirical evidence shows that it is very feasible. A few examples: C.F. Aliferis, G.F. Cooper. “An Evaluation of an Algorithm for Inductive Learning of Bayesian Belief Networks Using Simulated Data Sets”. In Proceedings of Uncertainty in Artificial Intelligence   67 random BNs with samples from <200 to 1500 and up to 50 variables obtained mean sensitivity of 92% and mean superfluous arcs ratio of 5% I. Tsamardinos, L.E. Brown, C.F. Aliferis. "The Max-Min Hill-Climbing Bayesian Network Structure Learning Algorithm" Machine Learning Journal, 2006   22 networks from 20 variables to 5000, and samples from 500 to 5000 yielding excellent Structural Hamming Distances (for details please see paper). I. Tsamardinos, C.F. Aliferis, A. Statnikov. "Time and Sample Efficient Discovery of Markov Blankets and Direct Causal Relations" In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 24-27, 2003, Washington, DC, USA, ACM Press, pages   8 networks with 27 to 5000 variabes and 500 to 5000 samples yield average sensitivity/specificity of 90% See for detailswww.dsl-lab.org

22 Other comments on the paper 1. The paper aspires to build powerful classifiers and to reveal structure in one modeling step. Several important predictive modeling approaches and structure learners that a priori seem more suitable are ignored. 2. Analysis is conducted exclusively with a commercial product owned by one of the authors. Conflict is disclosed in the paper. 3. Using (approximately) a BAN may facilitate parameterization however it does not facilitate structure discovery. 4. Ordering SNPs is a good idea. 5. No more than 3 parents per node means that approx. 20 samples are used for each independent cell in the conditional probability tables. Experience shows that this number is more than enough for sufficient parameterization IF this density is correct. 6. The proposed classifier achieves accuracy close to what one gets by classifying everything to the class with higher prevalence (since the distribution is very unbalanced). However close inspection shows that the classification is much more discriminatory. Accuracy is a very poor metric to show this.

23 Other comments on the paper 7. A very appropriate analysis not pursued here is to convert the graph to its equivalence class and examine structural dependencies there. 8. No examination of structure stability in the 5-folds of cross validation, or via bootstrapping. 9. Table 1 confuses explanatory with predictive modeling. SNP contributions are estimated in the very small sample while they should be estimated in the larger sample (table 1 offers an explanatory analysis). 10. It is not clear what set each SNP/gene SNP set is removed from to compute Table Mixing source populations in the evaluation set may have biased the evaluation. 12. Discretization has a huge effect on structure discovery algorithms. The applied discretization procedure of continuous variables is suboptimal. 13. When using selected cases and controls artifactual dependencies are introduced among some of the variables. This is well known and corrections to the Bayesian metric have been devised to deal with this. The paper ignores this despite that its purpose is precisely to infer such dependencies.

24 Other comments on the paper 14. The paper makes the argument that by enforcing that arcs go from the phenotype to SNPs the resulting model needs less sample to parameterize. While this may be true for the parameterization of the phenotype node, it is not true in general for the other nodes. In fact by doing so genotype nodes have, in general, to be more densely connected and thus their parameterization becomes more sample-intensive. At the same time the validity of the inferred structure may be compromised. 15. There has been quite a bit of “simulations to evaluate heuristic choices” and parameter values chosen by “sensitivity analysis” and other such pre- modeling that open up the possibility for some manual over-fitting.