Presentation is loading. Please wait.

Presentation is loading. Please wait.

Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University

Similar presentations


Presentation on theme: "Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University"— Presentation transcript:

1 Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University kbhwang@bi.snu.ac.kr

2 Copyright (c) 2004 by SNU CSE Biointelligence Lab2 Outline Bayesian network – revisit I  Properties of Bayesian network  Structural learning of Bayesian network Project 3-1  Analysis of structural learning algorithms  ALARM dataset Bayesian network – revisit II  Bayesian network classifiers (probabilistic inference) Project 3-2  Classification of microarray gene expression data using Bayesian networks

3 Copyright (c) 2004 by SNU CSE Biointelligence Lab3 Bayesian Network The joint probability distribution over all the variables in the Bayesian network. BA CD E Local probability distribution for X i

4 Copyright (c) 2004 by SNU CSE Biointelligence Lab4 Generative Model From the underlying distribution, a set of data examples can be generated. Conditional probability of interest can be calculated from jpd. Gene B Class Gene FGene G Gene A Gene CGene D Gene EGene H This Bayesian network can classify the examples by calculating appropriate conditional probability.  P(Class| other variables)

5 Copyright (c) 2004 by SNU CSE Biointelligence Lab5 Classification by Bayesian Networks I Calculate the conditional probability of ‘Class’ variable given the value of the other variables.  Infer conditional probability from joint probability distribution.  For example,  where summation is taken over all the possible class values.

6 Copyright (c) 2004 by SNU CSE Biointelligence Lab6 Knowing the Causal Structure Gene B Class Gene FGene G Gene A Gene CGene D Gene EGene H Gene C regulates Gene E and F. Gene D regulates Gene G and H. Class has an effect on Gene F and G. A set of comprehensible rules (or knowledge)

7 Copyright (c) 2004 by SNU CSE Biointelligence Lab7 Learning Bayesian Networks Metric approach  Use a scoring metric to measure how well a particular structure fits an observed set of cases.  A search algorithm is used.  Find a canonical form of an equivalence class. Independence approach  An independence oracle (approximated by some statistical test) is queried to identify the equivalence class that captures the independencies in the distribution from which the data was generated.  Search for a PDAG

8 Copyright (c) 2004 by SNU CSE Biointelligence Lab8 Scoring Metrics for Bayesian Networks Likelihood L(G,  G, C) = P(C|G h,  G )  G h : hypothesis that the given data (C) was generated by a distribution that can be factored according to G. The maximum likelihood metric of G (entropy metric with opposite sign)  prefer complete graph structure N: data size x j : jth example

9 Copyright (c) 2004 by SNU CSE Biointelligence Lab9 Information Criterion Scoring Metrics The Akaike information criterion (AIC) metric Bayesian information criterion (BIC) metric

10 Copyright (c) 2004 by SNU CSE Biointelligence Lab10 MDL Scoring Metrics The minimum description length (MDL) metric 1 The minimum description length (MDL) metric 2

11 Copyright (c) 2004 by SNU CSE Biointelligence Lab11 Bayesian Scoring Metrics A Bayesian metric The BDe (Bayesian Dirichlet & likelihood equivalence) metric Prior on the network structure

12 Copyright (c) 2004 by SNU CSE Biointelligence Lab12 Greedy Search Algorithm Generate initial Bayesian network structure G 0. For m = 1, 2, 3, …, until convergence.  Among the possible local changes (insertion of an edge, reversal of an edge, and deletion of an edge) in G m–1, the one leads to the largest improvement in the score is performed. The resulting graph is G m. Stopping criterion  Score(G m–1 ) == Score(G m ). At each iteration (learning Bayesian network consisting of n variables)  O(n 2 ) local changes should be evaluated to select the best one. Random restarts is usually adopted to escape local maxima.

13 Copyright (c) 2004 by SNU CSE Biointelligence Lab13 Project 3-1 Analysis of structural learning algorithms  Data generation from ALARM network  Various data set size, e.g., 1000, 3000, 5000, 10000.  Structural learning of Bayesian network by greedy search (hill- climbing) with several kinds of scoring metrics  Compare the results w.r.t. edge errors according to various sample sizes and learning methods

14 Copyright (c) 2004 by SNU CSE Biointelligence Lab14 ALARM Network # of nodes: 37 # of edges: 46 # of possible values of variable: 2 ~ 4 values

15 Copyright (c) 2004 by SNU CSE Biointelligence Lab15 Data Generation Using Netica (http://www.norsys.com)http://www.norsys.com

16 Copyright (c) 2004 by SNU CSE Biointelligence Lab16 Structural Learning WEKA (http://www.cs.waikato.ac.nz/ml/weka/)http://www.cs.waikato.ac.nz/ml/weka/  http://www.cs.waikato.ac.nz/~remco/weka_bn/ http://www.cs.waikato.ac.nz/~remco/weka_bn/

17 Copyright (c) 2004 by SNU CSE Biointelligence Lab17 Probabilistic Inference Calculate the conditional probability given values of observed variables.  Junction tree algorithm  Sampling methods  General probabilistic inference is intractable. (It is known to be NP-hard.)  However, calculation of the conditional probability for classification is rather straightforward because of the property of Bayesian network structure (d-separation).

18 Copyright (c) 2004 by SNU CSE Biointelligence Lab18 Markov Blanket Variables of interest  X = {X 1, X 2, …, X n } For a variable X i, its Markov blanket MB(X i ) is the subset of X – X i which satisfies the following: Markov boundary  Minimal Markov blanket

19 Copyright (c) 2004 by SNU CSE Biointelligence Lab19 Markov Blanket in Bayesian Network Given Bayesian network structure, determination of the Markov blanket of a variable is straightforward.  By the conditional independence assertions. Gene B Class Gene FGene G Gene A Gene CGene D Gene EGene H The Markov blanket of a node in the Bayesian network consists of all of its parents, spouses, and children.

20 Copyright (c) 2004 by SNU CSE Biointelligence Lab20 Classification by Bayesian Networks II

21 Copyright (c) 2004 by SNU CSE Biointelligence Lab21 Project 3-2 Classification using Bayesian network  Evaluate performance of Bayesian network classifier (classification accuracy)  Various parameter settings, e.g., scoring metrics and learning methods  If possible, compare with other learning methods such as neural networks and decision trees.  Leave-one-out cross validation Using WEKA

22 Copyright (c) 2004 by SNU CSE Biointelligence Lab22 Molecular Biology: Central Dogma DNA microarray

23 Copyright (c) 2004 by SNU CSE Biointelligence Lab23 DNA Microarrays Monitor thousands of gene expression levels simultaneously  traditional one gene experiments. Fabricated by high-speed robotics. Known probes

24 Copyright (c) 2004 by SNU CSE Biointelligence Lab24 Types of DNA Microarrays Oligonucleotide chips  An array of oligonucleotide (20 ~ 80-mer oligos) probes is synthesized. cDNA microarrays  Probe cDNA (500 ~ 5,000 bases long) is immobilized to a solid surface.

25 Copyright (c) 2004 by SNU CSE Biointelligence Lab25 Study Treatment-specific changes in gene expression discriminate in vivo drug response in human leukemia cells, MH Cheok et al., Nature Genetics 35, 2003. 60 leukemia patients Bone marrow samples Affymetrix GeneChip arrays Gene expression data

26 Copyright (c) 2004 by SNU CSE Biointelligence Lab26 Gene Expression Data # of data examples  120 (60: before treatment, 60: after treatment) # of genes measured  12600 (Affymetrix HG-U95A array) Task  Classification between “before treatment” and “after treatment” based on gene expression pattern

27 Copyright (c) 2004 by SNU CSE Biointelligence Lab27 Affymetrix GeneChip Arrays Use short oligos to detect gene expression level. Each gene is probed by a set of short oligos. Each gene expression level is summarized by  Signal: numerical value describing the abundance of mRNA  A/P call: denotes the statistical significance of signal

28 Copyright (c) 2004 by SNU CSE Biointelligence Lab28 Preprocessing Remove the genes having more than 60 ‘A’ calls  # of genes: 12600  3190 Discretization of gene expression level  Criterion: median gene expression value of each sample  0 (low) and 1 (high)

29 Copyright (c) 2004 by SNU CSE Biointelligence Lab29 Gene Filtering Using mutual information  Estimated probabilities were used.  # of genes: 3190  50 Final dataset  # of attributes: 51 (one for the class)  Class: 0 (after treatment), 1 (before treatment)  # of data examples: 120

30 Copyright (c) 2004 by SNU CSE Biointelligence Lab30 Final Dataset 120 51

31 Copyright (c) 2004 by SNU CSE Biointelligence Lab31 Submission Deadline: 2004. 12. 2 Location: 301-419


Download ppt "Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University"

Similar presentations


Ads by Google