Efficient Learning using Constrained Sufficient Statistics

Efficient Learning using Constrained Sufficient Statistics
Nir Friedman Hebrew University Lise Getoor Stanford

Learning Models from Data
Useful for pattern recognition, density estimation and classification Problem: Computational effort increases with size of the dataset Goal: Reduce computational cost without sacrificing quality of learned models

+ = Bayesian Networks Unique distribution
C + = Unique distribution Discrete random variables X1,…,Xn Bayesian network B = <G,Θ> encodes joint probability distribution -- parents of Xi in the graph G Transition: the particular model we focus on here is…. Todo: Add picture of bayes net? Something funny like p(altert| time of day, number of cups of coffee, stayed up late last night)

Learning Bayesian Networks
Given training set Find B that best matches D parameter estimation model selection Inducer E R B A C Data + Prior information

sufficient statistics
Parameter Estimation Relies on sufficient statistics For multinomial: sufficient statistics X Y Z Z Y X +

Learning Bayesian Network Structure
Active area of research Cooper & Herskovits 92; Lam & Bacchus 94; Heckerman,Geiger & Chickering 95; Pearl & Verma 91; Spirtes, Glymor & Scheines 93 Optimization approach: scoring metric + heuristic search Scoring Metrics Minimum Description Length (MDL) Bayesian scoring metrics Say learning Bayesian Networs from data Mention the heuristic search is greedy hillclimbing Assumptions: multinomila sample

MDL Scoring Metric ScoreMDL(G:D) = maxΘ log-likelihood - # of parameters If each is complete, the scoring functions have the following general form:

MDL Score Decomposition
sufficient statistics

Local Search Methods ΔYscore ΔWscore ΔXscore Z Y X W Z Y Z Y Z Y X W X
Say to compute the new networks score we need only calculate the new local score. To do this , requires that we make a pass through the database to compute the required sufficient statistics Mention that these are just a few of the potential changes Mention also detleting and reversing arcs we evaluate changes one edge at a time dominant cost: passes over database Z Y X W X W X W

Using Bounds to Guide Local Search
Z Y ? X Some if not most local changes will not improve the score What can we say about Score before calculating NXYZ?

Geometric View of Constraints

Constrained Optimization Problem
Objective function Problem: maxX F(X) subject to Theorem: The global maximum of ScoreMDL is bounded by the global maximum of F

Characterization of Local Score
Lemma 1: The function is convex over the positive quadrant. Lemma 2: The global maximum of F(X) is achieved at an extreme point of the feasible region

Not out of the woods yet ... Finding global maximum is difficult
Can find global max using NLP solver Alternatively, find some extreme point of the feasible region and use this value as a heuristic

Finding some extreme point
Pick row i and column j with sums r and c repeat Heuristic: MAXMAX Heuristic: RANDOM 83 417 158 342 131 367 133 369 48 83 48 2 25 342 369 83 417 131 158 342 133 367 131 369

Local Search Methods using Constrained Sufficient Statistics
Z Y max ΔYscore max ΔWscore X W Z Y Z Y Max ΔXscore Say to compute the new networks score we need only calculate the new local score. To do this , requires that we make a pass through the database to compute the required sufficient statistics Mention that these are just a few of the potential changes Mention also detleting and reversing arcs X W Z Y X W X W

Experimental Results greedy hill-climbing - HC
Compared two search techniques: greedy hill-climbing - HC heuristic search using bounds - CSS-HC Tested on two real Bayesian networks: Alarm - 37 variables [Beinlich, et.al. 89] Insurance - 26 variables [Binder, et.al. 97] Measured score vs. both time and number of computed statistics For number of computed heuristics check with Nir on the correct way to describe

Score vs. Time CSS-HC Score improvement HC t ALARM 50K

Score vs. Cached Statistics
CSS-HC HC speed up ALARM 50K

Performance Scaling CSS-HC HC 100K ALARM 50K 10K

Performance Scaling HC CSS-HC INSURANCE

MAXMAX vs. Optimal Solution
Here we note something interesting., We had expected that if we used more the more precise bounds computed by a NLP solver, we would see an even greater imporvement in the performance of our algorithm. While this data is *not* conclusive, we only have results part of the way out the performance curve, at lease in this range they suggest that the MAXMAX heuristic is indeed sufficient. We see that while the time using the NKP solver is certainly more, we do not notice a decrease in the number of statistics in our cache. OPT ALARM 50K

Conclusions Partial knowledge of dataset can be used to find bounds on sufficient statistics. Simple heuristics can approximate these bounds and be used effectively in local search algorithms. These techniques are general and can be applied in other situations where sufficient statistics must be computed.

Future Work Missing Data Learning is complicated significantly
benefit from bounds may be more dramatic Global bounds rather than local bounds develop Branch and Bound algorithm

Slides following are extras
LAST SLIDE Slides following are extras

Acknowledgements We would like to thank:
Walter Murray, Daphne Koller, Ronald Getoor, Peter Grunwald, Ron Parr and members of the Stanford DAGS research group

Our Approach Even before calculating sufficient statistics, our current knowledge of the training set constrains possible values constrained values either bound the change in score or provide heuristics Using these values, we can improve our search strategy

Learning Bayesian Networks
Active area of research Cooper & Herskovits 92, Lam & Bacchus 94, Heckerman 95, Heckerman, Geiger & Chickering 95 Common approach: scoring metric + heuristic search Say learning Bayesian Networs from data Mention the heuristic search is greedy hillclimbing Assumptions: multinomila sample

Bayesian Scoring Metrics
where

Constraints on Sufficient Statistics
For example, if we know we have the following constraints on :

Constrained Optimization Problem
Objective function Problem: subject to Theorem: The global maximum of ScoreMDL is bounded by the global maximum of F

Characterization of Local Score cont.
Theorem: The global maximum of ScoreMDL is bounded by the global maximum of F achieved at extreme point of feasible region

Local Search Methods Exploit decomposition of scoring metrics
Changes one arc at a time Example: greedy hill-climbing dominating cost: number of passes over the database to compute counts Mention greedy hill-climbing finds local max performs well in practice approach works also fro other local search methods such as beam search and simulated annealing

MDL Scoring Metric where log-likelihood of B given D
#(G) is number of parameters

Parameter Estimation Relies on sufficient statistics
For multinomial: N(xi,πi) Y Z P(X|Y,Z) X Y Z Z Y X

Learning Bayesian Networks is Hard...
Computationally intensive Dominant cost is time spent computing sufficient statistics particularly true for large training sets missing data

Score Decompostion If each is complete, the scoring functions have the following general form: where are the counts of each instantiation of

Efficient Learning using Constrained Sufficient Statistics

Similar presentations

Presentation on theme: "Efficient Learning using Constrained Sufficient Statistics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Efficient Learning using Constrained Sufficient Statistics

Similar presentations

Presentation on theme: "Efficient Learning using Constrained Sufficient Statistics"— Presentation transcript:

Similar presentations

About project

Feedback