Sum-Product Networks: A New Deep Architecture

Name: Sum-Product Networks: A New Deep Architecture
Uploaded: 2017-08-12T07:50:14+00:00
Duration: PTM38S27
Channel: Ashton Watkins
Description: Sum-Product Networks: A New Deep Architecture

Sum-Product Networks: A New Deep Architecture
Hoifung Poon Microsoft Research Joint work with Pedro Domingos 1

Graphical Models: Challenges
Restricted Boltzmann Machine (RBM) Bayesian Network Markov Network Sprinkler Rain Grass Wet Advantage: Compactly represent probability As we all know, GM have many attractive properties for prob mdl. However, there are major challenges in using GMs. In part, inference is generally intractable. In turn, this makes learning the parameters or structure of a GM a very difficult task since learning typically required inference as a subroutine. Problem: Inference is intractable Problem: Learning is difficult 2

Deep Learning Stack many layers
E.g.: DBN [Hinton & Salakhutdinov, 2006] CDBN [Lee et al., 2009] DBM [Salakhutdinov & Hinton, 2010] Potentially much more powerful than shallow architectures [Bengio, 2009] But … Inference is even harder Learning requires extensive effort Recently, deep learning has attracted increasing interest. The standard approach for DL is to stack many layers of graphical models (such as RBMs). Compared to shallow archi, this is potentially much more powerful in terms of representation and statistical learning. Unfortunately, even learning one layer of RBM is intractable. And when multiple layers are stacked, inference becomes even harder, and learning requires the black art of extensive engineering such as preproc and tuning. 3

Learning: Requires approximate inference Inference: Still approximate
Graphical Models To identify potential solutions, let’s consider the following visualization. Here, the red circle represents all distributions that can be compactly repr by a GM. As I mentioned earlier, the challenge is that learn req approx inf, And even if a perfect mdl is learned, inf still requires approx (such as gibbs or var-inf). There is no guarantee on how long it’ll take or the quality of rsts.

Problem: Too restricted
E.g., hierarchical mixture model, thin junction tree, etc. Problem: Too restricted Graphical Models Existing Tractable Models An alternative appro is to do all approx upfront to learn a tractable mdl and then have peace of mind in inf. This sounds very attractive, but unfortunately existing mdls such as … are too restricted.

This Talk: Sum-Product Networks
Compactly represent partition function using a deep network Sum-Product Networks Graphical Models Existing Tractable Models In this talk, we ask the question: how general can a tractable mdl be? We found that there are surprisingly general conditions, and we introduce SPNs based on these conditions. The key insight is to incorporate comp into prob mdl, and compactly repr the part func as a deep ntw.

Exact inference linear time in network size
Sum-Product Networks Graphical Models Exact inference linear time in network size Existing Tractable Models Exact inf is linear time in the size of an SPN

Can compactly represent many more distributions
Sum-Product Networks Graphical Models Existing Tractable Models And we can compactly repr many more distributions compared to existing mdls.

Learn optimal way to reuse computation, etc.
Sum-Product Networks Graphical Models Existing Tractable Models In part, we can even compactly repr many general classes of distributions for which the GM is a complete connected graph. And the reason is that SPN can leverage many structural properties such as determinism and CSI, and learn the optimal way to form and reuse partial comp.

Outline Sum-product networks (SPNs) Learning SPN Experimental results
Conclusion I will start by introducing SPNs. Then I will address the crucial problem of learning an SPN. Finally, … 10

Why Is Inference Hard? Bottleneck: Summing out variables
E.g.: Partition function Sum of exponentially many products Before introducing SPN, let’s consider why inf is difficult in a GM. Computing prob is actually easy up to a proportion; we simply multiply the potentials of factors. What makes inf hard is to sum out variables, which requires summing over an exp num of comb. The extreme case is the part func Z, which requires summing out all variables.

Alternative Representation
X1 X2 P(X) 1 0.4 0.2 0.1 0.3 P(X) = 0.4  I[X1=1]  I[X2=1] + 0.2  I[X1=1]  I[X2=0] + 0.1  I[X1=0]  I[X2=1] + 0.3  I[X1=0]  I[X2=0] To address this problem, we’ll build on the idea of Darwiche 2003. For simplicity, let’s start with boolean vars. Later, we will see how to extend this to discrete and continuous vars. Suppose we have two vars X1 and X2. The prob dist can be repr as a table. Alternatively, we can repr this as a summation over the prob multiplied by the corresponding indicators. For example, this indicator is 1 if …, and 0 otherwise. Network Polynomial [Darwiche, 2003]

Alternative Representation
X1 X2 P(X) 1 0.4 0.2 0.1 0.3 P(X) = 0.4  I[X1=1]  I[X2=1] + 0.2  I[X1=1]  I[X2=0] + 0.1  I[X1=0]  I[X2=1] + 0.3  I[X1=0]  I[X2=0] It’s easy to verify that the prob can be computed by setting the appropriate indicator values. Network Polynomial [Darwiche, 2003]

Shorthand for Indicators
X1 X2 P(X) 1 0.4 0.2 0.1 0.3 P(X) = 0.4  X1  X2 + 0.2  X1  X2 + 0.1  X1  X2 + 0.3  X1  X2     To make things less cumbersome, we can use the following shorthands, where … Network Polynomial [Darwiche, 2003]

Easy: Set both indicators to 1
Sum Out Variables e: X1 = 1 X1 X2 P(X) 1 0.4 0.2 0.1 0.3 P(e) = 0.4  X1  X2 + 0.2  X1  X2 + 0.1  X1  X2 + 0.3  X1  X2     The good news is that summing out a variable in this repr is easy: Suppose we want to comp the marginal of X1=1, which requires sum out X2. We can do this by simply setting both … to 1. Intuitively, setting an indicator to 1 will include that val into summation. In part, to comp part func Z, we can just set all ind to 1. This sounds great, except for one problem, there are exponentially many terms.   Set X1 = 1, X1 = 0, X2 = 1, X2 = 1 Easy: Set both indicators to 1

Graphical Representation
X1 X2 P(X) 1 0.4 0.2 0.1 0.3  0.4 0.3 0.2 0.1       We can visualize this problem graphically by putting sum node for wt summation and prod nodes for multi. X1 X1 X2 X2

But … Exponentially Large
Example: Parity Uniform distribution over states with even number of 1’s 2N-1                  N2N-1 The problem is that this graph is exponentially large. E.g., consider parity, which corresponds to … There are exp many even states. It’s easy to show that there is no conditional indep under this dist., so to repr this as a GM, you need a fully connected graph.      X1 X1 X2 X2 X3 X3 X4 X4 X5 X5 17 17

But … Exponentially Large
Can we make this more compact? Example: Parity Uniform distribution over states of even number of 1’s                  So the key question is can we make this graph more compact.      X1 X1 X2 X2 X3 X3 X4 X4 X5 X5 18 18

Use a Deep Network Example: Parity
Uniform distribution over states with even number of 1’s O(N) The answer is yes. In part, this ntw repr exactly the same dist with just linear size. 19

Use a Deep Network Example: Parity
Uniform distribution over states of even number of 1’s Induce many hidden layers Reuse partial computation The key is to introduce many hidden layers. 20

Arithmetic Circuits (ACs)
Data structure for efficient inference Darwiche [2003] Compilation target of Bayesian networks Key idea: Use ACs instead to define a new class of deep probabilistic models Develop new deep learning algorithms for this class of models We will call this class sum-product networks, since it represents distributions using DAGs of sums and products. as we'll see, these will enable much better deep learning than before

Sum-Product Networks (SPNs)
Rooted DAG Nodes: Sum, product, input indicator Weights on edges from sum to children  0.7 0.3   This leads us to introduce spns. An SPN is a DAG with a root. The leaves are input, and the internal nodes are sums and prods. A sum outputs the wt sum of child, …     0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8   X1 X1 X2 X2 22

Distribution Defined by SPN
P(X)  S(X)  0.7 0.3   We can define a prb dist by treating these values as unnormalize probs, just like a MN.     0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8   X1 X1 X2 X2 23 23

Can We Sum Out Variables?
= P(e)  Xe S(X) S(e)  0.7 0.3   So the key question is can we replace the exp sum for the true marginal by the linear eval of the ntw? If this is true, we are golden. But if we pick an arbitrary SPN, this may not be true.     0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8   X1 X1 X2 X2 e: X1 = 1 1 1 1 24

Valid SPN SPN is valid if S(e) = Xe S(X) for all e
Valid  Can compute marginals efficiently Partition function Z can be computed by setting all indicators to 1 We thus define that an SPN is valid iff … If an SPN is valid, … 25

Valid SPN: General Conditions
Theorem: SPN is valid if it is complete & consistent Consistent: Under product, no variable in one child and negation in another Complete: Under sum, children cover the same set of variables Fortunately, there are simple conditions for valid SPNs, which are both very general and very easy to validate or enforce. We can prove that an SPN is valid if it sats complete and consist. Complete means that … … Incomplete Inconsistent S(e)  Xe S(X) S(e)  Xe S(X) 26

Semantics of Sums and Products
Product  Feature  Form feature hierarchy Sum  Mixture (with hidden var. summed out) i i   wij wij Sum out Yi j ……  …… ……  …… The nodes in an SPN have very intuitive semantics. A prod node defines a feat by taking conjunction of its child. Together they form a feat hier. A sum can be viewed as a mixture with the hidden var sum out. The hidden var chooses the child; by setting all indicators to 1, the indicators can be dropped. SPN can be viewed as combining mix of experts and prod of expert with alternating mix/prods. j  I[Yi = j]

Probability: P(X) = S(X) / Z
Inference Probability: P(X) = S(X) / Z  0.51 X: X1 = 1, X2 = 0 0.7 0.3 X1 1 X2 0.42 0.72    To comp prob 0.6 0.9    0.7 0.8   0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8   X1 X1 X2 X2 1 1

If weights sum to 1 at each sum node
Inference If weights sum to 1 at each sum node Then Z = 1, P(X) = S(X)  0.51 X: X1 = 1, X2 = 0 0.7 0.3 X1 1 X2 0.42 0.72    In addition, if wt to 1, the PF is 1, and the prob is simply root val 0.6 0.9    0.7 0.8   0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8   X1 X1 X2 X2 1 1

Marginal: P(e) = S(e) / Z
Inference Marginal: P(e) = S(e) / Z  0.69 = 0.51  0.18 e: X1 = 1 0.7 0.3 X1 1 X2 0.6 0.9    Similarly for a partial state. Moreover, we can also simultaneously compute marginals for all nodes and inputs by up/down. 0.6 0.9    1 1   0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8   X1 X1 X2 X2 1 1 1

MPE: Replace sums with maxes
Inference MPE: Replace sums with maxes e: X1 = 1 0.7  0.42 = 0.294 MAX 0.3  0.72 = 0.216 0.7 0.3 X1 1 X2 0.42 0.72    0.6 0.9 0.7 0.8 MAX MAX MAX MAX MPE state can be computed by replacing … We then conduct an upward  0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8   X1 X1 X2 X2 1 1 1 Darwiche [2003]

MAX: Pick child with highest value
Inference MAX: Pick child with highest value e: X1 = 1 0.7  0.42 = 0.294 MAX 0.3  0.72 = 0.216 0.7 0.3 X1 1 X2 0.42 0.72    0.6 0.9 0.7 0.8 MAX MAX MAX MAX Followed by downward then recursively pick most probable child at a sum node.  0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8   X1 X1 X2 X2 1 1 1 Darwiche [2003]

Handling Continuous Variables
Sum  Integral over input Simplest case: Indicator  Gaussian SPN compactly defines a very large mixture of Gaussians So far, we assume vars are boolean. Extension to discrete var is trivial, we simply use the indicators for all discrete values. To handle cont var, as we will do in the experiments, we can replace the sum nodes over input by an integral. E.g., we can replace the indicators by a uni-var gaussian. In this case, SPN … We know that with enough comp, we can approx any cont var. So SPN is a potentially very powerful way to model cont distr.

SPNs Everywhere Graphical models
Existing tractable mdls. & inference mthds. Determinism, context-specific indep., etc. Can potentially learn the optimal way TO-DO: Three combined Existing inf algs such as … can be viewed as special cases of SPNs. In addition, SPNs can repr the nested combination of these algs, and learn from data the optimal way of such comb. 34

SPNs Everywhere Graphical models Methods for efficient inference
E.g., arithmetic circuits, AND/OR graphs, case-factor diagrams SPNs are a class of probabilistic models SPNs have validity conditions SPNs can be learned from data Highly related; but these are three key contributions of our work that go beyond ACs etc. In contrast to Acs and AND/OR graphs, which are data structures for efficient inference, SPNs are a class … SPNs have val … whereas ACs have none, because the question doesn't arise in the context of compilation SPNs are the first to make a connection between efficient inference and deep learning x// More general than AC, AND/OR, and simpler than CFD. Moreover, AC, AND/OR are considered as compilation target, SPN shows that they are representations in its own right. Most importantly, SPN represents the first approach for learning directly from data. 35

SPNs Everywhere Graphical models Models for efficient inference
General, probabilistic convolutional network Sum: Average-pooling Max: Max-pooling Conv ntw is typically viewed as a vision-specific archi. SPN provides a general and prob view of conv. A sum node can be viewed as impl avg. We can also do max pooling by replacing it w. a max node. 36

Product: Production rule
SPNs Everywhere Graphical models Models for efficient inference General, probabilistic convolutional network Grammars in vision and language Finally, SPN can easily implement grammars in vision and NLP. E.g., … More impo, SPN provides a principled way to learn grammar from data. E.g., object detection grammar, probabilistic context-free grammar Sum: Non-terminal Product: Production rule 37

Conclusion So far, we see all the good things about SPNs. But the crucial question is can we learn them effectively? 38

General Approach Start with a dense SPN
Find the structure by learning weights Zero weights signify absence of connections Can learn with gradient descent or EM A general appro is to lrn both the struc/wt by starting w. … and … Alternatively, we can also learn with EM …

The Challenge Gradient diffusion: Gradient quickly dilutes
Similar problem with EM Hard EM overcomes this problem SPN lends itself naturally to grad desc. Grad can be comp easily using back prop. Unfortunately, this suffers from the grad diffusion problem. EM suffers from the similar problem, … assign fractional ex to child … This leads us to the second major contribution of this talk: we propose to use hard EM to address this problem. … pick max child … maintain unit update no matter how deep is the ntw. 40

Our Learning Algorithm
Online learning  Hard EM Sum node maintains counts for each child For each example Find MPE instantiation with current weights Increment count for each chosen child Renormalize to set new weights Repeat until convergence 41

Conclusion 42

Task: Image Completion
Methodology: Learn a model from training images Complete unseen test images Measure mean square errors Very challenging Good for evaluating deep models To eval SPNs, we use the task of img comp, which is … and particular good for … Specifically, … … cover up a contiguous half of the img, such as left/btm …

Datasets Main evaluation: Caltech-101 [Fei-Fei et al., 2004]
101 categories, e.g., faces, cars, elephants Each category: 30 – 800 images Also, Olivetti [Samaria & Harter, 1994] (400 faces) Each category: Last third for test Test images: Unseen objects Our main eval is done on the well-known … … unseen objs … E.g., at test time we will use the faces of Anna, in training time we may see Bob and Mary, but never see any images of Anna at all. In contrast to many existing works, where there are often very similar imgs in trn compared to test, for which NN can do quite well. By focusing on …

SPN Architecture               ...... ...... ......
Whole Image    ......  Region     ...... ......   ...... Sum for whole image Regions are decomposed into smaller ones The smallest region is a pixel; univariate gaussian Combine w. other pixel by product, on top add mixture; then combine w. other to form bigger regions   … …  Pixel x y

Decomposition 

Decomposition  …… ……          

Systems SPN DBM [Salakhutdinov & Hinton, 2010]
DBN [Hinton & Salakhutdinov, 2006] PCA [Turk & Pentland, 1991] Nearest neighbor [Hays & Efros, 2007] 48

Caltech: Mean-Square Errors
Left Bottom NN PCA DBN DBM SPN NN PCA DBN DBM SPN

SPN vs. DBM / DBN SPN is order of magnitude faster
No elaborate preprocessing, tuning Reduced errors by 30-60% Learned up to 46 layers SPN DBM / DBN Learning 2-3 hours Days Inference < 1 second Minutes or hours

Example Completions Original SPN DBM DBN PCA Nearest Neighbor
TO-DO: Add animation PCA Nearest Neighbor

Open Questions Other learning algorithms Discriminative learning
Architecture Continuous SPNs Sequential domains Other applications These rst are quite promising; but this is only the first step and much remains to be explored. 57

End-to-End Comparison
Approximate General Graphical Models Approximate Data Performance Given same computation budget, which approach has better performance? Some dist such as ising mdl can be compactly repr by GMs but not SPNs, however, inf requires var approx such as gibbs and var-inf, which can also be viewed as spec case of SPNs. E.g., gibbs can be viewed as picking a random child rather than summing over them. So given the same comp budget, which appro will have better perf? This will be a fascinating question both theo and prac. Approximate Sum-Product Networks Exact Data Performance

Approximate Inference
True Model Approximate Inference Sum-Product Networks Graphical Models Optimal SPN Existing Tractable Models In this talk, we ask the question: how general can a tractable mdl be? We found that there are surprisingly general conditions, and we introduce SPNs based on these conditions. The key insight is to incorporate comp into prob mdl, and compactly repr the part func as a deep ntw.

Conclusion Sum-product networks (SPNs)
DAG of sums and products Compactly represent partition function Learn many layers of hidden variables Exact inference: Linear time in network size Deep learning: Online hard EM Substantially outperform state of the art on image completion … contribute to deep learning by … 60

Sum-Product Networks: A New Deep Architecture

Similar presentations

Presentation on theme: "Sum-Product Networks: A New Deep Architecture"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sum-Product Networks: A New Deep Architecture

Similar presentations

Presentation on theme: "Sum-Product Networks: A New Deep Architecture"— Presentation transcript:

Similar presentations

About project

Feedback