Presentation on theme: "Sum-Product Networks: A New Deep Architecture"— Presentation transcript:
1 Sum-Product Networks: A New Deep Architecture Hoifung PoonMicrosoft ResearchJoint work with Pedro Domingos1
2 Graphical Models: Challenges Restricted Boltzmann Machine (RBM)Bayesian NetworkMarkov NetworkSprinklerRainGrass WetAdvantage: Compactly represent probabilityAs we all know, GM have many attractive properties for prob mdl.However, there are major challenges in using GMs.In part, inference is generally intractable.In turn, this makes learning the parameters or structure of a GM a very difficult task since learning typically required inference as a subroutine.Problem: Inference is intractableProblem: Learning is difficult2
3 Deep Learning Stack many layers E.g.: DBN [Hinton & Salakhutdinov, 2006]CDBN [Lee et al., 2009]DBM [Salakhutdinov & Hinton, 2010]Potentially much more powerful than shallow architectures [Bengio, 2009]But …Inference is even harderLearning requires extensive effortRecently, deep learning has attracted increasing interest.The standard approach for DL is to stack many layers of graphical models (such as RBMs).Compared to shallow archi, this is potentially much more powerful in terms of representation and statistical learning.Unfortunately, even learning one layer of RBM is intractable.And when multiple layers are stacked, inference becomes even harder,and learning requires the black art of extensive engineering such as preproc and tuning.3
4 Learning: Requires approximate inference Inference: Still approximate Graphical ModelsTo identify potential solutions, let’s consider the following visualization.Here, the red circle represents all distributions that can be compactly repr by a GM.As I mentioned earlier, the challenge is that learn req approx inf,And even if a perfect mdl is learned, inf still requires approx (such as gibbs or var-inf).There is no guarantee on how long it’ll take or the quality of rsts.
5 Problem: Too restricted E.g., hierarchical mixture model, thin junction tree, etc.Problem: Too restrictedGraphical ModelsExisting Tractable ModelsAn alternative appro is to do all approx upfront to learn a tractable mdl and then have peace of mind in inf.This sounds very attractive, but unfortunately existing mdls such as … are too restricted.
6 This Talk: Sum-Product Networks Compactly represent partition function using a deep networkSum-Product NetworksGraphical ModelsExisting Tractable ModelsIn this talk, we ask the question: how general can a tractable mdl be?We found that there are surprisingly general conditions, and we introduce SPNs based on these conditions.The key insight is to incorporate comp into prob mdl, and compactly repr the part func as a deep ntw.
7 Exact inference linear time in network size Sum-Product NetworksGraphical ModelsExact inference linear time in network sizeExisting Tractable ModelsExact inf is linear time in the size of an SPN
8 Can compactly represent many more distributions Sum-Product NetworksGraphical ModelsExisting Tractable ModelsAnd we can compactly repr many more distributions compared to existing mdls.
9 Learn optimal way to reuse computation, etc. Sum-Product NetworksGraphical ModelsExisting Tractable ModelsIn part, we can even compactly repr many general classes of distributions for which the GM is a complete connected graph.And the reason is that SPN can leverage many structural properties such as determinism and CSI, and learn the optimal way to form and reuse partial comp.
10 Outline Sum-product networks (SPNs) Learning SPN Experimental results ConclusionI will start by introducing SPNs.Then I will address the crucial problem of learning an SPN.Finally, …10
11 Why Is Inference Hard? Bottleneck: Summing out variables E.g.: Partition functionSum of exponentially many productsBefore introducing SPN, let’s consider why inf is difficult in a GM.Computing prob is actually easy up to a proportion; we simply multiply the potentials of factors.What makes inf hard is to sum out variables, which requires summing over an exp num of comb.The extreme case is the part func Z, which requires summing out all variables.
12 Alternative Representation X1X2P(X)10.40.20.10.3P(X) = 0.4 I[X1=1] I[X2=1]+ 0.2 I[X1=1] I[X2=0]+ 0.1 I[X1=0] I[X2=1]+ 0.3 I[X1=0] I[X2=0]To address this problem, we’ll build on the idea of Darwiche 2003.For simplicity, let’s start with boolean vars. Later, we will see how to extend this to discrete and continuous vars.Suppose we have two vars X1 and X2. The prob dist can be repr as a table.Alternatively, we can repr this as a summation over the prob multiplied by the corresponding indicators.For example, this indicator is 1 if …, and 0 otherwise.Network Polynomial [Darwiche, 2003]
13 Alternative Representation X1X2P(X)10.40.20.10.3P(X) = 0.4 I[X1=1] I[X2=1]+ 0.2 I[X1=1] I[X2=0]+ 0.1 I[X1=0] I[X2=1]+ 0.3 I[X1=0] I[X2=0]It’s easy to verify that the prob can be computed by setting the appropriate indicator values.Network Polynomial [Darwiche, 2003]
14 Shorthand for Indicators X1X2P(X)10.40.20.10.3P(X) = 0.4 X1 X2+ 0.2 X1 X2+ 0.1 X1 X2+ 0.3 X1 X2To make things less cumbersome, we can use the following shorthands, where …Network Polynomial [Darwiche, 2003]
15 Easy: Set both indicators to 1 Sum Out Variablese: X1 = 1X1X2P(X)10.40.20.10.3P(e) = 0.4 X1 X2+ 0.2 X1 X2+ 0.1 X1 X2+ 0.3 X1 X2The good news is that summing out a variable in this repr is easy:Suppose we want to comp the marginal of X1=1, which requires sum out X2.We can do this by simply setting both … to 1.Intuitively, setting an indicator to 1 will include that val into summation.In part, to comp part func Z, we can just set all ind to 1.This sounds great, except for one problem, there are exponentially many terms.Set X1 = 1, X1 = 0, X2 = 1, X2 = 1Easy: Set both indicators to 1
16 Graphical Representation X1X2P(X)10.40.20.10.184.108.40.206.1We can visualize this problem graphically by putting sum node for wt summation and prod nodes for multi.X1X1X2X2
17 But … Exponentially Large Example: ParityUniform distribution over states with even number of 1’s2N-1N2N-1The problem is that this graph is exponentially large.E.g., consider parity, which corresponds to …There are exp many even states.It’s easy to show that there is no conditional indep under this dist., so to repr this as a GM, you need a fully connected graph.X1X1X2X2X3X3X4X4X5X51717
18 But … Exponentially Large Can we make this more compact?Example: ParityUniform distribution over states of even number of 1’sSo the key question is can we make this graph more compact.X1X1X2X2X3X3X4X4X5X51818
19 Use a Deep Network Example: Parity Uniform distribution over states with even number of 1’sO(N)The answer is yes. In part, this ntw repr exactly the same dist with just linear size.19
20 Use a Deep Network Example: Parity Uniform distribution over states of even number of 1’sInduce many hidden layersReuse partial computationThe key is to introduce many hidden layers.20
21 Arithmetic Circuits (ACs) Data structure for efficient inferenceDarwiche Compilation target of Bayesian networksKey idea: Use ACs instead to define a new class of deep probabilistic modelsDevelop new deep learning algorithms for this class of modelsWe will call this class sum-product networks, since it represents distributions using DAGs of sums and products.as we'll see, these will enable much better deep learning than before
22 Sum-Product Networks (SPNs) Rooted DAGNodes: Sum, product, input indicatorWeights on edges from sum to children0.70.3This leads us to introduce spns.An SPN is a DAG with a root. The leaves are input, and the internal nodes are sums and prods. A sum outputs the wt sum of child, …0.40.90.70.220.127.116.11.8X1X1X2X222
23 Distribution Defined by SPN P(X) S(X)0.70.3We can define a prb dist by treating these values as unnormalize probs, just like a MN.0.40.90.70.18.104.22.168.8X1X1X2X22323
24 Can We Sum Out Variables? =P(e) Xe S(X) S(e)0.70.3So the key question is can we replace the exp sum for the true marginal by the linear eval of the ntw?If this is true, we are golden. But if we pick an arbitrary SPN, this may not be true.0.40.90.70.22.214.171.124.8X1X1X2X2e: X1 = 111124
25 Valid SPN SPN is valid if S(e) = Xe S(X) for all e Valid Can compute marginals efficientlyPartition function Z can be computed by setting all indicators to 1We thus define that an SPN is valid iff …If an SPN is valid, …25
26 Valid SPN: General Conditions Theorem: SPN is valid if it is complete & consistentConsistent: Under product, no variable in one child and negation in anotherComplete: Under sum, children cover the same set of variablesFortunately, there are simple conditions for valid SPNs, which are both very general and very easy to validate or enforce.We can prove that an SPN is valid if it sats complete and consist.Complete means that ……IncompleteInconsistentS(e) Xe S(X)S(e) Xe S(X)26
27 Semantics of Sums and Products Product Feature Form feature hierarchySum Mixture (with hidden var. summed out)iiwijwijSum out Yij……………………The nodes in an SPN have very intuitive semantics.A prod node defines a feat by taking conjunction of its child.Together they form a feat hier.A sum can be viewed as a mixture with the hidden var sum out. The hidden var chooses the child; by setting all indicators to 1, the indicators can be dropped.SPN can be viewed as combining mix of experts and prod of expert with alternating mix/prods.jI[Yi = j]
29 If weights sum to 1 at each sum node InferenceIf weights sum to 1 at each sum nodeThen Z = 1, P(X) = S(X)0.51X: X1 = 1, X2 = 00.70.3X11X20.420.72In addition, if wt to 1, the PF is 1, and the prob is simply root val0.60.90.70.126.96.36.199.188.8.131.52.8X1X1X2X211
30 Marginal: P(e) = S(e) / Z InferenceMarginal: P(e) = S(e) / Z0.69 = 0.51 0.18e: X1 = 10.70.3X11X20.60.9Similarly for a partial state.Moreover, we can also simultaneously compute marginals for all nodes and inputs by up/down.0.60.9184.108.40.206.220.127.116.11.8X1X1X2X2111
31 MPE: Replace sums with maxes InferenceMPE: Replace sums with maxese: X1 = 10.7 0.42 = 0.294MAX0.3 0.72 = 0.2160.70.3X11X20.420.718.104.22.168.8MAXMAXMAXMAXMPE state can be computed by replacing …We then conduct an upward0.40.90.70.22.214.171.124.8X1X1X2X2111Darwiche 
32 MAX: Pick child with highest value InferenceMAX: Pick child with highest valuee: X1 = 10.7 0.42 = 0.294MAX0.3 0.72 = 0.2160.70.3X11X20.420.7126.96.36.199.8MAXMAXMAXMAXFollowed by downward then recursively pick most probable child at a sum node.0.40.90.70.188.8.131.52.8X1X1X2X2111Darwiche 
33 Handling Continuous Variables Sum Integral over inputSimplest case: Indicator GaussianSPN compactly defines a very large mixture of GaussiansSo far, we assume vars are boolean. Extension to discrete var is trivial, we simply use the indicators for all discrete values.To handle cont var, as we will do in the experiments, we can replace the sum nodes over input by an integral.E.g., we can replace the indicators by a uni-var gaussian.In this case, SPN …We know that with enough comp, we can approx any cont var. So SPN is a potentially very powerful way to model cont distr.
34 SPNs Everywhere Graphical models Existing tractable mdls. & inference mthds.Determinism, context-specific indep., etc.Can potentially learn the optimal wayTO-DO: Three combinedExisting inf algs such as … can be viewed as special cases of SPNs.In addition, SPNs can repr the nested combination of these algs, and learn from data the optimal way of such comb.34
35 SPNs Everywhere Graphical models Methods for efficient inference E.g., arithmetic circuits,AND/OR graphs, case-factor diagramsSPNs are a class of probabilistic modelsSPNs have validity conditionsSPNs can be learned from dataHighly related; but these are three key contributions of our work that go beyond ACs etc.In contrast to Acs and AND/OR graphs, which are data structures for efficient inference, SPNs are a class …SPNs have val … whereas ACs have none, because the question doesn't arise in the context of compilationSPNs are the first to make a connection between efficient inference and deep learningx//More general than AC, AND/OR, and simpler than CFD.Moreover, AC, AND/OR are considered as compilation target, SPN shows that they are representations in its own right.Most importantly, SPN represents the first approach for learning directly from data.35
36 SPNs Everywhere Graphical models Models for efficient inference General, probabilistic convolutional networkSum: Average-poolingMax: Max-poolingConv ntw is typically viewed as a vision-specific archi.SPN provides a general and prob view of conv.A sum node can be viewed as impl avg. We can also do max pooling by replacing it w. a max node.36
37 Product: Production rule SPNs EverywhereGraphical modelsModels for efficient inferenceGeneral, probabilistic convolutional networkGrammars in vision and languageFinally, SPN can easily implement grammars in vision and NLP. E.g., …More impo, SPN provides a principled way to learn grammar from data.E.g., object detection grammar,probabilistic context-free grammarSum: Non-terminalProduct: Production rule37
38 Outline Sum-product networks (SPNs) Learning SPN Experimental results ConclusionSo far, we see all the good things about SPNs. But the crucial question is can we learn them effectively?38
39 General Approach Start with a dense SPN Find the structure by learning weightsZero weights signify absence of connectionsCan learn with gradient descent or EMA general appro is to lrn both the struc/wt by starting w. … and …Alternatively, we can also learn with EM …
40 The Challenge Gradient diffusion: Gradient quickly dilutes Similar problem with EMHard EM overcomes this problemSPN lends itself naturally to grad desc. Grad can be comp easily using back prop.Unfortunately, this suffers from the grad diffusion problem.EM suffers from the similar problem, … assign fractional ex to child …This leads us to the second major contribution of this talk: we propose to use hard EM to address this problem.… pick max child … maintain unit update no matter how deep is the ntw.40
41 Our Learning Algorithm Online learning Hard EMSum node maintains counts for each childFor each exampleFind MPE instantiation with current weightsIncrement count for each chosen childRenormalize to set new weightsRepeat until convergence41
43 Task: Image Completion Methodology:Learn a model from training imagesComplete unseen test imagesMeasure mean square errorsVery challengingGood for evaluating deep modelsTo eval SPNs, we use the task of img comp, which is … and particular good for …Specifically, …… cover up a contiguous half of the img, such as left/btm …
44 Datasets Main evaluation: Caltech-101 [Fei-Fei et al., 2004] 101 categories, e.g., faces, cars, elephantsEach category: 30 – 800 imagesAlso, Olivetti [Samaria & Harter, 1994] (400 faces)Each category: Last third for testTest images: Unseen objectsOur main eval is done on the well-known …… unseen objs …E.g., at test time we will use the faces of Anna, in training time we may see Bob and Mary, but never see any images of Anna at all.In contrast to many existing works, where there are often very similar imgs in trn compared to test, for which NN can do quite well.By focusing on …
45 SPN Architecture ...... ...... ...... Whole Image......Region..................Sum for whole imageRegions are decomposed into smaller onesThe smallest region is a pixel; univariate gaussianCombine w. other pixel by product, on top add mixture; then combine w. other to form bigger regions……Pixelxy
50 SPN vs. DBM / DBN SPN is order of magnitude faster No elaborate preprocessing, tuningReduced errors by 30-60%Learned up to 46 layersSPNDBM / DBNLearning2-3 hoursDaysInference< 1 secondMinutes or hours
51 Example Completions Original SPN DBM DBN PCA Nearest Neighbor TO-DO: Add animationPCANearest Neighbor
52 Example Completions Original SPN DBM DBN PCA Nearest Neighbor TO-DO: Add animationPCANearest Neighbor
53 Example Completions Original SPN DBM DBN PCA Nearest Neighbor TO-DO: Add animationPCANearest Neighbor
54 Example Completions Original SPN DBM DBN PCA Nearest Neighbor TO-DO: Add animationPCANearest Neighbor
55 Example Completions Original SPN DBM DBN PCA Nearest Neighbor TO-DO: Add animationPCANearest Neighbor
56 Example Completions Original SPN DBM DBN PCA Nearest Neighbor TO-DO: Add animationPCANearest Neighbor
57 Open Questions Other learning algorithms Discriminative learning ArchitectureContinuous SPNsSequential domainsOther applicationsThese rst are quite promising; but this is only the first step and much remains to be explored.57
58 End-to-End Comparison ApproximateGeneralGraphical ModelsApproximateDataPerformanceGiven same computation budget,which approach has better performance?Some dist such as ising mdl can be compactly repr by GMs but not SPNs, however, inf requires var approx such as gibbs and var-inf, which can also be viewed as spec case of SPNs. E.g., gibbs can be viewed as picking a random child rather than summing over them.So given the same comp budget, which appro will have better perf? This will be a fascinating question both theo and prac.ApproximateSum-Product NetworksExactDataPerformance
59 Approximate Inference True ModelApproximate InferenceSum-Product NetworksGraphical ModelsOptimal SPNExisting Tractable ModelsIn this talk, we ask the question: how general can a tractable mdl be?We found that there are surprisingly general conditions, and we introduce SPNs based on these conditions.The key insight is to incorporate comp into prob mdl, and compactly repr the part func as a deep ntw.
60 Conclusion Sum-product networks (SPNs) DAG of sums and productsCompactly represent partition functionLearn many layers of hidden variablesExact inference: Linear time in network sizeDeep learning: Online hard EMSubstantially outperform state of the art on image completion… contribute to deep learning by …60