Sum-Product Networks: A New Deep Architecture

Slides:

Advertisements

Similar presentations

Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.

Advertisements

AGVISE Laboratories %Zone or Grid Samples – Northwood laboratory

Simplifications of Context-Free Grammars

Variations of the Turing Machine

PDAs Accept Context-Free Languages

ALAK ROY. Assistant Professor Dept. of CSE NIT Agartala

Angstrom Care 培苗社 Quadratic Equation II

AP STUDY SESSION 2.

Select from the most commonly used minutes below.

Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.

David Burdett May 11, 2004 Package Binding for WS CDL.

1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.

The 5S numbers game..

Media-Monitoring Final Report April - May 2010 News.

Break Time Remaining 10:00.

The basics for simulations

Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)

Factoring Quadratics — ax² + bx + c Topic

EE, NCKU Tien-Hao Chang (Darby Chang)

Turing Machines.

Table 12.1: Cash Flows to a Cash and Carry Trading Strategy.

PP Test Review Sections 6-1 to 6-6

DIVISIBILITY, FACTORS & MULTIPLES

1 The Royal Doulton Company The Royal Doulton Company is an English company producing tableware and collectables, dating to Operating originally.

Operating Systems Operating Systems - Winter 2010 Chapter 3 – Input/Output Vrije Universiteit Amsterdam.

Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.

Copyright © 2013, 2009, 2006 Pearson Education, Inc. 1 Section 5.5 Dividing Polynomials Copyright © 2013, 2009, 2006 Pearson Education, Inc. 1.

Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.

Biology 2 Plant Kingdom Identification Test Review.

Adding Up In Chunks.

SLP – Endless Possibilities What can SLP do for your school? Everything you need to know about SLP – past, present and future.

MaK_Full ahead loaded 1 Alarm Page Directory (F11)

1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.

1 Termination and shape-shifting heaps Byron Cook Microsoft Research, Cambridge Joint work with Josh Berdine, Dino Distefano, and.

Artificial Intelligence

When you see… Find the zeros You think….

1 Using Bayesian Network for combining classifiers Leonardo Nogueira Matos Departamento de Computação Universidade Federal de Sergipe.

Before Between After.

Slide R - 1 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Prentice Hall Active Learning Lecture Slides For use with Classroom Response.

12 October, 2014 St Joseph's College ADVANCED HIGHER REVISION 1 ADVANCED HIGHER MATHS REVISION AND FORMULAE UNIT 2.

Subtraction: Adding UP

1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)

1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.

Types of selection structures

1 Titre de la diapositive SDMO Industries – Training Département MICS KERYS 09- MICS KERYS – WEBSITE.

Static Equilibrium; Elasticity and Fracture

Essential Cell Biology

Converting a Fraction to %

Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)

Clock will move after 1 minute

Copyright © 2013 Pearson Education, Inc. All rights reserved Chapter 11 Simple Linear Regression.

Physics for Scientists & Engineers, 3rd Edition

Select a time to count down from the clock above

Copyright Tim Morris/St Stephen's School

1.step PMIT start + initial project data input Concept Concept.

9. Two Functions of Two Random Variables

1 Dr. Scott Schaefer Least Squares Curves, Rational Representations, Splines and Continuity.

1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)

Graeme Henchel Multiples Graeme Henchel

0 x x2 0 0 x1 0 0 x3 0 1 x7 7 2 x0 0 9 x0 0.

Sum-Product Networks: A New Deep Architecture

المشرف د.يــــاســـــــــر فـــــــؤاد By: ahmed badrealldeen

Presentation transcript:

Sum-Product Networks: A New Deep Architecture Hoifung Poon Microsoft Research Joint work with Pedro Domingos 1

Graphical Models: Challenges Restricted Boltzmann Machine (RBM) Bayesian Network Markov Network Sprinkler Rain Grass Wet Advantage: Compactly represent probability As we all know, GM have many attractive properties for prob mdl. However, there are major challenges in using GMs. In part, inference is generally intractable. In turn, this makes learning the parameters or structure of a GM a very difficult task since learning typically required inference as a subroutine. Problem: Inference is intractable Problem: Learning is difficult 2

Deep Learning Stack many layers E.g.: DBN [Hinton & Salakhutdinov, 2006] CDBN [Lee et al., 2009] DBM [Salakhutdinov & Hinton, 2010] Potentially much more powerful than shallow architectures [Bengio, 2009] But … Inference is even harder Learning requires extensive effort Recently, deep learning has attracted increasing interest. The standard approach for DL is to stack many layers of graphical models (such as RBMs). Compared to shallow archi, this is potentially much more powerful in terms of representation and statistical learning. Unfortunately, even learning one layer of RBM is intractable. And when multiple layers are stacked, inference becomes even harder, and learning requires the black art of extensive engineering such as preproc and tuning. 3

Learning: Requires approximate inference Inference: Still approximate Graphical Models To identify potential solutions, let’s consider the following visualization. Here, the red circle represents all distributions that can be compactly repr by a GM. As I mentioned earlier, the challenge is that learn req approx inf, And even if a perfect mdl is learned, inf still requires approx (such as gibbs or var-inf). There is no guarantee on how long it’ll take or the quality of rsts.

Problem: Too restricted E.g., hierarchical mixture model, thin junction tree, etc. Problem: Too restricted Graphical Models Existing Tractable Models An alternative appro is to do all approx upfront to learn a tractable mdl and then have peace of mind in inf. This sounds very attractive, but unfortunately existing mdls such as … are too restricted.

This Talk: Sum-Product Networks Compactly represent partition function using a deep network Sum-Product Networks Graphical Models Existing Tractable Models In this talk, we ask the question: how general can a tractable mdl be? We found that there are surprisingly general conditions, and we introduce SPNs based on these conditions. The key insight is to incorporate comp into prob mdl, and compactly repr the part func as a deep ntw.

Exact inference linear time in network size Sum-Product Networks Graphical Models Exact inference linear time in network size Existing Tractable Models Exact inf is linear time in the size of an SPN

Can compactly represent many more distributions Sum-Product Networks Graphical Models Existing Tractable Models And we can compactly repr many more distributions compared to existing mdls.

Learn optimal way to reuse computation, etc. Sum-Product Networks Graphical Models Existing Tractable Models In part, we can even compactly repr many general classes of distributions for which the GM is a complete connected graph. And the reason is that SPN can leverage many structural properties such as determinism and CSI, and learn the optimal way to form and reuse partial comp.

Outline Sum-product networks (SPNs) Learning SPN Experimental results Conclusion I will start by introducing SPNs. Then I will address the crucial problem of learning an SPN. Finally, … 10

Why Is Inference Hard? Bottleneck: Summing out variables E.g.: Partition function Sum of exponentially many products Before introducing SPN, let’s consider why inf is difficult in a GM. Computing prob is actually easy up to a proportion; we simply multiply the potentials of factors. What makes inf hard is to sum out variables, which requires summing over an exp num of comb. The extreme case is the part func Z, which requires summing out all variables.

Alternative Representation X1 X2 P(X) 1 0.4 0.2 0.1 0.3 P(X) = 0.4  I[X1=1]  I[X2=1] + 0.2  I[X1=1]  I[X2=0] + 0.1  I[X1=0]  I[X2=1] + 0.3  I[X1=0]  I[X2=0] To address this problem, we’ll build on the idea of Darwiche 2003. For simplicity, let’s start with boolean vars. Later, we will see how to extend this to discrete and continuous vars. Suppose we have two vars X1 and X2. The prob dist can be repr as a table. Alternatively, we can repr this as a summation over the prob multiplied by the corresponding indicators. For example, this indicator is 1 if …, and 0 otherwise. Network Polynomial [Darwiche, 2003]

Alternative Representation X1 X2 P(X) 1 0.4 0.2 0.1 0.3 P(X) = 0.4  I[X1=1]  I[X2=1] + 0.2  I[X1=1]  I[X2=0] + 0.1  I[X1=0]  I[X2=1] + 0.3  I[X1=0]  I[X2=0] It’s easy to verify that the prob can be computed by setting the appropriate indicator values. Network Polynomial [Darwiche, 2003]

Shorthand for Indicators X1 X2 P(X) 1 0.4 0.2 0.1 0.3 P(X) = 0.4  X1  X2 + 0.2  X1  X2 + 0.1  X1  X2 + 0.3  X1  X2     To make things less cumbersome, we can use the following shorthands, where … Network Polynomial [Darwiche, 2003]

Easy: Set both indicators to 1 Sum Out Variables e: X1 = 1 X1 X2 P(X) 1 0.4 0.2 0.1 0.3 P(e) = 0.4  X1  X2 + 0.2  X1  X2 + 0.1  X1  X2 + 0.3  X1  X2     The good news is that summing out a variable in this repr is easy: Suppose we want to comp the marginal of X1=1, which requires sum out X2. We can do this by simply setting both … to 1. Intuitively, setting an indicator to 1 will include that val into summation. In part, to comp part func Z, we can just set all ind to 1. This sounds great, except for one problem, there are exponentially many terms.   Set X1 = 1, X1 = 0, X2 = 1, X2 = 1 Easy: Set both indicators to 1

Graphical Representation X1 X2 P(X) 1 0.4 0.2 0.1 0.3  0.4 0.3 0.2 0.1       We can visualize this problem graphically by putting sum node for wt summation and prod nodes for multi. X1 X1 X2 X2

But … Exponentially Large Example: Parity Uniform distribution over states with even number of 1’s 2N-1                  N2N-1 The problem is that this graph is exponentially large. E.g., consider parity, which corresponds to … There are exp many even states. It’s easy to show that there is no conditional indep under this dist., so to repr this as a GM, you need a fully connected graph.      X1 X1 X2 X2 X3 X3 X4 X4 X5 X5 17 17

But … Exponentially Large Can we make this more compact? Example: Parity Uniform distribution over states of even number of 1’s                  So the key question is can we make this graph more compact.      X1 X1 X2 X2 X3 X3 X4 X4 X5 X5 18 18

Use a Deep Network Example: Parity Uniform distribution over states with even number of 1’s O(N) The answer is yes. In part, this ntw repr exactly the same dist with just linear size. 19

Use a Deep Network Example: Parity Uniform distribution over states of even number of 1’s Induce many hidden layers Reuse partial computation The key is to introduce many hidden layers. 20

Arithmetic Circuits (ACs) Data structure for efficient inference Darwiche [2003] Compilation target of Bayesian networks Key idea: Use ACs instead to define a new class of deep probabilistic models Develop new deep learning algorithms for this class of models We will call this class sum-product networks, since it represents distributions using DAGs of sums and products. as we'll see, these will enable much better deep learning than before

Sum-Product Networks (SPNs) Rooted DAG Nodes: Sum, product, input indicator Weights on edges from sum to children  0.7 0.3   This leads us to introduce spns. An SPN is a DAG with a root. The leaves are input, and the internal nodes are sums and prods. A sum outputs the wt sum of child, …     0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8   X1 X1 X2 X2 22

Distribution Defined by SPN P(X)  S(X)  0.7 0.3   We can define a prb dist by treating these values as unnormalize probs, just like a MN.     0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8   X1 X1 X2 X2 23 23

Can We Sum Out Variables? = P(e)  Xe S(X) S(e)  0.7 0.3   So the key question is can we replace the exp sum for the true marginal by the linear eval of the ntw? If this is true, we are golden. But if we pick an arbitrary SPN, this may not be true.     0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8   X1 X1 X2 X2 e: X1 = 1 1 1 1 24

Valid SPN SPN is valid if S(e) = Xe S(X) for all e Valid  Can compute marginals efficiently Partition function Z can be computed by setting all indicators to 1 We thus define that an SPN is valid iff … If an SPN is valid, … 25

Valid SPN: General Conditions Theorem: SPN is valid if it is complete & consistent Consistent: Under product, no variable in one child and negation in another Complete: Under sum, children cover the same set of variables Fortunately, there are simple conditions for valid SPNs, which are both very general and very easy to validate or enforce. We can prove that an SPN is valid if it sats complete and consist. Complete means that … … Incomplete Inconsistent S(e)  Xe S(X) S(e)  Xe S(X) 26

Semantics of Sums and Products Product  Feature  Form feature hierarchy Sum  Mixture (with hidden var. summed out) i i   wij wij Sum out Yi j ……  …… ……  …… The nodes in an SPN have very intuitive semantics. A prod node defines a feat by taking conjunction of its child. Together they form a feat hier. A sum can be viewed as a mixture with the hidden var sum out. The hidden var chooses the child; by setting all indicators to 1, the indicators can be dropped. SPN can be viewed as combining mix of experts and prod of expert with alternating mix/prods. j  I[Yi = j]

Probability: P(X) = S(X) / Z Inference Probability: P(X) = S(X) / Z  0.51 X: X1 = 1, X2 = 0 0.7 0.3 X1 1 X2 0.42 0.72    To comp prob 0.6 0.9    0.7 0.8   0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8   X1 X1 X2 X2 1 1

If weights sum to 1 at each sum node Inference If weights sum to 1 at each sum node Then Z = 1, P(X) = S(X)  0.51 X: X1 = 1, X2 = 0 0.7 0.3 X1 1 X2 0.42 0.72    In addition, if wt to 1, the PF is 1, and the prob is simply root val 0.6 0.9    0.7 0.8   0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8   X1 X1 X2 X2 1 1

Marginal: P(e) = S(e) / Z Inference Marginal: P(e) = S(e) / Z  0.69 = 0.51  0.18 e: X1 = 1 0.7 0.3 X1 1 X2 0.6 0.9    Similarly for a partial state. Moreover, we can also simultaneously compute marginals for all nodes and inputs by up/down. 0.6 0.9    1 1   0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8   X1 X1 X2 X2 1 1 1

MPE: Replace sums with maxes Inference MPE: Replace sums with maxes e: X1 = 1 0.7  0.42 = 0.294 MAX 0.3  0.72 = 0.216 0.7 0.3 X1 1 X2 0.42 0.72    0.6 0.9 0.7 0.8 MAX MAX MAX MAX MPE state can be computed by replacing … We then conduct an upward  0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8   X1 X1 X2 X2 1 1 1 Darwiche [2003]

MAX: Pick child with highest value Inference MAX: Pick child with highest value e: X1 = 1 0.7  0.42 = 0.294 MAX 0.3  0.72 = 0.216 0.7 0.3 X1 1 X2 0.42 0.72    0.6 0.9 0.7 0.8 MAX MAX MAX MAX Followed by downward then recursively pick most probable child at a sum node.  0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8   X1 X1 X2 X2 1 1 1 Darwiche [2003]

Handling Continuous Variables Sum  Integral over input Simplest case: Indicator  Gaussian SPN compactly defines a very large mixture of Gaussians So far, we assume vars are boolean. Extension to discrete var is trivial, we simply use the indicators for all discrete values. To handle cont var, as we will do in the experiments, we can replace the sum nodes over input by an integral. E.g., we can replace the indicators by a uni-var gaussian. In this case, SPN … We know that with enough comp, we can approx any cont var. So SPN is a potentially very powerful way to model cont distr.

SPNs Everywhere Graphical models Existing tractable mdls. & inference mthds. Determinism, context-specific indep., etc. Can potentially learn the optimal way TO-DO: Three combined Existing inf algs such as … can be viewed as special cases of SPNs. In addition, SPNs can repr the nested combination of these algs, and learn from data the optimal way of such comb. 34

SPNs Everywhere Graphical models Methods for efficient inference E.g., arithmetic circuits, AND/OR graphs, case-factor diagrams SPNs are a class of probabilistic models SPNs have validity conditions SPNs can be learned from data Highly related; but these are three key contributions of our work that go beyond ACs etc. In contrast to Acs and AND/OR graphs, which are data structures for efficient inference, SPNs are a class … SPNs have val … whereas ACs have none, because the question doesn't arise in the context of compilation SPNs are the first to make a connection between efficient inference and deep learning x// More general than AC, AND/OR, and simpler than CFD. Moreover, AC, AND/OR are considered as compilation target, SPN shows that they are representations in its own right. Most importantly, SPN represents the first approach for learning directly from data. 35

SPNs Everywhere Graphical models Models for efficient inference General, probabilistic convolutional network Sum: Average-pooling Max: Max-pooling Conv ntw is typically viewed as a vision-specific archi. SPN provides a general and prob view of conv. A sum node can be viewed as impl avg. We can also do max pooling by replacing it w. a max node. 36

Product: Production rule SPNs Everywhere Graphical models Models for efficient inference General, probabilistic convolutional network Grammars in vision and language Finally, SPN can easily implement grammars in vision and NLP. E.g., … More impo, SPN provides a principled way to learn grammar from data. E.g., object detection grammar, probabilistic context-free grammar Sum: Non-terminal Product: Production rule 37

Outline Sum-product networks (SPNs) Learning SPN Experimental results Conclusion So far, we see all the good things about SPNs. But the crucial question is can we learn them effectively? 38

General Approach Start with a dense SPN Find the structure by learning weights Zero weights signify absence of connections Can learn with gradient descent or EM A general appro is to lrn both the struc/wt by starting w. … and … Alternatively, we can also learn with EM …

The Challenge Gradient diffusion: Gradient quickly dilutes Similar problem with EM Hard EM overcomes this problem SPN lends itself naturally to grad desc. Grad can be comp easily using back prop. Unfortunately, this suffers from the grad diffusion problem. EM suffers from the similar problem, … assign fractional ex to child … This leads us to the second major contribution of this talk: we propose to use hard EM to address this problem. … pick max child … maintain unit update no matter how deep is the ntw. 40

Our Learning Algorithm Online learning  Hard EM Sum node maintains counts for each child For each example Find MPE instantiation with current weights Increment count for each chosen child Renormalize to set new weights Repeat until convergence 41

Outline Sum-product networks (SPNs) Learning SPN Experimental results Conclusion 42

Task: Image Completion Methodology: Learn a model from training images Complete unseen test images Measure mean square errors Very challenging Good for evaluating deep models To eval SPNs, we use the task of img comp, which is … and particular good for … Specifically, … … cover up a contiguous half of the img, such as left/btm …

Datasets Main evaluation: Caltech-101 [Fei-Fei et al., 2004] 101 categories, e.g., faces, cars, elephants Each category: 30 – 800 images Also, Olivetti [Samaria & Harter, 1994] (400 faces) Each category: Last third for test Test images: Unseen objects Our main eval is done on the well-known … … unseen objs … E.g., at test time we will use the faces of Anna, in training time we may see Bob and Mary, but never see any images of Anna at all. In contrast to many existing works, where there are often very similar imgs in trn compared to test, for which NN can do quite well. By focusing on …

SPN Architecture               ...... ...... ...... Whole Image    ......  Region     ...... ......   ...... Sum for whole image Regions are decomposed into smaller ones The smallest region is a pixel; univariate gaussian Combine w. other pixel by product, on top add mixture; then combine w. other to form bigger regions   … …  Pixel x y

Decomposition 

Decomposition  …… ……          

Systems SPN DBM [Salakhutdinov & Hinton, 2010] DBN [Hinton & Salakhutdinov, 2006] PCA [Turk & Pentland, 1991] Nearest neighbor [Hays & Efros, 2007] 48

Caltech: Mean-Square Errors Left Bottom NN PCA DBN DBM SPN NN PCA DBN DBM SPN

SPN vs. DBM / DBN SPN is order of magnitude faster No elaborate preprocessing, tuning Reduced errors by 30-60% Learned up to 46 layers SPN DBM / DBN Learning 2-3 hours Days Inference < 1 second Minutes or hours

Example Completions Original SPN DBM DBN PCA Nearest Neighbor TO-DO: Add animation PCA Nearest Neighbor

Example Completions Original SPN DBM DBN PCA Nearest Neighbor TO-DO: Add animation PCA Nearest Neighbor

Example Completions Original SPN DBM DBN PCA Nearest Neighbor TO-DO: Add animation PCA Nearest Neighbor

Example Completions Original SPN DBM DBN PCA Nearest Neighbor TO-DO: Add animation PCA Nearest Neighbor

Example Completions Original SPN DBM DBN PCA Nearest Neighbor TO-DO: Add animation PCA Nearest Neighbor

Example Completions Original SPN DBM DBN PCA Nearest Neighbor TO-DO: Add animation PCA Nearest Neighbor

Open Questions Other learning algorithms Discriminative learning Architecture Continuous SPNs Sequential domains Other applications These rst are quite promising; but this is only the first step and much remains to be explored. 57

End-to-End Comparison Approximate General Graphical Models Approximate Data Performance Given same computation budget, which approach has better performance? Some dist such as ising mdl can be compactly repr by GMs but not SPNs, however, inf requires var approx such as gibbs and var-inf, which can also be viewed as spec case of SPNs. E.g., gibbs can be viewed as picking a random child rather than summing over them. So given the same comp budget, which appro will have better perf? This will be a fascinating question both theo and prac. Approximate Sum-Product Networks Exact Data Performance

Approximate Inference True Model Approximate Inference Sum-Product Networks Graphical Models Optimal SPN Existing Tractable Models In this talk, we ask the question: how general can a tractable mdl be? We found that there are surprisingly general conditions, and we introduce SPNs based on these conditions. The key insight is to incorporate comp into prob mdl, and compactly repr the part func as a deep ntw.

Conclusion Sum-product networks (SPNs) DAG of sums and products Compactly represent partition function Learn many layers of hidden variables Exact inference: Linear time in network size Deep learning: Online hard EM Substantially outperform state of the art on image completion … contribute to deep learning by … 60