1 Scalable Probabilistic Databases with Factor Graphs and MCMC Michael Wick, Andrew McCallum, and Gerome Miklau VLDB 2010.

1 Scalable Probabilistic Databases with Factor Graphs and MCMC Michael Wick, Andrew McCallum, and Gerome Miklau VLDB 2010

2 Outline  Background of research  Key contributions  FACTORIE language  Models for information extraction  MCMC with database “assist”  Experimental results  Implications for information extraction more generally

Background of research  McCallum an ML researcher crossing bridge to DB  Mostly tools and apps (incl. IE) for undirected models  “Probabilistic databases” undergoing significant evolution (see survey by Dalvi et al, CACM, 2009):  Early PDB systems attached probabilities to tuples:  0.7: Employs(IBM,John)  0.95 Employs(IBM,Mary) etc  Aggregation queries etc. under global independence  Around 2005, model-based approaches took over, but faced the same issues (expressive power, complexity) as in AI 3

Key contributions  Increasingly sophisticated CRF-like models for extraction, entity resolution, schema mapping, etc.  FACTORIE for model construction and inference  Efficient MCMC inference on relational worlds  Handles very large models without blowing up  Efficient local computation for each MC step  Integration with database technology:  Possible world = database, MC step = database update  Query evaluation directly on database  Incremental re-evaluation after each MC step 9

Key contributions  Increasingly sophisticated CRF-like models for extraction, entity resolution, schema mapping, etc.  FACTORIE for model construction and inference  Efficient MCMC inference on relational worlds  Handles very large models without blowing up  Efficient local computation for each MC step  Integration with database technology:  Possible world = database, MC step = database update  Query evaluation directly on database  Incremental re-evaluation after each MC step 10

Factor graphs  Nodes are variables and factors (potentials on sets of variables)  Links connect variables to factors that include them  P(x 1,…,x n ) = Π j F j (s j )/Z and (in this paper) F j (s j ) = exp( ϕ j (s j ) θ j ) w/ features ϕ j  FACTORIE uses loops in a way analogous to BUGS (plates) 13

MCMC (Metropolis-Hastings)  Worlds x, evidence e, posterior π(x) = P(x | e) = P(x,e)/P(e)  Proposal distribution q(x’ | x) determines neighborhood of x  MH samples x’ from q(x’ | x), accepts with probability  α(x’ | x) = min(1, π(x’) q(x | x’) / π(x) q(x’ | x) ) = min(1, P(x’,e) q(x | x’) / P(x,e) q(x’ | x) )  For graphical models (and BLOG), P(x,e) is a product of local conditional probabilities (or potentials)  If the change from x to x’ is local (e.g., a single tuple becomes true or false), almost all terms in P(x,e) and P(x’,e) cancel out  Hence the per-step computation cost is independent of model size 14

Additional efficiency  Pasula and Russell, IJCAI-01Approximate inference for first- order probabilistic languagesApproximate inference for first- order probabilistic languages  MCMC over relational worlds with identity uncertainty  One specific issue: How to sample a new object for a function value?  E.g., sample a Prof as value of Advisor(Student 12 ), where the student’s choice depends on the funding that each Prof has  Gibbs sampling: n Profs => compute probabilities for n networks!  Metropolis-Hastings: propose a new advisor, evaluate the ratio 15

MCMC on values 16 B(H 1 ) A(H 1 ) Earthquake(R a ) B(H 2 ) A(H 2 ) B(H 3 ) A(H 3 ) Earthquake(R b ) B(H 4 ) A(H 4 ) B(H 5 ) A(H 5 ) B(H 1 ) A(H 1 ) Earthquake(R a ) B(H 2 ) A(H 2 ) B(H 3 ) A(H 3 ) Earthquake(R b ) B(H 4 ) A(H 4 ) B(H 5 ) A(H 5 ) B(H 1 ) A(H 1 ) Earthquake(R a ) B(H 2 ) A(H 2 ) B(H 3 ) A(H 3 ) Earthquake(R b ) B(H 4 ) A(H 4 ) B(H 5 ) A(H 5 ) B(H 1 ) A(H 1 ) Earthquake(R a ) B(H 2 ) A(H 2 ) B(H 3 ) A(H 3 ) Earthquake(R b ) B(H 4 ) A(H 4 ) B(H 5 ) A(H 5 )

Integration with DB technology  Databases are designed for  storing lots of data  efficient processing of queries on lots of data  How much can we borrow from DB technology to help with probabilistic IE? 21

Optimizing query evaluation  In databases, running a query can be expensive, especially if it involves scanning all the data:  Aggregation, e.g., #{x,y: R(x,y) ^ R(y,x)}  Quantifier alternation, chains of literals, etc.  A materialized view is a cached database table representation of any query result  Incremental view maintenance recomputes the materialized view whenever any tuple changes  E.g., if R(A,B) is set to true, check R(B,A) and add 1  So query can be re-evaluated much faster after each MC step 33

Drawbacks of black-box DB technology  Modifying tuples in a disk-resident DB is expensive  DB technology designed mostly for atomic transactions; 500/second on $10K system  Difficult to add new types of optimization, e.g., maintaining efficient summaries (min, etc.)  Not suitable for some data types, e.g., images  A “database” sounds like a “possible world”, but only under Herbrand semantics 35

Experiments - NER 36  Skip-chain CRF includes links between labels for identical tokens (but not across docs!!)

Experiments - NER  Proposal distribution:  Choose up to five documents at random  Choose one label variable at random among these  Choose a label at random  Data: 1788 NYT articles  Query # B-PER labels (evaluate every 10k MC steps) 37 17650 plus/minus 50 Essentially each B-PER decision is independent; Too many parameters, too little context, no parameter uncertainty!

Summary  A serious attempt to create scalable, nontrivial probability models and inference technology for IE  Experiments unconvincing, both for raw efficiency and reasonableness  Not clear if FACTORIE is “elegantly” usable to create very complex models  Some continuing work…. 38

1 Scalable Probabilistic Databases with Factor Graphs and MCMC Michael Wick, Andrew McCallum, and Gerome Miklau VLDB 2010.

Similar presentations

Presentation on theme: "1 Scalable Probabilistic Databases with Factor Graphs and MCMC Michael Wick, Andrew McCallum, and Gerome Miklau VLDB 2010."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Scalable Probabilistic Databases with Factor Graphs and MCMC Michael Wick, Andrew McCallum, and Gerome Miklau VLDB 2010.

Similar presentations

Presentation on theme: "1 Scalable Probabilistic Databases with Factor Graphs and MCMC Michael Wick, Andrew McCallum, and Gerome Miklau VLDB 2010."— Presentation transcript:

Similar presentations

About project

Feedback