Presentation on theme: "Sampling and Soundness: Can We Have Both? Carla Gomes, Bart Selman, Ashish Sabharwal Cornell University Jörg Hoffmann DERI Innsbruck …and I am: Frank van."— Presentation transcript:
Sampling and Soundness: Can We Have Both? Carla Gomes, Bart Selman, Ashish Sabharwal Cornell University Jörg Hoffmann DERI Innsbruck …and I am: Frank van Harmelen
Nov 11, 2007ISWC072 Talk Roadmap A Sampling Method with a Correctness Guarantee Can we apply this to the Semantic Web? Discussion
Nov 11, 2007ISWC073 How Might One Count? Problem characteristics: Space naturally divided into rows, columns, sections, … Many seats empty Uneven distribution of people (e.g. more near door, aisles, front, etc.) How many people are present in the hall?
Nov 11, 2007ISWC074 #1: Brute-Force Counting Idea: Go through every seat If occupied, increment counter Advantage: Simplicity, accuracy Drawback: Scalability
Nov 11, 2007ISWC075 #2: Branch-and-Bound (DPLL-style) Idea: Split space into sections e.g. front/back, left/right/ctr, … Use smart detection of full/empty sections Add up all partial counts Advantage: Relatively faster, exact Drawback: Still accounts for every single person present: need extremely fine granularity Scalability Framework used in DPLL-based systematic exact counters e.g. Relsat [Bayardo-et-al 00], Cachet [Sang et al. 04]
Nov 11, 2007ISWC076 #3: Naïve Sampling Estimate Idea: Randomly select a region Count within this region Scale up appropriately Advantage: Quite fast Drawback: Robustness: can easily under- or over-estimate Scalability in sparse spaces: e.g solutions out of means need region much larger than to hit any solutions
Nov 11, 2007ISWC077 Idea: Identify a balanced row split or column split (roughly equal number of people on each side) Use local search for estimate Pick one side at random Count on that side recursively Multiply result by 2 This provably yields the true count on average! Even when an unbalanced row/column is picked accidentally for the split, e.g. even when samples are biased or insufficiently many Surprisingly good in practice, using a local search as the sampler Sampling with a Guarantee
Nov 11, 2007ISWC078 Algorithm SampleCount Input: Boolean formula F 1. Set numFixed = 0, slack = some constant (e.g. 2, 4, 7, …) 2. Repeat until F becomes feasible for exact counting a. Obtain s solution samples for F b. Identify the most balanced variable and variable-pair [ x is balanced : s /2 samples have x = 0, s /2 have x = 1 ( x, y ) is balanced : s /2 samples have x = y, s /2 have x = – y ] c. If x is more balanced than ( x, y ), randomly set x to 0 or 1 Else randomly replace x with y or – y ; simplify F d. Increment numFixed Output: model count 2 numFixed – slack exactCount(simplified F ) with confidence (1 – 2 – slack ) Note: showing one trial [Gomes-Hoffmann-Sabharwal-Selman IJCAI07]
Nov 11, 2007ISWC079 Correctness Guarantee Key properties: Holds irrespective of the quality of the local search estimates No free lunch! Bad estimates high variance of trial outcome min(trials) is high-confidence but not tight Confidence grows exponentially with slack and t Ideas used in the proof: Expected model count = true count (for each trial) Use Markovs inequality Pr[X>kE[X]] < 1/k to bound error probability (X is outcome of one trial) Theorem: SampleCount with t trials gives a correct lower bound with probability (1 – 2 – slack t ) e.g. slack =2, t =4 99% correctness confidence
Nov 11, 2007ISWC0710 Circuit Synthesis, Random CNFs 1.4 x min 1.4 x hrs 1.6 x min 1.4 x wff x hrs 4.0 x hrs 1.6 x min 1.8 x wff x True Count 1.0 x hrs 1.8 x hrs 8.0 x min wff hrs hrs 5.9 x min 3bitadd_ x sec 2.1 x sec 2.4 x sec 2bitmax_6 Cachet (exact) Relsat (exact) SampleCount (99% conf.) Instance
Nov 11, 2007ISWC0711 Talk Roadmap A Sampling Method with a Correctness Guarantee Can we apply this to the Semantic Web? Discussion
Nov 11, 2007ISWC0712 Talk Roadmap A Sampling Method with a Correctness Guarantee Can we apply this to the Semantic Web? [Highly speculative] Discussion
Nov 11, 2007ISWC0713 Counting in the Semantic Web… … should certainly be possible with this method Example: given RDF database D, count how many triples comply with query q Throw a constraint cutting the set of all triples in half If feasible, count n triples exactly; return n*2 #constraints-slack Else, iterate Merely technical challenges: What are constraints cutting the set of all triples in half? How to throw a constraint? When to stop throwing constraints? How to efficiently count the remaining triples?
Nov 11, 2007ISWC0714 What about Deduction? Does follow from ? Exploit connection implication UNSAT upper bounds? A similar theorem does NOT hold for upper bounds Nutshell: Markovs inequality Pr[X>kE[X]] < 1/k does not have a symmetric Pr[X
Nov 11, 2007ISWC0715 What about Deduction? Does follow from ? Much more distant adaptation: Constraint = something that removes half of !! Throw some and check whether Confidence problematic: Can we draw any conclusions if NOT ? May be that 1, 2 in with 1 2, but a constraint separated 1 from 2 May be that all relevant are thrown out Are there interesting cases where we can bound the probability of these events??
Nov 11, 2007ISWC0716 Talk Roadmap A Sampling Method with a Correctness Guarantee Can we apply this to the Semantic Web? [Highly speculative] Discussion
Nov 11, 2007ISWC0717 Discussion In prop CNF, one can efficiently obtain high-confidence lower bounds on nr of models, by sampling Application to Semantic Web: Adaptation to counting tasks should be possible Adaptation for, via upper bounds, is problematic Promising: heuristic method sacrificing confidence guarantee Alternative adaptation weakens instead of strengthening it Sampling the knowledge base Confidence guarantees?? Your feedback and thoughts are highly appreciated!!
Nov 11, 2007ISWC0718 What about Deduction? Does follow from ? Straightforward adaptation: There is a variant of this algorithm that computes high- confidence upper bounds instead Throw large constraints, check if is SAT If SAT, no implication; if UNSAT in each of t iterations, confidence on upper bound on #models Many problems: Is the actually easier to check?? Large constraints are tough even in propositional CNF context! (Large = involves half of the prop vars; needed for confidence) Upper bound on #models is not confidence in UNSAT!