Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fusing Data with Correlations Ravali Pochampally, Anish Das Sarma, Luna Dong, Alexandra Meliou, Divesh Srivastava AT&T Research.

Similar presentations


Presentation on theme: "Fusing Data with Correlations Ravali Pochampally, Anish Das Sarma, Luna Dong, Alexandra Meliou, Divesh Srivastava AT&T Research."— Presentation transcript:

1 Fusing Data with Correlations Ravali Pochampally, Anish Das Sarma, Luna Dong, Alexandra Meliou, Divesh Srivastava AT&T Research

2 Fusion in web data extraction extractor 1 Applications: Building knowledge bases, answer questions, facilitate data mining How can we purge wrong triples from the dataset?

3 ✗ ✗ ✗ ✗ Knowledge tripleS1S1 S2S2 S3S3 S4S4 ✓✓✓ ✓✓ ✓ ✓✓✓ ✓✓✓ ✓ ✓ ✓✓✓ ✓✓ ✓✓ The data fusion problem correlated 2 Contribution: Fusion techniques that consider source quality and correlations anti-correlated badgood

4 This talk evaluationfuture directions extractor diagnosis 3 PrecRecCorr: consider correlations extractor approximations PrecRec: consider source quality 2 techniques

5 High-level intuition 4 S1S1 S2S2 S3S3 JagadishUMichATTUMich DewittMSR UWisc BernsteinMSR CareyUCIATTBEA FranklinUCB UMD Researcher affiliation source qualitycorrelationsevaluationfuture directions

6 High-level intuition 5 S1S1 S2S2 S3S3 Jagadish UMich ATT UMich Dewitt MSR UWisc Bernstein MSR CareyUCIATTBEA Franklin UCB UMD Researcher affiliation source qualitycorrelationsevaluationfuture directions Voting: Trust the majority

7 High-level intuition 6 S1S1 S2S2 S3S3 Jagadish UMich ATT UMich Dewitt MSR UWisc Bernstein MSR Carey UCI ATTBEA Franklin UCB UMD Researcher affiliation source qualitycorrelationsevaluationfuture directions Quality-based: More votes to accurate sources

8 Source quality in extraction 7 S1S1 S2S2 S3S3 Daniel Radcliffe ✓✓ Emma Watson ✓✓✓ J. K. Rowling ✓ Daniel Craig ✓ Rupert Grint ✓✓ Actors/actresses in “Harry Potter” films ✗ ✗ high recall high precision med prec/rec Considering source quality: -- More likely to be correct if extracted by high-precision source. -- More likely to be wrong if not extracted by high-recall source. source qualitycorrelationsevaluationfuture directions

9 Source quality metrics o Recall: o False positive rate: A source is good if r i > q i 8 source qualitycorrelationsevaluationfuture directions probability to return a true triple probability to return a false triple

10 Accounting for quality o Compute score for each triple:  If extracts it, multiply by Good source higher score Bad source lower score  If does not extract it, multiply by Good source lower score Bad source higher score 9 source qualitycorrelationsevaluationfuture directions

11 Correlation scenarios o Copying: o Overlapping  On true triples:  On false triples: o Complementary sources: 10 Triple provided by good sources with recall r and FPR q Correlations capture richer information than copying relationships source qualitycorrelationsevaluationfuture directions

12 Correlation in web extraction Significant negative correlation 11 [Dong et al. PVLDB 2014] source qualitycorrelationsevaluationfuture directions

13 Considering correlations o Positive correlation: o Negative correlation: 12 joint recall Exact solution: We can express these probabilities using an exponential number of correlation parameters source qualitycorrelationsevaluationfuture directions

14 Aggressive approximation o Partial independence assumptions linear number of parameters But: low accuracy 13 correlation between S i and the other sources source qualitycorrelationsevaluationfuture directions

15 aggressive approximation exact solution no independence assumptions partial independence assumptions elastic approximation trade efficiency for accuracy Approximation levels 14 source qualitycorrelationsevaluationfuture directions high accuracy exponential size low accuracy linear size add parameters closer approximation

16 Elastic approximation 15 Iterations of the elastic approximation 3 steps 3 iterations achieve near-optimal accuracy source qualitycorrelationsevaluationfuture directions

17 Comparisons o Our techniques: PrecRec & PrecRecCorr o Union-K  A triple is correct if at least K% of sources provide it o 3-Estimate [Galland et al. WSDM 2010]  Iteratively computes trustworthiness o LTM [Zhao et al. PVLDB 2012]  Uses graphical models and Gibbs sampling 16 source qualitycorrelationsevaluationfuture directions

18 Three real-world datasets o Restaurant: [Marian et al. DE Bull, 2011]  7 sources  93 triples o Book: [Dong et al. PVLDB, 2009]  879 sources  225 triples o ReVerb: [Fader et al. EMNLP, 2011]  6 extractors  2407 triples 17 source qualitycorrelationsevaluationfuture directions

19 Restaurant 18 source qualitycorrelationsevaluationfuture directions

20 Book 19 source qualitycorrelationsevaluationfuture directions

21 ReVerb 20 source qualitycorrelationsevaluationfuture directions

22 Synthetic data: low precision 21 source qualitycorrelationsevaluationfuture directions

23 Synthetic data: high precision 22 source qualitycorrelationsevaluationfuture directions

24 Synthetic data: low recall 23 source qualitycorrelationsevaluationfuture directions

25 Synthetic data: correlations 24 source qualitycorrelationsevaluationfuture directions

26 Error diagnosis 25 source qualitycorrelationsevaluationfuture directions

27 Contributions o Fusion techniques that consider source quality and correlations o The number of correlation parameters grows exponentially, but we provide a scalable solution o Evaluation on real-world and synthetic data shows that our techniques are more effective than the state-of-the-art 26

28 The data fusion problem Naïve approach: Simple majority voting achieves relatively low precision and recall 27 Knowledge tripleS1S1 S2S2 S3S3 S4S4 S5S5 ✓✓✓✓ ✓✓ ✓ ✓✓✓✓ ✓✓ ✓✓✓ ✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✗ ✗ ✗ ✗

29 Extracting web data extractor 28 Different extractors can extract different data from the same document

30 Semantics o Triple independence  If a source provides triple t 1, it is independent of whether it provides t 2. o Open-world  If a triple is not provided by a source, it is considered unknown, rather than false. 29

31 Independence assumption 30 source qualitycorrelationsevaluationfuture directions Assumes source independence! Do we need to worry about correlations?

32 Experimental evaluation o Effectiveness:  Comparison with state-of-the-art techniques on real-world data o Efficiency:  Evaluation of the approximation algorithms o Pushing the limits with synthetic data 31 source qualitycorrelationsevaluationfuture directions

33 Execution time 32 source qualitycorrelationsevaluationfuture directions


Download ppt "Fusing Data with Correlations Ravali Pochampally, Anish Das Sarma, Luna Dong, Alexandra Meliou, Divesh Srivastava AT&T Research."

Similar presentations


Ads by Google