Download presentation

Presentation is loading. Please wait.

Published byRosaline Watkins Modified about 1 year ago

1
Fusing Data with Correlations Ravali Pochampally, Anish Das Sarma, Luna Dong, Alexandra Meliou, Divesh Srivastava AT&T Research

2
Fusion in web data extraction extractor 1 Applications: Building knowledge bases, answer questions, facilitate data mining How can we purge wrong triples from the dataset?

3
✗ ✗ ✗ ✗ Knowledge tripleS1S1 S2S2 S3S3 S4S4 ✓✓✓ ✓✓ ✓ ✓✓✓ ✓✓✓ ✓ ✓ ✓✓✓ ✓✓ ✓✓ The data fusion problem correlated 2 Contribution: Fusion techniques that consider source quality and correlations anti-correlated badgood

4
This talk evaluationfuture directions extractor diagnosis 3 PrecRecCorr: consider correlations extractor approximations PrecRec: consider source quality 2 techniques

5
High-level intuition 4 S1S1 S2S2 S3S3 JagadishUMichATTUMich DewittMSR UWisc BernsteinMSR CareyUCIATTBEA FranklinUCB UMD Researcher affiliation source qualitycorrelationsevaluationfuture directions

6
High-level intuition 5 S1S1 S2S2 S3S3 Jagadish UMich ATT UMich Dewitt MSR UWisc Bernstein MSR CareyUCIATTBEA Franklin UCB UMD Researcher affiliation source qualitycorrelationsevaluationfuture directions Voting: Trust the majority

7
High-level intuition 6 S1S1 S2S2 S3S3 Jagadish UMich ATT UMich Dewitt MSR UWisc Bernstein MSR Carey UCI ATTBEA Franklin UCB UMD Researcher affiliation source qualitycorrelationsevaluationfuture directions Quality-based: More votes to accurate sources

8
Source quality in extraction 7 S1S1 S2S2 S3S3 Daniel Radcliffe ✓✓ Emma Watson ✓✓✓ J. K. Rowling ✓ Daniel Craig ✓ Rupert Grint ✓✓ Actors/actresses in “Harry Potter” films ✗ ✗ high recall high precision med prec/rec Considering source quality: -- More likely to be correct if extracted by high-precision source. -- More likely to be wrong if not extracted by high-recall source. source qualitycorrelationsevaluationfuture directions

9
Source quality metrics o Recall: o False positive rate: A source is good if r i > q i 8 source qualitycorrelationsevaluationfuture directions probability to return a true triple probability to return a false triple

10
Accounting for quality o Compute score for each triple: If extracts it, multiply by Good source higher score Bad source lower score If does not extract it, multiply by Good source lower score Bad source higher score 9 source qualitycorrelationsevaluationfuture directions

11
Correlation scenarios o Copying: o Overlapping On true triples: On false triples: o Complementary sources: 10 Triple provided by good sources with recall r and FPR q Correlations capture richer information than copying relationships source qualitycorrelationsevaluationfuture directions

12
Correlation in web extraction Significant negative correlation 11 [Dong et al. PVLDB 2014] source qualitycorrelationsevaluationfuture directions

13
Considering correlations o Positive correlation: o Negative correlation: 12 joint recall Exact solution: We can express these probabilities using an exponential number of correlation parameters source qualitycorrelationsevaluationfuture directions

14
Aggressive approximation o Partial independence assumptions linear number of parameters But: low accuracy 13 correlation between S i and the other sources source qualitycorrelationsevaluationfuture directions

15
aggressive approximation exact solution no independence assumptions partial independence assumptions elastic approximation trade efficiency for accuracy Approximation levels 14 source qualitycorrelationsevaluationfuture directions high accuracy exponential size low accuracy linear size add parameters closer approximation

16
Elastic approximation 15 Iterations of the elastic approximation 3 steps 3 iterations achieve near-optimal accuracy source qualitycorrelationsevaluationfuture directions

17
Comparisons o Our techniques: PrecRec & PrecRecCorr o Union-K A triple is correct if at least K% of sources provide it o 3-Estimate [Galland et al. WSDM 2010] Iteratively computes trustworthiness o LTM [Zhao et al. PVLDB 2012] Uses graphical models and Gibbs sampling 16 source qualitycorrelationsevaluationfuture directions

18
Three real-world datasets o Restaurant: [Marian et al. DE Bull, 2011] 7 sources 93 triples o Book: [Dong et al. PVLDB, 2009] 879 sources 225 triples o ReVerb: [Fader et al. EMNLP, 2011] 6 extractors 2407 triples 17 source qualitycorrelationsevaluationfuture directions

19
Restaurant 18 source qualitycorrelationsevaluationfuture directions

20
Book 19 source qualitycorrelationsevaluationfuture directions

21
ReVerb 20 source qualitycorrelationsevaluationfuture directions

22
Synthetic data: low precision 21 source qualitycorrelationsevaluationfuture directions

23
Synthetic data: high precision 22 source qualitycorrelationsevaluationfuture directions

24
Synthetic data: low recall 23 source qualitycorrelationsevaluationfuture directions

25
Synthetic data: correlations 24 source qualitycorrelationsevaluationfuture directions

26
Error diagnosis 25 source qualitycorrelationsevaluationfuture directions

27
Contributions o Fusion techniques that consider source quality and correlations o The number of correlation parameters grows exponentially, but we provide a scalable solution o Evaluation on real-world and synthetic data shows that our techniques are more effective than the state-of-the-art 26

28
The data fusion problem Naïve approach: Simple majority voting achieves relatively low precision and recall 27 Knowledge tripleS1S1 S2S2 S3S3 S4S4 S5S5 ✓✓✓✓ ✓✓ ✓ ✓✓✓✓ ✓✓ ✓✓✓ ✓✓✓ ✓✓✓✓ ✓✓✓✓ ✓✓✓✓ ✗ ✗ ✗ ✗

29
Extracting web data extractor 28 Different extractors can extract different data from the same document

30
Semantics o Triple independence If a source provides triple t 1, it is independent of whether it provides t 2. o Open-world If a triple is not provided by a source, it is considered unknown, rather than false. 29

31
Independence assumption 30 source qualitycorrelationsevaluationfuture directions Assumes source independence! Do we need to worry about correlations?

32
Experimental evaluation o Effectiveness: Comparison with state-of-the-art techniques on real-world data o Efficiency: Evaluation of the approximation algorithms o Pushing the limits with synthetic data 31 source qualitycorrelationsevaluationfuture directions

33
Execution time 32 source qualitycorrelationsevaluationfuture directions

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google