Characterizing the Uncertainty of Web Data: Models and Experiences Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli Studi.

Characterizing the Uncertainty of Web Data: Models and Experiences Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli Studi Roma Tre Dipartimento di Informatica ed Automazione {blanco,crescenz,merialdo,papotti}@dia.uniroma3.it

Outline Introduction and goals Probabilistic models to evaluate the accuracy of web data sources Experiencing the models on real-life web data Lessons learned

The Web as a Source of Information Opportunities − a huge amount of information publicly available − valuable data repository can be built by aggregating information spread over many sources − abundance of redundancy for data of many domains

The Web as a Source of Information Opportunities − a huge amount of information publicly available − valuable data repository can be built by aggregating information spread over many sources − abundance of redundancy for data of many domains [Blanco et al. WebDb2010, demo@WWW2011] SyLaMinMaxVolOpen Ibm8887399 Cisc13344122342 Appl88873991998 Appl88873991998

Limitations − sources are inaccurate, uncertain and unreliable − some sources reproduce the contents published by others Data conflicts … 20.6420.4920.88 HRBN max price?

Several ranking methods for web sources E.g. Google PageRank, Alexa Traffic Rank Mainly based on the popularity of the sources Several factors can compromise the quality of data even when extracted from authoritative sources Errors in the editorial process Errors in the publishing process Errors in the data extraction process Popularity-based rankings

Problem Definition A set of sources (possibly with copiers) provide values of several attributes for a common set of objects w1w1 w2w2 w3w3 errors in bold

Problem Definition A set of sources (possibly with copiers) provide values of several attributes for a common set of objects We want to compute automatically − A score of accuracy for each web source − The probability distribution for each value w1w1 w2w2 w3w3 w4w4 (Copier) score (w 1 )?... score (w 4 )?

State-of-the-art Probabilistic models to evaluate the accuracy of web data sources (i.e., algorithms to reconcile data from inaccurate sources)  NAIVE (voting)  ACCU [Yin et al, TKDE08; Wu&Marian, WebDb07; Galland et al, WSDM10]  DEP [Dong et al, PVLDB09]  M-DEP [Blanco et al, Caise10; Dong et al, PVLDB10]

Goals The goal of our work is twofold: illustrate the state-of-the-art models compare the result of these models on the same real world datasets

NAIVE Independent sources Consider a single attribute at a time Count the votes for each possible value it works it does not! 381 gets 2 votes 380 gets 1 vote Sources Truth

Limitations of the NAIVE Model Real sources can exhibit different accuracies Every source is considered equivalent independently from its authority and accuracy More accurate sources should weight more than inaccurate sources

ACCU: a Model considering the Accuracy of the Sources Accuracy 3/31/3 The vote of a source is weighted according to its accuracy with respect to that attribute Sources Truth Result Main intuition: it's difficult that sources agree on errors! Consensus on (many) true values allows the algorithm to compute accuracy Truth Discovery (consensus) Source Accuracy Discovery 45 542

Limitations of the ACCU model Accuracy 3/32/31/3 Misleading majorities might be formed by copiers Both values (380 and 381) get 3/3 as weighted vote 2/3 CopierIndependentsSources: Truth Result Copiers have to be detected to neutralize the “copied” portion of their votes

A Generative Model of Copiers Copier Truth independently produced objects copied objects Source 2 Source 1 e e1e1 e1e1 e2e2 e2e2 Source 3

DEP: A Model to Consider Source Dependencies “Portion” of independent opinion Accuracy 3/3 2/31/3 3/3 1/3 380 gets 3/3 as independent weighted vote 381 gets 2/3 x 3/3 + 1/3 x 1/3 = 7/9 as independent weighted vote A source is copying 2/3 of its tuples CopierIndependents Sources: Truth 2/3 Result Main intuition: copiers can be detected as they propagate false values (i.e., errors)

Contextual Analysis of Truth, Accuracies, and Dependencies Truth Discovery Source Accuracy Discovery Dependence Detection

M-DEP: Improved Evidence from MULTIATT Analysis Truth An analysis based only on the Volume would fail in this example: it would recognizes w 2 as a copier of w 1 but it would not detect w 4 as a copier of w 3 actually w 1 and w 2 are independent sources sharing a common format for volumes MULTIATT(3) w1w1 w2w2 w3w3 w4w4 errors in bold Copier

Experiments with Web Data Soccer players Truth : hand crafted from official pages Stats: 976 objects and 510 symbols (on average) Videogames Truth : www.esrb.com Stats: 227 objects and 40 symbols (on average) NASDAQ Stock Quotes Truth : www.nasdaq.com Stats: 819 objects, 2902 symbols (on average)

Sample Accuracies of the Sources Sampled accuracy: the number of true values correctly reported over the number of objects. Pearson correlation coefficient shows that quality of data and popularity do not overlap

Experiments with Models Probability Concentration measures the performance in computing probability distributions for the observed objects. Low scores for Soccer: no authority on the Web Differences in VideoGames: #of distinct symbols (5 vs 75) High SA scores in Finance for every model: large #of distinct symbols a

Global Execution Times

Attributes Execution Times

Lessons Learned Three dimensions to decide which technique to use: Characteristics of the domain - domains where authoritative sources exist are much easier to handle - large number of distinct symbols help a lot too Requirements on the results - on average, more complex models return better results, especially for Probability Concentration Execution times - depend on the number of objects and number of distinct symbols. Naïve always scales well.

Thanks!

Bayesian Analysis (1) A random variable X to model possible values of the observed objects o is the observation of the value provided by a source for a single object Accuracy represents the event X=x=x t is the true value Observations for k objects Goal:

Bayesian Analysis (2) Goal: According to the Bayes Rule, we need to know Main idea: computing based on a generative model of the sources

A Simple Probabilistic Model for ACCU Sources Independence assumptions Sources Values Attributes Uniform distribution assumption The accuracy of a source wrt an attribute is the average of the probabilities associated with the values that provides

How to Compute the Independent Weighted Vote Thanks to the independent copying assumption is the probability that the source w is a copier of the source w' − c: prior probability of copying a tuple − 1-c: prior probability of providing a tuple independently

Bayesian Analysis of the Relationships between Sources Bayes Rule on All observations: of all tuples provided by all the sources We need to consider all possible relationships between w 1 and w 2 in our model A partition of the space of events: − w 1 and w 2 are independent − w 1 is a copier of w 2 − w 2 is a copier of w 1

Dealing with Many Attributes Let us consider only two attributes A and B and two sources w 1 and w 2 O set of objects for which both sources provide a value same values: different values: e.g. : w 1 and w 2 provide the same values: the value of A is true, but the value of B is false

Independent Sources: Both Sources are Correct error rate: Є = 1 - Accuracy − notation: − error rate of source w 1 wrt the attribute A Thanks to the independent attributes assumption Probability that w 1 provides a correct value for both attributes Probability that w 2 provides a correct value for both attributes

Independent Sources: Other Cases possible false values of attribute A possible false values of attribute B Remaining cases:

Wrap-Up Thanks to the independent values assumption

Dependent Sources Thanks to the independent attributes assumption Probability that w 1 and w 2 both independently provide a correct tuple Probability that w 1 is copying a true tuple produced by w 2 w 1 is acting like a copier w 1 is acting independently

How to Compute the Accuracy? Bayesian Analysis of the mutual dependence between consensus and accuracy Iterate two steps: − Consensus Analysis based on the agreement of the sources among their observations on individual objects and on the current accuracy of sources, compute the probability for the attributes of every object − Accuracy Analysis based on the current probability distributions of the observed object attributes, evaluate the accuracy of the sources

Characterizing the Uncertainty of Web Data: Models and Experiences Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli Studi.

Similar presentations

Presentation on theme: "Characterizing the Uncertainty of Web Data: Models and Experiences Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli Studi."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Characterizing the Uncertainty of Web Data: Models and Experiences Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli Studi.

Similar presentations

Presentation on theme: "Characterizing the Uncertainty of Web Data: Models and Experiences Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli Studi."— Presentation transcript:

Similar presentations

About project

Feedback