Presentation is loading. Please wait.

Presentation is loading. Please wait.

ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica.

Similar presentations


Presentation on theme: "ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica."— Presentation transcript:

1 ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica Università di Roma “La Sapienza”

2 Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia2 Outline  Privacy-aware integration –Privacy risk assessment –Private record linkage  Quality-aware integration –Flexible and fully automatic record linkage Summary New!!!

3 Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia3 PrivateIDSSNDOBZIPHealth_Problem a11/20/6700198Shortness of breath b02/07/8100159Headache c02/07/8100156Obesity d08/07/7600198Shortness of breath PrivateIDSSNDOBZIPEmploymentMarital Status 1A11/20/6700198ResearcherMarried 5E08/07/7600114Private Employee Married 3C02/07/8100156Public Employee Widow T1 T2 Linkage of Anonymous Data QUASI-IDENTIFIER

4 Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia4 Our Proposal  A framework for assessing privacy risk that takes into accounts both facets of privacy –based on statistical decision theory  Definition and analysis of: –disclosure policies modelled by disclosure rules –several privacy risk functions  Estimated risk as an upper-bound of true risk and related complexity analysis  Algorithm for finding the disclosure rule minimizing the privacy risk

5 Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia5 The Formal Framework Disclosure Rule δ Loss function l(δ,  ) -  representing attacker’s knowledge Risk R(δ,  )=f(l(δ,  ) ) identification sensitivity 

6 Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia6 K-anonimity  K anonimity is SIMPLY a special case of our framework in which: 1.θ true = relation T, more strict assumption on the attacker’s knowledge. We proved that under some assumption we can bound the true risk by our “more general” risk 2.  is a costant, questionable: independence on the type of disclosed attributes (HIV result same loss as last doctor visit) 3.  is underspecified, we can specify the set of disclosure rules in several ways Our framework underlies some questionable hypotheses of k-anonimity!!!

7 Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia7 Private Record Linkage  Being P and Q be two peers owning the relations R P (A1,…An) and R Q (B1,…,Bn), respectively, the privacy- preserving record matching problem is to perform record matching between R P and R Q, such that at the end of the process –P will know only a set P Match, consisting of records in R P that match with records in R Q. Similarly Q will know only the set Q Match.  Of particular importance is that no information will be revealed to P and Q concerning records that do not match each other  Published at SIGMOD 07

8 Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia8 Key Ideas and Solutions (1)  Cannot just encrypt data and then compute distances among them – by definition encryption functions do not preserve distances  Let’s work on numbers, instead of records!!!  Mapping of records in a vector space, and record matching performed in such a space

9 Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia9 Key Ideas and Solutions (2)  Third-party based protocol in which: –The two parties build together the embedding space by using a method (SparseMap) with “secure” features –Each of the two parties embeds its own dataset and sends it to the third party –The third party W performs the intersection and sends back to the parties  Mapping of records in a vector space, and record matching performed in such a space

10 Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia10 Key Ideas and Solutions (3)  Th1: Given the two relations R P (D1,…,Ds) and R Q (D1,…,Dx), the set of matching records RecMatch, DBSize the database, the following result is proven, the record matching protocol ¯finds the matched records between the two relations with the following assurance: – RecMatch is not disclosed to W; –R P - RecMatch is not disclosed to Q –R Q - RecMatch is not disclosed to P –DBSize is disclosed to W and bounded by P and Q

11 Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia11 Schema Matching Features  Th2: Given the schemas R P and R Q, owned by parties P and Q respectively and the set of matching attributes AttrMatch, the schema matching protocol finds the attributes common to the two schemas with the following assurance: –AttrMatch is not disclosed to W –AttrMatch is not disclosed to P and Q –AttrMatchSize is not disclosed to P and Q

12 Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia12 How good are we?  Time: better than record linkage without privacy preservation  Effectiveness: Comparable wrt recall and precision

13 Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia13 Flexible and Automatic RL  P2P systems are loosely coupled, dynamic, open  Manual phases of record linkage can be problematic: –Time consuming vs. dynamic feature/open –Syncronous interactions vs. loosely coupled systems  Need for flexible and automatic RL

14 Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia14 Background: Record Linkage Techniques  Search Space Reduction: –Sorted Neighborhood Method –Blocking –Hierarchical grouping –…  Decision Rules: –Probabilistic: Fellegi&Sunter –Empirical –Knowledge-based  Comparison Functions: –Edit distance –Smith-Waterman –Q-grams –Jaro string comparator –Soundex code –TF-IDF –…

15 Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia15 Key Idea  Record Linkage is a complex process and should be decomposed as much as possible in its constituting phases  For each phase the most appropriate technique should be chosen depending on application and data requirements  In order to dynamically build ad-hoc record linkage workflows  RELAIS: toolkit serving such a purpose – developed at Istat –UNIROMA contribution on data profiling stuff (wait a couple of slides )

16 Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia16 RELAIS Toolkit RELAIS Application Constraints: Admissible error-rates Privacy issues Cost … Database Features: Size Quality Domain features … Record Linkage Workflow

17 Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia17 RL Workflows Preprocessing Search Space Reduction Comparison Function Decision Model Normalization UpperLowerCase Schema reconciliation Blocking SNM Edit Distance Jaro Equality Probabilistic Empirical RecLink WF Appl2 SNM Probabilistic RecLink WF Appl1 Normalization UpperLowerCase Blocking Jaro Empirical Equality

18 Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia18 Making Automatic Some Phases  Data profiling for choosing matching keys  Automatic extraction of: –Completeness –Consistency –Identification power  On going

19 Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia19 Status of RELAIS  Currently guided execution of RL workflows with all phases automatic  Future: –Definition of RELAIS's architecture as a service- oriented, web-accessible architecture. Formal specification of (i) input/output of services, and (ii) pre/post conditions by semantic Web Services technologies –Automatic generation of RL workflows by reasoning on service specification usage of either automatic [Berardi et al VLDB 2005] or semi automatic [Bouguettaya et al. VLDBJ 2003] service composition techniques

20 Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia20 Implementation View PQ-RELAIS Record Linkage Workflow Q-RELAIS P-RELAIS Data Source profiling (quality metadata) Quality-based trust evaluation Automatic and flexible RL Privacy risk assessment Private RL


Download ppt "ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica."

Similar presentations


Ads by Google