Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Searching and Integrating Information on the Web Seminar 4: Ranking Queries and Data Privacy Professor Chen Li UC Irvine.

Similar presentations


Presentation on theme: "1 Searching and Integrating Information on the Web Seminar 4: Ranking Queries and Data Privacy Professor Chen Li UC Irvine."— Presentation transcript:

1 1 Searching and Integrating Information on the Web Seminar 4: Ranking Queries and Data Privacy Professor Chen Li UC Irvine

2 Seminar 32 Outline and readings Ranking Queries  Fagin, R., Combining Fuzzy Information from Multiple Systems, PODS 1996  Fagin et al., Optimal Aggregation Algorithms for Middleware, PODS 2001.Optimal Aggregation Algorithms for Middleware Data privacy: –Database-as-service  Executing SQL over Encrypted Data in the Database-Service- Provider Model. Hakan Hacigumus, Bala Iyer, Chen Li, and Sharad Mehrotra. SIGMOD 2002. –XML Data publishing  Secure XML Publishing without Information Leakage in the Presence of Data Inference. Xiaochun Yang and Chen Li. To appear in VLDB'04

3 Seminar 33 Outline Ranking Queries Data privacy: –XML Data publishing –Database-as-service

4 Seminar 34 1.Finding multi-attribute tuples with top-k highest scores 2.Scoring function: aggregating scores on attributes, e.g., w1*A1 + … + wn * An, where wi is the weight for attribute Ai. 3.Monotone aggregation functions: if tuple A has a higher grade than tuple B on each attribute, then A’s overall grade is higher than B’s. Top-k queries

5 Seminar 35 Applications Multimedia databases Web search queries: –Restaurants –Houses –Cars –…

6 Seminar 36 Modes of Data Access (Fagin) Underlying Middleware (e.g., Search engines, Garlic, QBIC) supports 2 modes: 1. Sorted access: - Attribute Ai (column) forms a list Li sorted based on the score of Ai. - The list is output one by one. 2. Random access: - Ask the system for the grade of any given object Goal: minimize the total cost to get the top-k results ace...ace... price mileage year bef...bef... ade...ade... Sorted lists

7 Seminar 37 FA: Fagin’s algorithm [PODS96] 1.Do sorted access in parallel to each of the m sorted lists Li. Wait until there is a set H of at least k objects such that each of these objects has been seen in each of the m lists. 2.For each object R that has been seen, do random access as needed to each of the lists Li to find the i-th field xi or R. 3.Compute the aggregate results.

8 Seminar 38 Example: 1.Suppose k = 1. Given the three partial lists retrieved so far, ‘e’ appears in all of them. We can say that the top 1 tuple must be in {a,b,c,e,d,f}. 2.Reason: since the function is monotonic, tuple ‘e’ “blocks” all tuples below, since they can only have a smaller overall grade than ‘e’. 3.The algorithm does random access for these 5 tuples to get their grades, and pick the top 1. 4.Notice that we cannot say ‘e’ must be the top 1, since other tuples (e.g., ‘a’) may still have a higher overall score 5.Minor point: one possible improvement – ‘f’ can never be better than ‘e’. ace...ace... price mileage year bef...bef... ade...ade... Cut-off line

9 Seminar 39 General case 1.Once k tuples have appeared in all the partial lists, halt. 2.Reason: these k tuples block all the tuples below, which cannot be better than these k tuples 3.Do random access for the retrieved tuples to get their overall grades, and find the top-k. k price mileage year k k Cut-off line

10 Seminar 310 FA’s Properties 1.Can correctly find top-k results for monotone aggregation functions 2.Cost of a database with N objects: O(N^[(m-1)/m]*K^[1/m]) with arbitrarily high probability.

11 Seminar 311 FA’s Drawbacks The number of sorted accesses is still large. Since all seen tuples should be buffered, the required buffer size is unbounded. Does not exploit the bound given by the aggregation function to determine when to stop sorted access.

12 Seminar 312 TA: Threshold Algorithm [PODS2001] 1.Do sorted access in parallel to each of the m sorted lists. As an object R is seen under sorted access in some list, do random access to the other lists to find the grade xi of object R in other lists. Then compute the aggregate grade for this object R. If this is one of the highest, insert it, else discard it. 2.For each list Li, let xi be the grade of the last object seen under sorted access. Define the threshold value T to be t( x1, …, xm). As soon as at least k objects have been seen whose grade is at least equal to T, then halt. 3.Return the K objects that have been seen with the highest grades.

13 Seminar 313 Example: 1.A buffer keeps the top-k tuples that have been found so far 2.For any tuple in a sorted list, do a random access to get its overall grade. Compare it with the tuples in the buffer queue, and decide to insert it or discard it. 3.Threshold window (including the previous m records) represents the “best” top-k results we can see, assuming we can combine best values from different tuples. 4.Notice that this window may not be “horizontal” if we use different speeds to access different lists 5.This window helps us decide when to stop: once we find k tuple whose grade is at least equal to the window tuple, we halt. ace...ace... price mileage year bef...bef... ade...ade... buffer for top-k Threshold window

14 Seminar 314 TA’s Properties 1.TA is optimal for all monotone functions and over every database. 2.Compared to FA, TA requires a small, constant-size buffer. 3.TA allows early stopping –Can show TA never stops later than FA. (Why?) 4.There are times when the user is satisfied with approximate top k list. TA is modified to give such approximation. 5.TA can be modified to the case where random access is impossible

15 Seminar 315 Instance Optimality 1.Algorithm b is instance optimal over an algorithm set A and a database instance set D, if b is in A, and for any algorithm a in A and every instance d in D, we have: cost (b,D) = O(cost(a,D)). 2.Similar to “competitive ratio” 3.Essentially: b is the best algorithm in A. 4.Stronger than “optimality in a worst-case case” 5.TA is instance optimal in all “correct algorithms” (nondeterministic algorithms). A b a

16 Seminar 316 Variations of TA NRA: When no random access is possible –Example: Web search engines, which typically do not allow you to enter a URL and get its ranking TA Z : When no sorted access is possible for some predicates –Example: Find good restaurants near location x (sorted and random access for restaurant ratings, random access only for distances from a mapping site) CA: When the relative costs of random and sorted accesses matter. TA  : Only when approximate answers are needed –Example: Web search, with lots of good quality answers

17 Seminar 317 Outline Ranking Queries Data privacy: –XML Data publishing –Database-as-service

18 Seminar 318 Motivation Privacy in publishing XML data Applications: –Web publishing –Data sharing and exchange, e.g., in P2P systems

19 Seminar 319 Example: Hospital XML data physician Walker physician phname Smith (1) treat (1) (2) phname (2) treat (3) treat (2) patient (1) pname (1) disease (1) ward (1) leukemia (1) W305 (1) Alice (1) Alice (2) Betty (2) Cathy (2) patient (2) disease (2) pname (2) ward (2) W305 (2) leukemia (2) Betty (1) patient (3) pname (3) leukemia (3) ward (3) W305 (3) disease (3) Cathy (1) W403 patient pname Tom cancer (4) ward (4) disease (4)... hospital leukemia (1) Goal: hide Alice ’ s disease Common Knowledge: patients in the same ward have the same disease

20 Seminar 320 Problem Given: An XML document to be published Sensitive data in the document Common knowledge using which public users can do data inference Find: A partial document to be released so that users cannot infer the sensitive data

21 Seminar 321 Research challenges How to model data inference using common knowledge? How to compute all possible inferred data? How to compute a partial document to be published without leaking sensitive information?

22 Seminar 322 Roadmap  Information Leakage –Defining sensitive data –Describing common knowledge –Computing inferred documents Prevent information leakage

23 Seminar 323 Defining sensitive data hospital pname * Cathy A2A2 disease * patient Alice S A1A1 Using an XQuery, called “regulating query” A special node marked “*” to indicate the sensitive data

24 Seminar 324 Example 1 disease * patient Alice S A1A1 leukemia (2) hospital patient (1) pname (1) disease (1) ward (1) leukemia (1) W305 (1) Alice (1) patient (2) disease (2) pname (2) ward (2) W305 (2) Betty (1) patient (3) pname (3) leukemia (3) ward (3) W305 (3) disease (3) Cathy (1) Map the query to the XML tree For each mapping, the target of the * node is sensitive.

25 Seminar 325 Example 2 hospital pname * Cathy A2A2 leukemia (2) hospital patient (1) pname (1) disease (1) ward (1) leukemia (1) W305 (1) Alice (1) patient (2) disease (2) pname (2) ward (2) W305 (2) Betty (1) patient (3) pname (3) leukemia (3) ward (3) W305 (3) disease (3) Cathy (1)

26 Seminar 326 Common Knowledge Represented as XML constraints Could be obtained in various ways, e.g., –possible schema –analysis from the published data

27 Seminar 327 Common Constraints Child constraints: //p  //p/c //patient  //patient/pname Descendant constraints: //p  //p//d //patient  //patient//disease Functional dependencies: //p/a  //p/b //patient/ward  //patient/disease Patient pname Patient disease Patient warddisease Patient warddisease w1w2d1d2 If w1 = w2, then d1 = d2 (value equal)

28 Seminar 328 Modify partial document using constraints C 1 : //patient  //patient/pname C 2 : //patient  //patient//disease C 3 : //patient/ward  //patient/disease hospital leukemia patient diseaseward W305 patient wardpname (1) (2) (1) (2) Partial document P

29 Seminar 329 Apply C1 on document P C 1 : //patient  //patient/pname hospital leukemia patient diseaseward W305 patient wardpname (1) (2) (1) (2) C 1 (P) pname

30 Seminar 330 Apply C2 on document P C 2 : //patient  //patient//disease hospital leukemia patient diseaseward W305 patient wardpname (1) (2) (1) (2) C 2 (P) disease Floating branch: exact location unknown

31 Seminar 331 Apply C3 on document P C3: //patient/ward  //patient/disease hospital leukemia patient diseaseward W305 patient wardpname (1) (2) (1) (2) C 3 (P) disease leukemia

32 Seminar 332 hospital leukemia patient diseaseward W305 patient wardpname (1) (2) (1) (2) C 2 : //patient  //patient//disease C 3 : //patient/ward  //patient/disease disease leukemia Apply a sequence of constraints:

33 Seminar 333 hospital leukemia patient diseaseward W305 patient wardpname (1) (2) (1) (2) C 2 : //patient  //patient//disease C 3 : //patient/ward  //patient/disease disease leukemia Another user applies a different sequence of constraints: After applying C3, we cannot use C2 to expand the tree No more floating branch!

34 Seminar 334 They look different! P1 is “m-contained” in P2: –There is a mapping from P1 to P2. –A floating branch can be mapped to a path. –The m-containing document P2 has more information P2 is also “m-contained” in P1. Thus they are “m-equivalent”! P2: result of hospital leukemia patient diseaseward W305 patient ward (1) pname (1) (2) (1) (2) disease leukemia P1: result of disease leukemia hospital leukemia patient diseaseward W305 patient ward (1) pname (1) (2) (1) (2)

35 Seminar 335 What documents can users infer? Different users can use different sequences of constraints to do inference Thus they can infer different documents Questions: –Can an inference process terminate? –What inferred document should we consider to prevent leakage of sensitive data?

36 Seminar 336 Theorem Given a partial document P of an XML document D and a set of constraints C={C1,…, Ck}, there is a document M that can be inferred from P using a sequence of constraints, such that: –for any sequence of constraints, its resulting document is m-contained in M. Can be computed using a greedy approach. Such a document is unique under m- equivalence.

37 Seminar 337 Information leakage For a partial document P, if there exists a regulating query A, such that the maximal inferred document M can produce a non-empty answer to the query A, then we say “P causes information leakage.” Partial Document P Inference Regulating query A

38 Seminar 338 Roadmap Information Leakage  Prevent information leakage

39 Seminar 339 Formal Problem Given an XML document D, a regulating query A, common knowledge represented as constraints C1,…,Ck; –How to find a partial document P without information leakage? –Called a valid partial document The empty document is a trivial one We want the published document to have as much data as possible

40 Seminar 340 An algorithm We develop an algorithm for solving this problem We use the running example to illustrate the algorithm

41 Seminar 341 Example patient (1) pname (1) disease (1) ward (1) leukemia (1) W305 (1) Alice (1) patient (2) disease (2) pname (2) ward (2) W305 (2) leukemia (2) Betty (1) patient (3) pname (3) leukemia (3) ward (3) W305 (3) disease (3) Cathy (1) hospital disease * patient Alice S Regulating query A Functional dependency: //patient/ward  //patient/disease

42 Seminar 342 Remove sensitive data A(D) patient (1) pname (1) disease (1) ward (1) W305 (1) Alice (1) patient (2) disease (2) pname (2) ward (2) W305 (2) leukemia (2) Betty (1) patient (3) pname (3) leukemia (3) ward (3) W305 (3) disease (3) Cathy (1) hospital leukemia (1) Remaining document: D - A(D) disease * patient Alice S

43 Seminar 343 Compute the maximal inferred document M of D-A(D) patient (1) pname (1) disease (1) ward (1) W305 (1) Alice (1) patient (2) disease (2) pname (2) ward (2) W305 (2) leukemia (2) Betty (1) patient (3) pname (3) leukemia (3) ward (3) W305 (3) disease (3) Cathy (1) hospital leukemia (1) Maximal inferred document: M disease * patient Alice S

44 Seminar 344 Testing Information Leakage patient (1) pname (1) disease (1) ward (1) W305 (1) Alice (1) patient (2) disease (2) pname (2) ward (2) W305 (2) leukemia (2) Betty (1) patient (3) pname (3) leukemia (3) ward (3) W305 (3) disease (3) Cathy (1) hospital leukemia (1) There is a mapping from A to P. So information leaked. disease * patient Alice S Regulating query A

45 Seminar 345 Computing a valid partial document A S Inference A S break mapping Inference A S break mapping chase back How to break the mappings? How to chase back the inference steps? A(D) D - A(D)

46 Seminar 346 AND/OR Graphs A structure representing how a goal can be reached by solving subproblems. We use such graphs to formulate the process of finding a valid partial document

47 Seminar 347 patient (1) pname (1) disease (1) ward (1) W305 (1) Alice (1) patient (2) disease (2) pname (2) ward (2) W305 (2) leukemia (2) Betty (1) patient (3) pname (3) leukemia (3) ward (3) W305 (3) disease (3) Cathy (1) hospital leukemia (1) disease * patient Alice S Regulating query A START leukemia Alice (1) OR Consider mapping images of the leaf nodes in A An “ OR ” connector shows that solving any of the subproblems can solve the parent problem.

48 Seminar 348 patient (1) pname (1) disease (1) ward (1) W305 (1) Alice (1) patient (2) disease (2) pname (2) ward (2) W305 (2) leukemia (2) Betty (1) patient (3) pname (3) leukemia (3) ward (3) W305 (3) disease (3) Cathy (1) hospital leukemia (1) disease * patient Alice S Regulating query A START leukemia W305 AND OR Alice leukemia (2) (3) (1) leukemia (2) W305 (1) W305 (3) OR Multiple ways to infer the sensitive data. An “ AND ” connector shows that solving ALL the subproblems can solve the parent problem.

49 Seminar 349 patient (1) pname (1) disease (1) ward (1) W305 (1) Alice (1) patient (2) disease (2) pname (2) ward (2) W305 (2) leukemia (2) Betty (1) patient (3) pname (3) leukemia (3) ward (3) W305 (3) disease (3) Cathy (1) hospital leukemia (1) disease * patient Alice S Regulating query A START leukemia W305 AND OR Alice leukemia (2) (3) (1) leukemia (2) W305 (1) W305 (3) OR AND... OR Continue expanding the AND/OR graph

50 Seminar 350 AND/OR Graphs (cont) A special START node representing the goal of computing a valid partial document. The graph has nodes corresponding to nodes in the maximal inferred document M. Such a node represents the subproblem of hiding its corresponding node n in M –This node n should be removed from M –It cannot be inferred using the constraints and other nodes in M.

51 Seminar 351 Solution graphs A connected subgraph (of M) including the START node For each node in the subgraph, its successor connectors are also in the subgraph. If it contains an OR connector, it must also contain one of the connector's successors. If it contains an AND connector, it must also contain all the successors of the connector.

52 Seminar 352 Example solution graphs START Alice (1) OR START leukemia AND OR (1) W305 (1) OR

53 Seminar 353 Computing a valid partial document using a solution graph For a solution graph G, for each node in G, we remove the corresponding node in M to get a valid partial document START Alice (1) OR START leukemia AND OR (1) W305 (1) OR patient (1) pname (1) disease (1) ward (1) W305 (1) Alice (1) patient (2) disease (2) pname (2) ward (2) W305 (2) leukemia (2) Betty (1) patient (3) pname (3) leukemia (3) ward (3) W305 (3) disease (3) Cathy (1) hospital leukemia (1)

54 Seminar 354 Constructing an AND/OR Graph Give an algorithm for computing an AND/OR graph Consider inference steps of different constraints Many algorithms proposed on finding a solution graph. They are applicable No need to construct the entire AND/OR graph. Search for a solution graph “on the fly.”

55 Seminar 355 Related work Data Execution Query Data Query Execution Query Execution Data B. C/S access control C. Database as a service D. Data publishing (our work) Data Execution Query A. Single-user DBMS Different scenarios of database security based on trust domains

56 Seminar 356 Summary of 2 nd paper Formulated problem of publishing XML document without information leakage due to data inference Showed the effect of constraints on inference Algorithm for finding a valid partial document of a given document

57 Seminar 357 Outline Ranking Queries Data privacy: –XML Data publishing –Database-as-service (DAS) modelDatabase-as-service (DAS) model


Download ppt "1 Searching and Integrating Information on the Web Seminar 4: Ranking Queries and Data Privacy Professor Chen Li UC Irvine."

Similar presentations


Ads by Google