# Detecting Data Leakage Panagiotis Papadimitriou Hector Garcia-Molina

## Presentation on theme: "Detecting Data Leakage Panagiotis Papadimitriou Hector Garcia-Molina"— Presentation transcript:

Leakage Problem Stanford Infolab2 App. U 1 App. U 2 JeremySarahMark Other Sources e.g. Sarah’s Network Name: Mark Sex: Male …. Name: Sarah Sex: Female …. Kathryn

Outline Problem Description Guilt Models – Pr{U 1 leaked data} = 0.7 – Pr{U 2 leaked data} = 0.2 Distribution Strategies Stanford Infolab3

Problem Description Guilt Models Distribution Strategies Stanford Infolab4

Problem Entities EntityDataset Distributor Facebook T Set of all Facebook profiles Agents Facebook Apps U 1, …, U n R 1, …, R n R i : Set of people’s profiles who have added the application U i Leaker S Set of leaked profiles Stanford Infolab5

Agents’ Data Requests Sample – 100 profiles of Stanford people Explicit – All people who added application (example we used so far) – All Stanford profiles Stanford Infolab6

Problem Description Guilt Models Distribution Strategies Stanford Infolab7

Guilt Models (1/3) Stanford Infolab8 Other Sources e.g. Sarah’s Network 8 p p: posterior probability that a leaked profile comes from other sources p Guilty Agent: Agent who leaks at least one profile Pr{G i |S}: probability that agent U i is guilty, given the leaked set of profiles S

Guilt Models (2/3) Stanford Infolab99 or Agents leak each of their data items independently Agents leak all their data items OR nothing or (1-p) 2 (1-p)p p(1-p) p2p2

Guilt Models (3/3) IndependentlyNOT Independently Stanford Infolab10 Pr{G 1 } Pr{G 2 } Pr{G 1 }

Problem Description Guilt Models Distribution Strategies Stanford Infolab11

The Distributor’s Objective (1/2) Stanford Infolab12 U1U1 U1U1 U2U2 U2U2 U3U3 U3U3 U4U4 U4U4 Request R1R1 Pr{G 1 |S}>>Pr{G 2 |S} Pr{G 1 |S}>> Pr{G 4 |S} S (leaked) R1R1 R1R1 R3R3 R3R3 R2R2 R3R3 R4R4

The Distributor’s Objective (2/2) To achieve his objective the distributor has to distribute sets R i, …, R n that minimize Intuition: Minimized data sharing among agents makes leaked data reveal the guilty agents Stanford Infolab13

Distribution Strategies – Sample (1/4) Set T has four profiles: – Kathryn, Jeremy, Sarah and Mark There are 4 agents: – U 1, U 2, U 3 and U 4 Each agent requests a sample of any 2 profiles of T for a market survey Stanford Infolab14

Distribution Strategies – Sample (2/4) Poor Minimize Stanford Infolab15 U1U1 U2U2 U3U3 U4U4 U1U1 U2U2 U3U3 U4U4

Distribution Strategies – Sample (3/4) Optimal Distribution Avoid full overlaps and minimize Stanford Infolab16 U1U1 U2U2 U3U3 U4U4

Distribution Strategies – Sample (4/4) Stanford Infolab17

Distribution Strategies Sample Data Requests The distributor has the freedom to select the data items to provide the agents with General Idea: – Provide agents with as much disjoint sets of data as possible Problem: There are cases where the distributed data must overlap E.g., |R i |+…+|R n |>|T| Explicit Data Requests The distributor must provide agents with the data they request General Idea: – Add fake data to the distributed ones to minimize overlap of distributed data Problem: Agents can collude and identify fake data NOT COVERED in this talk Stanford Infolab18

Conclusions Data Leakage Modeled as maximum likelihood problem Data distribution strategies that help identify the guilty agents Stanford Infolab19

Thank You!