Presentation on theme: "Enabling Privacy in Provenance- Aware Workflow Systems Susan B. Davidson 1 Joint work with Sanjeev Khanna, Sudeepa Roy, Julia Stoyaonovich, Val Tannen."— Presentation transcript:
Enabling Privacy in Provenance- Aware Workflow Systems Susan B. Davidson 1 Joint work with Sanjeev Khanna, Sudeepa Roy, Julia Stoyaonovich, Val Tannen 1 Yi Chen 2 Tova Milo 3 1 University of Pennsylvania 2 Arizona State University 3 Tel Aviv University
2 Science has been revolutionized floods of data High-throughput technologies generate floods of data, which must be analyzed to create knowledge. Scientists are turning to scientific workflow systems to help manage the analysis.
Scientific Workflow Split Entries Align Sequences Functional Data Curate Annotations Format-2 Format-1 Format-3 Construct Trees 3 Graphical representation of a sequence of actions to perform a task (e.g., a biological experiment) Vertex ≡ Module (program) Edge ≡ Dataflow Run: An execution of the workflow Actual data appears on the edges. TGCCGTG TGGCTAA ATG… CTGTGC … CTAAATG TCTGTGC … GGCTAAA TGTCTG TGCCGTG TGGCGTC … ATCCGTG TGGCTA..
Need for provenance 4 TGCCGTGT GGCTAAAT GTCTGTGC … CCCTTTCCG TGTGGCTA AATGTCTG TGC … TGCCGTGT GGCTAAAT GTCTGTGC GTCTGTGC … TGCCGTGT GGCTAAAT GTCTGTGC GTCTGTGC … TGCCGTGT GGCTAAAT GTCTGTGC … ATGGCCGT GTGGTCTG TGCCTAAC TAACTAA… … Biologist’s workspace Which sequences have been used to produce this tree? How has this tree been generated? ? s Split Entries Align Sequences Functional DataCurate Annotations Format Construct Trees t Need for privacy
5 Outline The need: Workflow provenance repositories The challenge of privacy Privacy-aware search and query
Current situation: Workflow repositories To enable sharing and reuse, repositories of workflow specifications are being created – e.g. myExperiment.org – Keyword search is used to find specifications of interest Several workflow systems are storing provenance information – Module executions, input parameters, input/output data – “Input-only” 7
The Vision “Workflow Provenance” repositories will store specifications as well as executions (i.e. provenance information) – Searchable – Queryable – Privacy protecting Searching/querying these repositories can be used to – Find/reuse workflows – Understand meaning of a workflow – Correct/debug erroneous specifications – Find “bad” data and what its downstream effect might be 8
“You are better off designing in security and privacy… from the start, rather than trying to add them later.” (Shapiro, CACM 53:6, June 2010) 9
Privacy Concerns in Scientific Workflows 10 Data, module, structural
Example: Module Privacy 11 P: (X1, X2, X3, X4, X5) (X1, X2, X3)(X1, X2, X4, X5) P has cancer? Split entries Check for Cancer Check for Infectious disease Create Report report Patient record: Gender, smoking habits, Familial environment, blood pressure, blood test report, … P has an infectious disease? If X1 > 60 OR (X2 < 800 AND X5=1) AND …. Module functionality should be kept secret (From patient’s standpoint): output should not be guessed given input data values (From module owner’s standpoint): no one should be able to simulate the module and use it elsewhere.
Example: Structural Privacy 12 M1 M2 Protein + Functional annotation Protein M2 compares domains of proteins (more precise but more time consuming) M1 compares the entire protein against already annotated genomes Relationships between certain data/module pairs should be kept secret
Privacy concerns at a glance P: (X1, X2, X3, X4) (X1, X2, X3)(X1, X2, X4) P has cancer? P has infectious disease? Data Privacy Data items are private Module Privacy Module functionality is private (x, f(x)) Structural Privacy Execution paths between certain data is private Split entries Check for cancer Check for infectious disease Create Report report smoking habits, blood pressure, blood test report, …… 13 Check for cancer DB
14 Module Privacy (a hint) A module f = a function For every input x to f, f(x) value should not be revealed – Enough equivalent possible f(x) values w.r.t. visible information According to required privacy guarantee There is a knife, a fork and a spoon in this figure arxiv.org/abs/1005.5543
Access Views Prefixes of workflow hierarchy can be used to define the finest level of granularity the user can “see” 17 W1 W3 W2W4W1W2W4 W1W2 W1: IO M1M2
I O Determine Genetic Susceptibility M1 M2 Evaluate Disorder Risk Consult External Databases M3 Expand SNP Set M4 Generate Database Queries Query OMIM Combine Disorder Sets M5 M6 M8 W1 W2 W4 Query PubMed M7 SNPs, ethnicity lifestyle, family history, physical symptoms disorders prognosis SNPs disorders query Search Keyword search: “Database, disorder risk” Result should be concise and informative W1W2W4 “WISE: a Workflow Information Search Engine”, (ICDE 2009)
I O M2 Evaluate Disorder Risk M3 Expand SNP Set Generate Database Queries Query OMIM Combine Disorder Sets M5 M6 M8 Query PubMed M7 SNPs, ethnicity lifestyle, family history, physical symptoms disorders prognosis SNPs disorders query Result of “database, disorder risk” W1W2W4
I O Determine Genetic Susceptibility M1 M2 Evaluate Disorder Risk Consult External Databases M3 Expand SNP Set M4 W1 W2 SNPs, ethnicity lifestyle, family history, physical symptoms disorders prognosis SNPs Result of “database, disorder risk” W1W2 I O M2 Evaluate Disorder Risk M3 Expand SNP Set SNPs, ethnicity lifestyle, family history, physical symptoms prognosis SNPs disorders M4 Consult External Databases
Queries “Workflows in which Expand SNP is performed before Query PubMed.” “Data items that were used to produce d19.” … Shares with search the need to match certain terms and give a view of the result and preserve privacy. 21
Research Challenges More ways of enabling scientists to construct workflows Formalizing privacy notions Provenance query language? Efficiently identifying data that users can access – users may have different privileges, yielding many different access views. … 22
Acknowledgments Members of the PennDB group – Susan Davidson – Sanjeev Khanna – Sudeepa Roy – Julia Stoyanovich – Val Tannen Friend of PennDB and Tel Aviv faculty – Tova Milo PennDB alum and ASU faculty – Yi Chen Bioinformatics collaborators – Sarah Cohen-Boulakia This work was supported in part by NSF IIS-0803524, IIS-0629846, IIS-0915438, CCF-0635084, and IIS-0904314; NSF CAREER award IIS-0845647; and CRA 0937060