Presentation is loading. Please wait.

Presentation is loading. Please wait.

Architecture Tutorial Security and privacy in provenance Simon Miles King’s College London.

Similar presentations


Presentation on theme: "Architecture Tutorial Security and privacy in provenance Simon Miles King’s College London."— Presentation transcript:

1 Architecture Tutorial Security and privacy in provenance Simon Miles King’s College London

2 Architecture Tutorial Outline Provenance Models and Systems Illustrative Application Privacy and Security Issues

3 Architecture Tutorial Provenance

4 Architecture Tutorial What Provenance Is Oxford English Dictionary: –the fact of coming from some particular source or quarter; origin, derivation –the history or pedigree of a work of art, manuscript, rare book, etc.; –concretely, a record of the passage of an item through its various owners. Provenance is important for: –Interpretation –Judging value

5 Architecture Tutorial Causation Everything that is part of the provenance of an item is a cause of that item being as it is For example, provenance of a bottle of wine includes: –Grapes from which it is made –Where those grapes grew –Steps in the wine’s preparation –How the wine was stored –Between which parties the wine was transported, e.g. producer to distributer to retailer

6 Architecture Tutorial Motivating Applications We and other projects interviewed and supported users with issues regarding provenance in a range of domains, including: BioinformaticsParticle Physics ProteomicsOrgan transplant Aircraft simulationPolice database integration Social planningChemical analysis Genetic diseasesGrid service fault tolerance Brain image analysisAstronomy

7 Architecture Tutorial Provenance Questions How did I (or someone else) come by this result? What was common and relevant in the history of this set of successful outcomes? Was the process claimed to be performed the one which was actually performed?

8 Architecture Tutorial Provenance Questions What inputs were used to derive this output? What software produced this data? Can I generalise from the process by which this result was produced to a re- usable plan?

9 Architecture Tutorial Provenance Questions Were these regulations followed in producing this result? Are these two independent conclusions actually based on the same faulty assumption/input? What differed between the way these two results were produced?

10 Architecture Tutorial Shared Histories and Futures Multiple data can be produced by one process One process can use data from many sources as input The provenance (and futures) of data items overlap It is suspect to say that one data item = one provenance, provenance stored with data

11 Architecture Tutorial Causal Provenance Models Illustrative Application

12 Architecture Tutorial Causal graphs Donor Organ Decision: Yes

13 Architecture Tutorial Causal graphs Donor Organ Decision: Yes Family Consent Decision: Yes decision based on Blood Test Results: -ve

14 Architecture Tutorial Causal graphs Donor Organ Decision: Yes Family Consent Decision: Yes decision based on response to Blood Test Results: -ve Blood Test Request: 432 Family Consent Request: 432 response to

15 Architecture Tutorial Causal graphs Donor Organ Decision: Yes Family Consent Decision: Yes Patient Brain Death: PID 432 decision based on response to triggered by Blood Test Results: -ve Blood Test Request: 432 Family Consent Request: 432 response to

16 Architecture Tutorial Causal graphs Donor Organ Decision: Yes Family Consent Decision: Yes Patient Brain Death: PID 432 decision based on response to triggered by Blood Test Results: -ve Blood Test Request: 432 Family Consent Request: 432 response to triggered by

17 Architecture Tutorial Causal Connections Patient after donation with two kidneys Donation operation Causes and effects are occurrences –Occurrence of an event, or –Occurrence of a data item or physical object being in a particular state

18 Architecture Tutorial Documentation and Provenance We can distinguish –process documentation (the documentation recorded into a store about processes) –provenance (everything that caused an item to be as it is) Process documentation is recorded as processes are executed The data items that a process will ultimately produce may not be known at that time Provenance of an entity is obtained as the result of a query over process documentation Process documentationProvenance

19 Architecture Tutorial Process Documentation Documentation of one process comes from multiple, possibly independent, sources May share a store or use separate ones Family Testing Lab Doctor Blood Test Results Blood Test Request Family Consent Decision Family Consent Request Donor Organ Decision: Yes Patient Brain Death: PID 432

20 Architecture Tutorial Provenance Scope An item is caused to be as it is by previous events, which were themselves caused by other events The causal graph could go back to the beginning of time If all this information was provided as a result of a query, it would be unmanageable and mostly irrelevant to the querier Therefore, the querier needs to scope the query to that which is relevant scope

21 Architecture Tutorial Open Data Model Organisation 1 Organisation 2 Organisation 3 Distributed processes involve functionality from multiple independent organisations Each needs to record documentation independently We need a common, open data model and interfaces for recording and querying data in that model Provenance Stores

22 Architecture Tutorial Digitally Controlled Process Inference Blood Test Results Blood Test Request

23 Architecture Tutorial Inferred Physical ProcessDigitally Controlled Process Inference Blood Test Results Blood Test Request Sent Blood Sample Received Blood Sample

24 Architecture Tutorial Privacy and Security Issues

25 Architecture Tutorial Anonymised User Actions Provenance records for healthcare will include documentation regarding the actions of patients (or samples of theirs) Going to see a particular (their) GP Undergoing surgery at a particular hospital Their blood sample being sent to a testing lab Even if the patient is anonymised within the records, the pattern of their actions can be enough to uniquely identify them

26 Architecture Tutorial Data and Metadata Rights Provenance is often viewed as metadata to the data of which it provides a history Provenance information is usually generated automatically at runtime, and it is not known what that information will be in advance, appropriate rights have to be applied to the provenance How do access rights of the provenance metadata relate to those of the data?

27 Architecture Tutorial Multi-Data Metadata Furthermore, provenance is often metadata to multiple data items For example, a record of the process of a transplant operation is the provenance of The transplanted organ, The decision to transplant, Blood tests carried out to decide to transplant, etc. Each may be stored separately and have very different access control policies

28 Architecture Tutorial Necessary Distribution of Query It is sometimes necessary to distribute parts of the provenance data about a process into multiple stores For example, in the OTM case, by EU law the data regarding activity within each hospital had to remain within that hospital To answer a provenance question, we need to query across distributed stores

29 Architecture Tutorial Automatic Capture Provenance is often viewed as metadata to the data of which it provides a history Provenance information is usually generated automatically at runtime, and it is not known what that information will be in advance, appropriate rights have to be applied to the provenance How do access rights of the provenance metadata relate to those of the data?

30 Architecture Tutorial Traffic Confidentiality and Inference Traffic confidentiality means hiding the fact that a service was used by a client, even where transmitted data is encrypted A pharmaceutical company querying a small lab’s public database concerning a particular disease Can help achieve confidentiality by using intermediaries who use multiple services But could infer actual service used from provenance set up to allow inferences

31 Architecture Tutorial Extra Material

32 Architecture Tutorial Extra Material Index Motivation for general provenance models Interoperability and the Open Provenance Model Provenance technologies in database research, digital libraries, semantic web Provenance in Tupelo (from NCSA) Provenance in Taverna (from Manchester) The Provenance Challenges Open research issues

33 Architecture Tutorial Motivation for Common, General Provenance Models

34 Architecture Tutorial Separately Documented Aspects Attribution and related events –Modified by Simon Miles, compressed by X –Created at time T1, deposited at T2 Documentation of the processing of data –Enactment of workflows –Chain of ownership Versioning Differing practice, technologies, emphasis: workflows, DB research, libraries, semweb

35 Architecture Tutorial Preparation for Questions Don’t know in advance of something being produced that it will be produced –When documenting events, can’t yet associate that documentation with what those events ultimately produce Don’t know in advance of being asked (about provenance) what will be asked –When documenting provenance, can’t restrict documentation to that you know will be used

36 Architecture Tutorial Shared Histories and Futures Multiple data can be produced by one process One process can use data from many sources as input The provenance (and futures) of data items overlap It is suspect to say that one data item = one provenance, provenance stored with data

37 Architecture Tutorial Alternative Accounts In some disciplines or for some kinds of data, provenance can be disputed Even within a computer system, there can be multiple accounts of apparently the same event AB A sent X to BA sent Y to B

38 Architecture Tutorial Common General Models Provide skeleton for documenting all aspects of provenance Record lots without (much) regard to particular questions... Then query as relevant to required usage System interoperation through common serialisation Can connect records from different systems involved in producing 1 data item

39 Architecture Tutorial Provenance Scope An item is caused to be as it is by previous events, which were themselves caused by other events The causal graph could go back to the beginning of time If all this information was provided as a result of a query, it would be unmanageable and mostly irrelevant to the querier Therefore, the querier needs to scope the query to that which is relevant scope

40 Architecture Tutorial Interoperability

41 Architecture Tutorial Open Data Model Organisation 1 Organisation 2 Organisation 3 Distributed processes involve functionality from multiple independent organisations Each needs to record documentation independently We need a common, open data model and interfaces for recording and querying data in that model Provenance Stores

42 Architecture Tutorial Open Provenance Model Can describe any process (not just WF execution‏) Allows alternate accounts by different observers http://openprovenance.org

43 Architecture Tutorial OPM Requirements To allow provenance information to be exchanged between systems, by means of a compatibility layer based on a shared provenance model. To allow developers to build and share tools that operate on such provenance model. To define the model in a precise, technology- agnostic manner. To support a digital representation of provenance for any “thing”, whether produced by computer systems or not.

44 Architecture Tutorial OPM Non-Requirements OPM does not specify the internal representations that systems have to adopt to store and manipulate provenance internally. OPM does not define a computer-parsable syntax for this model (but prototype RDF, XML schemas have been developed) OPM does not specify protocols to store such provenance information in provenance repositories. OPM does not specify protocols to query provenance repositories.

45 Architecture Tutorial Contributors Original contributors from: –Universities: Southampton, Indiana, King’s College, Manchester, Davis, Hasselt, Utah, Southern California –Microsoft, NCSA, PNNL Plus 3 rd challenge participants including: –Universities: Harvard, Chicago, Santa Barbara, Amsterdam –SDSC

46 Architecture Tutorial Open Provenance Model 3 node types – artifact, process, agent 5 arc types – used, generated, triggered, derived, controlled – and inference rules Generic – extensibility via annotation Choice of granularity and focus (e.g., artifact or process-centric)‏

47 Architecture Tutorial Entities Artifact: Immutable piece of state, which may have a physical embodiment in an physical object, or a digital representation in a computer system. Process: Action or series of actions performed on or caused by artifacts, and resulting in new artifacts. Agent: Contextual entity acting as a catalyst of a process, enabling, facilitating, controlling, affecting its execution.

48 Architecture Tutorial Edges A A P used P was generated by A P was triggered by was derived from P A Role identifiers on edges specify in what way an artifact relates to a process

49 Architecture Tutorial Pegasus Example FITS DataSet Produce Sky Mosaic used (inputSet) Degree used (size) Mosaic was generated by (output) Pegasus / Condor DAGMan was controlled by (enactor) agent artifact process artifact

50 Architecture Tutorial Mapping Attribution to OPM creation used A was generated by Simon Miles wasActionOf agent artifact process artifact A dc:creator “Simon Miles”

51 Architecture Tutorial Provenance Technologies in database research, digital libraries, semantic web

52 Architecture Tutorial Database Research In database research, the concept of provenance has been used for: –Inferring what database table values affected a query result (Buneman et al) –Tracking the changes in relational data structure between versions of a database –Tracking changes in database schemas (Chiticariu and Tan)

53 Architecture Tutorial Why & Where Provenance (Buneman et al.) SELECT name, telephone FROM employee WHERE salary > SELECT AVERAGE salary FROM employee Alfred Bertha Charlie Denise Eric020 7848 …. 900 800 700 600 500 Denise Eric020 7848 …. nametelephonesalary nametelephone where why

54 Architecture Tutorial Digital Library Technologies In digital libraries, a set of standards are sometimes used to provide data structures to store metadata along with archived objects, OAIS, METS, PREMIS... An Archival Information Packet (AIP) provides write-once data and metadata AIP metadata can contain identifiers and relationships to connect one version to preceding versions, and record events relevant to the archived object, e.g. compression, integrity check

55 Architecture Tutorial Provenance in RDF Different schemes have been suggested for recording documentation on the provenance of statements in RDF Reified statements: A: http://...subj http://...isRelated http://...obj B: http://...hasCreator “Simon” Named graphs Causal graph explicit as part of data model

56 Architecture Tutorial Provenance as Bibliography Dublin Core can be used to express bibliography information: creator, publisher, subject, etc. –http://purl.org/dc/elements/1.1/creator Not as expressive as causal graphs and can be captured in a graph –e.g. who created something is part of the process by which it was created But DC metadata common across applications and easy to use Users can find it helpful to include both

57 Architecture Tutorial Provenance in Tupelo Thanks to Joe Futrelle, National Centre for Supercomputing Applications for following slides

58 Architecture Tutorial Tupelo: semantic content Abstracts content from storage impls (e.g., Sesame, Mulgara)‏ Provides location- independent addressing of content and metadata Supports transparent mirroring, caching, failover, etc. (tupeloproject.org)‏

59 Architecture Tutorial Tupelo “Tupelo... provides a Web access protocol and Java API (Application Program Interface) that interface with an RDF (Resource Description Framework) mapping of the Open Provenance Model.” –Towards provenance-aware geographic information systems, ACM SIGSPATIAL 2008

60 Architecture Tutorial NCSA Provenance Infrastructure Open Provenance Model Tupelo Semantic Content Repository Context OPM toolkit Store OPM toolkit Visualization, interaction Tracking, modeling, presentation Abstraction, inference, storage desktop, portal, etc.

61 Architecture Tutorial Tupelo Provenance API Java API to record OPM data as RDF, e.g Artifact artifact = graph.newArtifact("input file 1"); graph.assertArtifact (artifact);

62 Architecture Tutorial Tupelo Provenance API Query OPM graph by searching for patterns in RDF Unifier u = new Unifier(); u.setColumnNames("file", "path"); u.addPattern("file", Rdf.TYPE, PC3Utilities.ns("CSV_file")); u.addPattern("file“, PC3Utilities.ns("PathToFile"), "path"); context.perform(u);

63 Architecture Tutorial Provenance in Taverna Thanks to Paolo Missier, University of Manchester, for following slides

64 Architecture Tutorial Taverna “The Taverna workbench is a free software tool for designing and executing workflows, created by the myGrid project” –Taverna website

65 Architecture Tutorial 65 Collections example: from genes to SNPs gene -> genomic region extend region retrieve SNPs in the region rearrange SNP details See myexperiment.org: http://www.myexperiment.org/workflows/166http://www.myexperiment.org/workflows/166 [ ENSG00000139618, ENSG00000083093 ] [[,,... ], [,,... ]

66 Architecture Tutorial 66 Collections, iterations, and provenance l(s) → l(s) s → s s → l(s) s → s Processor signatures [139618, 83093] [23520984, 31786617][16,13] [16,13][23560179, 31871809] [,... ] [,... ] Dot product 13961883093

67 Architecture Tutorial 67 Capturing provenance with iterations X:s P Y:s [a 1...a i...a n ] semantics: Y = (map P [a 1...a n ]) = [ (P a 1 )... (P a n ) ] (extends to multiple inputs...) [b 1...b i...b n ] workflow processor: the elementary graph building block X P[n] Y a1a1 anan b1b1 bnbn X P[1] Y... unfolding during execution: b1b1 a1a1 bnbn anan P[1] P[n] wasGeneratedBy used OPM pattern:... iteration due to list depth mismatch

68 Architecture Tutorial 68 Querying provenance graphs Problem: –users are rarely interested in the complete provenance graph noisy, possibly large, difficult to navigate Goal: let users identify –variables that carry interesting values for which provenance is sought –nodes in the graph where provenance information should be reported

69 Architecture Tutorial Provenance query - no semantics provenancy query syntax: SELECT merged_pathways AT get_pathways_by_genes1, mmusculus_gene_ensembl interesting value interesting processor interesting processor

70 Architecture Tutorial Role of semantics in provenance Taverna runtime P1 P2 P3 P4 P5 P6 P1 P2 P3 P4 P5 P6 P1 P2 P3 P4 P5 P6 dataflow topology + raw lineage events Provenance capture and query processor lineage database (RDB)‏ query semantic resource annotations “describe the derivation of each pathway through Kegg, in which gene g is involved” reference ontologies Semantic overlays current implementation

71 Architecture Tutorial The Provenance Challenges

72 Architecture Tutorial Provenance Challenges 1 & 2 IPAW 2006, HPDC 2007 20 teams, 1 workflow, 9 queries Interoperability? –lots of manual work required –call for standards (source: gridprovenance.org)‏

73 Architecture Tutorial Provenance Challenge 3 Ended with a workshop in Amsterdam, 10- 11th June Specifically aimed at interoperability Each team: –Runs an astronomy data analysis process –Executes queries on provenance –Exports provenance as OPM –Imports other teams’ OPM provenance and re-runs queries

74 Architecture Tutorial Open Issues

75 Architecture Tutorial Intention and Reason OPM provides a mechanistic view of what has occurred It does not capture assertions such as: –X occurred because I aimed to achieve Y –X occurred because I believed that Y was true –X occurred because I had an obligation to ensure it did

76 Architecture Tutorial Digitally Controlled Process Inference and Physical Processes Blood Test Results Blood Test Request

77 Architecture Tutorial Inferred Physical ProcessDigitally Controlled Process Inference and Physical Processes Blood Test Results Blood Test Request Sent Blood Sample Received Blood Sample


Download ppt "Architecture Tutorial Security and privacy in provenance Simon Miles King’s College London."

Similar presentations


Ads by Google