Presentation on theme: "Layering in Provenance Systems Margo Seltzer May 13, 2009 Provenance in Secure and Advanced Computer Systems."— Presentation transcript:
Layering in Provenance Systems Margo Seltzer May 13, 2009 Provenance in Secure and Advanced Computer Systems
2PSACS: May 2009 The Vision: Provenance Everywhere All data has provenance. Applications generate provenance. Systems generate provenance. Users generate provenance. Provenance is: –Secure. –Queryable. –Globally searchable. There are provenance-aware algorithms.
3PSACS: May 2009 The Problem: Provenance Comes from Different Places Depending on the source, provenance is attached to different kinds of objects: –Operating system: files –Database systems: tuples –Workflow engines: objects –Applications: Variables (from an interpreter) Links (from a browser)
4PSACS: May 2009 Data are related Tuples live in files. Files comprise data sets. Browsers write files. Variables relate to each other. Objects may be files, tuples, or data sets. Must integrate provenance from different representations.
5PSACS: May 2009 Why Integrate Provenance?
6PSACS: May 2009 Outline Provenance disclosure and integration Layering and provenance Parting remarks
7PSACS: May 2009 Provenance Observation versus Disclosure Disclosed provenance: –Provenance that is explicitly provided. –Provider understands semantics of the data referenced by provenance. –Example: This image is the result of aligning these other two images. Observed provenance: –Provenance deduced by interpreting events. –Observer translates event into a provenance relationship. –Example: Process P wrote file F, therefore file F depends on file P
8PSACS: May 2009 Your observed provenance is my disclosed provenance. The distinction between observed and disclosed provenance is one of vantage point. A file system observes that the workflow engine produced the file atlas.x.gif. The workflow engine can disclose that atlas.x.gif is the result of a 5-step process that began with reading warp.air.
9PSACS: May 2009 Problem Overview Systems capture provenance at different levels of abstraction: –File systems: files and processes –Database systems: tuples and queries –Workflow engines: objects and operators –Interpreters: variable and operations –Browsers: URLs and traversals Users want to query across these abstractions.
10PSACS: May 2009 Use Case: PA-Browser Browsers capture a users search and traversal patterns. Action: User inadvertently downloads a virus. Without layering: –Browser knows this came from virus.com. –File system knows what files were affected. With layering: –How did user get to the virus? –What else was downloaded from that site? –Are there other files that might be similarly tainted?
11PSACS: May 2009 Use Case: PA-Python Applications Python wrappers generate trace of processing steps internal to python. Usage: Program reads 100 input files, uses two of them to produce a graph. Without layering: –Python knows which files were actually used to produce the graph. –File system knows that Python read 100 files and produced an output file. With layering –Can identify that two input files lead directly to output file.
12PSACS: May 2009 Integrating Requires Layering Layering implies that provenance collection and tracking systems interact directly with one another. Why not a centralized provenance repository? –Requires a mechanism to translate names. –Every participant must agree on naming convention. –Must be able to generate references to objects created by other participants. –What happens when you add a new participant with a new naming mechanism? Layering provides a natural way to transmit and integrate provenance.
13PSACS: May 2009 Outline Provenance disclosure and integration Layering and provenance Parting remarks
14PSACS: May 2009 Provenance-Aware Agents An agent that is provenance-aware: –Accepts disclosed provenance from others. –Observes events and generates provenance from them. –Discloses provenance to others. Implications: –Both input and output are disclosed provenance –Participation in an integrated provenance-aware system requires an API for disclosed provenance.
15PSACS: May 2009 DPAPI: The Disclosed Provenance API Grew out of our experience designing and building PASS (Provenance-Aware Storage Systems). Used as the universal internal API between components in the PASS architecture. Used to extend PASS to NFS. Used by provenance-aware applications. Has evolved through three generations.
16PSACS: May 2009 DPAPI Concepts Pnode –Unique ID assigned at object creation. –Never recycled. –Used to access an objects provenance. Provenance record –An attribute/value pair. –Plain value or cross-reference. Version –Objects change; changes are reflected in versions.
17PSACS: May 2009 DPAPI Functions Pass_read: Reads data with a reference to its provenance. Pass_write: Writes data with provenance. Pass_freeze: Subsequent modifications to object create a new version. Pass_mkobj: Create an object to represent something at a different abstraction layer. Pass_reviveobj: Given a pnode number, obtain a reference to the appropriate object. Pass_sync: Flush an objects provenance to disk.
18PSACS: May 2009 Example Stack: NFS ApplicationPASSNFS PA-Application libpass DPAPI Syscall API DPAPI
19PSACS: May 2009 Example 5-stack PA-Python Application PA Python Library DPAPI Syscall API DPAPI PA-Python Interpreter PASSNFS DPAPI lib API
20PSACS: May 2009 Benefits to Layering Ability to query across layers. Access objects by the name that is meaningful to the user. Automatic association between names at different layers. Associate related objects named differently. Extensible data model.
21PSACS: May 2009 Outline Provenance disclosure and integration Layering and provenance Parting remarks
22PSACS: May 2009 Lessons Learned (1) Guidelines for making applications or systems provenance-aware: –Identify what provenance you want to collect. Create objects as necessary using dpapi_mkobj Accumulate provenance records for those objects –Replace read calls with dpapi_read calls. –Replace write calls with dpapi_write calls. –Use cross-references to relate objects. –If necessary, export DPAPI to higher layers
23PSACS: May 2009 Lessons Learned (2) Application architecture dictates how difficult this is. –Firefoxs modular architecture makes it difficult to have provenance and data flow together hrough the browser APIs are never done. –DPAPI continues to evolve. –Added two new calls early in 2009.
24PSACS: May 2009 Lessons Learned (3) Differentiating applications from substrates: –We initially thought that our Python wrappers made Python provenance-aware. –Instead they enabled provenance-aware Python appcliations. –Making Python provenance-aware requires changes to the interpreter -- similar to those to make an operating system provenance-aware.
25PSACS: May 2009 Making Provenance Ubiquitous One size does not fit all. Provenance is useful at all levels of the system: –Capture semantics of applications. –Capture execution mode of interpreter. –Capture system dependencies. Data and provenance live in a world with many names.
26PSACS: May 2009 Layering Enables Interoperability Data objects are the point of interoperability. –Users exchange or share data, not provenance. –Users query provenance. The names people associate with their data must be available in provenance queries. A layered approach associates names with one another. Layering enables consistency between provenance and data.
27PSACS: May 2009 New Layers We have explored layering in: –Operating system –Network-attached storage –Interpreters –Language libraries –Browsers –Workflow engines (Kepler) We welcome new layers to our stack: –Database?
28PSACS: May 2009 Thank You! Margo Seltzer May 13, 2009 Provenance in Secure and Advanced Computer Systems
29PSACS: May 2009 DPAPI (detail) int dpapi_freeze(int fd); int dpapi_mkobj(int reference_fd); int dpapi_revive_obj(int reference_fd, __pnode_t pnode, version_t version); ssize_t paread(int fd, void *data, size_t datalen, __pnode_t *pnode_ret, version_t *version_ret); ssize_t pawrite(int fd, const void *data, size_t datalen, const struct dpapi_addition *records, unsigned numrecords); int dpapi_sync(int fd);