Provenance-aware Storage Systems Kiran-Kumar Muniswamy-Reddy David A. Holland Uri Braun Margo Seltzer Harvard University.

Slides:



Advertisements
Similar presentations
File Management.
Advertisements

Provenance-Aware Storage Systems Margo Seltzer April 29, 2005.
More on File Management
Hi-Fi: Collecting High-Fidelity Whole-System Provenance Devin J.Pohly 1, Stephen McLaughlin 1, Patrick McDaniel 1, Kevin Butler 2 1 Pennsylvania State.
PROVENANCE FOR THE CLOUD (USENIX CONFERENCE ON FILE AND STORAGE TECHNOLOGIES(FAST `10)) Kiran-Kumar Muniswamy-Reddy, Peter Macko, and Margo Seltzer Harvard.
Chapter 4 : File Systems What is a file system?
Mendel Rosenblum and John K. Ousterhout Presented by Travis Bale 1.
File Systems.
Provenance-Aware Storage Systems The First Workshop on Provenance Aware Storage Systems October 20, 2005 Margo Seltzer.
Allocation Methods - Contiguous
Chapter 10: File-System Interface
File Management Chapter 12. File Management A file is a named entity used to save results from a program or provide data to a program. Access control.
Chapter Physical Database Design Methodology Software & Hardware Mapping Logical Design to DBMS Physical Implementation Security Implementation Monitoring.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
8/28/97Information Organization and Retrieval Files and Databases University of California, Berkeley School of Information Management and Systems SIMS.
1 Course Outline Processes & Threads CPU Scheduling Synchronization & Deadlock Memory Management File Systems & I/O Networks, Protection and Security.
Installing Windows XP Professional Using Attended Installation Slide 1 of 41Session 2 Ver. 1.0 CompTIA A+ Certification: A Comprehensive Approach for all.
HADOOP ADMIN: Session -2
File Management Chapter 12. File Management File management system is considered part of the operating system Input to applications is by means of a file.
Operating Systems (CS 340 D) Dr. Abeer Mahmoud Princess Nora University Faculty of Computer & Information Systems Computer science Department.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Databases and Database Management Systems
A Comparison of Library Tracking Methods in High Performance Computing Computer System Cluster and Networking Summer Institute 2013 Poster Seminar William.
Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 3: Operating-System Structures System Components Operating System Services.
File Systems (1). Readings r Reading: Disks, disk scheduling (3.7 of textbook; “How Stuff Works”) r Reading: File System Implementation ( of textbook)
File Management Chapter 12. File Management File management system is considered part of the operating system Input to applications is by means of a file.
Query Execution Section 15.1 Shweta Athalye CS257: Database Systems ID: 118 Section 1.
1 Database Management Systems (DBMS). 2 Database Management Systems (DBMS) n Overview of: ä Database Management Components ä Database Systems Architecture.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
RevDedup: A Reverse Deduplication Storage System Optimized for Reads to Latest Backups Chun-Ho Ng, Patrick P. C. Lee The Chinese University of Hong Kong.
File and Database Design Class 22. File and database design: 1. Choosing the storage format for each attribute from the logical data model. 2. Grouping.
CISC Machine Learning for Solving Systems Problems Presented by: Suman Chander B Dept of Computer & Information Sciences University of Delaware Automatic.
Introduction Current Work Design & Implementation Conclusions PQLite: Provenance Query Language PQLite: An Overly Simplistic Query Language for Data Provenance.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Linux File system and VFS. A simple description of the UNIX system, also applicable to Linux, is this: "On a UNIX system, everything is a file; if something.
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
Distributed File Systems Group A5 Amit Sharma Dhaval Sanghvi Ali Abbas.
Scott Finley University of Wisconsin – Madison CS 736 Project.
Operating Systems: Summary INF1060: Introduction to Operating Systems and Data Communication.
File Systems.  Issues for OS  Organize files  Directories structure  File types based on different accesses  Sequential, indexed sequential, indexed.
Operating Systems Files, Directory and File Systems Operating Systems Files, Directory and File Systems.
Journaling versus Softupdates Asynchronous Meta-Data Protection in File System Authors - Margo Seltzer, Gregory Ganger et all Presenter – Abhishek Abhyankar.
W4118 Operating Systems Instructor: Junfeng Yang.
Day 28 File System.
File System Implementation
Chapter 11 & 12: File System Interface and Implementation
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Introduction to Kernel
Chapter 11: File System Implementation
Chapter 12: File System Implementation
Filesystems.
CHAPTER 5: PHYSICAL DATABASE DESIGN AND PERFORMANCE
Operation System Program 4
Chapter 11: File System Implementation
File Systems Kanwar Gill July 7, 2015.
An overview of the kernel structure
Chapter 2: System Structures
CS179G, Project In Computer Science
Introduction to Operating Systems
Chapter 2: Operating-System Structures
Chapter 16 File Management
Database Design Hacettepe University
THE GOOGLE FILE SYSTEM.
Supporting High-Performance Data Processing on Flat-Files
Chapter 2: Operating-System Structures
SE350: Operating Systems Lecture 12: File Systems.
The Design and Implementation of a Log-Structured File System
Presentation transcript:

Provenance-aware Storage Systems Kiran-Kumar Muniswamy-Reddy David A. Holland Uri Braun Margo Seltzer Harvard University

Provenance-aware storage systems (PASS) Provenance (lineage) is the ownership history of an object In a FS context, provenance is “a description of the execution history that produced a persistent object” Queries of provenance can answer questions like: –Who is using my dataset? –On whose data does my result depend? Two possible approaches –Disclosed provenance Depend on apps and users to record provenance Rich semantic knowledge –Observed provenance System transparently records, maintains provenance data Little semantic knowledge, but is gathered for all workloads Authors implemented a PASS filesystem (PASTA) which automatically gathers provenance in a UNIX environment

Provenance records For each file in the filesystem, record –The executable that created it –Any input files –“Complete” hardware platform description –Command line –Process environment –Other data such as random seeds > sort a > b FILE/b ARGVsort a NAME/bin/sort, /bin/cat INPUTPnode # of a OPENNAME/lib/i686/libc.so.6, /usr/share/locale/l ocale.alias,/etc/mt ab,/proc/meminfo,.#prov.mtab, … ENVPWD=/pass USER=root … KERNELLinux … MODULE…

Tasks enabled by PASS Script generation –Generate a Makefile that reproduces a file Detecting system changes –Compare provenance of two files to detect changes in environment, libraries, etc. Intrusion detection –Detailed logs of how objects have changed Retrieving compile-time flags –In case you forgot how you compiled something Build debugging –Avoid needing to “make clean” after any change Understand system dependencies –e.g., objects depended on /bin/mount because libc reads the mount table frequently –User can manually choose files to be ignored by PASTA

PASS Implementation Collector kernel module intercepts syscalls and generates provenance records –Per-process provenance information kept in memory –Records are written to disk Duplicate elimination –Coalesce entries from repeated syscalls Versions –Filesystem data is not versioned but provenance records are Node merging for cycle elimination –Merge the provenance of sets of processes that produce cycles Approx 5000 lines of in-kernel code –Not including in-kernel Berkeley DB

PASTA – the storage layer Stacked on ext2 using FiST –Not clear why a storage layer is needed –Maybe to guarantee that the metadata follows the data? In-kernel Berkeley DB (KBDB) stores five provenance tables –Provenance: main repository of records –Map: map inodes to pnodes –Argdata: assign sequence numbers to each command line and environment record –Argreverse: reverse index of Argdata –Argindex: secondary index of cmdline and environment components to sequence numbers

Queries Conventional attribute lookup Transitive closure of ancestry or descendancy information Query tools act on the provenance databases –Provenance Explorer allows users to browse the filesystem and make point queries –Makefile Generator produces the set of commands that led to a file’s current state

Evaluation: Performance Small file microbenchmark: –Create, read, write, sync, delete KB files in 100 directories –2X time overhead for small files Large file microbenchmark –Write then read 100MB sequentially, write then read random 256KB chunks –2-15% time overhead for large files Build Linux kernel then generate Makefiles for every resulting file –Fast – only 65ms per file thanks to DB index

Evaluation: Provenance growth After kernel build, append a comment to N random files and rebuild kernel NSize (MB)% files modified % space growth % record growth

Evaluation: One user’s experience Computational biologist who uses blast, a tool to find regions of similarity between biological sequences –One program generates databse files, blast does the comparison, then some perl scripts clean it up After workflow, biologist uses PASS query tools to generate Makefiles with specific commands Reports runtime overhead of 1.65%

Prototype capabilities and limitations Collects and maintains provenance w/out apriori workload knowledge Cannot generate provenance for files from non- provenanceified machines No security and access control –e.g., An employee review should be readable by the employee, but includes input from colleagues that should be private –Future work Simple query capabilities

Research challenges Security model Cycle-breaking Provenance pruning –e.g., when deleting a file with long chains of pnodes Integrate with other provenance-gathering apps, systems Network-aware PASS systems Integrate with file versioning

Lingering questions… Overhead for systems files? –Collect provenance for system daemons? Deeper evaluation of provenance time/space costs over time Provenance in aging filesystems User studies –Who wants this?