Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering Nithya N. Vijayakumar, Beth Plale DDE Lab, Indiana University {nvijayak,

Slides:



Advertisements
Similar presentations
LEAD Portal: a TeraGrid Gateway and Application Service Architecture Marcus Christie and Suresh Marru Indiana University LEAD Project (
Advertisements

Provenance-Aware Storage Systems Margo Seltzer April 29, 2005.
Multi-level SLA Management for Service-Oriented Infrastructures Wolfgang Theilmann, Ramin Yahyapour, Joe Butler, Patrik Spiess consortium / SAP.
Service Oriented Architecture for Mobile Applications Swarupsingh Baran University of North Carolina Charlotte.
Feedback on OPM Yogesh Simmhan Microsoft Research Synthesis of pairwise conversations with: Roger Barga Satya Sahoo Microsoft Research Beth Plale Abhijit.
V-1 Part V: Collaborative Signal Processing Akbar Sayeed.
IPAW'08 – Salt Lake City, Utah, June 2008 Data lineage model for Taverna workflows with lightweight annotation requirements Paolo Missier, Khalid Belhajjame,
Application Graphic design / svetagraphics.com 01 FRAMEWORK data service.
Module 17 Tracing Access to SQL Server 2008 R2. Module Overview Capturing Activity using SQL Server Profiler Improving Performance with the Database Engine.
As computer network experiments increase in complexity and size, it becomes increasingly difficult to fully understand the circumstances under which a.
CoreGRID Workpackage 5 Virtual Institute on Grid Information and Monitoring Services Authorizing Grid Resource Access and Consumption Erik Elmroth, Michał.
Active Databases as Information Systems
6th Biennial Ptolemy Miniconference Berkeley, CA May 12, 2005 Distributed Computing in Kepler Ilkay Altintas Lead, Scientific Workflow Automation Technologies.
Zero-programming Sensor Network Deployment 學生:張中禹 指導教授:溫志煜老師 日期: 5/7.
CS 501: Software Engineering Fall 2000 Lecture 16 System Architecture III Distributed Objects.
Adaptive Sampling in Distributed Streaming Environment Ankur Jain 2/4/03.
OCCF – The Realtime Grid. 1 Characteristics of Current Grid Computing Static data sets - Generally from fixed length experiments - Statistical measurements.
Application architectures
Dunja Mladenić Marko Grobelnik Jožef Stefan Institute, Slovenia.
Copyright ©2009 Opher Etzion Event Processing Course Engineering and implementation considerations (related to chapter 10)
Winter Retreat Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen, Emre Kıcıman, Anthony Accardi, Armando Fox, Eric Brewer
Definition of terms Definition of terms Explain business conditions driving distributed databases Explain business conditions driving distributed databases.
Overview of Search Engines
Event Processing In Workflows – BPM. Session 4 Event processing in Workflows (BPM) Moderator Rainer von Ammon, University of Regensburg Panelists Name.
Moving Objects Databases Nilanshu Dharma Shalva Singh.
Application architectures
Presenter: Chi-Hung Lu 1. Problems Distributed applications are hard to validate Distribution of application state across many distinct execution environments.
Chapter 1 Overview of Databases and Transaction Processing.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 13 Slide 1 Application architectures.
Windows.Net Programming Series Preview. Course Schedule CourseDate Microsoft.Net Fundamentals 01/13/2014 Microsoft Windows/Web Fundamentals 01/20/2014.
Chapter 6 – Architectural Design Lecture 2 1Chapter 6 Architectural design.
Learner Modelling in a Multi-Agent System through Web Services Katerina Kabassi, Maria Virvou Department of Informatics, University of Piraeus.
Naixue GSU Slide 1 ICVCI’09 Oct. 22, 2009 A Multi-Cloud Computing Scheme for Sharing Computing Resources to Satisfy Local Cloud User Requirements.
1 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements for caBIG Infrastructure to Support Semantic Workflows Yolanda.
Active Monitoring in GRID environments using Mobile Agent technology Orazio Tomarchio Andrea Calvagna Dipartimento di Ingegneria Informatica e delle Telecomunicazioni.
An Integration Framework for Sensor Networks and Data Stream Management Systems.
Usage of `provenance’: A Tower of Babel Luc Moreau.
Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.
Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.
Department of Information Engineering The Chinese University of Hong Kong A Framework for Monitoring and Measuring a Large-Scale Distributed System in.
Ramiro Voicu December Design Considerations  Act as a true dynamic service and provide the necessary functionally to be used by any other services.
Application of Provenance for Automated and Research Driven Workflows Tara Gibson June 17, 2008.
Event Processing A Perspective From Oracle Dieter Gawlick, Shailendra Mishra Oracle Corporation March,
Ocean Observatories Initiative Data Management (DM) Subsystem Overview Michael Meisinger September 29, 2009.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Provenance Challenge gLite Job Provenance.
San Diego Supercomputer Center Grid Physics Network (GriPhyN) University of Florida DGL: The Assembly Language for Grid Computing Arun swaran Jagatheesan.
Streamflow - Programming Model for Data Streaming in Scientific Workflows Chathura Herath.
INNOV-10 Progress® Event Engine™ Technical Overview Prashant Thumma Principal Software Engineer.
Recording the Context of Action for Process Documentation Ian Wootten Cardiff University, UK
A Data Stream Publish/Subscribe Architecture with Self-adapting Queries Alasdair J G Gray and Werner Nutt School of Mathematical and Computer Sciences,
© 2006, National Research Council Canada © 2006, IBM Corporation Solving performance issues in OTS-based systems Erik Putrycz Software Engineering Group.
Recording Actor Provenance in Scientific Workflows Ian Wootten, Shrija Rajbhandari, Omer Rana Cardiff University, UK.
Virtual Application Profiler (VAPP) Problem – Increasing hardware complexity – Programmers need to understand interactions between architecture and their.
Knowledge Modeling and Discovery. About Thetus Thetus develops knowledge modeling and discovery infrastructure software for customers who: Have high-value.
A Security Framework with Trust Management for Sensor Networks Zhiying Yao, Daeyoung Kim, Insun Lee Information and Communication University (ICU) Kiyoung.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
OGCE Workflow and LEAD Overview Suresh Marru, Marlon Pierce September 2009.
CS223: Software Engineering
Collection and storage of provenance data Jakub Wach Master of Science Thesis Faculty of Electrical Engineering, Automatics, Computer Science and Electronics.
Developing GRID Applications GRACE Project
Performing Fault-tolerant, Scalable Data Collection and Analysis James Jolly University of Wisconsin-Madison Visualization and Scientific Computing Dept.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
Chapter 1 Overview of Databases and Transaction Processing.
Apache Cocoon – XML Publishing Framework 데이터베이스 연구실 박사 1 학기 이 세영.
Design of a Notification Engine for Grid Monitoring Events and Prototype Implementation Natascia De Bortoli INFNGRID Technical Board Bologna Feb.
Servicing Seismic and Oil Reservoir Simulation Data through Grid Data Services Sivaramakrishnan Narayanan, Tahsin Kurc, Umit Catalyurek and Joel Saltz.
SQL Database Management
Project Management: Messages
Smita Vijayakumar Qian Zhu Gagan Agrawal
Laura Bright David Maier Portland State University
Presentation transcript:

Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering Nithya N. Vijayakumar, Beth Plale DDE Lab, Indiana University {nvijayak,

Project Description Provenance collection in stream filtering systems Identify unique challenges posed by stream filtering systems to provenance tracking Low overhead data model and collection model that addresses these challenges

Outline Stream filtering systems Challenges posed by stream filtering systems Current provenance solutions applied to streams Proposed provenance data model Low overhead provenance collection model Calder stream processing system Implementation of provenance models in Calder Application in LEAD Future work

Stream filtering systems Data driven systems that accept events in real time –appropriate when data is continuously generated –data stream is an indefinite sequence of time ordered events Filter (query, user defined application) –a processing unit that takes one or more event sequences as input, and generates a new event sequence, as output –queries with well-defined language or customized application code –long running and associated with a lifetime Applications –monitoring, stock ticks in financial applications, performance measurements in network monitoring and traffic management, sensor data, scientific datasets

Challenges posed by stream filtering systems Identifying provenance entities –atomic unit? event/ stream/source Capturing stream filtering conditions with low overhead –distributed environment –environmental and configuration changes Maintaining relevance with non-persistent data –trace back source of events long after being derived Dynamic accuracy estimation –quality of service guarantees for derived streams –provenance across streams –deduce accuracy of derived streams

Current provenance solutions applied to streams: What is the challenge? Representing provenance for stream entities using Virtual Data Grid system –indefinite sequence of time ordered datasets –non-persistent data events –need accountability more than reproducibility Provenance collection using PASOA or Karma –provenance to be collected for each stream and filters executed on streams –communication between components of the stream filtering system not very important than the entities themselves

Current provenance solutions applied to streams (contd…) Logging environmental conditions using Log4j –non-trivial load on the service –aggregating provenance traces difficult Augmenting accuracy and lineage using Trio –lineage cannot be associated with datasets –need to trace the accuracy of a set of events long after the stream is generated

Provenance data model: What to track? Atomic units –streams generated outside the system (base streams) –declarative queries or application code that executes continuously (adaptive filters) –streams generated by executing adaptive filters on base and derived streams (derived streams)

Provenance data model: How to store it? Provenance stack –base provenance information and a list of changes –latest information identified by timestamp and is current from that point onwards Provenance tree –derived stream refers to provenance of input streams (base and derived) + adaptive filters –provenance can refer to annotations outside the system (SAM) Store the provenance history (compressed or uncompressed) of streams and filters

Low overhead provenance collection model Base provenance –collected from user when registering a stream/filter –document the available information (inputs, filters, rate, sources etc) –store system and user defined metadata as name value pairs in base provenance information –base provenance can be updated by the user Dynamic provenance –subset of a stream identified by a starting timestamp and ending timestamp –changes logged with starting timestamp current from then on

A simple example Temperature Feed D0010 Q0099 B0011 D0005 owner foo permissions open to everyone 13:00:00 Feb :34:56 Feb B0011 down Sampling 0.85

Calder stream processing system Distributed processing of streams Service oriented access to data streams SQL based rule-action support Extends OGSA-DAI v6 GDS to streaming resources Synchronous and asynchronous data delivery Data Management Subsystem Stream Grid Data Service Query Planning Service Stream Rowset Service Provenance Service Users/ Appli- cation Computatio n Node Running Query Processing Engine Queries/ Requests Result data Data Streams Calder Pub-sub system Monitoring Service

Calder Query Execution

Provenance collection in Calder Query Planner Service Monitoring Service Monitoring Updates Prove- nance Service Query execution plan updates Subscribe to receive event of interest Monitoring updates Provenance Queries/ Updates Provenance Results Provenance Propagation XML Database Computation nodes

Application in LEAD Radar meta-data is sent through pub-sub system User submits filter query Calder executes filter query on incoming data streams Filtered datasets are processed using data mining algorithms (MDA & ADaM) Triggers (WS-Notifications) sent to workflows that invoke forecast models. Provenance tracking will help in understanding why and when a trigger was sent

Future work Complex Event Processing –processing multiple streams –identifying global behavior Context Management –informative search based on past usage –predicting system characteristics –managing profiles for users and dynamic system configuration

Thank you Questions and Feedback Welcome! Nithya Vijayakumar