Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester.

Slides:



Advertisements
Similar presentations
21 Sep 2005LCG's R-GMA Applications R-GMA and LCG Steve Fisher & Antony Wilson.
Advertisements

Distributed Query Processing Donald Kossmann University of Heidelberg
Module 13: Performance Tuning. Overview Performance tuning methodologies Instance level Database level Application level Overview of tools and techniques.
Berkeley dsn declarative sensor networks problem David Chu, Lucian Popa, Arsalan Tavakoli, Joe Hellerstein approach related dsn architecture status  B.
Declarative sensor networks David Chu Computer Science Division EECS Department UC Berkeley DBLunch UC Berkeley 2 March 2007.
GridVine: Building Internet-Scale Semantic Overlay Networks By Lan Tian.
SemSorGrid4Env: Semantic Sensor Grids for Rapid Application Development for Environmental Management Development of an integrated information.
A Semantically Enabled Service Architecture for Mashups over Streaming and Stored Data Alasdair J G Gray University of Manchester Extended Semantic Web.
Distributed Query Processing over Streaming and Stored Data Alasdair J G Gray Information Management Group University of Manchester Dagstuhl Seminar –
Slides thanks to Steve Lynden Amy Krause EPCC Distributed Query Processing with OGSA-DQP Principles and Architectures for Structured Data Integration:
Database management concepts Database Management Systems (DBMS) An example of a database (relational) Database schema (e.g. relational) Data independence.
A Survey of Wireless Sensor Network Data Collection Schemes by Brett Wilson.
Exploiting Correlated Attributes in Acquisitional Query Processing Amol Deshpande University of Maryland Joint work with Carlos Sam
1 Lecture 13: Database Heterogeneity Debriefing Project Phase 2.
GLEON Data Management Luke Winslow PASEO 3/18/09.
Distributed Database Management Systems. Reading Textbook: Ch. 4 Textbook: Ch. 4 FarkasCSCE Spring
©Silberschatz, Korth and Sudarshan18.1Database System Concepts Centralized Systems Run on a single computer system and do not interact with other computer.
14-18 March 2004 EDBT'04 : Service-Based Distributed Query Processing for the Grid (M N Alpdemir) 1 Title, places, people, funding, projects Manchester.
POLITECNICO DI TORINO TRIBUTE and DIMMER. DIMMER - The context One of the major challenges in today’s economy concerns the reduction in energy usage and.
Accessing Existing Distributed Science Archives as RDF Models Alasdair J G Gray 1 Norman Gray 2 Iadh Ounis 1 1 Computing Science, University of Glasgow.
Berlin SPARQL Benchmark (BSBM) Presented by: Nikhil Rajguru Christian Bizer and Andreas Schultz.
SensIT PI Meeting, January 15-17, Self-Organizing Sensor Networks: Efficient Distributed Mechanisms Alvin S. Lim Computer Science and Software Engineering.
On the Construction of Data Aggregation Tree with Minimum Energy Cost in Wireless Sensor Networks: NP-Completeness and Approximation Algorithms National.
Can RDB2RDF Tools Feasible Expose Large Science Archives for Data Integration?  Alasdair J G Gray (University of Glasgow now Manchester)  Norman Gray.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
An Integration Framework for Sensor Networks and Data Stream Management Systems.
Data Integration on the Semantic Sensor Web Alasdair J G Gray Information Management Group University of Manchester Seminar at Imperial College London.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
PIER & PHI Overview of Challenges & Opportunities Ryan Huebsch † Joe Hellerstein † °, Boon Thau Loo †, Sam Mardanbeigi †, Scott Shenker †‡, Ion Stoica.
DBSQL 14-1 Copyright © Genetic Computer School 2009 Chapter 14 Microsoft SQL Server.
Session-9 Data Management for Decision Support
Mobile Relay Configuration in Data-Intensive Wireless Sensor Networks.
1 Heterogeneity in Multi-Hop Wireless Networks Nitin H. Vaidya University of Illinois at Urbana-Champaign © 2003 Vaidya.
Sensor Database System Sultan Alhazmi
Week 5 Lecture Distributed Database Management Systems Samuel ConnSamuel Conn, Asst Professor Suggestions for using the Lecture Slides.
CS542 Seminar – Sensor OS A Virtual Machine For Sensor Networks Oct. 28, 2009 Seok Kim Eugene Seo R. Muller, G. Alonso, and D. Kossmann.
The Explicator Project: Integrating Astronomy Data with Semantic Web Tools Alasdair J G Gray Information Management Group Seminar University of Manchester.
Semantic Access to Existing Archives Using RDF and SPARQL Alasdair J G Gray.
Rule-Based Programming for VORBs Bertram Ludaescher Arcot Rajasekar Data and Knowledge Systems San Diego Supercomputer Center U.C. San Diego.
October 7, 1999Reactive Sensor Network1 Workshop - RSN Update Richard R. Brooks Head Distributed Intelligent Systems Dept. Applied Research Laboratory.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
Data access and integration with OGSA-DAI: OGSA-DQP Steven Lynden University of Manchester.
Distributed DBMSs- Concept and Design Jing Luo CS 157B Dr. Lee Fall, 2003.
Databases Illuminated
1. 2 Preface In the time since the 1986 edition of this book, the world of compiler design has changed significantly 3.
OGSA-DQP:Service-Based Distributed Query Processing on the Grid M.Nedim Alpdemir Department of Computer Science University of Manchester.
A Data Stream Publish/Subscribe Architecture with Self-adapting Queries Alasdair J G Gray and Werner Nutt School of Mathematical and Computer Sciences,
Programming Sensor Networks Andrew Chien CSE291 Spring 2003 May 6, 2003.
Querying The Internet With PIER Nitin Khandelwal.
Speaker: SSG4Env WP4 Semantic Integrator Proposal & WP2 Collaboration.
Link Layer Support for Unified Radio Power Management in Wireless Sensor Networks IPSN 2007 Kevin Klues, Guoliang Xing and Chenyang Lu Database Lab.
Raluca Paiu1 Semantic Web Search By Raluca PAIU
W. Hong & S. Madden – Implementation and Research Issues in Query Processing for Wireless Sensor Networks, ICDE 2004.
EGEE is a project funded by the European Union under contract IST Information and Monitoring Services within a Grid R-GMA (Relational Grid.
GRIN: A Graph Based RDF Index Octavian Udrea 1 Andrea Pugliese 2 V. S. Subrahmanian 1 1 University of Maryland College Park 2 Università di Calabria.
Optimizing Query Processing In Sensor Networks Ross Rosemark.
Supporting Join Queries Talk by: Andy Cooke Collaborators: Alasdair Gray, Lisha Ma, and Werner Nutt Heriot-Watt University.
CERN 21 January 2005Piotr Nyczyk, CERN1 R-GMA Basics and key concepts Monitoring framework for computing Grids – developed by EGEE-JRA1-UK, currently used.
IHP Im Technologiepark Frankfurt (Oder) Germany IHP Im Technologiepark Frankfurt (Oder) Germany ©
OGSA-DQP Steven Lynden University of Manchester. Data access & integration with OGSA-DAI: GGF 17 2 Introduction OGSA-DQP is a service based distributed.
Efficient Opportunistic Sensing using Mobile Collaborative Platform MOSDEN.
The Design of an Acquisitional Query Processor For Sensor Networks Samuel Madden, Michael J. Franklin, Joseph M. Hellerstein, and Wei Hong Presentation.
A Grid Data Integration Service (OGSA-DQP) Paul Watson, University of Newcastle-upon-Tyne based on the work of… Norman Paton, Tasos Gounaris,
S. Sudarshan CS632 Course, Mar 2004 IIT Bombay
R-GMA as an example of a generic framework for information exchange
Distributed Database Management Systems
Streaming Sensor Data Fjord / Sensor Proxy Multiquery Eddy
Database Architecture
Workflow Adaptation as an Autonomic Computing Problem
Presentation transcript:

Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester Seminar at University of Cardiff 26 April 2010

Overview of the Talk Motivation: SemSorGrid4Env Forms of heterogeneity –Data source –Data semantics Query processing –SNEEql –SNEE-DQP –SPARQL STR 26 April 20102

SemSorGrid4Env Semantic sensor Grids for rapid application development for environmental monitoring Coastal and estuarine flood warning Fire monitoring and warning 26 April 20103

Estuarine Flood Warning Sensors deployed along UK South coast –On and off shore –Bespoke hardware –Fixed functionality –Fixed data rate No bursts –Central distribution centre Multitude of related data sources –Shipping –Flood defenses –Flooding models –… 26 April 20104

Fire Detection Deployment: Castilla y León, Northwest Spain –Forested region Wireless sensor network –Off-the-shelf sensor nodes TMoteSky TinyOS –Configure dynamically: ad hoc queries –In-network query processing –Controlled rate variability Satellite image data for the region 26 April 20105

Abstract Problem Stored data Sensor Network Integrator 26 April Sensor Network Stored data service Streaming data service

Data source Data stream Query capabilities Data access Types of Heterogeneity Stored data Sensor Network Integrator 26 April Sensor Network Stored data service Streaming data service Data semantics

D ATA S OURCE H ETEROGENEITY 26 April 20108

Data Source Characteristics Traditional stored data –Data stored in a database –User observes a static data set –One-off query execution Streaming data –Data processed on-the-fly Maybe stored for later access –User observes changes in data set –Continuous or snap-shot query execution 26 April 20109

Types of Data Stream Acquire-Stream Query controls data rate Informs query planning Receive-Stream Source controls data rate Potentially : –Unknown rate –Bursty data 26 April Stream Processor Source Acquire()Data Stream Processor Source Data

Streaming Source Access Pull Access Consumer periodically polls for new data. Introduces processing delay Push Access Publisher sends data as it is produced. Minimises processing delays 26 April Note: Orthogonal to data stream type Affects physical operator selection

Query Processing Challenges Variety of data sources –Stored data –Receive-stream –Acquire-stream No common query semantics –Streaming data languages –Stored data languages Distributed data sources 26 April

Query Languages Stored (relational) data –SQL Streaming (relational) data –SQL extensions –Continuous Query Language (CQL) –Sensor NEtwork Engine query language (SNEEql) 26 April

Query Language: SQL Designed for stored data Contains blocking operators –Join –Aggregates –… Example system: GSN (Aberer et al, 2007) –Key concept: Virtual sensor Wraps SQL execution Controls periodic evaluation –Limits expressiveness 26 April

Query Language: CQL (Arasu et al, 2006) Designed for receive-streams Windows used for blocking operators Contains type conversion operators –Stream -> Window –Window -> Stream Semantics defined by implementation Example system: STREAM (Arasu et al, 2010?) –Data Stream Management System –No support for stored data 26 April

Query Language: SNEEql (Brenninkmeijer et al, 2008) Designed for acquire-streams, receive-streams, and stored data –Based on CQL ideas Well-defined semantics –Independent of system Example system: SNEE (Galpin et al, 2009) –In-network query evaluation –Acquire-streams –Reactive/periodic operators –Controls network behaviour 26 April

SNEEql Query Syntax SELECT {RSTREAM | DSTREAM | ISTREAM} + attribute list FROM extent list WHERE expression *STREAM optional –Converts a window to a stream Extent list: –Streams with windows of the form [FROM t1 TO t2 SLIDE s unit] –Relations with windows of the form [SCAN EVERY t1 unit] 26 April

An Example from Hydrology Investigating water drainage in the Peak District –Hilly terrain –Peat bogs –River in valley bottom WSN measuring –Rainfall and –River depth WSN schema: –river (rain: int, depth: int) Sites (5, 6, 7, 9) –hilltop (rain : int) Sites (4) 26 April

Example Multi-source Query Every 15 minutes, and within 24 hours of their being taken, we wish to obtain time-correlated measurements of the river depth now and the rainfall at the top of the hill 15 minutes before, provided that it is now raining less in the river than it was in the hill top, that the rainfall in the hill top was above 5mm and greater than average rainfall. SELECT RSTREAM r.time, h.rain, r.depth FROM River[NOW] r, Hilltop[AT NOW-15 MINUTES] h, WHERE h.rain > 5 AND r.rain < h.rain AND h.rain >= (SELECT AVG(weather.rain) FROM Weather [rescan every day] WHERE weather.region = 'Peak District'); Acquisition rate = 15 min; Max delivery time = 24 hours; 26 April

SNEE-DQP Query Stack Metadata –Logical schema –Physical schema Source Allocation –Splitting the query into parts for each data source Source Planning –Physical operator selection –Generate plan for source 26 April Metadata SNEEql query + QoS Query Execution Plan Parsing Logical Planning Source Allocation Source Planning

Example: Query Plan SELECT RSTREAM r.time, h.rain, r.depth FROM River [NOW] r, Hilltop [AT NOW - 15 MINUTES] h, WHERE h.rain > 5 AND r.rain < h.rain AND h.rain >= (SELECT AVG(weather.rain) FROM Weather [RESCAN every day] WHERE weather.region = 'Peak District'); 26 April EXCHANGE JOIN river.rain<hilltop.rain ACQUIRE [time,rain] rain > 5 hilltop EVERY 15 min ACQUIRE [time,rain] rain > 5 hilltop EVERY 15 min ACQUIRE [time,rain, depth] true river EVERY 15 min ACQUIRE [time,rain, depth] true river EVERY 15 min TIME_WINDOW [t-15, t-15, 15] DELIVER EXCHANGE AVERAGE (rain) AVERAGE (rain) SCAN [rain] region = ‘Peak District’ weather EVERY DAY SCAN [rain] region = ‘Peak District’ weather EVERY DAY JOIN h.rain >= AVG(weather.rain) JOIN h.rain >= AVG(weather.rain)

In-Network SNEE Two-phase DQP –Single-site Push / –Multi-site Steiner tree routing to reduce energy Operator placement –Location sensitive –Reduce transmission Data buffering Time division agenda nesC code generated 26 April routing parsing/type checking translation/rewriting algorithm assignment partitioning where-scheduling when-scheduling code generation,, nesC code abstract-syntactic tree logical-algebraic form physical-algebraic form PAF routing tree RT fragmented-algebraic form agenda RT distributed-algebraic form RT DAF single-site phase multi-site phase

In-Network Query Planning SELECT r.time, h.rain, r.depth FROM River [NOW] r, Hilltop [AT NOW - 15 MINUTES] h, WHERE h.rain > 5 AND r.rain < h.rain; 26 April EXCHANGE JOIN river.rain<hilltop.rain ACQUIRE [time,rain] rain > 5 hilltop EVERY 15 min ACQUIRE [time,rain] rain > 5 hilltop EVERY 15 min ACQUIRE [time,rain, depth] true river EVERY 15 min ACQUIRE [time,rain, depth] true river EVERY 15 min TIME_WINDOW [t-15, t-15, 15]

Query Routing Tree April 2010

Partitioning/Where-Scheduled 25 EXCHANGE F1:{0} F2:{4} F3:{4} F4:{5,6,7,9} DELIVER JOIN river.rain<hilltop.rain ACQUIRE [time,rain] rain > 5 hilltop EVERY 15 min ACQUIRE [time,rain] rain > 5 hilltop EVERY 15 min ACQUIRE [time,rain,depth] true river EVERY 15 min ACQUIRE [time,rain,depth] true river EVERY 15 min TIME_WINDOW [t-15, t-15, 15] EXCHANGE intra-fragment exchanges intra-node exchange multi-hop exchange F1 F2 F3 F4 relay node April 2010

Query Agenda 26 TimeNode6Node3Node9Node7Node5Node4Node2Node0 0:00:00.000F4 1 F3 1 0:15:00.000F4 2 F3 2 0:30:00.000F4 3 F3 3 ……………………… 8:30:00.000F4 35 F3 35 8:45:00.000F4 36 F3 36 8:45:00.064tx3rx6tx7rx9 8:45:00.658tx5rx3 8:45:01.250tx5rx7 8:45:02.439tx4rx5 8:45:04.783F2 8:45:04.908tx2rx4 8:45:07.095tx0rx2 8:45:07.158F1 buffering factor = 36 maximum delivery time = 8:45: acquisition rate = 15 m 26 April 2010

SNEE now and future SNEE Now In-network SNEE –Acquire-streams –Quality of service aware Expected lifetime Total energy consumption Delivery time Acquisition rate –Runs on: Simulators TMoteSkys TinyNodes Out-of-network SNEE –Receive-streams –Pull-based data sources SNEE Future SNEE-DQP (within 2010) –Combine in-network and out-of-network versions –Stored relations –Push-based data sources Beyond 2010 –Model building inside queries –Greater resilience for in-network execution 2726 April 2010

S EMANTIC H ETEROGENEITY 26 April

A Data Integration Approach Heterogeneous sources –Autonomous –Local schemas Homogeneous view –Mediated global schema Mapping –Local-as-View –Global-as-View 26 April Global Schema Query 1 Query n DB 1 Wrapper 1 DB k Wrapper k DB i Wrapper i Mappings Relies on agreement of a common global schema

P2P Integration Approach Heterogeneous sources –Autonomous –Local schemas Heterogeneous views –Multiple schemas Mappings –From sources to common schema –Between pairs of schema Require common integration data model 26 April Schema 1 DB 1 Wrapper 1 DB k Wrapper k DB i Wrapper i Schem a j Query 1 Query n Mappings

Semantic Integrator SNEE- DQP Query Translator Data Translator Query Resolver Data Resolver Q [[Q]] Q’ [[Q’]] q [[q]] Tuples Semantic Integrator 26 April Streaming Source Stored data S 2 O Mappings Tuples Stored data Streaming Source Tuples O 2 O Mappings

P2P Integration Approach Heterogeneous sources –Autonomous –Local schemas Heterogeneous views –Multiple schemas Mappings –From sources to common schema –Between pairs of schema Require common integration data model Can RDF do this? 26 April Schema 1 DB 1 Wrapper 1 DB k Wrapper k DB i Wrapper i Schem a j Query 1 Query n Mappings

C AN RDB2RDF TOOLS FEASIBLY EXPOSE LARGE SCIENCE ARCHIVES FOR DATA INTEGRATION ? A word of warning … 26 April

RDB2RDF: Two Approaches Extract-Transform-Load Data replicated as RDF –Data can become stale Native SPARQL query support –Limited optimisation mechanisms Existing RDF stores Jena Seasame Query-driven Conversion Data stored as relations Native SQL query support –Highly optimised access methods SPARQL queries must be translated Existing translation systems D2RQ SquirrelRDF 26 April

Experiment Time query evaluation –Astronomy data set ~500MB –5 real queries –No joins Systems compared: –Relational DB MySQL v –RDB2RDF tools D2RQ v0.5.2 SquirrelRDF v0.1 –RDF Triple stores Jena v2.5.6 (SDB) Sesame v2.1.3 (Native) 26 April Relational DB RDB2RDF SPARQL query Triple store SPARQL query Relational DB SQL query

Performance Results 26 April ,450 5,339 21, ,932 2,7337,2294,0901,307 17,793 7,468 19, ,561

The Show Stopper: Query Translation Each bound variable resulted in a self- join –RDBMS cannot optimize for this –RDBMS perform badly with self-joins Each row retrieved with a separate query –1 query becomes n queries, where n is cardinality of the relation Predicate selection in RDB2RDF tool –No optimization possible 26 April

C AN RDB2RDF T OOLS F EASIBLE E XPOSE L ARGE S CIENCE A RCHIVES FOR D ATA I NTEGRATION ? N OT CURRENTLY ! More work needed on query translation… (Gray et al, 2009) 26 April

Conclusions Query-based access to distributed data sources, both streaming and stored SNEEql and SNEE-DQP overcome data source heterogeneity SPARQL STR and Semantic Integrator overcome semantic heterogeneity 26 April

Manchester Team RAs Ixent Galpin Alasdair J. G. Gray PhDs Christian Y. A. Brenninkmeijer Farhana Jabeen Academics Alvaro A. A. Fernandes Norman W. Paton MSc Students Jamil Naja Varadarajan Rajagopalan 26 April

Acknowledgements This work was/is funded –UK EPSRC through DIAS-MC project Explicator project (University of Glasgow) –European Commission as part of the SemSorGrid4Env project SNEE is released under a permissive open source license; please visit: April

References 1.K. Aberer, M. Hauswirth, and A. Salehi. Infrastructure for data processing in large-scale interconnected sensor networks. In MDM 2007, pp198–205, A. Arasu, B. Babcock, S. Babu, J. Cieslewicz, M. Datar, K. Ito, R. Motwani, U. Srivastava, and J. Widom. Stream: The stanford data stream management system. In M. Garofalakis, J. Gehrke, and R. Rastogi, editors, Data Stream Management: Processing High-Speed Data Streams. Springer, (to appear). 3.A. Arasu, S. Babu, and J. Widom. The CQL continuous query language: Semantic foundations and query execution. VLDB Journal 15(2):121–142, C. Y. A. Brenninkmeijer, I. Galpin, A. A. A. Fernandes, and N. W. Paton. A semantics for a query language over sensors, streams and relations. In BNCOD 25, pp87–99, I. Galpin, C. Y. A. Brenninkmeijer, F. Jabeen, A. A. A. Fernandes, and N. W. Paton. Comprehensive optimization of declarative sensor network queries. In SSDBM 2009, pp339–360, A. J. G. Gray, N. Gray, and I. Ounis. Can RDB2RDF tools feasibily expose large science archives for data integration? In ESWC 2009, pp491–505, April