Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester Seminar at University of Cardiff 26 April 2010
Overview of the Talk Motivation: SemSorGrid4Env Forms of heterogeneity –Data source –Data semantics Query processing –SNEEql –SNEE-DQP –SPARQL STR 26 April 20102
SemSorGrid4Env Semantic sensor Grids for rapid application development for environmental monitoring Coastal and estuarine flood warning Fire monitoring and warning 26 April 20103
Estuarine Flood Warning Sensors deployed along UK South coast –On and off shore –Bespoke hardware –Fixed functionality –Fixed data rate No bursts –Central distribution centre Multitude of related data sources –Shipping –Flood defenses –Flooding models –… 26 April 20104
Fire Detection Deployment: Castilla y León, Northwest Spain –Forested region Wireless sensor network –Off-the-shelf sensor nodes TMoteSky TinyOS –Configure dynamically: ad hoc queries –In-network query processing –Controlled rate variability Satellite image data for the region 26 April 20105
Abstract Problem Stored data Sensor Network Integrator 26 April Sensor Network Stored data service Streaming data service
Data source Data stream Query capabilities Data access Types of Heterogeneity Stored data Sensor Network Integrator 26 April Sensor Network Stored data service Streaming data service Data semantics
D ATA S OURCE H ETEROGENEITY 26 April 20108
Data Source Characteristics Traditional stored data –Data stored in a database –User observes a static data set –One-off query execution Streaming data –Data processed on-the-fly Maybe stored for later access –User observes changes in data set –Continuous or snap-shot query execution 26 April 20109
Types of Data Stream Acquire-Stream Query controls data rate Informs query planning Receive-Stream Source controls data rate Potentially : –Unknown rate –Bursty data 26 April Stream Processor Source Acquire()Data Stream Processor Source Data
Streaming Source Access Pull Access Consumer periodically polls for new data. Introduces processing delay Push Access Publisher sends data as it is produced. Minimises processing delays 26 April Note: Orthogonal to data stream type Affects physical operator selection
Query Processing Challenges Variety of data sources –Stored data –Receive-stream –Acquire-stream No common query semantics –Streaming data languages –Stored data languages Distributed data sources 26 April
Query Languages Stored (relational) data –SQL Streaming (relational) data –SQL extensions –Continuous Query Language (CQL) –Sensor NEtwork Engine query language (SNEEql) 26 April
Query Language: SQL Designed for stored data Contains blocking operators –Join –Aggregates –… Example system: GSN (Aberer et al, 2007) –Key concept: Virtual sensor Wraps SQL execution Controls periodic evaluation –Limits expressiveness 26 April
Query Language: CQL (Arasu et al, 2006) Designed for receive-streams Windows used for blocking operators Contains type conversion operators –Stream -> Window –Window -> Stream Semantics defined by implementation Example system: STREAM (Arasu et al, 2010?) –Data Stream Management System –No support for stored data 26 April
Query Language: SNEEql (Brenninkmeijer et al, 2008) Designed for acquire-streams, receive-streams, and stored data –Based on CQL ideas Well-defined semantics –Independent of system Example system: SNEE (Galpin et al, 2009) –In-network query evaluation –Acquire-streams –Reactive/periodic operators –Controls network behaviour 26 April
SNEEql Query Syntax SELECT {RSTREAM | DSTREAM | ISTREAM} + attribute list FROM extent list WHERE expression *STREAM optional –Converts a window to a stream Extent list: –Streams with windows of the form [FROM t1 TO t2 SLIDE s unit] –Relations with windows of the form [SCAN EVERY t1 unit] 26 April
An Example from Hydrology Investigating water drainage in the Peak District –Hilly terrain –Peat bogs –River in valley bottom WSN measuring –Rainfall and –River depth WSN schema: –river (rain: int, depth: int) Sites (5, 6, 7, 9) –hilltop (rain : int) Sites (4) 26 April
Example Multi-source Query Every 15 minutes, and within 24 hours of their being taken, we wish to obtain time-correlated measurements of the river depth now and the rainfall at the top of the hill 15 minutes before, provided that it is now raining less in the river than it was in the hill top, that the rainfall in the hill top was above 5mm and greater than average rainfall. SELECT RSTREAM r.time, h.rain, r.depth FROM River[NOW] r, Hilltop[AT NOW-15 MINUTES] h, WHERE h.rain > 5 AND r.rain < h.rain AND h.rain >= (SELECT AVG(weather.rain) FROM Weather [rescan every day] WHERE weather.region = 'Peak District'); Acquisition rate = 15 min; Max delivery time = 24 hours; 26 April
SNEE-DQP Query Stack Metadata –Logical schema –Physical schema Source Allocation –Splitting the query into parts for each data source Source Planning –Physical operator selection –Generate plan for source 26 April Metadata SNEEql query + QoS Query Execution Plan Parsing Logical Planning Source Allocation Source Planning
Example: Query Plan SELECT RSTREAM r.time, h.rain, r.depth FROM River [NOW] r, Hilltop [AT NOW - 15 MINUTES] h, WHERE h.rain > 5 AND r.rain < h.rain AND h.rain >= (SELECT AVG(weather.rain) FROM Weather [RESCAN every day] WHERE weather.region = 'Peak District'); 26 April EXCHANGE JOIN river.rain<hilltop.rain ACQUIRE [time,rain] rain > 5 hilltop EVERY 15 min ACQUIRE [time,rain] rain > 5 hilltop EVERY 15 min ACQUIRE [time,rain, depth] true river EVERY 15 min ACQUIRE [time,rain, depth] true river EVERY 15 min TIME_WINDOW [t-15, t-15, 15] DELIVER EXCHANGE AVERAGE (rain) AVERAGE (rain) SCAN [rain] region = ‘Peak District’ weather EVERY DAY SCAN [rain] region = ‘Peak District’ weather EVERY DAY JOIN h.rain >= AVG(weather.rain) JOIN h.rain >= AVG(weather.rain)
In-Network SNEE Two-phase DQP –Single-site Push / –Multi-site Steiner tree routing to reduce energy Operator placement –Location sensitive –Reduce transmission Data buffering Time division agenda nesC code generated 26 April routing parsing/type checking translation/rewriting algorithm assignment partitioning where-scheduling when-scheduling code generation,, nesC code abstract-syntactic tree logical-algebraic form physical-algebraic form PAF routing tree RT fragmented-algebraic form agenda RT distributed-algebraic form RT DAF single-site phase multi-site phase
In-Network Query Planning SELECT r.time, h.rain, r.depth FROM River [NOW] r, Hilltop [AT NOW - 15 MINUTES] h, WHERE h.rain > 5 AND r.rain < h.rain; 26 April EXCHANGE JOIN river.rain<hilltop.rain ACQUIRE [time,rain] rain > 5 hilltop EVERY 15 min ACQUIRE [time,rain] rain > 5 hilltop EVERY 15 min ACQUIRE [time,rain, depth] true river EVERY 15 min ACQUIRE [time,rain, depth] true river EVERY 15 min TIME_WINDOW [t-15, t-15, 15]
Query Routing Tree April 2010
Partitioning/Where-Scheduled 25 EXCHANGE F1:{0} F2:{4} F3:{4} F4:{5,6,7,9} DELIVER JOIN river.rain<hilltop.rain ACQUIRE [time,rain] rain > 5 hilltop EVERY 15 min ACQUIRE [time,rain] rain > 5 hilltop EVERY 15 min ACQUIRE [time,rain,depth] true river EVERY 15 min ACQUIRE [time,rain,depth] true river EVERY 15 min TIME_WINDOW [t-15, t-15, 15] EXCHANGE intra-fragment exchanges intra-node exchange multi-hop exchange F1 F2 F3 F4 relay node April 2010
Query Agenda 26 TimeNode6Node3Node9Node7Node5Node4Node2Node0 0:00:00.000F4 1 F3 1 0:15:00.000F4 2 F3 2 0:30:00.000F4 3 F3 3 ……………………… 8:30:00.000F4 35 F3 35 8:45:00.000F4 36 F3 36 8:45:00.064tx3rx6tx7rx9 8:45:00.658tx5rx3 8:45:01.250tx5rx7 8:45:02.439tx4rx5 8:45:04.783F2 8:45:04.908tx2rx4 8:45:07.095tx0rx2 8:45:07.158F1 buffering factor = 36 maximum delivery time = 8:45: acquisition rate = 15 m 26 April 2010
SNEE now and future SNEE Now In-network SNEE –Acquire-streams –Quality of service aware Expected lifetime Total energy consumption Delivery time Acquisition rate –Runs on: Simulators TMoteSkys TinyNodes Out-of-network SNEE –Receive-streams –Pull-based data sources SNEE Future SNEE-DQP (within 2010) –Combine in-network and out-of-network versions –Stored relations –Push-based data sources Beyond 2010 –Model building inside queries –Greater resilience for in-network execution 2726 April 2010
S EMANTIC H ETEROGENEITY 26 April
A Data Integration Approach Heterogeneous sources –Autonomous –Local schemas Homogeneous view –Mediated global schema Mapping –Local-as-View –Global-as-View 26 April Global Schema Query 1 Query n DB 1 Wrapper 1 DB k Wrapper k DB i Wrapper i Mappings Relies on agreement of a common global schema
P2P Integration Approach Heterogeneous sources –Autonomous –Local schemas Heterogeneous views –Multiple schemas Mappings –From sources to common schema –Between pairs of schema Require common integration data model 26 April Schema 1 DB 1 Wrapper 1 DB k Wrapper k DB i Wrapper i Schem a j Query 1 Query n Mappings
Semantic Integrator SNEE- DQP Query Translator Data Translator Query Resolver Data Resolver Q [[Q]] Q’ [[Q’]] q [[q]] Tuples Semantic Integrator 26 April Streaming Source Stored data S 2 O Mappings Tuples Stored data Streaming Source Tuples O 2 O Mappings
P2P Integration Approach Heterogeneous sources –Autonomous –Local schemas Heterogeneous views –Multiple schemas Mappings –From sources to common schema –Between pairs of schema Require common integration data model Can RDF do this? 26 April Schema 1 DB 1 Wrapper 1 DB k Wrapper k DB i Wrapper i Schem a j Query 1 Query n Mappings
C AN RDB2RDF TOOLS FEASIBLY EXPOSE LARGE SCIENCE ARCHIVES FOR DATA INTEGRATION ? A word of warning … 26 April
RDB2RDF: Two Approaches Extract-Transform-Load Data replicated as RDF –Data can become stale Native SPARQL query support –Limited optimisation mechanisms Existing RDF stores Jena Seasame Query-driven Conversion Data stored as relations Native SQL query support –Highly optimised access methods SPARQL queries must be translated Existing translation systems D2RQ SquirrelRDF 26 April
Experiment Time query evaluation –Astronomy data set ~500MB –5 real queries –No joins Systems compared: –Relational DB MySQL v –RDB2RDF tools D2RQ v0.5.2 SquirrelRDF v0.1 –RDF Triple stores Jena v2.5.6 (SDB) Sesame v2.1.3 (Native) 26 April Relational DB RDB2RDF SPARQL query Triple store SPARQL query Relational DB SQL query
Performance Results 26 April ,450 5,339 21, ,932 2,7337,2294,0901,307 17,793 7,468 19, ,561
The Show Stopper: Query Translation Each bound variable resulted in a self- join –RDBMS cannot optimize for this –RDBMS perform badly with self-joins Each row retrieved with a separate query –1 query becomes n queries, where n is cardinality of the relation Predicate selection in RDB2RDF tool –No optimization possible 26 April
C AN RDB2RDF T OOLS F EASIBLE E XPOSE L ARGE S CIENCE A RCHIVES FOR D ATA I NTEGRATION ? N OT CURRENTLY ! More work needed on query translation… (Gray et al, 2009) 26 April
Conclusions Query-based access to distributed data sources, both streaming and stored SNEEql and SNEE-DQP overcome data source heterogeneity SPARQL STR and Semantic Integrator overcome semantic heterogeneity 26 April
Manchester Team RAs Ixent Galpin Alasdair J. G. Gray PhDs Christian Y. A. Brenninkmeijer Farhana Jabeen Academics Alvaro A. A. Fernandes Norman W. Paton MSc Students Jamil Naja Varadarajan Rajagopalan 26 April
Acknowledgements This work was/is funded –UK EPSRC through DIAS-MC project Explicator project (University of Glasgow) –European Commission as part of the SemSorGrid4Env project SNEE is released under a permissive open source license; please visit: April
References 1.K. Aberer, M. Hauswirth, and A. Salehi. Infrastructure for data processing in large-scale interconnected sensor networks. In MDM 2007, pp198–205, A. Arasu, B. Babcock, S. Babu, J. Cieslewicz, M. Datar, K. Ito, R. Motwani, U. Srivastava, and J. Widom. Stream: The stanford data stream management system. In M. Garofalakis, J. Gehrke, and R. Rastogi, editors, Data Stream Management: Processing High-Speed Data Streams. Springer, (to appear). 3.A. Arasu, S. Babu, and J. Widom. The CQL continuous query language: Semantic foundations and query execution. VLDB Journal 15(2):121–142, C. Y. A. Brenninkmeijer, I. Galpin, A. A. A. Fernandes, and N. W. Paton. A semantics for a query language over sensors, streams and relations. In BNCOD 25, pp87–99, I. Galpin, C. Y. A. Brenninkmeijer, F. Jabeen, A. A. A. Fernandes, and N. W. Paton. Comprehensive optimization of declarative sensor network queries. In SSDBM 2009, pp339–360, A. J. G. Gray, N. Gray, and I. Ounis. Can RDB2RDF tools feasibily expose large science archives for data integration? In ESWC 2009, pp491–505, April