Presentation is loading. Please wait.

Presentation is loading. Please wait.

Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester.

Similar presentations


Presentation on theme: "Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester."— Presentation transcript:

1 Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester Seminar at University of Cardiff 26 April 2010

2 Overview of the Talk Motivation: SemSorGrid4Env Forms of heterogeneity –Data source –Data semantics Query processing –SNEEql –SNEE-DQP –SPARQL STR 26 April 20102

3 SemSorGrid4Env Semantic sensor Grids for rapid application development for environmental monitoring Coastal and estuarine flood warning Fire monitoring and warning 26 April 20103

4 Estuarine Flood Warning Sensors deployed along UK South coast –On and off shore –Bespoke hardware –Fixed functionality –Fixed data rate No bursts –Central distribution centre Multitude of related data sources –Shipping –Flood defenses –Flooding models –… 26 April 20104

5 Fire Detection Deployment: Castilla y León, Northwest Spain –Forested region Wireless sensor network –Off-the-shelf sensor nodes TMoteSky TinyOS –Configure dynamically: ad hoc queries –In-network query processing –Controlled rate variability Satellite image data for the region 26 April 20105

6 Abstract Problem Stored data Sensor Network Integrator 26 April 20106 Sensor Network Stored data service Streaming data service

7 Data source Data stream Query capabilities Data access Types of Heterogeneity Stored data Sensor Network Integrator 26 April 20107 Sensor Network Stored data service Streaming data service Data semantics

8 D ATA S OURCE H ETEROGENEITY 26 April 20108

9 Data Source Characteristics Traditional stored data –Data stored in a database –User observes a static data set –One-off query execution Streaming data –Data processed on-the-fly Maybe stored for later access –User observes changes in data set –Continuous or snap-shot query execution 26 April 20109

10 Types of Data Stream Acquire-Stream Query controls data rate Informs query planning Receive-Stream Source controls data rate Potentially : –Unknown rate –Bursty data 26 April 201010 Stream Processor Source Acquire()Data Stream Processor Source Data

11 Streaming Source Access Pull Access Consumer periodically polls for new data. Introduces processing delay Push Access Publisher sends data as it is produced. Minimises processing delays 26 April 201011 Note: Orthogonal to data stream type Affects physical operator selection

12 Query Processing Challenges Variety of data sources –Stored data –Receive-stream –Acquire-stream No common query semantics –Streaming data languages –Stored data languages Distributed data sources 26 April 201012

13 Query Languages Stored (relational) data –SQL Streaming (relational) data –SQL extensions –Continuous Query Language (CQL) –Sensor NEtwork Engine query language (SNEEql) 26 April 201013

14 Query Language: SQL Designed for stored data Contains blocking operators –Join –Aggregates –… Example system: GSN (Aberer et al, 2007) –Key concept: Virtual sensor Wraps SQL execution Controls periodic evaluation –Limits expressiveness 26 April 201014

15 Query Language: CQL (Arasu et al, 2006) Designed for receive-streams Windows used for blocking operators Contains type conversion operators –Stream -> Window –Window -> Stream Semantics defined by implementation Example system: STREAM (Arasu et al, 2010?) –Data Stream Management System –No support for stored data 26 April 201015

16 Query Language: SNEEql (Brenninkmeijer et al, 2008) Designed for acquire-streams, receive-streams, and stored data –Based on CQL ideas Well-defined semantics –Independent of system Example system: SNEE (Galpin et al, 2009) –In-network query evaluation –Acquire-streams –Reactive/periodic operators –Controls network behaviour 26 April 201016

17 SNEEql Query Syntax SELECT {RSTREAM | DSTREAM | ISTREAM} + attribute list FROM extent list WHERE expression *STREAM optional –Converts a window to a stream Extent list: –Streams with windows of the form [FROM t1 TO t2 SLIDE s unit] –Relations with windows of the form [SCAN EVERY t1 unit] 26 April 201017

18 An Example from Hydrology Investigating water drainage in the Peak District –Hilly terrain –Peat bogs –River in valley bottom WSN measuring –Rainfall and –River depth WSN schema: –river (rain: int, depth: int) Sites (5, 6, 7, 9) –hilltop (rain : int) Sites (4) 26 April 201018 0 0 2 2 1 1 3 3 4 4 5 5 6 6 7 7 8 8 9 9

19 Example Multi-source Query Every 15 minutes, and within 24 hours of their being taken, we wish to obtain time-correlated measurements of the river depth now and the rainfall at the top of the hill 15 minutes before, provided that it is now raining less in the river than it was in the hill top, that the rainfall in the hill top was above 5mm and greater than average rainfall. SELECT RSTREAM r.time, h.rain, r.depth FROM River[NOW] r, Hilltop[AT NOW-15 MINUTES] h, WHERE h.rain > 5 AND r.rain < h.rain AND h.rain >= (SELECT AVG(weather.rain) FROM Weather [rescan every day] WHERE weather.region = 'Peak District'); Acquisition rate = 15 min; Max delivery time = 24 hours; 26 April 201019

20 SNEE-DQP Query Stack Metadata –Logical schema –Physical schema Source Allocation –Splitting the query into parts for each data source Source Planning –Physical operator selection –Generate plan for source 26 April 201020 Metadata SNEEql query + QoS Query Execution Plan Parsing Logical Planning Source Allocation Source Planning

21 Example: Query Plan SELECT RSTREAM r.time, h.rain, r.depth FROM River [NOW] r, Hilltop [AT NOW - 15 MINUTES] h, WHERE h.rain > 5 AND r.rain < h.rain AND h.rain >= (SELECT AVG(weather.rain) FROM Weather [RESCAN every day] WHERE weather.region = 'Peak District'); 26 April 201021 EXCHANGE JOIN river.rain<hilltop.rain ACQUIRE [time,rain] rain > 5 hilltop EVERY 15 min ACQUIRE [time,rain] rain > 5 hilltop EVERY 15 min ACQUIRE [time,rain, depth] true river EVERY 15 min ACQUIRE [time,rain, depth] true river EVERY 15 min TIME_WINDOW [t-15, t-15, 15] DELIVER EXCHANGE AVERAGE (rain) AVERAGE (rain) SCAN [rain] region = ‘Peak District’ weather EVERY DAY SCAN [rain] region = ‘Peak District’ weather EVERY DAY JOIN h.rain >= AVG(weather.rain) JOIN h.rain >= AVG(weather.rain)

22 In-Network SNEE Two-phase DQP –Single-site Push / –Multi-site Steiner tree routing to reduce energy Operator placement –Location sensitive –Reduce transmission Data buffering Time division agenda nesC code generated 26 April 201022 routing parsing/type checking translation/rewriting algorithm assignment partitioning where-scheduling when-scheduling code generation,, nesC code abstract-syntactic tree logical-algebraic form physical-algebraic form PAF routing tree RT fragmented-algebraic form agenda 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 RT distributed-algebraic form RT DAF single-site phase multi-site phase

23 In-Network Query Planning SELECT r.time, h.rain, r.depth FROM River [NOW] r, Hilltop [AT NOW - 15 MINUTES] h, WHERE h.rain > 5 AND r.rain < h.rain; 26 April 201023 EXCHANGE JOIN river.rain<hilltop.rain ACQUIRE [time,rain] rain > 5 hilltop EVERY 15 min ACQUIRE [time,rain] rain > 5 hilltop EVERY 15 min ACQUIRE [time,rain, depth] true river EVERY 15 min ACQUIRE [time,rain, depth] true river EVERY 15 min TIME_WINDOW [t-15, t-15, 15]

24 Query Routing Tree 24 0 0 2 2 1 1 3 3 4 4 5 5 6 6 7 7 8 8 9 9 0 0 2 2 1 1 3 3 4 4 5 5 6 6 7 7 8 8 9 9 26 April 2010

25 Partitioning/Where-Scheduled 25 EXCHANGE F1:{0} F2:{4} F3:{4} F4:{5,6,7,9} DELIVER JOIN river.rain<hilltop.rain ACQUIRE [time,rain] rain > 5 hilltop EVERY 15 min ACQUIRE [time,rain] rain > 5 hilltop EVERY 15 min ACQUIRE [time,rain,depth] true river EVERY 15 min ACQUIRE [time,rain,depth] true river EVERY 15 min TIME_WINDOW [t-15, t-15, 15] EXCHANGE intra-fragment exchanges intra-node exchange multi-hop exchange F1 F2 F3 F4 relay node 0 0 2 2 1 1 3 3 4 4 5 5 6 6 7 7 8 8 9 9 26 April 2010

26 Query Agenda 26 TimeNode6Node3Node9Node7Node5Node4Node2Node0 0:00:00.000F4 1 F3 1 0:15:00.000F4 2 F3 2 0:30:00.000F4 3 F3 3 ……………………… 8:30:00.000F4 35 F3 35 8:45:00.000F4 36 F3 36 8:45:00.064tx3rx6tx7rx9 8:45:00.658tx5rx3 8:45:01.250tx5rx7 8:45:02.439tx4rx5 8:45:04.783F2 8:45:04.908tx2rx4 8:45:07.095tx0rx2 8:45:07.158F1 buffering factor = 36 maximum delivery time = 8:45:07.158 acquisition rate = 15 m 26 April 2010

27 SNEE now and future SNEE Now In-network SNEE –Acquire-streams –Quality of service aware Expected lifetime Total energy consumption Delivery time Acquisition rate –Runs on: Simulators TMoteSkys TinyNodes Out-of-network SNEE –Receive-streams –Pull-based data sources SNEE Future SNEE-DQP (within 2010) –Combine in-network and out-of-network versions –Stored relations –Push-based data sources Beyond 2010 –Model building inside queries –Greater resilience for in-network execution 2726 April 2010

28 S EMANTIC H ETEROGENEITY 26 April 201028

29 A Data Integration Approach Heterogeneous sources –Autonomous –Local schemas Homogeneous view –Mediated global schema Mapping –Local-as-View –Global-as-View 26 April 201029 Global Schema Query 1 Query n DB 1 Wrapper 1 DB k Wrapper k DB i Wrapper i Mappings Relies on agreement of a common global schema

30 P2P Integration Approach Heterogeneous sources –Autonomous –Local schemas Heterogeneous views –Multiple schemas Mappings –From sources to common schema –Between pairs of schema Require common integration data model 26 April 201030 Schema 1 DB 1 Wrapper 1 DB k Wrapper k DB i Wrapper i Schem a j Query 1 Query n Mappings

31 Semantic Integrator SNEE- DQP Query Translator Data Translator Query Resolver Data Resolver Q [[Q]] Q’ [[Q’]] q [[q]] Tuples Semantic Integrator 26 April 201031 Streaming Source Stored data S 2 O Mappings Tuples Stored data Streaming Source Tuples O 2 O Mappings

32 P2P Integration Approach Heterogeneous sources –Autonomous –Local schemas Heterogeneous views –Multiple schemas Mappings –From sources to common schema –Between pairs of schema Require common integration data model Can RDF do this? 26 April 201032 Schema 1 DB 1 Wrapper 1 DB k Wrapper k DB i Wrapper i Schem a j Query 1 Query n Mappings

33 C AN RDB2RDF TOOLS FEASIBLY EXPOSE LARGE SCIENCE ARCHIVES FOR DATA INTEGRATION ? A word of warning … 26 April 201033

34 RDB2RDF: Two Approaches Extract-Transform-Load Data replicated as RDF –Data can become stale Native SPARQL query support –Limited optimisation mechanisms Existing RDF stores Jena Seasame Query-driven Conversion Data stored as relations Native SQL query support –Highly optimised access methods SPARQL queries must be translated Existing translation systems D2RQ SquirrelRDF 26 April 201034

35 Experiment Time query evaluation –Astronomy data set ~500MB –5 real queries –No joins Systems compared: –Relational DB MySQL v5.1.25 –RDB2RDF tools D2RQ v0.5.2 SquirrelRDF v0.1 –RDF Triple stores Jena v2.5.6 (SDB) Sesame v2.1.3 (Native) 26 April 201035 Relational DB RDB2RDF SPARQL query Triple store SPARQL query Relational DB SQL query

36 Performance Results 26 April 201036 3,450 5,339 21,492 485,932 2,7337,2294,0901,307 17,793 7,468 19,984 372,561

37 The Show Stopper: Query Translation Each bound variable resulted in a self- join –RDBMS cannot optimize for this –RDBMS perform badly with self-joins Each row retrieved with a separate query –1 query becomes n queries, where n is cardinality of the relation Predicate selection in RDB2RDF tool –No optimization possible 26 April 201037

38 C AN RDB2RDF T OOLS F EASIBLE E XPOSE L ARGE S CIENCE A RCHIVES FOR D ATA I NTEGRATION ? N OT CURRENTLY ! More work needed on query translation… (Gray et al, 2009) 26 April 201038

39 Conclusions Query-based access to distributed data sources, both streaming and stored SNEEql and SNEE-DQP overcome data source heterogeneity SPARQL STR and Semantic Integrator overcome semantic heterogeneity 26 April 201039

40 Manchester Team RAs Ixent Galpin Alasdair J. G. Gray PhDs Christian Y. A. Brenninkmeijer Farhana Jabeen Academics Alvaro A. A. Fernandes Norman W. Paton MSc Students Jamil Naja Varadarajan Rajagopalan 26 April 201040

41 Acknowledgements This work was/is funded –UK EPSRC through DIAS-MC project Explicator project (University of Glasgow) –European Commission as part of the SemSorGrid4Env project SNEE is released under a permissive open source license; please visit: http://code.google.com/p/snee/ http://code.google.com/p/snee/ 26 April 201041

42 References 1.K. Aberer, M. Hauswirth, and A. Salehi. Infrastructure for data processing in large-scale interconnected sensor networks. In MDM 2007, pp198–205, 2007. 2.A. Arasu, B. Babcock, S. Babu, J. Cieslewicz, M. Datar, K. Ito, R. Motwani, U. Srivastava, and J. Widom. Stream: The stanford data stream management system. In M. Garofalakis, J. Gehrke, and R. Rastogi, editors, Data Stream Management: Processing High-Speed Data Streams. Springer, (to appear). 3.A. Arasu, S. Babu, and J. Widom. The CQL continuous query language: Semantic foundations and query execution. VLDB Journal 15(2):121–142, 2006. 4.C. Y. A. Brenninkmeijer, I. Galpin, A. A. A. Fernandes, and N. W. Paton. A semantics for a query language over sensors, streams and relations. In BNCOD 25, pp87–99, 2008. 5.I. Galpin, C. Y. A. Brenninkmeijer, F. Jabeen, A. A. A. Fernandes, and N. W. Paton. Comprehensive optimization of declarative sensor network queries. In SSDBM 2009, pp339–360, 2009. 6.A. J. G. Gray, N. Gray, and I. Ounis. Can RDB2RDF tools feasibily expose large science archives for data integration? In ESWC 2009, pp491–505, 2009. 26 April 201042


Download ppt "Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester."

Similar presentations


Ads by Google