Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Overview of Polystores

Similar presentations

Presentation on theme: "An Overview of Polystores"— Presentation transcript:

1 An Overview of Polystores
Patrick Valduriez Inria, Montpellier, France Joint work with Boyan Kolev, Carlyna Bondiombouy, Oleksandra Levchenko and Ricardo Jimenez-Peris

2 Cloud & Big Data Landscape
Easy to get lost No "one size fits all" No standard Keeps evolving Data Processing Frameworks NoSQL Databases

3 ? General Problem Very complex, ad-hoc development Hard to extend
NoSQLDS1 Client Application (e.g. in Java) SQL DS3 ? BI Analytics Query NoSQLDS2 Very complex, ad-hoc development Querying different data sources Managing intermediate results Delivering (e.g. sorting) the final results Hard to extend What if a new SQL DS appears?

4 Outline Polystores The CloudMdsQL polystore Query language
Distributed architecture Extending CloudMdsQL with MFR CloudMdsQL contributions

5 Origins of Polystores Multidatabase systems (or federated database systems) A few databases (e.g. less than 10) Corporate DBs Powerful queries (with updates and transactions) Web data integration systems Many data sources (e.g. 1000’s) DBs or files behind a web server Simple queries (read-only) Mediator/wrapper architecture Fourth Edition, 2018

6 Polystores Also called multistore systems
A major topic of research [The Case for Polystores. M. Stonebraker's blog. July 2015] Provide integrated access to multiple, heterogeneous cloud data stores such as NoSQL, HDFS, CEP and RDBMS Great for integrating structured (relational) data and big data But typically trade data store autonomy for performance or work only for certain categories of data stores (e.g. RDBMS and HDFS)

7 Taxonomy of Polystores*
Three kinds Loosely-coupled Similar to mediator/wrapper Common interface Autonomy of data stores, i.e. the ability to be locally controlled (independent of the multistore) Tightly-coupled Exploit local interfaces for efficiency Trade data store autonomy for performance Materialized views, indexes Hybrid Compromise between loosely- and tightly-coupled *C. Bondiombouy, P. Valduriez. Query Processing in Cloud Multistore Systems: an overview. Int. Journal of Cloud Computing,  5(4):  , 2016.

8 Comparisons: functionality
Polystore Objective Data model Query language Data stores Loosely-coupled BigIntegrator (Uppsala U.) Querying relational and cloud data Relational SQL-like BigTable, RDBMS Forward (UC San Diego) Unyfing relational and NoSQL JSON-based SQL++ RDBMS, NoSQL QoX (HP labs) Analytic data flows Graph XML based RDBMS, ETL Tightly-coupled Polybase (Microsoft) Querying Hadoop from RDBMS SQL HDFS, RDBMS HadoopDB (Yale U.) Querying RDBMS from Hadoop SQL-live (HiveQL) Estocada (Inria) Self-tuning No common model Native query languages Hybrid SparkSQL (UCB) SQL atop Spark Nested BigDAWG (MIT) Unifying relational and NoSQL Island query languages, with CAST and SCOPE operators RDBMS, NoSQL, Array DBMS, DSMSs Dans l'état de l'art que j'ai fait, j'ai sélectionné plusieurs systèmes représentatifs pour chaque catégorie

9 Comparisons: implementation
Polystore Special modules Schema mgt Query processing Query optimization Loosely-coupled BigIntegrator (Uppsala U.) Importer, absorber, finalizer LAV Access filters Heuristics Forward (UC San Diego) Query processor GAV Data store capabilities Cost-based QoX (HP Labs) Dataflow engine No Data/ function shipping, operation decomposition Tightly-coupled Polybase (Microsoft) HDFS bridge Query splitting HadoopDB (Yale U.) SMS planer, dbconnector Estocada (Inria) Storage advisor Materialized views View-based query rewriting Hybrid SparkSQL (UCB) Catalyst extensible optimizer Dataframes In-memory caching using columnar storage BigDAWG (MIT) Island query processors GAV within islands Function/ data shipping Deuxième comparaison: techniques d'implémentation

10 The CloudMdsQL Polystore*
A hybrid polystore Context: CoherentPaaS FP7 ( ) Objectives Design an SQL-like query language to query multiple data sources in a cloud Autonomous data stores Design a query engine for that language Fully distributed over a cluster’s nodes Compiler/optimizer To produce efficient query execution plans Design an ultra-scalable transaction manager Le contexte de ma thèse est plus précisément: le multistore CloudMdsQL C'est un multistore hybride *B. Kolev, C. Bondiombouy, P. Valduriez, R. Jiménez-Peris, R. Pau, J. Pereira. The CloudMdsQL Multistore System. SIGMOD 2016.

11 The CloudMdsQL Language*
Functional SQL-like query language Can represent all query building blocks as functions A function can be expressed in one of the DS languages, as a native function E.g. a breadth-first search on a graph DS Function results can be used as input to subsequent functions Functions can transform types and do data-metadata conversion Le langage CloudMdsQL (nous utilisons le même nom pour le langage et système) est fonctionnel Langage fonctionnel: langage où tout est fonction, le résultat d'une fonction devenant l'input de la fonction suivante. Avantage: pas d'effet de bord, facile à optimiser avec des techniques de réécriture. -Autoriser les transformations ( attribute ou relations en données et vice versa) *B. Kolev, P. Valduriez, C. Bondiombouy, R. Jiménez-Peris, R. Pau, J. Pereira. CloudMdsQL: Querying Heterogeneous Cloud Data Stores with a Common Language. Distributed and Parallel Databases, 34(4): , 2016.

12 CloudMdsQL Table Expressions
Named table expression Expression that returns a table representing a nested query [against a data store] Name and Signature (names and types of attributes) Query is executed with a schema on read No need for global schema 3 kinds of table expressions Native named tables Using a data store’s native query mechanism SQL named tables Regular SELECT statements Python named tables Embedded blocks of Python statements that produce relations Expression de table An important notion in Clmd is table expression inspired from XML table views, which is an expression that return as a table.

13 CloudMdsQL Query Example
SQL subquery on PostgreSQL T1(x int, y = ( SELECT x, y FROM A ) T2(x int, z = {* db.B.find( {$lt: {x, 10}}, {x:1, z:1, _id:0} ) *} SELECT T1.x, T2.z FROM T1, T2 WHERE T1.x = T2.x AND T1.y <= 3 Native subquery on MongoDB Integration subquery J'illustre le langage CloudMdsQL par un exemple de requête Dans cet ex. il y a 3 sous requêtes 1 relationnel qui produit une expression de table 1 document, exprimé dans le langage de MongoDB qui produit une expression de table 1 requête d'intégration des tables intermédiaires La ss req de l’expr T1 est réécrite :> intro condition de filtrage y<=3 (ds l’instruction select ) :> reduit la qtte d’info recup en augmentant la selectivite de la ss req

14 Centralized Query Engine
Wrapper DS1 DS1 Query Processor Straighforward M/W architecture => High communication cost DS – QE Little optimization opportunities User Application JDBC Client Query Mediator Wrapper DS2 DS2 CloudMdsQL Query Execution Engine Table Store Connector Table Store

15 Distributed Query Engine
Fully distributed architecture => Many optimization opportunities Let me quickly introduce the architecture of CloudMdsQL query engine in a computer cluster, without going into the details. This architecture is different from that of multidatabase or data integration systems with mediator and wrapper, which need to be centralized at some server. Since we are in the context of a cloud Platform as a Service, we have more control over where the components can be located. Here, the query engine has a fully distributed architecture, which allows query engine nodes to be collocated with data store nodes in a computer cluster. Afficher le carré bleu This fully distributed architecture provides many optimization opportunites, for instance, ship the smallest intermediate result to another node for doing a join.

16 Extending CloudMdsQL with MFR*
Objectives Integration of relational and HDFS data With autonomy of data stores, unlike e.g. Polybase Query data stored in HDFS using a data processing framework (DPF) like Spark or Flink Using powerful functions lile Map, Filter, Reduce, etc. Issues Execute joins between RDBMS and HDFS Extend the CloudMdsQL Query Engine to work with Spark Solution Map-Filter-Reduce (MFR) expression La contribution principale: l'extension de CLoudMdsQL avec les opérateurs MFR *C. Bondiombouy, B. Kolev, O. Levchenko, P. Valduriez. Integrating Big Data and Relational Data with a Functional SQL-like Query Language. DEXA Extended version in SpringerTLDKS journal 9940:48-74, 2016.

17 Motivating Example An editorial office needs to find appropriate reporters for a list of publications based on given keywords Publications (relational data) After MapReduce processing title author kw On cloud storage … charlie cloud storage keyword expert freq cloud alice 2 storage 1 virtual bob app Illustre le probleme -bureau de redaction Posts (HDFS) , alice, storage, cloud , bob, cloud, virtual, app , alice, cloud

18 MFR Expression Works on a dataset, i.e. an abstraction for a set of tuples in a DPF For instance, a Resilient Distributed Dataset in Spark Consists of key-value tuples Using Map-Filter-Reduce operations Map, Filter, Reduce : the main operations Other operations : Scan, FlatMap, Project, …

19 MFR Expression Example
Example: count the words that contain the string ‘cloud’ Dataset .MAP(KEY,1) SCAN(TEXT,’words.txt’) .FILTER( KEY LIKE ‘%cloud%’ ) .REDUCE (SUM) MAP (KEY, 1) : transforms each tuple (which has a single word as its only element) of the input dataset into a key-value pair where the word is mapped to a value of 1.

20 Example Query with MFR Query: retrieve data from RDBMS and HDFS
/* SQL subquery */ T1(title string, kw = ( SELECT title, kw FROM tbl ) /* MFR subquery */ T2(word string, count = {* SCAN(TEXT,'words.txt’) .MAP(KEY,1) .REDUCE(SUM) .PROJECT(KEY,VALUE) *} /* Integration subquery */ SELECT title, kw, count FROM T1 JOIN T2 ON = T2.word WHERE LIKE '%cloud%' L'opérateur PROJECT dans l'instruction MFR fournit un mappage entre les champs de jeu de données et les colonnes d'expression de table nommées.

21 Query Optimization We apply known optimization techniques to reduce execution time and communication costs Selection pushdown inside subqueries Bind join MFR operators reordering How to optimize query?

22 Bind Join - example Query Processor
T1(id int, x = ( SELECT id, x FROM A ) /* SQL subquery */ T2(id int, y = { /* DPF subquery */ SCAN(…).MAP(…).REDUCE(…).PROJECT(KEY, VALUE) *} SELECT T1.x, T2.y /* integration subquery */ FROM T1 BIND JOIN T2 ON = Query Processor RDBMS Wrapper HDFS Wrapper SCAN(…).MAP(…).REDUCE(…) .FILTER(KEY IN (1,3)) .PROJECT(…) SELECT id, x FROM A

23 MFR Rewrite Rules Rules for reordering MFR operators, based on their algebraic properties Focus on permuting FILTER with PROJECT REDUCE MAP Vise à transferer moins de doé et reduire le volume de données

24 Rule PROJECT / FILTER PROJECT (<expr_list>).SELECT(<predicate1>) => FILTER(<predicate2>).PROJECT(<expr_list>) T1(a int, b = {* … .PROJECT (KEY, VALUE[0]) *} SELECT a, b FROM T1 WHERE a > b VALUE is a pair and VALUE[0] is its first element. T1(a int, b ={* … . FILTER(KEY>VALUE[0]).PROJECT(KEY,VALUE[0])*} SELECT a, b FROM T1

25 Rule REDUCE / FILTER REDUCE(<transformation>).FILTER(<predicate>) => FILTER(<predicate>).REDUCE(<transformation>) REDUCE (SUM) .FILTER(KEY LIKE ‘%cloud%) FILTER (KEY LIKE ‘%cloud%’) . REDUCE (SUM)

Rule MAP / FILTER MAP(<expr_list>).FILTER(<predicate1>) => FILTER(<predicate2>).MAP(<expr_list>) MAP(VALUE [0], KEY) .FILTER(KEY > VALUE) FILTER (VALUE [0] > KEY) . MAP (VALUE [0], KEY)

27 Map/Filter/Reduce => Spark
We need to translate MFR operators to Spark operators map flatMap reduceByKey aggregateByKey filter -How to translate MFR operator (CloudMdsQL) to Spark? (done by the DPF wrapper)

28 CloudMdsQL Contributions
Advantage Relieves users from building complex client/server applications in order to access multiple data stores Innovation Adds value by allowing arbitrary code/native query to be embedded To preserve the expressivity of each data store’s query mechanism Provision for traditional distributed query optimization Validation With 10 different data stores, including SQL, NoSQL and Spark Transfer to the Leanxcale startup

Download ppt "An Overview of Polystores"

Similar presentations

Ads by Google