Presentation is loading. Please wait.

Presentation is loading. Please wait.

MapReduce VS Parallel DBMSs

Similar presentations


Presentation on theme: "MapReduce VS Parallel DBMSs"— Presentation transcript:

1 MapReduce VS Parallel DBMSs
Presenter: Ran Ding

2 Guideline 1. Introduction 2. Where the MR wins
3. DBMS “sweet spot” tests 4. Why the Parallel DBMS wins 5. Conclusion

3 Introduction-----MR The MapReduce (MR) paradigm has been hailed as a revolutionary new platform for large-scale, massively parallel data access. Like Hadoop

4 Introduction----Parallel DBMS
Parallel DBMS appeared at mid the Teradata and Gamma projects pioneered a new architectural paradigm based on a cluster of commodity computers.

5 Introduction---Horizontal partitioning
Distributing the rows of a relational table across the nodes of the cluster so they can process in parallel.

6 Introduction---DBMS One benefit is system automatically manages the various alternative partitioning strategies for the tables involved in the query. Like hash, range, and round-robin…..

7 Introduction-- Mapping parallel DBMS onto MapReduce
It is not easy!!!!!! UDF(user defined field) helps. Like GROUP BY in SQL.

8 Where the MR wins 1. ETL and “read once” data sets
2. Complex analytics 3. Semi-structured data 4. Quick-and-dirty analyses 5. Limited-budget operations

9 ETL and “read once” data sets
Extract-transform-load system MR system can be considered a general- purpose parallel ETL system. DBMSs may perform the ETL

10 Complex analytics Cannot be structured as single SQL aggregate queries
MR is a good candidate

11 Semi-structured data MR systems are good at processing the data is prepared for loading into a back-end system DBMS requires wide tables with many attributes Plus, MR-style systems are easily store and process

12 Quick-and-dirty analyses
DBMS need the programmer write the schema then load MR just copy!

13 Limited-budget operations
MR is basically open source for free Parallel DBMS: huge cost

14 DBMS “Sweet Spot” Test

15 Why the Parallel DBMS wins
1. Repetitive record parsing 2. Compression 3. Pipelining 4. Scheduling 5. Column-oriented storage

16 Repetitive record parsing
Parsing task requires each Map and Reduce task repeatedly parse and convert string fields into the appropriate type Records are parsed by DBMSs when the data is initially loaded.

17 Compression It is hard to say……..
Commercial DBMSs may use carefully tuned compression algorithms

18 Pipelining In parallel DBMS, data is streamed from producer to consumer the intermediate data is never written to disk In MR system, it writes the result to local data structure, and consumers read from it

19 Scheduling In a parallel DBMS, every node knows what it should do
MR system is scheduled on processing nodes one storage block at a time.

20 Column-oriented storage
Vertica Reads only the attributes necessary for solving the user query DBMS-X and Hadoop are both row stores

21 What should MR learn from Parallel DBMS
MR advocates should learn from parallel DBMS the technologies and techniques for efficient query parallel execution.

22 Conclusion MR systems are powerful tools for ETL-style applications and for complex analytics. If the application is query-intensive, whether semi structured or rigidly structured, then a DBMS is probably the better choice

23 Thank you~~ Questions?


Download ppt "MapReduce VS Parallel DBMSs"

Similar presentations


Ads by Google