CMU SCS Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB Applications C. Faloutsos – A. Pavlo Lecture#28: Modern Database Systems.

Slides:



Advertisements
Similar presentations
Shark:SQL and Rich Analytics at Scale
Advertisements

MapReduce.
Andy Pavlo April 13, 2015April 13, 2015April 13, 2015 NewS QL.
Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
C-Store: Data Management in the Cloud Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Jun 5, 2009.
Transaction.
HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads Azza Abouzeid1, Kamil BajdaPawlikowski1, Daniel Abadi1, Avi.
HadoopDB An Architectural Hybrid of Map Reduce and DBMS Technologies for Analytical Workloads Presented By: Wen Zhang and Shawn Holbrook.
NoSQL and NewSQL Justin DeBrabant CIS Advanced Systems - Fall 2013.
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications C. Faloutsos – A. Pavlo How to Scale a Database System.
NoSQL Database.
CS 405G: Introduction to Database Systems 24 NoSQL Reuse some slides of Jennifer Widom Chen Qian University of Kentucky.
Cloud Computing Other Mapreduce issues Keke Chen.
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications C. Faloutsos – A. Pavlo Lecture#25: OldSQL vs. NoSQL vs. NewSQL.
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
1 A Comparison of Approaches to Large-Scale Data Analysis Pavlo, Paulson, Rasin, Abadi, DeWitt, Madden, Stonebraker, SIGMOD’09 Shimin Chen Big data reading.
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications C. Faloutsos – A. Pavlo Lecture#23: Distributed Database Systems (R&G.
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Databases with Scalable capabilities Presented by Mike Trischetta.
MapReduce VS Parallel DBMSs
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Lecture On Database Analysis and Design By- Jesmin Akhter Lecturer, IIT, Jahangirnagar University.
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
HadoopDB Presenters: Serva rashidyan Somaie shahrokhi Aida parbale Spring 2012 azad university of sanandaj 1.
Modern Databases NoSQL and NewSQL Willem Visser RW334.
Changwon Nati Univ. ISIE 2001 CSCI5708 NoSQL looks to become the database of the Internet By Lawrence Latif Wed Dec Nhu Nguyen and Phai Hoang CSCI.
NoSQL Databases Oracle - Berkeley DB. Content A brief intro to NoSQL About Berkeley Db About our application.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
An Introduction to HDInsight June 27 th,
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
SLIDE 1IS 257 – Fall 2014 NewSQL and VoltDB University of California, Berkeley School of Information IS 257: Database Management.
EECS 262a Advanced Topics in Computer Systems Lecture 17 Comparison of Parallel DB, CS, MR and Jockey October 24 th, 2012 John Kubiatowicz and Anthony.
MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.
Lecture 8: Databases and Data Infrastructure CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael.
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications C. Faloutsos – A. Pavlo Lecture#25: Column Stores.
Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉 教授 : 許毅然 作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.
NoSQL Systems Motivation. NoSQL: The Name  “SQL” = Traditional relational DBMS  Recognition over past decade or so: Not every data management/analysis.
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications C. Faloutsos – A. Pavlo Lecture#24: Distributed Database Systems (R&G.
BACS 287 Big Data & NoSQL 2016 by Jones & Bartlett Learning LLC.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
MapReduce and Parallel DMBSs: Friends or Foes? Michael Stonebraker, Daniel Abadi, David J. Dewitt, Sam Madden, Erik Paulson, Andrew Pavlo, Alexander Rasin.
EECS 262a Advanced Topics in Computer Systems Lecture 16 Comparison of Parallel DB, CS, MR and Jockey March 16 th, 2016 John Kubiatowicz Electrical Engineering.
CSCI5570 Large Scale Data Processing Systems
CS 405G: Introduction to Database Systems
Operational & Analytical Database
Dremel.
Modern Databases NoSQL and NewSQL
Introduction to NewSQL
Project Project mid-term report due on 25th October at midnight Format
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Lecture#7: Fun with SQL (Part 2)
A Comparison of Approaches to Large-Scale Data Analysis
Lecture#24: Distributed Database Systems
NoSQL Databases An Overview
Cse 344 May 2nd – Map/reduce.
Overview of big data tools
Interpret the execution mode of SQL query in F1 Query paper
H-store: A high-performance, distributed main memory transaction processing system Robert Kallman, Hideaki Kimura, Jonathan Natkins, Andrew Pavlo, Alex.
Presentation transcript:

CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications C. Faloutsos – A. Pavlo Lecture#28: Modern Database Systems

CMU SCS Administrivia – Final Exam Who: You What: R&G Chapters When: Tuesday May 6th 5:30pm ‐ 8:30pm Where: WEH 7500 Why: Databases will help your love life. Faloutsos/PavloCMU SCS /6152

CMU SCS Administrivia – Final Exam Handwritten NotesPrinted Notes ✔ X

CMU SCS Today’s Class Distributed OLAP OldSQL vs. NoSQL vs. NewSQL How to scale a database system Faloutsos/PavloCMU SCS /6154

CMU SCS OLTP vs. OLAP On-line Transaction Processing: –Short-lived txns. –Small footprint. –Repetitive operations. On-line Analytical Processing: –Long running queries. –Complex joins. –Exploratory queries. Faloutsos/PavloCMU SCS /6155

CMU SCS Workload Characterization WritesReads Simple Complex Workload Focus Operation Complexity OLTP OLAP Michael Stonebraker – “Ten Rules For Scalable Performance In Simple Operation' Datastores” Social Networks

CMU SCS Relational Database Backlash New Internet start-ups hit the limits of single-node DBMSs. Early companies used custom middleware to shard databases across multiple DBMSs. Google was a pioneer in developing non- relational DBMS architectures. Faloutsos/PavloCMU SCS /6157

CMU SCS MapReduce Simplified parallel computing paradigm for large-scale data analysis. Originally proposed by Google in Hadoop is the current leading open-source implementation. Faloutsos/PavloCMU SCS /6158

CMU SCS Calculate total order amount per day after Jan 1 st. MapReduce Example Faloutsos/PavloCMU SCS /6159 Reduce Workers $10.00 $25.00 $ $30.00 $85.00 $ $44.00 $62.00 $69.00 Map Output Map Workers $ $ $69.00 ReduceOutput DATE AMOUNT $ $ $ $ $ $ $15.00 DATE AMOUNT $ $ $ $ $ $ $15.00 MAP(key, value) { if (key >= “ ”) { output(key, value); } MAP(key, value) { if (key >= “ ”) { output(key, value); } REDUCE(key, values) { sum = 0; while (values.hasNext()) { sum += values.next(); } output(key, sum); } REDUCE(key, values) { sum = 0; while (values.hasNext()) { sum += values.next(); } output(key, sum); }

CMU SCS What MapReduce Does Right Since all intermediate results are written to HDFS, if one node crashes the entire query does not need to be restarted. Easy to load data and start running queries. Great for semi-structured data sets. Faloutsos/PavloCMU SCS /61510

CMU SCS What MapReduce Did Wrong Have to parse/cast values every time: –Multi-attribute values handled by user code. –If data format changes, code must change. Expensive execution: –Have to send data to executors. –A simple join requires multiple MR jobs. Faloutsos/PavloCMU SCS /61511

CMU SCS Join Example Find sourceIP that generated most adRevenue along with its average pageRank. 12

CMU SCS Join Example – SQL Faloutsos/PavloCMU SCS /61513 SELECT INTO Temp sourceIP, AVG(pageRank) AS avgPageRank, SUM(adRevenue) AS totalRevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN “ ” AND “ ” GROUP BY UV.sourceIP; SELECT sourceIP, totalRevenue, avgPageRank FROM Temp ORDER BY totalRevenue DESC LIMIT 1; SELECT INTO Temp sourceIP, AVG(pageRank) AS avgPageRank, SUM(adRevenue) AS totalRevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN “ ” AND “ ” GROUP BY UV.sourceIP; SELECT sourceIP, totalRevenue, avgPageRank FROM Temp ORDER BY totalRevenue DESC LIMIT 1;

CMU SCS Join Example – MapReduce Faloutsos/PavloCMU SCS /61514 Phase 1: Filter Phase 2: Aggregation Phase 3: Search Map: Emit all records for Rankings. Filter UserVisits data. Reduce: Compute cross product. Map: Emit all tuples (i.e., passthrough) Reduce: Compute avg pageRank for each sourceIP. Map: Emit all tuples (i.e., passthrough) Reduce: Scan entire input and emit the record with greatest adRevenue sum.

CMU SCS Join Example – Results Find sourceIP that generated most adRevenue along with its average pageRank. 15

CMU SCS Distributed Joins Are Hard Assume tables are horizontally partitioned: –Table1 Partition Key → table1.key –Table2 Partition Key → table2.key Q: How to execute? Naïve solution is to send all partitions to a single node and compute join. Faloutsos/PavloCMU SCS /61516 SELECT * FROM table1, table2 WHERE table1.val = table2.val SELECT * FROM table1, table2 WHERE table1.val = table2.val

CMU SCS Semi-Joins First distribute the join attributes between nodes and then recreate the full tuples in the final output. –Send just enough data from each table to compute which rows to include in output. Lots of choices make this problem hard: –What to materialize? –Which table to send? Faloutsos/PavloCMU SCS /61517

CMU SCS MapReduce in 2014 SQL/Declarative Query Support Table Schemas Column-oriented storage. Faloutsos/PavloCMU SCS /61518

CMU SCS Column Stores Store tables as sections of columns of data rather than as rows of data. Faloutsos/PavloCMU SCS /61519

CMU SCS Column Stores Faloutsos/PavloCMU SCS /61520 sidnameloginagegpasex SELECT sex, AVG(GPA) FROM student GROUP BY sex SELECT sex, AVG(GPA) FROM student GROUP BY sex Row-oriented Storage

CMU SCS Column Stores Faloutsos/PavloCMU SCS /61521 sidnameloginagegpasex SELECT sex, AVG(GPA) FROM student GROUP BY sex SELECT sex, AVG(GPA) FROM student GROUP BY sex Column-oriented Storage sidnameloginagegpasex

CMU SCS Column Stores Only scan the columns that a query needs. Allows for amazing compression ratios: –Values for the same query are usually similar. Main goal is delay materializing a record back to its row-oriented format for as long as possible inside of the DBMS. Inserts/Updates/Deletes are harder… Faloutsos/PavloCMU SCS /61522

CMU SCS Column Store Systems Many column-store DBMSs –Examples: Vertica, Sybase IQ, MonetDB Hadoop storage library: –Example: Parquet, RCFile Faloutsos/PavloCMU SCS /61523

CMU SCS NoSQL In addition to MapReduce, Google created a distributed DBMS called BigTable. –It used a GET/PUT API instead of SQL. –No support for txns. Newer systems have been created that follow BigTable’s anti-relational spirit. Faloutsos/PavloCMU SCS /61524

CMU SCS NoSQL Systems Faloutsos/PavloCMU SCS /61525 DocumentsColumn-FamilyKey/Value

CMU SCS NoSQL Drawbacks Developers write code to handle eventually consistent data, lack of transactions, and joins. Not all applications can give up strong transactional semantics. Faloutsos/PavloCMU SCS /61526

CMU SCS NewSQL Next generation of relational DBMSs that can scale like a NoSQL system but without giving up SQL or txns. Faloutsos/PavloCMU SCS /61527

CMU SCS Aslett White Paper [Systems that] deliver the scalability and flexibility promised by NoSQL while retaining the support for SQL queries and/or ACID, or to improve performance for appropriate workloads. 28 Matt Aslett – 451 Group (April 4 th, 2011)

CMU SCS Wikipedia Article A class of modern relational database systems that provide the same scalable performance of NoSQL systems for OLTP workloads while still maintaining the ACID guarantees of a traditional database system. 29 Wikipedia (April 2014)

CMU SCS NewSQL Systems Faloutsos/PavloCMU SCS /61530 MiddlewareNew DesignMySQL Engines

CMU SCS NewSQL Implementations Distributed Concurrency Control Main Memory Storage Hybrid Architectures –Support OLTP and OLAP in single DBMS. Query Code Compilation Faloutsos/PavloCMU SCS /61531