Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG.

Slides:



Advertisements
Similar presentations
Inner Architecture of a Social Networking System Petr Kunc, Jaroslav Škrabálek, Tomáš Pitner.
Advertisements

Database Management3-1 L3 Database Management Santa R. Susarapu Ph.D. Student Virginia Commonwealth University.
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
+ Hbase: Hadoop Database B. Ramamurthy. + Introduction Persistence is realized (implemented) in traditional applications using Relational Database Management.
NoSQL and Review 1. Big Data (some old numbers) Facebook:  130TB/day: user logs  TB/day: 83 million pictures Google: > 25 PB/day processed data.
A Survey of Distributed Database Management Systems Brady Kyle CSC
Jennifer Widom NoSQL Systems Overview (as of November 2011 )
Relational Database Alternatives NoSQL. Choosing A Data Model Relational database underpin legacy applications and meet business needs However, companies.
NoSQL Databases: MongoDB vs Cassandra
Dimensional Modeling Business Intelligence Solutions.
HBase Presented by Chintamani Siddeshwar Swathi Selvavinayakam
CSE 190: Internet E-Commerce Lecture 10: Data Tier.
NoSQL and NewSQL Justin DeBrabant CIS Advanced Systems - Fall 2013.
NoSQL Database.
CS 405G: Introduction to Database Systems 24 NoSQL Reuse some slides of Jennifer Widom Chen Qian University of Kentucky.
+ Hbase: Hadoop Database B. Ramamurthy. + Motivation-1 HDFS itself is “big” Why do we need “hbase” that is bigger and more complex? Word count, web logs.
1 Yasin N. Silva Arizona State University This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
XML, distributed databases, and OLAP/warehousing The semantic web and a lot more.
AN INTRODUCTION TO NOSQL DATABASES Karol Rástočný, Eduard Kuric.
Systems analysis and design, 6th edition Dennis, wixom, and roth
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
ZhangGang, Fabio, Deng Ziyan /31 NoSQL Introduction to Cassandra Data Model Design Implementation.
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
WTT Workshop de Tendências Tecnológicas 2014
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Introduction to Hadoop and HDFS
Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 2: Intro to Relational.
Changwon Nati Univ. ISIE 2001 CSCI5708 NoSQL looks to become the database of the Internet By Lawrence Latif Wed Dec Nhu Nguyen and Phai Hoang CSCI.
NoSQL Databases Oracle - Berkeley DB Rasanjalee DM Smriti J CSC 8711 Instructor: Dr. Raj Sunderraman.
Cloud Computing Clase 8 - NoSQL Miguel Johnny Matias
NoSQL Databases Oracle - Berkeley DB. Content A brief intro to NoSQL About Berkeley Db About our application.
1 Dennis Kafura – CS5204 – Operating Systems Big Table: Distributed Storage System For Structured Data Sergejs Melderis 1.
+ Hbase: Hadoop Database B. Ramamurthy. + Motivation-0 Think about the goal of a typical application today and the data characteristics Application trend:
Object Persistence (Data Base) Design Chapter 13.
Object Persistence Design Chapter 13. Key Definitions Object persistence involves the selection of a storage format and optimization for performance.
Logical Database Design Chapter 4 G. Green 1. Agenda Evolution of Data Models Chapter 1 pgs 25 – 28 Chapter 9 pgs 409 – 418 Relational Database Model.
Databases Shortfalls of file management systems Structure of a database Database administration Database Management system Hierarchical Databases Network.
CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes 13: BigTable, HBASE, Cassandra Hector Garcia-Molina.
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
Chapter 2: Intro to Relational Model. 2.2 Example of a Relation attributes (or columns) tuples (or rows)
Competitive (Business) Intelligence Systems The Road to Denormalization (starring Charlie Sheen & other Random Celebrities)
Foundations of Business Intelligence: Databases and Information Management.
NoSQL Systems Motivation. NoSQL: The Name  “SQL” = Traditional relational DBMS  Recognition over past decade or so: Not every data management/analysis.
NOSQL DATABASE Not Only SQL DATABASE
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT IT Monitoring WG Technology for Storage/Analysis 28 November 2011.
NoSQL: Graph Databases. Databases Why NoSQL Databases?
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, From SQL to NoSQL Xiao Yu Mar 2012.
NoSQL databases A brief introduction NoSQL databases1.
INFS 6220 Systems Analysis & Design Transactional DBs vs. Data Warehouses.
Chapter 1: Introduction. 1.2 Database Management System (DBMS) DBMS contains information about a particular enterprise Collection of interrelated data.
Introduction to NoSQL Databases Chyngyz Omurov Osman Tursun Ceng,Middle East Technical University.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
Big Data Yuan Xue CS 292 Special topics on.
NoSQL: Graph Databases
CS 405G: Introduction to Database Systems
NoSQL: Graph Databases
and Big Data Storage Systems
CS122B: Projects in Databases and Web Applications Winter 2017
Chapter 1: Introduction
Dremel.
NOSQL databases and Big Data Storage Systems
Project Project mid-term report due on 25th October at midnight Format
Central Florida Business Intelligence User Group
NoSQL Systems Overview (as of November 2011).
1 Demand of your DB is changing Presented By: Ashwani Kumar
INFS 3220 Systems Analysis & Design
The Relational Model Textbook /7/2018.
Chapter 1: Introduction
Presentation transcript:

Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG

The NoSQL Movement Meetup June in San Francisco NoSQL name proposed by Eric Evans 2004 BigTable (Google) 2007 Dynamo (Amazon) 2008 Cassandra (Facebook) Hadoop/HBase (Yahoo) Project Voldemort (LinkedIn) NoSQL Conferences

Relational Database/SQL

Bernstein and Goodman Multi-version Concurrency Control Database Timeline CODASYL - Network database - Schema - DDL/DML 1970 Codd Relational Model 1980 Gray Transaction 1995 Bernstein et al Critique of ANSI SQL Isolation Levels 1989 SQL SQL SQL:1999 Object Relational 2003 SQL:2003 Analytics extensions 1979 Oracle 1974 SEQUEL

Row Column Relational Model Normalized data “Atomic” Multi-column Key Operations on tables: select, project, join Relationship on key Primary Key Foreign Key Table – n-tuple Key

SQL Designed for Transaction Processing Good Easily handles simple cases Everyone has a Query Language Bad Data access language (not Turing complete) Declarative Language (4GL)  Impedance mismatch with procedural languages Complicated cases get repetitive

Normalization Refine design of structured data “Atomic” No repeating groups Data item depends on key (and nothing else) Avoid modification anomalies Ensure every data item is stored only once Avoid bias to any particular pattern of querying Allow data to be accessed from every angle Denormalization

Star Schema Example Fact Table Product Store Promotion Date Date_key Store_key Promotion_key Product_key Receipt_number Quantity Revenue Unit_price Date_key Day_in_week Day_in_month Day_in_year Day_name Week_in_month Week_in_year Month_nbr Month_name Quarter Year Holiday Holiday_desc …

Database Summary Costs –Fixed schema –Normalization –Transform data on load –Cost of scaling –Problems with large objects –Complicated software Benefits –Mature technology –Precise querying –Star Schema – historic data

Tuple Store/NoSQL

Tuple Storage Systems Google Database System –Chubby – Lock/metadata manager –Google File System – Distributed file system –Bigtable – Tuple storage on GFS –Map Reduce – Data processing on tuples Other tuple stores –Voldemort – Amazon Dynamo –Cassandra –HBase –Hypertable

Tuple Store Model One Table Operate on Map Set of (Key, Value) Structured Key Unstructured Value Operations: select, project Map Reduce Tuple Store KeyValue KeyColumnTimestamp

Map Reduce Define two functions –Map Input: tuple Output: list of tuples –Reduce Input: key, list of values Output: list or tuple Specify a cluster Specify input and output tuple stores Framework does the rest { Map(k1, v1) } -> { list(k2, v2) } { list(k2, v2) } -> { (k2, list(v2)) } { Reduce(k2, list(v2)) } -> { list(v3) } -> { (k2, v3) }

Map Reduce Example For each web page count the number of pages that reference that page Input tuple store is WWW Map Function: for each anchor on web page, emit (anchorURL, 1) Reduce Function: emit (anchorURL, sum(list)) { Map(k1, v1) } -> { list(k2, v2) } { list(k2, v2) } -> { (k2, list(v2)) } { Reduce(k2, list(v2)) } -> { (k2, v3) } URLWeb Page URLWeb Page URLWeb Page URLWeb Page … Output tuple store is { (URL, count) }

Example in SQL CREATE TABLE links (URL page NOT NULL, URL ref_page NOT NULL, PRIMARY KEY page, ref_page ) SELECT ref_page, count(DISTINCT page) FROM links GROUP BY ref_page For each web page count the number of pages that reference that page

Tuple Store Summary Semi-structured data –No need to normalize data Simple implementations –Cheap, fast, scalable Map Reduce Processing –Simple programming (for geeks) Issues –No guidance from schema –No model for historic data Hadoop wins Sort Benchmark

Synthesis

Summary SQL –Structured data –Precise –Historic data –Needs transformation –Scalability issues NoSQL –Cheap –Scalable –Handles large data

Enterprise Model MoneyContentAnalytics ? NoSQL Relational DB Metadata? Issues: - Data volume - Query requirements

Analytics Architecture Map Reduce Processing TB+/day RDB Data Warehouse GB++/day Reports Tuple Store Cubes Reports etc.

Summary It is all about structured data How much do we want? How much can we afford?