Database Applications (15-415) NoSQL Databases Lecture 26, April 19, 2016 Mohammad Hammoud.

Slides:



Advertisements
Similar presentations
Distributed Systems CS
Advertisements

Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
C-Store: Data Management in the Cloud Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Jun 5, 2009.
Jennifer Widom NoSQL Systems Overview (as of November 2011 )
NoSQL Databases: MongoDB vs Cassandra
Presented by Marie-Gisele Assigue Hon Shea Thursday, March 31 st 2011.
NoSQL and NewSQL Justin DeBrabant CIS Advanced Systems - Fall 2013.
NoSQL Database.
CS 405G: Introduction to Database Systems 24 NoSQL Reuse some slides of Jennifer Widom Chen Qian University of Kentucky.
Distributed Databases
An introduction to MongoDB Rácz Gábor ELTE IK, febr. 10.
Massively Parallel Cloud Data Storage Systems S. Sudarshan IIT Bombay.
A Study in NoSQL & Distributed Database Systems John Hawkins.
1 Yasin N. Silva Arizona State University This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Distributed Data Stores and No SQL Databases S. Sudarshan IIT Bombay.
Databases with Scalable capabilities Presented by Mike Trischetta.
Systems analysis and design, 6th edition Dennis, wixom, and roth
Distributed Data Stores and No SQL Databases S. Sudarshan Perry Hoekstra (Perficient) with slides pinched from various sources such as Perry Hoekstra (Perficient)
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Getting Biologists off ACID Ryan Verdon 3/13/12. Outline Thesis Idea Specific database Effects of losing ACID What is a NoSQL database Types of NoSQL.
Goodbye rows and tables, hello documents and collections.
Introduction to Hadoop and HDFS
Modern Databases NoSQL and NewSQL Willem Visser RW334.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Trade-offs in Cloud.
NoSQL Databases Oracle - Berkeley DB. Content A brief intro to NoSQL About Berkeley Db About our application.
NOSQL DATABASES Please remember to read the NOSQL Distilled book and the Seven Databases book.
Database Systems: Design, Implementation, and Management Tenth Edition Chapter 12 Distributed Database Management Systems.
Data in the Cloud – I Parallel Databases The Google File System Parallel File Systems.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Exam and Lecture Overview.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation MongoDB Architecture.
Lecture 8: Databases and Data Infrastructure CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier.
Distributed Systems CS Consistency and Replication – Part I Lecture 10, September 30, 2013 Mohammad Hammoud.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
Dynamo: Amazon’s Highly Available Key-value Store DAAS – Database as a service.
NoSQL Or Peles. What is NoSQL A collection of various technologies meant to work around RDBMS limitations (mostly performance) Not much of a definition...
NoSQL Systems Motivation. NoSQL: The Name  “SQL” = Traditional relational DBMS  Recognition over past decade or so: Not every data management/analysis.
NOSQL DATABASE Not Only SQL DATABASE
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, From SQL to NoSQL Xiao Yu Mar 2012.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
BIG DATA/ Hadoop Interview Questions.
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
Database Applications (15-415) NoSQL Databases Lecture 24, November 27, 2016 Mohammad Hammoud.
CSCI5570 Large Scale Data Processing Systems
CS 405G: Introduction to Database Systems
and Big Data Storage Systems
Cloud Computing and Architecuture
CS122B: Projects in Databases and Web Applications Winter 2017
Introduction In the computing system (web and business applications), there are enormous data that comes out every day from the web. A large section of.
Trade-offs in Cloud Databases
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Modern Databases NoSQL and NewSQL
NOSQL.
Chapter 19: Distributed Databases
NOSQL databases and Big Data Storage Systems
Massively Parallel Cloud Data Storage Systems
1 Demand of your DB is changing Presented By: Ashwani Kumar
NOSQL and CAP Theorem.
NoSQL Databases An Overview
Database Applications (15-415) DBMS Internals- Part XIII Lecture 25, April 15, 2018 Mohammad Hammoud.
NoSQL Databases Antonino Virgillito.
Distributed Systems CS
Distributed Systems CS
CSE 482 Lecture 5: NoSQL.
Distributed Systems CS
CMPE 280 Web UI Design and Development March 14 Class Meeting
Database Applications (15-415) DBMS Internals- Part XIII Lecture 24, April 14, 2016 Mohammad Hammoud.
NoSQL & Document Stores
Presentation transcript:

Database Applications (15-415) NoSQL Databases Lecture 26, April 19, 2016 Mohammad Hammoud

Today…  Last Session:  Recovery Management  Today’s Session:  NoSQL databases  Announcements:  PS5 (the “last” assignment) is due on Thursday, April 21st  The final exam is on Wednesday, April 27th, from 8:30AM to 11:30AM in room 1031 (all materials are included- open book, open notes)  We will conduct an overview session for the final exam on Thursday, April 21 st at 4:30PM in room 1031

Outline Types of Data Scaling Databases & the 2PC Protocol The CAP Theorem and the BASE Properties NoSQL Databases

Types of Data  Data can be broadly classified into four types: 1.Structured Data:  Have a predefined model, which organizes data into a form that is relatively easy to store, process, retrieve and manage  E.g., relational data 2.Unstructured Data:  Opposite of structured data  E.g., Flat binary files containing text, video or audio  Note: data is not completely devoid of a structure (e.g., an audio file may still have an encoding structure and some metadata associated with it)

Types of Data  Data can be broadly classified into four types: 3.Dynamic Data:  Data that changes relatively frequently  E.g., office documents and transactional entries in a financial database 4.Static Data:  Opposite of dynamic data  E.g., Medical imaging data from MRI or CT scans

Why Classifying Data?  Segmenting data into one of the following 4 quadrants can help in designing and developing a pertaining storage solution  Relational databases are usually used for structured data  File systems or NoSQL databases can be used for (static), unstructured data (more on these later) Media Production, eCAD, mCAD, Office Docs Media Archive, Broadcast, Medical Imaging Transaction Systems, ERP, CRM BI, Data Warehousing Dynamic Unstructured Structured Static

Outline Types of Data Scaling Databases & the 2PC Protocol The CAP Theorem and the BASE Properties NoSQL Databases

Scaling Traditional Databases  Traditional RDBMSs can be either scaled:  Vertically (or Up)  Can be achieved by hardware upgrades (e.g., faster CPU, more memory, or larger disk)  Limited by the amount of CPU, RAM and disk that can be configured on a single machine  Horizontally (or Out)  Can be achieved by adding more machines  Requires database sharding and probably replication  Limited by the Read-to-Write ratio and communication overhead

Why Sharding Data?  Data is typically sharded (or striped) to allow for concurrent/parallel accesses Input data: A large file Machine 1 Chunk1 of input data Machine 2 Chunk3 of input data Machine 3 Chunk5 of input data Chunk2 of input dataChunk4 of input dataChunk5 of input data E.g., Chunks 1, 3 and 5 can be accessed in parallel

Amdahl’s Law  How much faster will a parallel program run?  Suppose that the sequential execution of a program takes T 1 time units and the parallel execution on p processors/machines takes T p time units  Suppose that out of the entire execution of the program, s fraction of it is not parallelizable while 1-s fraction is parallelizable  Then the speedup (Amdahl’s formula): 10

Amdahl’s Law: An Example  Suppose that:  80% of your program can be parallelized  4 machines are used to run your parallel version of the program  The speedup you can get according to Amdahl’s law is: 11 Although you use 4 processors you cannot get a speedup more than 2.5 times!

Real Vs. Actual Cases  Amdahl’s argument is too simplified  In reality, communication overhead and potential workload imbalance exist upon running parallel programs Process 1 Process 2 Process 3 Process 4 Serial Parallel 1. Parallel Speed-up: An Ideal Case Cannot be parallelized Can be parallelized Process 1 Process 2 Process 3 Process 4 Serial Parallel 2. Parallel Speed-up: An Actual Case Cannot be parallelized Can be parallelized Load Unbalance Communication overhead

Some Guidelines  Here are some guidelines to effectively benefit from parallelization: 1.Maximize the fraction of your program that can be parallelized 2.Balance the workload of parallel processes 3.Minimize the time spent for communication 13

Why Replicating Data?  Replicating data across servers helps in:  Avoiding performance bottlenecks  Avoiding single point of failures  And, hence, enhancing scalability and availability

Why Replicating Data?  Replicating data across servers helps in:  Avoiding performance bottlenecks  Avoiding single point of failures  And, hence, enhancing scalability and availability Main Server Replicated Servers

But, Consistency Becomes a Challenge  An example:  In an e-commerce application, the bank database has been replicated across two servers  Maintaining consistency of replicated data is a challenge Bal=1000 Replicated Database Event 1 = Add $1000 Event 2 = Add interest of 5% Bal= Bal= Bal= Bal=2100

The Two-Phase Commit Protocol  The two-phase commit protocol (2PC) can be used to ensure atomicity and consistency Database Server 1 Participant 1 Coordinator Database Server 2 Participant 2 Database Server 3 Participant 3 VOTE_REQUEST Phase I: Voting VOTE_COMMIT

The Two-Phase Commit Protocol  The two-phase commit protocol (2PC) can be used to ensure atomicity and consistency Database Server 1 Participant 1 Coordinator Database Server 2 Participant 2 Database Server 3 Participant 3 GLOBAL_COMMIT Phase II: Commit LOCAL_COMMIT “Strict” consistency, which limits scalability!

Outline Types of Data Scaling Databases & the 2PC Protocol The CAP Theorem and the BASE Properties NoSQL Databases

The CAP Theorem  The limitations of distributed databases can be described in the so called the CAP theorem  Consistency: every node always sees the same data at any given instance (i.e., strict consistency)  Availability: the system continues to operate, even if nodes in a cluster crash, or some hardware or software parts are down due to upgrades  Partition Tolerance: the system continues to operate in the presence of network partitions CAP theorem: any distributed database with shared data, can have at most two of the three desirable properties, C, A or P

The CAP Theorem (Cont’d)  Let us assume two nodes on opposite sides of a network partition:  Availability + Partition Tolerance forfeit Consistency  Consistency + Partition Tolerance entails that one side of the partition must act as if it is unavailable, thus forfeiting Availability  Consistency + Availability is only possible if there is no network partition, thereby forfeiting Partition Tolerance

Large-Scale Databases  When companies such as Google and Amazon were designing large-scale databases, 24/7 Availability was a key  A few minutes of downtime means lost revenue  When horizontally scaling databases to 1000s of machines, the likelihood of a node or a network failure increases tremendously  Therefore, in order to have strong guarantees on Availability and Partition Tolerance, they had to sacrifice “strict” Consistency (implied by the CAP theorem)

Trading-Off Consistency  Maintaining consistency should balance between the strictness of consistency versus availability/scalability  Good-enough consistency depends on your application

Trading-Off Consistency  Maintaining consistency should balance between the strictness of consistency versus availability/scalability  Good-enough consistency depends on your application Strict Consistency Generally hard to implement, and is inefficient Loose Consistency Easier to implement, and is efficient

The BASE Properties  The CAP theorem proves that it is impossible to guarantee strict Consistency and Availability while being able to tolerate network partitions  This resulted in databases with relaxed ACID guarantees  In particular, such databases apply the BASE properties:  Basically Available: the system guarantees Availability  Soft-State: the state of the system may change over time  Eventual Consistency: the system will eventually become consistent

Eventual Consistency  A database is termed as Eventually Consistent if:  All replicas will gradually become consistent in the absence of updates

Eventual Consistency  A database is termed as Eventually Consistent if:  All replicas will gradually become consistent in the absence of updates Webpage-A Event: Update Webpage- A Webpage-A

Eventual Consistency: A Main Challenge  But, what if the client accesses the data from different replicas? Webpage-A Event: Update Webpage- A Webpage-A Protocols like Read Your Own Writes (RYOW) can be applied!

Outline Types of Data Scaling Databases & the 2PC Protocol The CAP Theorem and the BASE Properties NoSQL Databases

NoSQL Databases  To this end, a new class of databases emerged, which mainly follow the BASE properties  These were dubbed as NoSQL databases  E.g., Amazon’s Dynamo and Google’s Bigtable  Main characteristics of NoSQL databases include:  No strict schema requirements  No strict adherence to ACID properties  Consistency is traded in favor of Availability

Types of NoSQL Databases  Here is a limited taxonomy of NoSQL databases: NoSQL Databases Document Stores Graph Databases Key-Value Stores Columnar Databases

Document Stores  Documents are stored in some standard format or encoding (e.g., XML, JSON, PDF or Office Documents)  These are typically referred to as Binary Large Objects (BLOBs)  Documents can be indexed  This allows document stores to outperform traditional file systems  E.g., MongoDB and CouchDB (both can be queried using MapReduce)

Types of NoSQL Databases  Here is a limited taxonomy of NoSQL databases: NoSQL Databases Document Stores Graph Databases Key-Value Stores Columnar Databases

Graph Databases  Data are represented as vertices and edges  Graph databases are powerful for graph-like queries (e.g., find the shortest path between two elements)  E.g., Neo4j and VertexDB Id: 1 Name: Alice Age: 18 Id: 2 Name: Bob Age: 22 Id: 3 Name: Chess Type: Group Id:100 Label: knows Since: 2001/10/03 Id:101 Label: knows Since: 2001/10/03 Id:103 Label: Members Id:104 Label: Members Id:105 Label: is_member Since: 2011/02/14 Id:102 Label: is_member Since: 2005/07/01

Types of NoSQL Databases  Here is a limited taxonomy of NoSQL databases: NoSQL Databases Document Stores Graph Databases Key-Value Stores Columnar Databases

Key-Value Stores  Keys are mapped to (possibly) more complex value (e.g., lists)  Keys can be stored in a hash table and can be distributed easily  Such stores typically support regular CRUD (create, read, update, and delete) operations  That is, no joins and aggregate functions  E.g., Amazon DynamoDB and Apache Cassandra

Types of NoSQL Databases  Here is a limited taxonomy of NoSQL databases: NoSQL Databases Document Stores Graph Databases Key-Value Stores Columnar Databases

 Columnar databases are a hybrid of RDBMSs and Key- Value stores  Values are stored in groups of zero or more columns, but in Column-Order (as opposed to Row-Order)  Values are queried by matching keys  E.g., HBase and Vertica Alice 3 25Bob 4 19 Carol 0 45 Record 1 Row-Order Alice 3 25 Bob 4 19 Carol 0 45 Column A Columnar (or Column-Order) Alice 3 25 Bob 4 19 Carol 0 45 Columnar with Locality Groups Column A = Group A Column Family {B, C}

Summary  Data can be classified into 4 types, structured, unstructured, dynamic and static  Different data types usually entail different database designs  Databases can be scaled up or out  The 2PC protocol can be used to ensure strict consistency  Strict consistency limits scalability

Summary (Cont’d)  The CAP theorem states that any distributed database with shared data can have at most two of the three desirable properties:  Consistency  Availability  Partition Tolerance  The CAP theorem lead to various designs of databases with relaxed ACID guarantees

Summary (Cont’d)  NoSQL (or Not-Only-SQL) databases follow the BASE properties:  Basically Available  Soft-State  Eventual Consistency  NoSQL databases have different types:  Document Stores  Graph Databases  Key-Value Stores  Columnar Databases