NoSQL by Michael Britton, Mark McGregor, and Sam Howard

Name: NoSQL by Michael Britton, Mark McGregor, and Sam Howard
Uploaded: 2017-07-24T15:13:09+00:00
Duration: PTM21S31
Channel: Howard Randolf Ramsey
Description: NoSQL by Michael Britton, Mark McGregor, and Sam Howard

NoSQL by Michael Britton, Mark McGregor, and Sam Howard
Simplicity, Speed, Scalability

What is NoSQL? Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontally scalable The term “NoSQL” is actually misleading. A more appropriate term is actually “Not Only SQL”

Origins 1998 - Carlo Strozzi Still used Relational model
More accurately called “NoRel” 2009 – Eric Evans and Johan Oskarsson Organized event to discuss open-source distributed databases Originally a term to label Non-ACID databases meant to be a Twitter hashtag but went viral and stuck

Why NoSQL Databases are required to process excessive amounts of data today. Data has been increasing exponentially in the fields of personal user information, social graphs, geo location data, user generated content and machine logging data. SQL databases were never designed to handle this much data, so NoSQL was made to do so.

What You Are Giving Up With NoSQL
Relationships between entities are basically non- existent Limited ACID transactions No standard language for queries (SQL) Less structured Varies from each NoSQL engine Relationships between entities (like tables) are limited to non-existent. For example, you usually can't join Less structured and rigid data model. NoSQL typically forces/gives more responsibility at the application layer for the developer to "do the right thing" when it comes to data relationships and consistency.

RDBMS Vs. NoSQL RDBMS NoSQL Structured and organized data
Structured Query Language (SQL) Data and its relationships stored in separate tables. Data Manipulation Language, Data Definition Language Tight Consistency BASE Transaction NoSQL No declarative query language No predefined schema Key-Value pair storage, Column Store, Document Store, Graph Databases Eventual consistency rather ACID property Unstructured and unpredictable data CAP Theorem Prioritize high performance, high availability and scalability

SQL VS NoSQL Queries NoSQL Query: SQL Query:

NoSQL vs. MySQL MySQL > 50 GB Data Writes Average: ~300 ms
Reads Average: ~350 ms Cassandra > 50 GB Data Writes Average: 0.12 ms Reads Average: 15 ms

NoSQL Pros/Cons High Scalability No Standardization
Distributed Computing Lower Cost Schema Flexibility, Semi- Structured Data No Complicated Relationships Cons No Standardization Limited query capabilities Eventual consistent model is not intuitive to program for

Non-Relational: The concept of joining tables together by relations is non-existent. Distributed: A network of interconnected computers, controlled by a central Database Management System Open-Source: Anyone can make changes to the original source code. Horizontally Scalable: Using multiple computers as one unit to increase productivity

Non-Relational Relational databases join tables together using Primary Key / Foreign Key relationships Non-Relational databases have no such structure Items are aggregated into one file, much like a giant Excel spreadsheet Prone to data duplication Difficult to update records

Distributed Non-relational databases can easily be spread out over multiple machines over the same network Each machine in the distributed network can carry information most relevant to it’s area Controlled by the DDBMS – Distributed Database Management System

Open-Source Source code is generally available to the open public
Improve the software as needed Share with the community

Horizontally Scalable
Vertical

Other Important Terms Denormalization - optimizing read performance by adding redundant data or grouping data in order to improve scalability and performance does NOT mean that the data has not been normalized Denormalization should ideally take place after 3NF has been achieved Constraints are used to ensure that redundant copies of data are synchronized Materialized View - a database object that contains the results of a query. query result is cached but can be updated from the original query as necessary

Other Important Terms Keyspace - object that holds together all column families of a design outermost grouping of data in datastore resembles a schema in RDMS Column Families - tuple (pair) consisting of a key-value pair, where the key is set to a value that is a set of columns object that contains columns of related data resembles a table in RDMS

Other Important Terms Super Column Family - tuple (pair) that consists of key-value pair, where the key is mapped to a value that are column families similar to a view in RDBS Column (data store) - tuple (triplet) key-value pair consisting of a unique name, a value, and a timestamp. the timestamp determines old data from new data not to be confused with a standard relational database column lowest level object in a keyspace

Other Important Terms Database Shard - a horizon partition in a database or a search partition. Each partition is a separate shard. shards can be distributed to separate hardware, reducing the number of rows in each table not to be confused with horizontal partitioning, which refers to splitting one or more tables by rows within a single schema or database server Sharding - the process of forming shards within the distributed database system. traditionally done by hand coding auto-sharding code is highly sought after

Other Important Terms Consistent Hashing - special hashing in which when the hash table is resized, only K / n keys need to be remapped K is the number of rows n is the number of slots

All your BASE are belonging to NoSQL
A BASE system gives up on consistency. Basically Available indicates the system does guarantee availability. Soft state indicates that the state of the system may change over time, even without input. Eventual consistency indicates that the system will become consistent over time, given the system doesn’t receive input during that time.

CAP Theorem (Brewer’s Theorem)
There are three basic requirements which exist in a special relation when designing for a distributed architecture. Consistency ‘C’ - the data in the database remains consistent after the execution of the operation Availability ‘A’ - the system is always on, no downtime. Partition Tolerance ‘P’ - the system continues to function even if the communication among the servers is unreliable.

CAP Theorem Cont. CAP provides the basic requirements for a distributed systems to follow 2 of the 3 requirements. All of the current NoSQL database follow the different combinations of C, A, and P. CA - Single site cluster, therefore all the nodes are always in contact. CP - Some data may not be accessible, but the rest is still consistent/accurate AP - System is still available under partitioning, but some of the data may be inaccurate.

Challenges of NoSQL Maturity - In comparison RDBMS systems have been around for a long time. Most NoSQL alternatives are in pre-production versions with many key features yet to be implemented. Support - Most NoSQL systems are Open Source projects, and the companies that offer support are small start-ups without global reach, support services, or the credibility of Oracle, Microsoft, or IBM.

Challenges of NoSQL Analytics and Business Intelligence - NoSQL databases have evolved to meet the scaling demands of Web 2.0 applications. Administration - The design goals for NoSQL is to provide a zero- admin solution, but as of today it requires a lot of skill to install and a lot of to effort to maintain. Expertise - Almost all NoSQL developers is learning how to use and develop for NoSQL

Advantages of NoSQL Elastic Scaling - NoSQL databases are designed to expand transparently to take advantage of new nodes, and they are usually designed with low-cost commodity hardware in mind. Big Data - The volumes of data that can be handled by NoSQL systems are greater than what can be handled by the biggest RDBMS. No DBA - NoSQL databases are designed from the ground up to require less management: automatic repair, data distribution, and simpler data models to lead to lower administration and tuning requirements.

Advantages of NoSQL Economic - NoSQL databases typically use clusters of cheap commodity servers to managing the ever-expanding amount of data and transactions. Flexible Data Models - NoSQL databases have more relaxed data model restrictions. Key Value stores and document databases allow the application to store virtually any structure it wants in a data element.

Taxonomy (Data Models)
Key-value stores are the simplest NoSQL databases. Every single item in the database is stored as an attribute name (or "key"), together with its value. Examples of key-value stores are Riak and Voldemort. Some key-value stores, such as Redis, allow each value to have a type, such as "integer", which adds functionality Document databases pair each key with a complex data structure known as a document. Documents can contain many different key-value pairs, or key-array pairs, or even nested documents. Graph stores are used to store information about networks, such as social connections. Graph stores include Neo4J and HyperGraphDB. Column stores such as Cassandra and HBase are optimized for queries over large datasets, and store columns of data together, instead of rows.

Key-Value stores Examples-Tokyo Cabinet/Tyrant, Redis, Voldemort, Oracle BDB Typical Application- Content caching (Focus on scaling to huge amounts of data, designed to handle massive load), logging, etc. Strengths- Fast Lookups Weaknesses- Stored data has no schema Accessing memory locations in assembly-level programming is essentially following the Key-Value pattern, with the memory location’s address serving as the key, and the value is stored at that memory address. The computer science concept of the hash table serves as another example, with a hash function transforming the key into an index used to find its associated value. The reason this pattern gets used again and again in programming and computer engineering is because of one word: speed. It can be argued that computers at their lowest (and fastest) level depend on the Key-Value pattern, so it makes perfect sense the Key-Value database is much more suitable for the requirements of massive Big Data when compared to the relational model. In short, Key-Value data stores are much faster. The ACID (atomic, consistent, isolated, durable) principle normally present in relational databases is not supported by the massive throughput typical of many systems using a Key-Value database. Because of this, many Key-Value systems use an “eventually consistent” model, widely used in parallel processing and distributed systems.

Oracle Embraces NoSQL

Oracle Embraces NoSQL Distributed key-value database
Designed to provide highly reliable, scalable, and available data storage across a configurable set of systems that function as storage nodes Data is stored as key-value pairs, which are written to particular storage node(s), based on the hashed value of the primary key. Storage nodes are replicated to ensure high availability, rapid failover in the event of a node failure and optimal load balancing of queries. Customer applications are written using an easy-to-use Java/C API to read and write data.

Oracle Embraces NoSQL Utilizes storage nodes
more storage nodes provide greater throughput Storage Node Agent (SNA) monitors each nodes behavior Replication nodes work in groups to serve the same data Replication factor of 3 Single-master architecture Master node replicates to replication nodes Election system elects new master in case of failure

Column Stores Examples-Cassandra, HBase, Riak
Typical applications-Distributed file systems Data model-Columns → column families Strengths-Fast lookups, good distributed storage of data Weaknesses-Very low-level API

Apache Cassandra Project
Scalability and high availability without compromising performance Uses column indexes Denormalization Materialized Views Built-in caching

Apache Cassandra Project
Used in over 1500 companies with large, active data sets Largest cluster has 300 TB of data on over 400 machines Replication across multiple data centers allows failed nodes to be replaced with no downtime Every node is identical, allowing no single point of failure Users can choose between synchronous and asynchronous replication

Document Databases Examples-CouchDB, MongoDb
Typical applications-Web applications (Similar to Key-Value stores, but the DB knows what the Value is) Data model-Collections of Key-Value collections Strengths-Tolerant of incomplete data Weaknesses-Query performance, no standard query syntax

Hu - MongoDB - us Stores data in the form of BSON (Binary JSON) documents with dynamic schemas, making the integration of data in certain types of applications easy and fast. Most talked about NoSQL DBMS technology because it features auto sharding, replication,schema less design, and scalability, and more.

Hu - MongoDB - us Full indexing support - index on any attribute
Replicable - mirror across WAN and LAN Auto Sharding Document-based querying Flexible aggregation GridFS allows for storage of data files larger than BSON allows Instead of storing a file in a single document, GridFS divides a file into parts, or chunks, [1] and stores each of those chunks as a separate document. By default GridFS limits chunk size to 255k. GridFS uses two collections to store files. One collection stores the file chunks, and the other stores file metadata.

Graph Databases Graphs databases store data in graphics to easily represent data Graphs records data in nodes with properties Nodes can have unlimited properties, but are generally broken up into multiple nodes Useful for answering questions based on related information

Neo4J Highly Scalable Fully ACID Intuitive graphical models
Custom disk-based native storage engine Massively scalable, with potential for BILLIONS of nodes Highly available

Neo4J Expressive, powerful, human readable graph query language EX:
MATCH (a:Actor { name:"Keanu Reeves" }) RETURN a

Other NoSQL DBMS Products Cont.
CouchDB - stores data in the form of a collection document. Each document is a bunch of ‘keys’ and corresponding ‘values’. CouchDB support indices, queries, and views. It uses JSON to story data, JavaScript as its query language using MapReduce and HTTP for the API. Redis - An in-memory, key value data store. Mostly used as a caching mechanism in most of the applications because it stores data in the RAM making it extremely fast when retrieving data. It is a data structure server and not a replacement to the traditional database. Used in combination with products like MySql to deliver high performance when the data is needed to be delivered rapidly.

Other NoSQL DBMS Products Cont.
Hadoop - An open-source framework. Written in Java and supports data-intensive distributed applications. Supports applications running on largest clusters of computers and allows analyzing data among many different computers. Designed to scale up from single servers to thousands of machines. There are currently 150 different NoSQL databases

Companies That Implement NoSQL
Google - BIGTABLE Facebook - CASSANDRA Mozilla - HBASE Adobe - HBASE Foursquare - MongoDB LinkedIn - VOLDEMORT Digg - REDIS Twitter - HADOOP, PIG, CASSANDRA

Questions? Tough!

Sources:

NoSQL by Michael Britton, Mark McGregor, and Sam Howard

Similar presentations

Presentation on theme: "NoSQL by Michael Britton, Mark McGregor, and Sam Howard"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

NoSQL by Michael Britton, Mark McGregor, and Sam Howard

Similar presentations

Presentation on theme: "NoSQL by Michael Britton, Mark McGregor, and Sam Howard"— Presentation transcript:

Similar presentations

About project

Feedback