Presentation is loading. Please wait.

Presentation is loading. Please wait.

Database Design (constructed by Hanh Pham based on slides from books and websites listed under reference section) NoSQL Databases.

Similar presentations


Presentation on theme: "Database Design (constructed by Hanh Pham based on slides from books and websites listed under reference section) NoSQL Databases."— Presentation transcript:

1 Database Design (constructed by Hanh Pham based on slides from books and websites listed under reference section) NoSQL Databases

2 Outline Introduction NoSQL Databases Key-Value Databases
Document Databases Column Databases Graph Databases

3 NoSQL Databases 1. Introduction

4 SQL Models NOT ALWAYS Good
End Users & SQL models: The end user is often interested in aggregated reporting information, not in separate data items and SQL pays a lot of attention to this aspect. No one can expect human users to explicitly control concurrency, integrity, consistency, or data type validity. That’s why SQL pays a lot of attention to transactional guaranties, schemas, and referential integrity. HOWEVER, non-human users (software applications) are not so often interested in in-database aggregation and able to control, at least in many cases, integrity and validity themselves. Moreover, elimination of these features had an extremely important influence on the performance and scalability of the stores.

5 SQL Models NOT ALWAYS Good
IMPEDANCE MISMATCH: the difference between the relational model and the in-memory data structures. The relational data model organizes data into a structure of tables and rows, or more properly, relations and tuples. This foundation on relations provides a certain elegance and simplicity, but it also introduces limitations. In particular, the values in a relational tuple have to be simple—they cannot contain any structure, such as a nested record or a list. This limitation isn’t true for in-memory data structures, which can take on much richer structures than relations. As a result, if you want to use a richer in memory data structure, you have to translate it to a relational representation to store it on disk. Hence the impedance mismatch—two different representations that require translation:

6 SQL Models NOT ALWAYS Good
An order, which looks like a single unit in an application user interface is split into many rows from many tables in a relational database

7 What is NOSQL? NOSQL (not only SQL) is a kind of DBMS (database management systems) where having relations and SQL(Structured Query Language) are not a must. Goals: simple design, horizontal scaling availability Use in: mostly big data and real-time web applications such as the ones at Google, Facebook, Amazon, Twitter, etc. to manage data for which SQL was not the best fit.

8 NOSQL vs. SQL systems While there are numerous characteristics that differentiate SQL and NOSQL the two most significant are Scaling and Modeling. Scaling – Traditionally SQL does not lend itself to massively parallel processing, which lead to larger computers (scale up) vs. distribution to numerous commodity servers, virtual machines or cloud instances (scale out). Modeling – SQL databases are highly normalized and require pre-defined data models prior to inserting data into the system. In contrast NOSQL databases do not require (although they support) pre-defined data model(s).

9 NOSQL vs. SQL systems Compare: SQL vs. NoSQL systems
Structured vs. Unstructured Tables, fields, pairs vs. unstructured text Transactions vs. Analytics Federated vs. Persisted Simpler & Smaller vs. Big Data (Volume, Variety and Velocity) Precision vs. Discovery

10 NOSQL vs. SQL systems ACID for SQL systems: VS.
Atomicity Consistency Isolation Durability VS. BASE for NoSQL systems: Basically Available Soft State Eventually Consistent

11 ACID of SQL A is for atomicity.
Atomicity, as the name implies, describes a unit that cannot be further divided. All the steps of a transaction/command have to complete or none of them completed. In essence, the set of steps is indivisible. You have to complete all of them as a single indivisible unit, or you complete none of them.

12 ACID of SQL C is for consistency.
In relational databases, this is known as strict consistency. In other words, a transaction does not leave a database in a state that violates the integrity of data. Transferring $100 from your savings account to your checking account must end with either (a) $100 more in your checking account and $100 less in your savings account or (b) both accounts have the same amount as they had at the start of the transaction. Consistency ensures no other possible state could result after a transfer operation.

13 ACID of SQL I is for isolation.
Isolated transactions are not visible to other users until transactions are complete. For example, in the case of a bank transfer from a savings account to a checking account, no one could not read your account balances while the funds are being deducted from your savings account but before they are added to your checking account. Databases can allow different levels of isolation. This can allow, for example, lost updates in which a query returns data that does not reflect the most recent update because the update operation has not completely finished

14 ACID of SQL D is for durability.
This means that once a transaction or operation is completed, it will remain even in the event of a power loss. In effect, this means that data is stored on disk, flash, or other persistent media.

15 BASE of SQL BA is for basically available. This means that there can be a partial failure in some parts of the distributed system and the rest of the system continues to function. For example, if a NoSQL database is running on 10 servers without replicating data and one of the servers fails, then 10% of the users’ queries would fail, but 90% would succeed. NoSQL databases often keep multiple copies of data on different servers. This allows the database to respond to queries even if one of the servers has failed.

16 BASE of SQL S is for soft state.
general, soft state means data will expire if it is not refreshed. Here, in NoSQL operations, it refers to the fact that data may eventually be overwritten with more recent data. This property overlaps with the third property of BASE transactions, eventually consistent.

17 BASE of SQL E is for eventually consistent. This means that there may be times when the database is in an inconsistent state. For example, some NoSQL databases keep multiple copies of data on multiple servers. There is, however, a possibility that the multiple copies may not be consistent for a short period of time. This can occur when a user or program updates one copy of the data and other copies continue to have the old version of the data. Eventually, the replication mechanism in the NoSQL database will update all copies, but in the meantime, the copies are inconsistent.

18 BASE of SQL Types of Eventual Consistency: • Casual consistency • Read-your-writes consistency • Session consistency • Monotonic read consistency • Monotonic write consistency

19 BASE of SQL Session Consistency
Session consistency ensures read-your-writes consistency during a session. You can think of a session as a conversation between a client and a server or a user and the database. As long as the conversation continues, the database “remembers” all writes you have done during the conversation. If the session ends and you start another session with the same server, there is no guarantee it will “remember” the writes you made in the previous session. A session may end if you log off an application using the database or if you do not issue commands to the database for so long that the database assumes you no longer need the session and abandons it.

20 BASE of SQL Monotonic Read Consistency

21 BASE of SQL Monotonic Read Consistency Monotonic read consistency ensures that if you issue a query and see a result, you will never see an earlier version of the value. Let’s assume Alice is yet again updating a customer’s outstanding balance. The outstanding balance is currently $1,500. She updates it to $2,500. Bob queries the database for the customer’s balance and sees that it is $2,500. If Bob issues the query again, he will see the balance is $2,500 even if all the servers with copies of that customer’s outstanding balance have not updated to the latest value.

22 BASE of SQL Monotonic Write Consistency
Monotonic write consistency ensures that if you were to issue several update commands, they would be executed in the order you issued them. Let’s consider a variation on the outstanding balance example. Alice is feeling generous today and decides to reduce all customers’ outstanding balances by 10%. Charlie, one of her customers, has a $1,000 outstanding balance. After the reduction, Charlie would have a $900 balance. Now imagine if Alice continues to process orders. Char- lie has just ordered $1,100 worth of material. His outstanding balance is now the sum of the previous outstanding balance ($900) and the amount of the new order ($1,100) or $2,000.

23 BASE of SQL Monotonic Write Consistency (cont.)
Now consider what would happen if the NoSQL database performed Alice’s operations in a different order. Charlie started with a $1,000 outstanding balance. Next, instead of having the discount applied, his record was first updated with the new order ($1,100). His outstanding balance becomes $2,100. Now, the 10% discount operation is executed and his outstanding balance is set to $2,100–$210 or $1890. Monotonic write consistency is obviously an important feature. If you cannot guarantee the order of operations in the database, you would have to build features into your program to guarantee operations exe- cute in the order you expect.

24 Aggregate Data Models Application developers use vastly different types of data structures. The aggregate models are driven by  Domain Driven Design (Eric Evans). An aggregate is a collection of data that we interact with as a unit. These units of data or aggregates form the boundaries for ACID operations with the database, Key-value, Document, and Column-family databases can all be seen as forms of aggregate-oriented database. Aggregates make it easier for the database to manage data storage over clusters, since the unit of data now could reside on any machine and when retrieved from the database gets all the related data along with it. Aggregate-oriented databases work best when most data interaction is done with the same aggregate, for example when there is need to get an order and all its details, it better to store order as an aggregate object but dealing with these aggregates to get item details on all the orders is not elegant. Aggregate-oriented databases make inter-aggregate relationships more difficult to handle than intra-aggregate relationships. Aggregate-ignorant databases are better when interactions use data organized in many different formations. Aggregate-oriented databases often compute materialized views to provide data organized differently from their primary aggregates. This is often done with map-reduce computations, such as a map-reduce job to get items sold per day.

25 Distribution Models Aggregate oriented databases make distribution of data easier, since the distribution mechanism has to move the aggregate and not have to worry about related data, as all the related data is contained in the aggregate. There are two styles of distributing data: Sharding: Sharding distributes different data across multiple servers, so each server acts as the single source for a subset of data. Replication: Replication copies data across multiple servers, so each bit of data can be found in multiple places. Replication comes in two forms, Master-slave replication makes one node the authoritative copy that handles writes while slaves synchronize with the master and may handle reads. Peer-to-peer replication allows writes to any node; the nodes coordinate to synchronize their copies of the data. Master-slave replication reduces the chance of update conflicts but peer-to-peer replication avoids loading all writes onto a single server creating a single point of failure. A system may use either or both techniques. Like Riak database shards the data and also replicates it based on the replication factor.

26 CAP Theorem In a distributed system, managing consistency(C), availability(A) and partition toleration(P) is important, Eric Brewer put forth the CAP theorem which states that in any distributed system we can choose only two of consistency, availability or partition tolerance. Many NoSQL databases try to provide options where the developer has choices where they can tune the database as per their needs. For example if you consider Riak a distributed key-value database. There are essentially three variables r, w, n where r=number of nodes that should respond to a read request before its considered successful. w=number of nodes that should respond to a write request before its considered successful. n=number of nodes where the data is replicated aka replication factor. In a Riak cluster with 5 nodes, we can tweak the r,w,n values to make the system very consistent by setting r=5 and w=5 but now we have made the cluster susceptible to network partitions since any write will not be considered successful when any node is not responding. We can make the same cluster highly available for writes or reads by setting r=1 and w=1  but now consistency can be compromised since some nodes may not have the latest copy of the data. The CAP theorem states that if you get a network partition, you have to trade off availability of data versus consistency of data. Durability can also be traded off against latency, particularly if you want to survive failures with replicated data.

27 Why choose NoSQL database ?
To improve programmer productivity by using a database that better matches an application's needs. To improve data access performance via some combination of handling larger data volumes, reducing latency, and improving throughput. It's essential to test your expectations about programmer productivity and/or performance before committing to using a NoSQL technology. Since most of the NoSQL databases are open source, testing them is a simple matter of downloading these products and setting up a test environment. Even if NoSQL cannot be used as of now, designing the system using service encapsulation supports changing data storage technologies as needs and technology evolve. Separating parts of applications into services also allows you to introduce NoSQL into an existing application.

28 NoSQL Databases 2. NoSQL Databases

29 Types of NOSQL Databases
Key-Value: Redis, Riak, Memcached, BerkeleyDB, LevelDB, Dynamo, FoundationDB, HyperDex, MemcacheDB, FairCom c-treeACE, Aerospike, OrientDB, MUMPS Document: MongoDB, CouchDB, OrientDB, Lotus Notes, Clusterpoint, Apache Couchbase, HyperDex, MarkLogic, Qizx, BaseX, Clusterpoint, eXist, Jackrabbit, OpenLink Virtuoso, RavenDB, SimpleDB, Terrastore Column: Amazon SimpleDB, Cassandra, Accumulo, Druid, HBase, Hypertable, Vertica. Graph: FlockDB, HyperGraphDB, OrientDB, Allegro, Neo4J, InfiniteGraph, Virtuoso, Stardog, DEX, OpenLink Virtuoso, Pregel, Sones GraphDB, OWLIM Multi-model: OrientDB, FoundationDB, ArangoDB, Alchemy Database, CortexDB

30 Types of NOSQL Databases
Key Value BigTable, CDB, Keyspace, LevelDB, membase, MemcacheDB, MongoDB, OpenLink Virtuoso, Tarantool, Tokyo Cabinet, TreapDB, Tuple space Eventually‐consistent - Apache Cassandra, Dynamo, Hibari, OpenLink Virtuoso, Project Voldemort, Riak Hierarchical - GT.M, InterSystems Caché Tabular – BigTable, Apache Hadoop, Apache Hbase, Hypertable, Mnesia, OpenLink Virtuoso Object Database - db4o, Eloquera, GemStone/S, InterSystems Caché, JADE, NeoDatis ODB, ObjectDB, Objectivity/DB, ObjectStore, OpenLink Virtuoso, Versant Object Database, Wakanda, ZODB Multivalue databases - Extensible Storage Engine (ESE/NT), jBASE, OpenQM, OpenInsight , Rocket U2, D3 Pick database, InterSystems Caché, InfinityDB Tuple store- Apache River, OpenLink Virtuoso, Tarantool

31 How NOSQL models Relate to Traditional ones ?
Source: CIO’s Guide to NOSQL, Dan McCreary, June 2012

32 NoSQL Databases 3. Key-Value Databases

33 Key-Value Databases Key-value pair databases are the simplest form of NoSQL databases. These databases are modeled on two components: keys and values. Example: Airline tags for checked bags are analogous to keys used to store data in a key-value database.

34 Key-Value Databases Keys are identifiers associated with values. They are analogous to tags you get when you check luggage at the airport. The tag you receive has an identifier associated with your luggage. With your tag, you can find your luggage more efficiently than without it. Imagine you have a connecting flight and your luggage did not make it to your connecting flight. If your luggage doesn’t have a tag, an airline employee searching for your bag would have to look through all undelivered bags to determine which is yours. Now imagine that the airline organizes undelivered bags by tag number. If the airline employee knows your ticket number, she or he could go right to that spot in the luggage area to retrieve your bag. Airlines generate luggage tags when you check a bag. If you were assigned the task of designing a ticket-generating program, you might decide to have tickets with two parts: a flight number and a sequential number. ❖ Note This is an oversimplified scheme because it does not account for flights with the same number that occurs on different days, but we will continue with it anyway.

35 Key-Value Databases You can use a similar approach when generating keys in a key-value database. Let’s assume you are building a key-generating program for an e-commerce website. You realize you need to track five pieces of information about each visitor to your site: the customer’s account number, name, address, number of items in the shopping cart, and customer type indicator. The customer type indicator identifies customers enrolled in the company’s loyalty program. All of these values are associated with a customer, so you can generate a sequential number for each customer. For each item you are storing, you create a new key by appending the name of the item you are storing to the customer number. For example, data about the first customer in the system would use keys 1.accountNumber, 1.name, 1.address, 1.numItems, and 1.custType (see Figure 2.11).

36 Key-Value Databases Figure 2.11 Key-value databases are modeled on a simple, two-part data structure consisting of an identifier and a data value.

37 Key-Value Databases Values are data stored along with keys. Like luggage, values in a key-value database can store many different things. Values can be as simple as a string, such as a name, or a number, such as the number of items in a customer’s shopping cart. You can store more complex values, such as images or binary objects, too.

38 Key-Value Databases Key-value databases give developers a great deal of flexibility when storing values: any type and any length. For example: +) Key=Cust123.address and Value= “543 N. Main St.” or “543 North Main St. Portland, OR ” +) Key=Emp328.photo and Value=a picture stored as a binary large object (BLOB) type or a string value such as “Not available”. Because key-value databases allow virtually any data type in values, it is important for software developers to implement checks in their programs. For example, a program that expects either a BLOB or a string with a value of “Not available” might not function as expected if the string “No photo” is used instead. A programmer might decide to support any BLOB or string as valid values, but it is up to the programmer to determine the range of valid values and enforce those choices as needed.

39 Key-Value Vs. Relational
Key-value databases are modeled on minimal principles for storing and retrieving data. Unlike in relational databases, there are no tables, so there are no features associated with tables, such as columns and constraints on columns. There is no need for joins in key-value databases, so there are no foreign keys. Key-value databases do not support a rich query language such as SQL. Some key-value databases support buckets, or collections, for creating separate namespaces within a database. This can be used to implement something analogous to a relational schema, especially when combined with a key-naming convention like the one described previously.

40 Key-Value Vs. Relational
The key-naming convention outlined above maps to patterns seen in relational database tables.

41 Key-Value Vs. Relational
If you have developed relational data models, you might have noticed parallels between the key-naming convention and tables, primary keys, and columns. The key-naming convention described previously basically uses the convention of concatenating a table name or symbol, a primary key, and a column name. For example, the key ‘cust123.address’ would be equivalent to a relational table named cust or customer, with a column called address, and a row identified by the primary key ID of 123.

42 Key-Value Databases: Conclusions
Key-value stores are the simplest NoSQL data stores to use from an API perspective. The client can either get the value for the key, put a value for a key, or delete a key from the data store. The value is a blob that the data store just stores, without caring or knowing what's inside; it's the responsibility of the application to understand what was stored. Since key-value stores always use primary-key access, they generally have great performance and can be easily scaled. Some of the popular key-value databases are Riak, Redis (often referred to as Data Structure server), Memcached and its flavors, Berkeley DB, HamsterDB (especially suited for embedded use), Amazon DynamoDB (not open-source), Project Voldemort and Couchbase. All key-value databases are not the same, there are major differences between these products, for example: Memcached data is not persistent while in Riak it is, these features are important when implementing certain solutions. Lets consider we need to implement caching of user preferences, implementing them in memcached means when the node goes down all the data is lost and needs to be refreshed from source system, if we store the same data in Riak we may not need to worry about losing data but we must also consider how to update stale data. Its important to not only choose a key-value database based on your requirements, it's also important to choose which key-value database.

43 NoSQL Databases 4. Document Databases

44 Document Databases Document databases, also called document-oriented databases, use a key-value approach where values are documents. In this case, documents are semistructured entities, typically in a standard format such as JavaScript Object Notation (JSON) or Extensible Markup Language (XML). Here, “document” does not refer to word processing or other office productivity files. It refers to data structures that are stored as strings or binary representations of strings. JSON = open standard format that uses human-readable text to represent data objects consisting of attribute–value pairs. It is an alternative to XML. JSON is a language-independent data format.

45 Document Databases JSON example: vs. XML:

46 Document Databases Documents Instead of storing each attribute of an entity with a separate key, docu- ment databases store multiple attributes in a single document. Here is a simple example of a document in JSON format: { firstName: "Alice", lastName: "Johnson", position: "CFO", officeNumber: "2-120", officePhone: " ", } One of the most important characteristics of document databases is you do not have to define a fixed schema before you add data to the database. Simply adding a document to the database creates the underlying data structures needed to support the document. The lack of a fixed schema gives developers more flexibility with document databases than they have with relational databases. For example, employees can have different attributes than the ones listed above.

47 Document Databases Another valid employee document is { firstName: "Bob", lastName: "Wilson", position: "Manager", officeNumber: "2-130", officePhone: " ", hireDate: "1-Feb-2010", terminationDate: "12-Aug-2014" } The attributes hireDate and terminationDate are in Bob’s document but not Alice’s. This is not a problem from the database perspective. Developers can add attributes as needed, but their programs are responsible for managing them. If you expect all employee documents to have first and last names, you should implement a check in your code that adds employee documents to ensure that the rule is enforced.

48 Document Databases Querying Documents You might be wondering, why couldn’t you store JSON or XML documents in key-value databases? Because key-value databases have few restrictions on the type of data stored as a value, you could store a JSON document as a value. The only way to retrieve such a document is by its key, however. Document databases provide application programming interfaces (APIs) or query languages that enable you to retrieve documents based on attribute values. For example, if you have a database with a collec- tion of employee documents called “employees,” you could use a state- ment such as the following to return the set of all employees with the position Manager: db.employees.find( { position:"Manager" }) As with relational databases, document databases typically support operators such as AND, OR, greater than, less than, and equal to.

49 Document vs. Relational Databases
A key distinction between document and relational databases is that document databases do not require a fixed, predefined schema. Another important difference is that documents can have embedded documents and lists of multiple values within a document. For example, the employee documents might include a list of previous positions an employee held within the company: { firstName: "Bob", lastName: "Wilson", positionTitle: "Manager", officeNumber: "2-130", officePhone: " ", hireDate: "1-Feb-2010", terminationDate: "12-Aug-2014" PreviousPositions: [ { position: "Analyst", StartDate:"1-Feb-2010", endDate:"10-Mar-2011" } { position: "Sr. Analyst", startDate: "10-Mar-2011" endDate:"29-May-2013" } ] } Embedding documents

50 Document vs. Relational Databases
Embedding documents or lists of values in a document eliminates the need for joining documents the way you join tables in a relational database. If there are cases where you stored a list of document identifiers in a document and want to look up attributes in the documents associated with those identifiers, then you would have to implement that operation in your program. Document databases are probably the most popular type of NoSQL database. They offer support for querying structures with multiple attributes, like relational databases, but offer more flexibility with regard to variation in the attributes used by each document. The next section discusses the column family database, which is another type of NoSQL database that shares some important characteristics with relational databases.

51 Document Databases: Conclusions
Documents are the main concept in document databases. The database stores and retrieves documents, which can be XML, JSON, BSON, and so on. These documents are self-describing, hierarchical tree data structures which can consist of maps, collections, and scalar values. The documents stored are similar to each other but do not have to be exactly the same. Document databases store documents in the value part of the key-value store; think about document databases as key-value stores where the value is examinable. Document databases such as MongoDB provide a rich query language and constructs such as database, indexes etc allowing for easier transition from relational databases. Some of the popular document databases we have seen are MongoDB, CouchDB , Terrastore, OrientDB, RavenDB, and of course the well-known and often reviled Lotus Notes that uses document storage.

52 NoSQL Databases 5. Column Databases

53 Column Databases Column family databases are perhaps the most complex of the NoSQL database types, at least in terms of the basic building block structures. Column family databases share some terms with relational databases, such as rows and columns, but you must be careful to understand important differences between these structures. A column is a basic unit of storage in a column family database. A column is a name and a value. (Some column family databases keep a time stamp along with a name and value) In this example, the column is named lastName and has a value of “Wilson”.

54 Column Databases A set of columns makes up a row. Rows can have the same columns, or they can have different columns. A row consists of one or more columns. Different rows can have different columns:

55 Column Databases When there are large numbers of columns, it can help to group them into collections of related columns. For example, first and last name are often used together, and office numbers and office phone numbers are frequently needed together. These can be grouped in collections called column families. As in document databases, column family databases do not require a predefined fixed schema. Developers can add columns as needed. Also, rows can have different sets of columns and super columns. Column family databases are designed for rows with many columns. It is not unusual for column family databases to support millions of columns.

56 Column vs. Relational Databases
Column family databases and relational databases are superficially similar. They both make use of rows and columns, for example. There are important differences in terms of data models and implementation details. One thing missing from column family databases is support for joining tables. You might have noticed that the term table has not been used when describing column family databases. This is intentional. Tables in relational databases have a relatively fixed structure, and the relational database management system can take advantage of that structure when optimizing the layout of data on drives and when retrieving data for read operations. Unlike a relational database table, however, the set of columns in a column family table can vary from one row to another.

57 Column vs. Relational Databases
In relational databases, data about an object can be stored in multiple tables. For example, a customer might have name, address, and contact information in one table; a list of past orders in another table; and a payment history in another. If you needed to reference data from all three tables at once, you would need to perform a join operation between tables. Column family databases are typically denormalized, or structured so that all relevant information about an object is in a single, possibly very wide, row. Query languages for column family databases may look similar to SQL. The query language can support SQL-like terms such as SELECT, INSERT, UPDATE, and DELETE as well as column family–specific operations, such as CREATE COLUMNFAMILY. The next section discusses a fourth type of NoSQL databases known as graph databases, which are well suited for addressing problems that require representing many objects and links between those objects. Social media, a transportation network, and an electric grid are just a few examples of areas where graph databases may be used.

58 Column Databases: Conclusions
Column-family databases store data in column families as rows that have many columns associated with a row key. Column families are groups of related data that is often accessed together. For a Customer, we would often access their Profile information at the same time, but not their Orders. Each column family can be compared to a container of rows in an RDBMS table where the key identifies the row and the row consists of multiple columns. The difference is that various rows do not have to have the same columns, and columns can be added to any row at any time without having to add it to other rows. When a column consists of a map of columns, then we have a super column. A super column consists of a name and a value which is a map of columns. Think of a super column as a container of columns. Cassandra is one of the popular column-family databases; there are others, such as HBase, Hypertable, and Amazon DynamoDB. Cassandra can be described as fast and easily scalable with write operations spread across the cluster. The cluster does not have a master node, so any read and write can be handled by any node in the cluster.

59 NoSQL Databases 6. Graph Databases

60 Graph Databases Graph databases are the most specialized of the four NoSQL databases discussed in this book. Instead of modeling data using columns and rows, a graph database uses structures called nodes and relationships (in more formal discussions, they are called vertices and edges). A node is an object that has an identifier and a set of attributes. A relationship is a link between two nodes that contain attributes about that relation. Nodes and Relationships There are many ways to use graph databases. Nodes can represent people, and relationships can represent their friendships in social networks. A node could be a city, and a relationship between cities could be used to store information about the distance and travel time between cities. Next figure includes an example of flying times between several cities.

61 Graph Databases Properties of relationships or nodes store attributes about relations between linked nodes. In this case, attributes include flying times between cities.

62 Graph Databases Both the nodes and relationships can have complex structures. For example, each city can have a list of airports along with demographic and geographic data about the city: Nodes can also have attributes to describe the node. In this case, attributes include information about the airports in the city along with population and geographic area. Graph databases get their name from a branch of mathematics called graph theory. Graph theory is the study of objects represented by vertices and relations represented by edges. Graph theory is not related to the study of charts and other visualizations sometimes referred to as graphs.

63 Graph vs. Relational Databases
Graph databases are designed to model adjacency between objects. Every node in the database contains pointers to adjacent objects in the database. This allows for fast operations that require following paths through a graph. For example, if you wanted to find all possible ways to fly from Montreal to Mexico City, you could start at the Montreal node and follow each of the adjacent nodes to Boston, Chicago, and Tokyo, and then to Mexico City. At the Boston node, you would find no relationship with Mexico City and assume that there are no direct flights available from Boston to Mexico City. From Chicago, a direct flight to Mexico City would take 3 hours and 50 minutes. That time, plus the 1 hour and 20 minutes to fly to Chicago would leave a total flying time of 5 hours and 10 minutes.

64 Graph vs. Relational Databases
Flying from Montreal to Tokyo to get to Mexico City is possible but hardly efficient. Because the relationship between Montreal and Tokyo shows a 13 hour 30 minute flight, over twice as long as the Montreal to Chicago to Mexico City route, you can safely stop following other routes from Tokyo to Mexico City. From Chicago, you could fly to Port- land, but like Boston, this does not lead to a direct flight to Mexico City. Finally, a direct flight from Montreal to Mexico City would take 5 hours, the fastest route available. Performing this same kind of analysis in a relational database would be more involved. We could easily represent the same data using a table:

65 Graph vs. Relational Databases

66 Graph vs. Relational Databases
Querying is more difficult for relational databases. You would have to write multiple SQL statements or use specialized recursive statements if they are provided (for example, Oracle’s CONNECT BY clause in SELECT statements) to find paths using the table representation of the data. Graph databases allow for more efficient querying when paths through graphs are involved. Many application areas are efficiently modeled as graphs and, in those cases, a graph database may streamline application development and minimize the amount of code you would have to write.

67 Graph Databases: Conclusions
Graph databases allow you to store entities and relationships between these entities. Entities are also known as nodes, which have properties. Think of a node as an instance of an object in the application. Relations are known as edges that can have properties. Edges have directional significance; nodes are organized by relationships which allow you to find interesting patterns between the nodes. The organization of the graph lets the data to be stored once and then interpreted in different ways based on relationships. Usually, when we store a graph-like structure in RDBMS, it's for a single type of relationship ("who is my manager" is a common example). Adding another relationship to the mix usually means a lot of schema changes and data movement, which is not the case when we are using graph databases. Similarly, in relational databases we model the graph beforehand based on the Traversal we want; if the Traversal changes, the data will have to change. In graph databases, traversing the joins or relationships is very fast. The relationship between nodes is not calculated at query time but is actually persisted as a relationship. Traversing persisted relationships is faster than calculating them for every query.

68 Graph Databases: Conclusions
Nodes can have different types of relationships between them, allowing you to both represent relationships between the domain entities and to have secondary relationships for things like category, path, time-trees, quad-trees for spatial indexing, or linked lists for sorted access. Since there is no limit to the number and kind of relationships a node can have, they all can be represented in the same graph database. Relationships are first-class citizens in graph databases; most of the value of graph databases is derived from the relationships. Relationships don't only have a type, a start node, and an end node, but can have properties of their own. Using these properties on the relationships, we can add intelligence to the relationship—for example, since when did they become friends, what is the distance between the nodes, or what aspects are shared between the nodes. These properties on the relationships can be used to query the graph. Since most of the power from the graph databases comes from the relationships and their properties, a lot of thought and design work is needed to model the relationships in the domain that we are trying to work with. Adding new relationship types is easy; changing existing nodes and their relationships is similar to data migration, because these changes will have to be done on each node and each relationship in the existing data.

69 References This slide is constructed using the slides and materials from the following books and websites: NoSQL Distilled, Pramod J. Sadalage and Martin Fowler NoSQL for Mere Mortals, Dan Sullivan OSEHRA Architecture Working Group, Ruoming Jin, Wikipedia


Download ppt "Database Design (constructed by Hanh Pham based on slides from books and websites listed under reference section) NoSQL Databases."

Similar presentations


Ads by Google