Aleksandar Talev & Vlatko Bojkovski

Aleksandar Talev & Vlatko Bojkovski
How to choose data model: SQL Server or Azure Cosmos DB. Which, When and Where ?

Thank you!

Speaker Info Aleksandar Talev Vlatko Bojkovski Director of RD
Microsoft MVP Data Platform MCSE 2017, MCT, MCSA, MCPD, MCDBA Semos DOO Skopje, Macedonia Vlatko Bojkovski Senior Engineer MCSA, MCTS, MCSD

Agenda noSQL vs SQL Cosmos DB Tour SQL Server vs Azure Cosmos DB
Conclusion

What is NOSQL?

What is NOSQL? NoSQL database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. NOSQL actually means “Not Only” SQL The data structures used by NoSQL databases (e.g. key-value, wide column, graph, or document) are different from those used by default in relational databases, making some operations faster in NoSQL.

Advantages of NoSQL Databases
Simpler to scale, it takes the same time to insert data in an empty table, or a table with billions of entries. Suitable for distributed systems. Can hold unstructured and semi-structured data. No need to map and maintain complex relationships. Fast to insert, because there’s no need to perform locks or check constraints. Fast to read, if the key is known.

Disadvantages of NoSQL Databases
Consistency needs to be considered as transactional updates across multiple entities are not guaranteed. Relationships between data need to be maintained externally to the data. Possible difficulties to filter and sort on non-key data; this might result in full table scans. There is no familiar SQL

NoSQL Databases Data Structures
{ customer: , orderid: 23dklnm, product: Dynamo, price: 3.40, currency: USD, discount: 0.05 } Typical data structures : Key-value stores Document databases Columnar databases Graph databases CustomerID Identity Column Family 1 Title: Miss, FirstName: Sam, LastName: Coal 2 Title: Mr, FirstName: Simon, MiddleName: Paul, LastName: Tindell 3 Title: Dr, GivenName: Shreep, BirthName: Sandeshreep

Comparing documents and collections with relations

Overview of Cosmos DB

Cosmos DB - Global distribution
Microsoft Azure is available in 54 regions and additional will be available Azure Cosmos DB is available in all of them It is classified as Ring 0 Azure Service – will be available in any new region by default Global distribution is comparable concept to what replication is for RDBMS

Read and Write regions

Consistency model Consistency is couple of rules that defines how distributed data is available to users Factors that are involved are throughput and latency Azure Cosmos DB offers comprehensive 99.99% SLAs which guarantee throughput, consistency, availability, and latency for Cosmos DB database accounts scoped to a single Azure region

Consistency models types
Strong consistency – guarantees that any item read will return the most recent version , doesn’t work with several regions Bounded staleness - reads may lag behind most K write operations or t time interval – Administrator define lag - threshold criteria Session – default model – guarantee reading most recent data inside any session Consistent prefix – similar to eventual consistency – good for retweets, likes and non-thread comments (A,B,C) - > (A),(A,B),(A,B,C) Eventual consistency – guarantees that all of the replicas will eventually converge to reflect the most recent data

Throughput and latency
Cosmos DB Introduced normalized quantity called request unit Number of request unit per any operation is always deterministic 1 RU is GET document of 1024 KB Number of request unit per second is called Throughput Estimating throughput with RU calculator :

Monitoring performance
Azure portal, Metrics blade: Overview Throughput Storage Latency Availability Consistency System Azure SDK Less detailed metrics from the response returned by ReadDocumentCollectionAsync Use the Azure Monitor SDK for detailed metrics

Security Encryption at Rest Firewall support Authentication
Primary databases are stored on SSD Media attachments, replicas and backups are stored in Azure Blob which uses HDD Firewall support Policy driven IP – based access control – for inbound connections Authentication Master keys – primary, secondary Resource tokens- more granular access – partition keys, documents, attachments, stored procedures

Hierarchy of a Cosmos DB
Azure Cosmos DB account Database Collections Documents Attachments { } Users Stored Procedures Triggers Permissions f(n) User-Defined Functions

Containers and Partitions
Physical partitions are managed by internal resources in Cosmos DB A logical partition is a partition within a physical partition that stores all the data associated with a single partition key value Each document must have a partition key and a unique key, which uniquely identify it Partition key acts as a logical partition for data and provides Azure Cosmos DB boundary for distributing data across physical partitions Containers are logical resources that group one or more physical partitions For : SQL API and MongoDB API accounts, a container maps to a Collection. Cassandra API and Table API accounts, a container maps to a Table. Gremlin API accounts, a container maps to a Graph.

Partitioning collections in Cosmos DB
Partition key = /UserId Container UserId : "b" UserId : "a" UserId : "y" UserId : "z" Logical partitions Physical partitions UserId : "a" UserId : "b" … UserId : "y" UserId : "z" …

System topology The Cosmos DB service is deployed worldwide across all Azure regions including the sovereign and government clouds. We deploy and manage the Cosmos DB service on stamps of machines, each with dedicated local SSDs. The Cosmos DB service is layered on top of Azure Service Fabric, which is a foundational distributed systems infrastructure of Azure. Cosmos DB uses Service Fabric for naming, routing, cluster and container management, coordination of rolling upgrades, failure detection, leader election (within a resource partition) and load balancing capabilities. Cosmos DB is deployed across one or more Service Fabric clusters, each potentially running multiple generations of hardware and of varying number of machines (currently, between machines). Machines within a cluster typically are spread across fault domains. The resource partitions is a logical concept. Physically, a resource partition is implemented in terms of a group of replicas, called replica-sets. Each machine hosts replicas corresponding to various resource partitions within a fixed set of processes. Replicas corresponding to the resource partitions are placed and load balanced across these machines. Each replica hosts an instance of Cosmos DB’s schema-agnostic database engine, which manages the resources as well as the associated indexes. The Cosmos DB database engine, in-turn, consists of components including implementation of several coordination primitives, the JavaScript language runtime, the query processor, the storage and indexing subsystems responsible for transactional storage and indexing of data, respectively. To provide durability and high availability, the database engine persists its index on SSDs and replicates it among the database engine instances within the replica-set(s) respectively. While the index is always persisted on local SSDs, the log is persisted either locally, on another machine within the cluster, or remotely across cluster or a datacenter within a region. The proximity of the index and log is configurable based on the price and latency SLA. The ability to dynamically configure the proximity between the database engine (compute) and log (storage) at the granularity of replicas of a resource partition is crucial for allowing tenants to dynamically select various service tiers.

Document structure in document databases
Documents are stored as JSON GeoJSON supported for geometry Each document must be 2 MB or less For binary blobs > 2 MB, use attachments Pointer to a URL outside the document Use Cosmos DB blob store (up to 2 GB total per database account) or an external store System properties added to all documents: _rid _etag _ts _self Id is added automatically (as a GUID) if it is not supplied

Cosmos DB – indexes Consistent: Lazy:
Cosmos DB collection follow the same consistency level as specified for the point-reads The index is updated synchronously Lazy: The index is updated asynchronously that is, when the collection’s throughput capacity is not fully utilized to serve user requests. User might get inconsistent results because data is ingested and indexed slowly. The index is generally in catch-up mode with ingested data.

Moving data into a Cosmos DB database
Import Data Method Destination API Data Migration tool (dtui.exe or dt.exe) SQL GraphSON import or the Gremlin Console Gremlin mongoimport or mongorestore Mongo AZCopy or dt.exe Table cqlsh COPY or the Spark connector Cassandra

Programming Cosmos DB

} { Using SQL API SQL (like) SDKs API Calls JavaScript logic { } { }
Client access JSON documents REST API } { { } SQL (like) SDKs { } API Calls { } JavaScript logic

Accessing data Connection mode Supported protocol Supported SDKs
API/Service port Gateway HTTPS All SDKS SQL(443), Mongo(10250, 10255, 10256), Table(443), Cassandra(10350), Graph(443) Direct .NET and Java SDK Ports within 10,000-20,000 range TCP .NET SDK

Using SQL to find documents in a collection
SELECT clause FROM clause (optional) JOIN clause (optional) WHERE clause (optional) ORDER BY clause (optional)

Using joins Use the JOIN clause Used for joining inside documents
Joins between documents are not possible CROSS JOIN only (no ON clause) Useful for projecting/flattening arrays of subdocuments SELECT p.Name, p.CurrentAddress.City AS CurrentCity, h.City AS EarlierCity FROM p JOIN h IN p.AddressHistory

Retrieving data using SQL queries
Connect with DocumentClient Issue SQL queries with CreateDocumentQuery Use LINQ by casting query to Ienumerable Fetching documents by Id Use ReadDocumentAsync to retrieve a document by Id Returns a DocumentResponse (typed) or a ResourceResponse<Document> (untyped)

How Cosmos DB supports server-side operations
Procedural logic: JavaScript as a high-level programming language Atomic transactions: Azure Cosmos DB guarantees that the database operations are atomic Performance: The JSON data is intrinsically mapped to the JavaScript language type system. Other performance benefits : Batching:. Pre-compilation: Sequencing:. Encapsulation: Stored procedures can be used to group logic in one place.

Stored procedures, triggers, and User Defined Functions (UDF)
You can write application logic to run directly within a transaction inside of the database engine. The application logic can be written entirely in JavaScript and can be modeled as a stored procedure, trigger, or a UDF. The JavaScript code within a stored procedure or a trigger can insert, replace, delete, read, or query documents within a collection var inputDocument = {id : "document1", author: "G. G. Marquez"}; client.executeStoredProcedureAsync(createdStoredProcedure.resource._self, inputDocument) .then(function(executionResult) { assert.equal(executionResult, "success - created DOCUMENT1"); }, function(error) { console.log("Error"); });

Understanding the Cosmos DB change feed
Stream of JSON documents that record inserts and updates to a collection, ordered by time. Query through the SQL API; uses collection throughput eCommerce Scenario Cosmos DB Change feed Microservices Tax calculation Orders Order processing

Why Azure Cosmos DB is the recommended database for serverless computing
Instant access to all your data: You have granular access to every value stored because Azure Cosmos DB automatically indexes all data by default Schemaless. Azure Cosmos DB is schemaless - so it's uniquely able to handle any data output from an Azure Function. Scalable throughput. Throughput can be scaled up and down instantly in Azure Cosmos DB. All functions can work in parallel using your allocated RU/s and your data is guaranteed to be consistent. Global replication. You can replicate Azure Cosmos DB data around the globe to reduce latency, geo-locating your data closest to where your users are.

Using the change feed with Azure Functions
App Service—“serverless” functions Possible to trigger from a Cosmos DB change feed Function is fired on each change, and receives the changed document(s) Cosmos DB App Service change feed Cosmos DB trigger Azure Function Output

Patterns for working with Cosmos DB
6/5/ :15 AM Patterns for working with Cosmos DB Vanilla 1:1,1:N, M:N Time-series data Write heavy (Event sourcing) Possible hot spots © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Vanilla 1:1 Gaming use case with 100s to billions of active players
GetPlayerById AddPlayer RemovePlayer UpdatePlayer Strawman: partition key = id Only GET, POST, PUT, and DELETE Provisioned throughput = Sigma (RUi * Ni) Bonus: Bulk Inserts Bonus: Bulk Read (use read feed or change feed) for analytics © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Vanilla 1:N Supporting lookup of game state Partition key is playerId
GetGameByIds(PlayerId, GameId) GetGamesByPlayerId(PlayerId) AddGame RemoveGame Partition key is playerId GetGameByIds, AddGame, and RemoveGame are GET, POST, and DELETE GetGamesByPlayerId is a single-partition query: SELECT * FROM c WHERE c.playerId = ‘p1’ © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Vanilla M:N Multi-player gaming. Lookup by either gameId or playerId
GetPlayerById(PlayerId) GetGameById(GameId) AddGame RemoveGame Partition key = PlayerId is OK if mix is skewed towards Player calls (because of index on game ID) If mix is 50:50, then need to store two pivots of the same data by Player Id and Game Id Double-writes vs. change feed for keeping copies up-to-date © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

6/5/ :15 AM Time-series data Ingest readings from sensors. Perform lookups by date time range AddSensorReading(SensorId) GetReadingsForTimeRange(StartTime, EndTime) GetSensorReadingsForTimeRange(SensorId, StartTime, EndTime) No natural partition key. Time is an anti-pattern! Set partition key to Sensor ID, id to timestamp, set TTL on the time window Bonus: create collections for per-minute, per-hour, per-day windows based on stream aggregation on time-windows © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Event sourcing pattern
6/5/ :15 AM Write-heavy workloads. Store each event as an immutable document, instead of updating state in-place AddEventForObject(ObjectId, EventType, Timestamp) GetEventsForObject(ObjectId, EventType) GetEventsSinceTimestamp(Timestamp) Why event-driven architectures? Inserts are more efficient than update at scale Built-in audit log of events Decoupled micro-services that act on events GetEventsForObject is a single-partition query to get latest state on read © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Hot spot - large partition keys
6/5/ :15 AM Use cases Multi-tenant applications where few tenants are very large Router publishes telemetry at higher rate than sensors Celebrity in a social networking app, viral gaming tournament Patterns to manage large partition keys Have a surrogate partition key like tenant ID Use hybrid partitioning scheme for small tenants, and large tenants = 0-100 Move large tenants to their own collections If the per-document size is large, use the patterns for large documents GetEventsSinceTimestamp using change feed Bonus: create collections per hour (0-23), and set differential throughput based on the request rates © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Demo : Building Serverless Application using Cosmos DB

Comparing Cosmos DB with SQL Server (1)

Comparing Cosmos DB with SQL Server (2)

Summary Cosmos DB is not a replacement for SQL Server
You would very, very rarely, if ever, migrate your data from an existing SQL Server database to Cosmos DB Most common scenarios will be : International retail chains and order processing Multi Player Gaming IoT Social media apps (Comments, likes and retweets) Fraud/Anomaly detection Metering - Counting and regulating usage (API calls, transactions/second, minutes used)

Thank You

Aleksandar Talev & Vlatko Bojkovski

Similar presentations

Presentation on theme: "Aleksandar Talev & Vlatko Bojkovski"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Aleksandar Talev & Vlatko Bojkovski

Similar presentations

Presentation on theme: "Aleksandar Talev & Vlatko Bojkovski"— Presentation transcript:

Similar presentations

About project

Feedback