MySQL to NoSQL Data Modeling Challenges in Supporting Scalability ΧΑΡΟΚΟΠΕΙΟ ΠΑΝΕΠΙΣΤΗΜΙΟ - ΤΜΗΜΑ ΠΛΗΡΟΦΟΡΙΚΗΣ ΚΑΙ ΤΗΛΕΜΑΤΙΚΗΣ ΠΜΣ "Πληροφορική και Τηλεματική“

Slides:



Advertisements
Similar presentations
Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B I590 Data Science Curriculum August Geoffrey Fox
Advertisements

Cassandra Database Project Alireza Haghdoost, Jake Moroshek Computer Science and Engineering University of Minnesota-Twin Cities Nov. 17, 2011 News Presentation:
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Chapter 14 The Second Component: The Database.
NoSQL and NewSQL Justin DeBrabant CIS Advanced Systems - Fall 2013.
Nikolay Tomitov Technical Trainer SoftAcad.bg.  What are Amazon Web services (AWS) ?  What’s cool when developing with AWS ?  Architecture of AWS 
BUSINESS DRIVEN TECHNOLOGY
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
Internet GIS. A vast network connecting computers throughout the world Computers on the Internet are physically connected Computers on the Internet use.
1 Yasin N. Silva Arizona State University This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
Overview of SQL Server Alka Arora.
Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.
M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Goodbye rows and tables, hello documents and collections.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Introduction to Hadoop and HDFS
Modern Databases NoSQL and NewSQL Willem Visser RW334.
Apache Cassandra - Distributed Database Management System Presented by Jayesh Kawli.
Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.
Engr. M. Fahad Khan Lecturer Software Engineering Department University Of Engineering & Technology Taxila.
NoSQL Databases Oracle - Berkeley DB. Content A brief intro to NoSQL About Berkeley Db About our application.
1 Dennis Kafura – CS5204 – Operating Systems Big Table: Distributed Storage System For Structured Data Sergejs Melderis 1.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
Introduction to DFS. Distributed File Systems A file system whose clients, servers and storage devices are dispersed among the machines of a distributed.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes 13: BigTable, HBASE, Cassandra Hector Garcia-Molina.
INFO1408 Database Design Concepts Week 15: Introduction to Database Management Systems.
MongoDB is a database management system designed for web applications and internet infrastructure. The data model and persistence strategies are built.
6/1/2001 Supplementing Aleph Reports Using The Crystal Reports Web Component Server Presented by Bob Gerrity Head.
By Vaibhav Nachankar Arvind Dwarakanath.  HBase is an open-source, distributed, column- oriented and sorted-map data storage.  It is a Hadoop Database;
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
Dynamo: Amazon’s Highly Available Key-value Store DAAS – Database as a service.
NoSQL Or Peles. What is NoSQL A collection of various technologies meant to work around RDBMS limitations (mostly performance) Not much of a definition...
Microsoft Azure and DataStax: Start Anywhere and Scale to Any Size in the Cloud, On- Premises, or Both with a Leading Distributed Database MICROSOFT AZURE.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, From SQL to NoSQL Xiao Yu Mar 2012.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
Introduction to Core Database Concepts Getting started with Databases and Structure Query Language (SQL)
SQL Basics Review Reviewing what we’ve learned so far…….
E-commerce Architecture Ayşe Başar Bener. Client Server Architecture E-commerce is based on client/ server architecture –Client processes requesting service.
Group members: Phạm Hoàng Long Nguyễn Huy Hùng Lê Minh Hiếu Phan Thị Thanh Thảo Nguyễn Đức Trí 1 BIG DATA & NoSQL Topic 1:
BIG DATA/ Hadoop Interview Questions.
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
Gorilla: A Fast, Scalable, In-Memory Time Series Database
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
Why NO-SQL ?  Three interrelated megatrends  Big Data  Big Users  Cloud Computing are driving the adoption of NoSQL technology.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
A presentation on ElasticSearch
and Big Data Storage Systems
Column-Based.
An Open Source Project Commonly Used for Processing Big Data Sets
CS122B: Projects in Databases and Web Applications Winter 2017
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
NOSQL.
The Improvement of PaaS Platform ZENG Shu-Qing, Xu Jie-Bin 2010 First International Conference on Networking and Distributed Computing SQUARE.
NOSQL databases and Big Data Storage Systems
1 Demand of your DB is changing Presented By: Ashwani Kumar
آزمايشگاه سيستمهای هوشمند علی کمالی زمستان 95
Tools for Processing Big Data Jinan Al Aridhee and Christian Bach
NoSQL Databases Antonino Virgillito.
AWS Cloud Computing Masaki.
Interpret the execution mode of SQL query in F1 Query paper
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

MySQL to NoSQL Data Modeling Challenges in Supporting Scalability ΧΑΡΟΚΟΠΕΙΟ ΠΑΝΕΠΙΣΤΗΜΙΟ - ΤΜΗΜΑ ΠΛΗΡΟΦΟΡΙΚΗΣ ΚΑΙ ΤΗΛΕΜΑΤΙΚΗΣ ΠΜΣ "Πληροφορική και Τηλεματική“ ΣΑΜΑΡΤΖΟΠΟΥΛΟΣ ΝΙΚΟΣ (itp12406)

Large data sets are driving adoption of NoSQL technologies. Transitioning from relational persistence to NoSQL persistence is non- trivial. The challenges making this transition are encountered in many areas. Software architecture, data modeling, deployment, developer skill sets, system operations, etc. They report on making this transition within the context of a large research enterprise: Project EPIC The focus of the paper is on data modeling issues, but touches on others issues Introduction and Background

Project EPIC  Investigates use of social media to collaborate and coordinate during times of disaster.  Project team consists of researchers with skills in human-centered computing, software engineering, natural language processing, and network privacy and policy.  Focus on the collection of Twitter data, because twitter is a place where people turn to, to ask for help or to report on things they can do to help during these events. 1. Introduction and Background 3

Software Engineering Challenges Crisis informatics places significant demands on Software Engineering Quality: colleagues require high-quality data sets for their research. Collection for an event must be 24/7 Robustness: Given the quality constraint, their data collection infrastructure must be robust in the face of network disconnects, system failures, rate limiting, etc. Scalability: A single event can generate millions of data points (Tweets); In its two years of deployment it has collected over 2B disaster-related status messages covering numerous mass emergency events that occurred in while maintaining 99% uptime. 1. Introduction and Background 4

Goal Design, develop, and deploy a system capable of ▫ collecting, ▫ packaging, and ▫ analyzing research-quality data sets ▫ in real-time and on-demand, while natural disasters and all these types of events can occur at any moment. 1. Introduction and Background 5

Software Architecture, version 1 Persistence Architecture based on Relational Technologies Highly-decoupled four-tier architecture ▫ applications, services, persistence, database Production-class software ▫ Hibernate, Spring, Spring MVC Infrastructure components ▫ Tomcat, MySQL and Lucene Adopted MySQL because of its “one size fits all” nature ▫ Familiar data model plus great tool support ▫ Easy integration with Hibernate and Lucene 1. Introduction and Background 6

System Architecture (V.1) 7

The Problem Relational databases are great when starting out ▫ Lots of tool support; well understood technology ▫ Can be made to scale (mostly through $) However, a relational-only approach to storage was not meeting their needs ▫ RDBMSs aren’t flexible, schema updates are painful ▫ Availability is less then ideal: single point of failure ▫ Data replication isn’t automatic; table scans are painful (this does not mean that we can't make the scale; we can do vertical scaling on these things: sharding of data, memory cache, good data center  $$$) 1. Introduction and Background 8

NoSQL Not Only SQL NoSQL (Not Only SQL) technologies ▫ Models based on Google’s BigTable and Amazon’s Dynamo ▫ Storage of “big data” sets across clusters of machines ▫ Enable analysis on large data sets via, e.g. MapReduce framework (Hadoop) Enable flexibility, availability, and scalability ▫ Flexibility via no enforced schema ▫ Availability via replication of data across the cluster ▫ Scalability via ability to add machines to the cluster 2. NoSQL 9

Version 2. Moving to NoSQL Added Apache Cassandra to their Persistence ▫ Addressed the storage problems they encountered in version 1. Analytics can now occur in a variety of ways ▫ Hadoop; Lucene; SQL Challenges ▫ Moving from relational schema to schema-less non-trivial ▫ Little to no tool support ▫ Largely undocumented frameworks and APIs ▫ Distributed system expertise required; sysadmin skills a plus! 2. NoSQL 10

2. NoSQL System Architecture (V.2) Hybrid Persistence Architecture: Relational + NoSQL 11

Original Data Model (Simplified) With ORM technology (Hibernate) this model is easily supported ORM frameworks for handling all interactions with the database, enabling the client to interact only with objects and their relationships ▫ But it runs into problems when the number of tweets that can be returned across the associations number in the hundreds of millions (the system runs out of memory) The benefit, however, is flexible queries 3. Data Model: Before the Transition 12

NoSQL technologies require “queries up front” ▫ Data retrieval is fast because query results are co-located ▫ This is by design; you have to write your data the way you want it to be retrieved ▫ Need to know what questions we are answering up front Queries Find all tweets associated with an event Find all tweets associated with a user Find all tweets associated with an event in a given time range 4. Making the Transition 13

Cassandra Data Model To ensure that their new architecture could answer these queries, they needed to store data that maps to Cassandra’s data model A column family consists of rows that point to many columns. Each column has a column name and a column value. Design of row key (i.e. strings, dates and numbers) is critical. It allows the client to index into the column family and retrieve columns. Each row can have a different set of columns (no schema) 4. Making the Transition 14

First Query: Find all tweets associated with an event Event name acts as row key; the unique id will be used as the column name for the event columns ▫ Each column stores full JSON representation of the tweet  no information about the tweet is lost Rows contain potentially millions of tweets ▫ Problematic when Cassandra attempts to replicate keys and their associated data (columns) around a cluster of machines all of the key’s data is replicated as a unit  long delays or timeouts when adding additional nodes to the cluster. 4. Making the Transition 15

Second Query: Find all tweets associated with a user Use of the secondary indexing feature provided by Cassandra Secondary indexes allow the client to execute very simple queries against column values that can be indexed by Cassandra ▫ Ex. screen_name = ‘jsmith’ 4. Making the Transition 16

Third Query: Find all tweets associated with an event in a given date range Make use of composite (string) row keys (event name : day) to store tweets in chunks of time By decreasing the number of columns stored with each key the amount of data that must be moved with each key, when it is replicated across the cluster, is also decreased  improves speed of replication Data reads may also be more efficient; the client may now specify the exact data they are interested in receiving, instead of requesting all the data available. 4. Making the Transition 17 This approach replaces first query

5. After the Transition 18 Hector CassandraTwitterStatusService

Lessons Learned Cassandra meets their needs but non-trivial to implement Flexibility ▫ Immunity to changes in Tweet meta-data by Twitter (don't have to make any change to software every time twitter changes the metadata) Availability ▫ Always writeable Scalability ▫ Need more storage? Add another node 5. After the Transition 19

Performance Twitter Streaming API can deliver tweets/s 24/7 (5M / day) Version 1 architecture struggled to keep up with collection Version 2 architecture (cassandra), with no need to store tweets in a queue waiting for the persistence mechanism to update its records ▫ Now it can handle 100+ tweets per second; ~8.6M a day (April 2011 Japan Earthquake) ▫ easily handled collection on 2012 Summer Olympics (712 users and keywords; 40M tweets (98.2GB) after two weeks) 5. After the Transition Deployment of the Project EPIC software infrastructure in a wide variety of configurations Single researcher storing Twitter data in JSON files, a research group running the infrastructure on a single powerful server, to an even larger research group running a hybrid persistence architecture on a large cluster of machines (as Project EPIC does today). 20

Challenges ▫ NoSQL is added alongside relational technologies; it does not replace them ▫ - they’re not saying that MySQL can’t scale ▫ Data modeling is hard; difficult to change queries ▫ Skills gap: SQL is familiar; NoSQL is unfamiliar ▫ New skills needed: system administration; cluster management Despite challenges, it is possible to incorporate NoSQL into existing systems ▫ Requires good software architecture and software engineering practices Possible but not trivial ▫ Determine if your application needs require this combination of flexibility, availability and scalability, offered by these technologies. If not look at other Storage Technologies and figure out which ones exactly meet your needs. 6. Conclusions 21

THANK YOU!!! 22