Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, From SQL to NoSQL Xiao Yu Mar 2012
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, RDBMS The predominant choice in storing data ›Not so true for data mining PhDs since we put everything in txt files. First formulated in 1969 by Codd ›We are using RDBMS everywhere
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, Slide from neo technology, “A NoSQL Overview and the Benefits of Graph Databases"
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, When RDBMS met Web 2.0 Slide from Lorenzo Alberton, "NoSQL Databases: Why, what and when"
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, What’s Wrong with Relational DB? Nothing is wrong. You just need to use the right tool. Relational is hard to scale. ›Easy to scale reads ›Hard to scale writes
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, The Death of RDBMS?
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, What’s NoSQL? The misleading term “NoSQL” is short for “Not Only SQL”. non-relational, schema-free, non-(quite)- acid horizontally scalable, distributed, easy replication support simple API
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, Four (emerging) NoSQL Categories Key-value stores ›Based on DHTs/ Amazon’s Dynamo paper * ›Data model: (global) collection of K-V pairs ›Example: Voldemort Column Families ›BigTable clones ** ›Data model: big table, column families ›Example: HBase, Cassandra, Hypertable *G DeCandia et al, Dynamo: Amazon's Highly Available Key-value Store, SOSP 07 ** F Chang et al, Bigtable: A Distributed Storage System for Structured Data, OSDI 06
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, Four (emerging) NoSQL Categories Document databases ›Inspired by Lotus Notes ›Data model: collections of K-V Collections ›Example: CouchDB, MongoDB Graph databases ›Inspired by Euler & graph theory ›Data model: nodes, rels, K-V on both ›Example: AllegroGraph, VertexDB, Neo4j
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, Focus of Different Data Models Slide from neo technology, “A NoSQL Overview and the Benefits of Graph Databases"
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, CAP theorem Consistency Availability Partition Tolerance RDBMS NoSQL (most)
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, When to use NoSQL? Bigness Massive write performance ›Twitter generates 7TB / per day (2010) Fast key-value access Flexible schema or data types Schema migration Write availability ›Writes need to succeed no matter what (CAP, partitioning) Easier maintainability, administration and operations No single point of failure Generally available parallel computing Programmer ease of use Use the right data model for the right problem Avoid hitting the wall Distributed systems support Tunable CAP tradeoffs from
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, Key-Value Stores idhair_colorageheight 1923Red186’0” 3371Blue34NA ………… Table in relational db Store/Domain in Key-Value db Find users whose age is above 18? Find all attributes of user 1923? Find users whose hair color is Red and age is 19? (Join operation) Calculate average age of all grad students?
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, Example of Voldemort
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, Voldemort in LinkedIn Sid Anand, LinkedIn Data Infrastructure (QCon London 2012)
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, RO Store Usage Pattern Sid Anand, LinkedIn Data Infrastructure (QCon London 2012)
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, Voldemort vs MySQL Sid Anand, LinkedIn Data Infrastructure (QCon London 2012)
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, Column Families – BigTable Alike F Chang, et al, Bigtable: A Distributed Storage System for Structured Data, osdi 06
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, BigTable Data Model The row name is a reversed URL. The contents column family contains the page contents, and the anchor column family contains the text of any anchors that reference the page.
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, More on Row and Column Rows stored in lexicographic order by row key Table dynamically split into “Tablets” Each tablet contains key [startKey, endKey) Tablets are distributed on different nodes All date in the same CF are usually same type Data in same CF are compressed and stored together CF in a specific row is sorted
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, BigTable API Examples adds one anchor to and deletes a different anchor uses a Scanner abstraction to iterate over all anchors in a particular row
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, BigTable Performance
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, Document Database - mongoDB Table in relational db Documents in a collection Initial release 2009 Open source, document db Json-like document with dynamic schema
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, mongoDB Product Deployment And much more…
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, mongoDB Features Document-oriented storage Full Index Support Replication & High Availability Auto-Sharding Querying Fast In-Place Updates ? Map/Reduce GridFS Commercial Support From
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, sum(checkout) From Gabriele Lana, CouchDB Vs MongoDB
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, And mongoDB is fast avgmeddevtotal mongoDB mySQL Indexed Queries avgmeddevtotal mongoDB mySQL Non-Indexed Queries
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, Graph Database Data Model Abstraction: Nodes Relations Properties
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, Neo4j - Build a Graph Slide from neo technology, “A NoSQL Overview and the Benefits of Graph Databases"
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, Neo4j – Traverse a Graph Slide from neo technology, “A NoSQL Overview and the Benefits of Graph Databases"
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, A Debatable Performance Evaluation Comparing Apple to Orange
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, Conclusion Use the right data model for the right problem