NoSQL Not only SQL or No Sql - No SQL support Support for the full SQL language imposes constraints on datastores. So does ACID compliance. So does the need for a fixed database schema. Many applications need more specialised datastores. A movement for choice in database architecture CouchBase survey Mike Loukides at O'ReillyMike Loukides at O'Reilly an excellent overview Polyglot Persistance by Martin Fowler Wikipedia Comparision nosql-databases.org - a rather terrifying set of resources. Tim Anglade's compilation of Interviews
NoSQL is not new Despite the wide-spread adoption of the relational data model for business application, there have always been a wide variety of specialised databases: Geographic Information Systems - complex spatial relationships - ArcGIS e.g. BCC KnowYourPlaceArcGIS OLAP - OnLine Analytic Processing - for analysis of transaction data Free Text databases eg. LexisNexis for legal documents Multi-dimensional sparse arrays - Pick and MUMPS Object-oriented databases - eg ZOPE for the Plone CMS These databases were directed at the need for complex and flexible data structures.
Forces for change Volume of data - Facebook has over 30 Petabytes - 30,000 terabytes or 30 million Gigabytes Volume of transactions - order of 1 million writes/sec Changeability/flexibility of schema - constant beta Complexity of data - UK Legislation
Use case: Terabytes of data need to be stored reliably with no schema requirements Reliability is a big problem when volumes are large. In a farm of say, 1000 servers, each with 8 spindles, there is a high probability that one disk will be down at any time. Random access update is too slow - append new data and merge in batch BigTable from Google HBase from Apache Dynamo from Amazon Doug Cutting on Apache's Hadoop
Use case: Batch data analysis Where very large transaction datasets need to be filtered and summarised, for example to analysis log files by IP location. In the past these could have been overnight jobs,now they need to be done in at most minutes. Map-Reduce is an architecture for large-scale distributed computation. MapReduce should be called MapMergeReduce. Each MapReduce task is written in Java (or a high-level language like Pig). The operating system (like Hadoop) coordinates the distribution of the map, merge and reduce jobs and the dataflows. input is a database of key-value pairs which are split ('sharded') over many spindles on many servers. the user's map operation runs on every server hosting the shards and transforms each key/value input into 0,one or more key/value outputs. Merge (shuffle) merges all pairs for the same key and distributes them (e.g. by hashing the keys) to multiple Reduce servers. This to can be user configurable. the user's reduce takes each group of values for the same key and produces zero, one or more key/values for each group. Successive MapMergeReduce operations can be chained together in a pipeline.
Use case: Fast put/get of keyed data Key-value store Where complex data is to be stored but the database is not interested in the internal structure. For example storing session data, user profiles, shopping carts The only operations are value = store.get(key) store.put(key, value) store.delete(key) Platforms: Project Voldemort Rhino
Use case: Page Caching Key-value cache Where the generation of a page takes a significant time, it is better to cache the pages as key/value pairs where the key is a URI and the value is the HTML page. As much of the cache as poosible is kept in RAM for rapid access Issues: cache flushing For example this site views summarized data from an eXist document store: AidViewAidView Platforms: Memecached
Use case: Linked data Graph Database Where data is composed of simple, highly interrelated facts. For example, there is an RDF version of Wikipedia called dbpedia. Some use available databases such as MySQL, but the specific form of the data and the queries on the data suggest native Triple (usually quad) stores to support RDF - Jena, Sesame Virtuoso- query with SPARQL. RDF has a rigid data model : [graph] subject- predicate- object and is widely used for linked dataJenaSesame Virtuoso Custom Graph stores - Neo4J non standard interfacesNeo4J
XML/XQuery for graphs tutorial for using Neo4j to compute relationships in a graph Friends relationship Some friends as XML a bit of XQuery The knows relationship expanded Permissions People Roles a bit of XQuery People and permissions Shortest Path is difficult - Dijkstra's algorithm is tricky to implement in functional languagesDijkstra's algorithm
Dan McCreary's Overview The CIO's Guide to NoSQL
Risks Lack of standardisation New technology Design cul-de-sac - requirements change Lack of available developer skills. R DMBS like Oracle and SQL Server are changing too - but just get more complex. A dissenting view - warning - NSFW