Presentation is loading. Please wait.

Presentation is loading. Please wait.

NoSQL, XRX and Functional Programming Version 4.0 Dan McCreary Monday May 6th, 2013.

Similar presentations

Presentation on theme: "NoSQL, XRX and Functional Programming Version 4.0 Dan McCreary Monday May 6th, 2013."— Presentation transcript:

1 NoSQL, XRX and Functional Programming Version 4.0 Dan McCreary Monday May 6th, 2013

2 M D Topics NoSQL –History –Business Motivation XRX (XForms, REST, XQuery) Functional Programming NoSQL Concepts –MapReduce/Hadoop –Key-value stores –CAP Theorem –Object-Relational Mapping NoSQL Product Classifications –Key-value stores, graph-stores, bigtable stores and document stores Sampling of Solutions –Big Data –Search –High availabily –Agility Analysis methods (ATAM) 2 Kelly-McCreary & Associates, LLC

3 M D Background for Dan McCreary Bell Labs NeXT Computer (Steve Jobs) Owner of 75-person software consulting firm US Federal data integration (National Information Exchange Model Native XML/XQuery for metadata management since 2006 Advocate of web standards, NoSQL and XRX systems Copyright Kelly-McCreary & Associates, LLC 3

4 M D Making Sense of NoSQL Copyright Kelly-McCreary & Associates, LLC 4 Working with NoSQL since 2006 Co-founders of the NoSQL Now! conference Authors of Manning book on NoSQL (MEAP now, print July 2013) Guide for managers with a focus on business benefits Focus on NoSQL architectural tradeoff analysis

5 M D How got into NoSQL Background as of 2006 –Worked for Steve Jobs at NeXT –Object-oriented development Objective-C, Java –Strong focus on object-relational mapping DBKit, WebObjects, Hibernate, Oracle, Sybase, MS-SQL Server –7 years doing XML (CriMNet), Metadata Registries (ISO/IEC 11179) and Semantics (RDF, OWL etc.) My Story 5 Kelly-McCreary & Associates

6 M D 6

7 M D 7

8 M D 8

9 M D 9 XForms Rocks!

10 M D 10 Kelly-McCreary & Associates

11 M D Four Translations 11 Copyright 2010 Dan McCreary & Associates T 1 – HTML into Java Objects T 2 – Java Objects into SQL Tables T 3 – Tables into Objects T 4 – Objects into HTML T1T1 T4T4 T2T2 T3T3 Object Middle Tier Relational Database Web Browser

12 M D Kurt's Suggestion store($collection, $file-name, $data) 12 Copyright 2010 Dan McCreary & Associates Web Browser Save Web Form Use a Native XML Database! Kurt Cagle eXist-db

13 M D Zero Translation XML lives in the web browser (XForms) REST interfaces XML in the database (Native XML, XQuery) XRX Web Application Architecture No translation! 13 Copyright 2010 Dan McCreary & Associates Web BrowserXML database XForms Model REST Interface Native XML Database

14 M D Database Tradeoff Analysis My Way 10,000 lines of code to break CRV XML document into Java component and use Hibernate to store each element into columns of tables 45 SQL inserts store 20 joins to extract Six months Four developers (Java, SQL, DBA, Project Manager) Kurt's Way One line of codes to store CRB One day to write One week to test 100x increase in agility 14 Kelly-McCreary & Associates

15 M D The Paradigm Shift 15 Kelly-McCreary & Associates 2006 RDBMSs are the only way to store enterprise data There are many ways to store enterprise data and if you pick the right one your agility can go up x1,000

16 M D Reality Sets In What I wanted Objective architecture analysis Fairly weigh the pros and cons of each alternative Be surrounded by people that know the strengths and weakness of many alternatives What I got Architecture decisions driven by an RDBMS license Architecture decisions made by one person with limited exposure to alternatives Architecture decisions made by lack of knowledge Architecture by fear of the unknown 16 Kelly-McCreary & Associates

17 M D Evaluating Software Architectures: Methods and Case Studies by Paul Clements, Rick Kazman, and Mark Klein Addison-Wesley, 2001 Key Book for ATAM 17 Kelly-McCreary & Associates We need "Evaluating Database Architectures"

18 M D Three Eras of Databases RDBMS for transactions, Data Warehouse for analytics and NoSQL for scalability Copyright Kelly-McCreary & Associates, LLC 18 RDBMS Data Warehouse 1985-1995 1995-2010 2010-Now Data Warehouse RDBMS NoSQL

19 M D Sample of NoSQL Jargon Document orientation Schema free MapReduce Horizontal scaling Sharding and auto-sharding Brewer's CAP Theorem Consistency Reliability Partition tolerance Single-point-of-failure Object-Relational mapping Key-value stores Column stores Document stores Memcached 19 Copyright Kelly-McCreary & Associates, LLC Indexing B-Tree Configurable durability Documents for archives Functional programming Document Transformation Document Indexing and Search Alternate Query Languages Aggregates OLAP XQuery MDX RDF SPARQL Architecture Tradeoff Modeling ATAM Note that within the context of NoSQL many of these terms have different meanings!

20 M D NoSQL – The Big Tent "NoSQL" – a label for a "meme" that now encompasses a large body of innovative ideas on data management "Not Only SQL" Focus on non-relational databases and hybrids A community were new ideas are quickly recombined to create innovative new business solutions Copyright Kelly-McCreary & Associates, LLC 20

21 M D Pressures on SQL Only Systems Copyright Kelly-McCreary & Associates, LLC 21 SQL OLAP/BI/Data Warehouse Social Networks Scalability Agile Schema Free Document-Data Large Data Sets Reliability Linked Data

22 M D Before NoSQL DB Selection Was Easy! Copyright Kelly-McCreary & Associates, LLC 22 Does it look like document? Use Microsoft Office Use the RDBMS Start Stop No Yes

23 M D An evolving tree of data types Copyright Kelly-McCreary & Associates, LLC 23 Read Mostly Read/Write Structured Unstructured Transactional RDBMS BI/DW Web Crawlers Documents Log Files XML JSON Binary Open Linked Data Graph

24 M D Many Uses of Data Copyright Kelly-McCreary & Associates, LLC 24 Transactions (OLTP) Analysis (OLAP) Search and Findability Enterprise Agility Discovery and Insight Speed and Reliability Consistency and Availability

25 M D Before NoSQL RelationalAnalytical (OLAP) 25

26 M D After NoSQL RelationalAnalytical (OLAP)Key-Value Column-FamilyDocumentGraph key value key value key value key value 26

27 M D Three Eras of Enterprise Data NoSQL will not replace ERP or BI/DW systems – but they will complement them and also facilitate the integration of unstructured document data Copyright Kelly-McCreary & Associates, LLC 27 1970s, 80s, 90s1990s, 2000s HR InventorySales Finance ERP BI/Data Warehouse ERP BI/Data Warehouse OLTP OLAP NoSQL Today Siloed Systems Documents ERP Drives BI/DWNoSQL and Documents

28 M D Simplicity is a Virtue Many modern systems derive their strength by dramatically limiting the features in their system and focus on a specific task Simplicity allows database designer to focus on the primary business drivers Copyright Kelly-McCreary & Associates, LLC 28 Photo from flickr by PSNZ Images

29 M D Simplicity is a Design Style Focus only on simple systems that solve many problems in a flexible way Examples: –Touch screen interfaces –Key/Value data stores Copyright Kelly-McCreary & Associates, LLC 29

30 M D Historical Context Mainframe Era 1 CPU COBOL and FORTRAN Punchcards and flat files $10,000 per CPU hour Commodity Processors 10,000 CPUs Functional programming MapReduce "farms" Pennies per CPU hour Copyright Kelly-McCreary & Associates, LLC 30

31 M D Two Approaches to Computation 31 Copyright 2010 Dan McCreary & Associates Alonzo Church John von Neumann Manage state with a program counter.Make computations act like math functions. Which is simpler? Which is cheaper? Which will scale to 10,000 CPUs? 1930s and 40s

32 M D Standard vs. MapReduce Prices Copyright Kelly-McCreary & Associates, LLC 32 John's WayAlonzo's Way

33 M D MapReduce CPUs Cost Less! Copyright Kelly-McCreary & Associates, LLC 33 Cust cost from 32 to 6 cents per CPU hour! Perhaps Alanzo was right! 82% Cost Reduction! Why? (hint: how "shareable" is this process)

34 M D Its about transformation! Kelly-McCreary & Associates, LLC 34

35 M D Architectural Tradeoffs Kelly-McCreary & Associates, LLC 35 I want a fast car with good mileage. I want a scaleable database with low cost that runs well on the 1000 CPUs in our data center.

36 M D Recent History The term NoSQL became re-popularized around 2009 Used for conferences of advocates of non- relational databases Became a contagious idea "meme" First of many "NoSQL meetups" in San Francisco organized by Jon Oskarsson Conversion from "No SQL" to "Not Only SQL" in recent year 36 Kelly-McCreary & Associates, LLC

37 M D Key Events on NoSQL Timeline Copyright Kelly-McCreary & Associates, LLC 37 1998 Carlo Strozzi Intros NoSQL 2006 Google Bigtable 2009 Eric Evans Reintroduces Term 2005 Google MapReduce XQuery 1.0 Recommendation NoSQL Meetups Amazon Dynamo 20072008201020042003 LiveJournal's Memcache Open Source Becomes Popular MDX

38 M D RDBMS vs. NoSQL NoSQL is real and it’s here to stay RDBMS NoSQL Google Trends 38 Copyright Kelly-McCreary & Associates, LLC

39 M D NoSQL and Web 2.0 Startups Many web 2.0 startups did not use Oracle or MySQL They built their own data stores influenced by Amazon’s Dynamo and Google’s Bigtable in order to store and process huge amounts of data In the social community or cloud computing applications most of these data stores became open source software 39 Kelly-McCreary & Associates, LLC

40 M D 2003 - Memcached Originally developed for LiveJournal as a "side-cache" Allowed databases to put data in a "RAM Cache" in front of a SQL server to speed up read-mostly operations Cache in all front-end servers can be shared via a simple protocol Dramatically lowers stress on SQL server Ideal "front end" to SQL systems to increase performance Copyright Kelly-McCreary & Associates, LLC 40

41 M D memcachedb Partition Select and Updates Most frequently read data always stays in RAM Complex update transactions go directly to the database Copyright Kelly-McCreary & Associates, LLC 41 memcachedb Select SQL "System of Record" Insert Update Delete memcachedb Sync with memcached protocol

42 M D Google MapReduce 2004 paper that had huge impact of functional programming on the entire community Copied by many organizations, including Yahoo Copyright Kelly-McCreary & Associates, LLC 42

43 M D Google Bigtable Paper 2006 paper that gave focus to scaleable databases designed to reliably scale to petabytes of data and thousands of machines Copyright Kelly-McCreary & Associates, LLC 43

44 M D Amazon's Dynamo Paper Werner Vogels CTO - October 2, 2007 Used to power Amazon's Web Sites One of the most influential papers in the NoSQL movement Copyright Kelly-McCreary & Associates, LLC 44 Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swami Sivasubramanian, Peter Vosshall and Werner Vogels, “Dynamo: Amazon's Highly Available Key-Value Store”, in the Proceedings of the 21st ACM Symposium on Operating Systems Principles, Stevenson, WA, October 2007.

45 M D 2009: the NoSQL "Revolt" “NoSQLers came to share how they had overthrown the tyranny of slow, expensive relational databases in favor of more efficient and cheaper ways of managing data.” 45 Kelly-McCreary & Associates, LLC Computerworld magazine, July 1 st, 2009 NoSQL!

46 M D Copyright 2008 Dan McCreary & Associates 46 Many Processes Today Are Driven By… The constraints of yesterday… Challenge: Ask ourselves the question… Do our current method of solving problems with tabular data… Reflect the storage of the 1950s… Or our actual business requirements? What structures best solve the actual business problem?

47 M D Copyright 2008 Dan McCreary & Associates 47 No-Shredding! Relational databases take a single hierarchical document and shred it into many pieces so it will fit in tabular structures Document stores prevent this shredding My Form Data

48 M D Copyright 2008 Dan McCreary & Associates 48 Is Shredding Really Necessary? Every time you take hierarchical data and put it into a traditional database you have to put repeating groups in separate tables and use SQL “joins” to reassemble the data

49 M D The Addition of XML Web Services T 1 – HTML into Java Objects T 2 – Java Objects into SQL Tables T 3 – Tables into Objects T 4 – Objects into HTML T 5 – Objects to XML T 6 – XML to Objects 49 Copyright 2011 Kelly-McCreary & Associates T1T1 T3T3 T2T2 T4T4 Object Middle Tier Relational Database Web Browser T5T5 Web Service T6T6

50 M D "The Vietnam of Applications" Object-relational mapping has become one of the most complex components of building applications today A "Quagmire" where many projects get lost Many "heroic efforts" have been made to solve the problem –Java Hibernate Framework –Ruby on Rails But sometimes the best way to avoid complexity is to keep your architecture very simple Copyright Kelly-McCreary & Associates, LLC 50

51 M D "Schema Free" Systems that automatically determine how to index data as the data is loaded into the database No a priori knowledge of data structure No need for up-front logical data modeling –…but some modeling is still critical Adding new data elements or changing data elements is not disruptive Searching millions of records still has sub- second response time 51 Copyright 2010 Dan McCreary & Associates

52 M D Schema-Free Integration "We can easily store the data that we actually get, not the data we thought we would get." Copyright Kelly-McCreary & Associates, LLC 52 XML v1 XML v2 XML v3 Enterprise Messaging System NoSQL Database

53 M D Upfront ER Modeling is Not Required You do not have to finish modeling your data before you insert your first records No Data Definition Language "DDL" is needed Metadata is used to create indexes as data arrives Modeling becomes a statistical process – write queries to find exceptions and normalize data Exceptions make the rules but can still be used Data validation can still be done on documents using tools such as XML Schema and business rules systems like Schematron Copyright Kelly-McCreary & Associates, LLC 53

54 M D Monoculture and Mono-architecture 54 Copyright 2011 Kelly-McCreary & Associates Image Source: Wikipedia

55 M D Eric Evans “The whole point of seeking alternatives [to RDBMS systems] is that you need to solve a problem that relational databases are a bad fit for.” Eric Evans Rackspace 55 Kelly-McCreary & Associates, LLC

56 M D Finding the Right Match 56 Copyright 2011 Dan McCreary & Associates Schema-Free Mature Query Language Standards Compliant Use CMU's Architectural Tradeoff and Method (ATAM) Process

57 M D ACID Transactions Atomic – transaction either complete or fail – no in- between Consistent – no reports can ever catch data between states Isolated – never allow users to touch data that is incomplete Durable – recover committed transactions from partial system failures Copyright Kelly-McCreary & Associates, LLC 57

58 M D BASE Base Availability of Soft state Eventually Consistent "Eventually Consistent" is not appropriate for many financial reporting systems Copyright Kelly-McCreary & Associates, LLC 58

59 M D Sharding and Partition Tolerance When a database gets full how easy is it for your system to automatically move half the data to new systems? 59 Kelly-McCreary & Associates, LLC Time to "Shard" New Server Before autoshard the node is near 100% capacity. After autoshard the nodes are "rebalanced" with each node taking half the load. copy 1/2 data

60 M D Brewer's CAP Theorem Consistency AvailabilityPartition Tolerance 60 Kelly-McCreary & Associates, LLC You can not have all three in so pick two!

61 M D Proof of Brewer's CAP Theorem In 2002, Seth Gilbert and Nancy Lynch of MIT, formally proved Brewer's Theorem to be correct Focus turned away from trying to get all three to trying to pick the right two for the job Many systems are becoming "CAP tunable" 61 Kelly-McCreary & Associates, LLC

62 M D CAP Systems CA – Consistent and Available –Example – a single site, single database, no auto- sharding –If you have to partition, stop, partition, restart CP – Consistent with Partition Tolerance –Some data not available when partitioning –What data you have is always consistent AP – Available with Partition Tolerance –Always on, but during partitioning, some data may be inconsistent –After sharding eventual consistency Copyright Kelly-McCreary & Associates, LLC 62

63 M D CAP Regions Consistency AvailabilityPartition Tolerance 63 Kelly-McCreary & Associates, LLC CP AP CA NoSQL Innovation RDBMS

64 M D When does CAP Apply? Kelly-McCreary & Associates, LLC 64 Consistency AND Availability Each transaction can select between Consistency OR Availability depending on the context If you have… A single processor or many processors on a working network: Many processors and network failures: then you get…

65 M D Scale Up vs. Scale Out Scale Up Make a single CPU as fast as possible Increase clock speed Add RAM Make disk I/O go faster Scale Out Make Many CPUs work together Learn how to divide your problems into independent threads Copyright Kelly-McCreary & Associates, LLC 65

66 M D The Tunable SLA NoSQL systems that use many commodity processors can be precisely tuned to meet an organizations service level agreements Copyright Kelly-McCreary & Associates, LLC 66 Max Read Time Max Write Time Reads Per Second Duplicate Copies Multiple Datacenters Transaction Guarantees Writes Per Second $435.50 Estimated Price/Month inputsoutput

67 M D Example: Amazon DynamoDB Tune your service to the business requirements (challenging to do with a RDBMS) Copyright Kelly-McCreary & Associates, LLC 67 Tuning Knobs

68 M D "One Size Fits…" "One Size Does Not Fit All" James Hamilton Nov. 3 rd, 2009 Kelly-McCreary & Associates, LLC 68,guid,afe46691-a293-4f9a-8900-5688a597726a.aspx James works for Amazon Web Services

69 M D NoSQL and Cloud Computing High scalability –Especially in the horizontal direction (multi CPUs) Low administration overhead –Simple web page administration Lower fees for MapReduce transformations 69 Kelly-McCreary & Associates, LLC

70 M D Distribution Models Copyright Kelly-McCreary & Associates, LLC 70 Master-SlavePeer-to-Peer Master Standby Master Standby Master Node requests Used only if primary master fails

71 M D The NO-SQL Universe 71 Copyright 2010 Dan McCreary & Associates Document StoresKey-Value Stores Graph/Triple Stores Object Stores Column-Family Stores XML

72 M D Relational Data is usually stored in row by row manner (row store) Standardized query language (SQL) Data model defined before you add data Joins merge data from multiple tables Results are tables Pros: mature ACID transactions with fine-grain security controls Cons: Requires up front data modeling, does not scale well Copyright Kelly-McCreary & Associates, LLC 72 Examples: Oracle, MySQL, PostgreSQL, Microsoft SQL Server, IBM DB/2

73 M D Analytical (OLAP) Based on "Star" schema with central fact table for each event Optimized for analysis of read- analysis of historical data Use of MDX language to count query "measures" for "categories" of data Pros: fast queries for large data Cons: not optimized for transactions and updates Copyright Kelly-McCreary & Associates, LLC 73 Examples: Cognos, Hyperion, Microstrategy, Pentaho, Microsoft, Oracle, Business Objects

74 M D Key-Value Stores Keys used to access opaque blobs of data Values can contain any type of data (images, video) Pros: scalable, simple API (put, get, delete) Cons: no way to query based on the content of the value Copyright Kelly-McCreary & Associates, LLC 74 key value key value key value key value Examples: Berkley DB, Memcache, DynamoDB, S3, Redis, Riak

75 M D Key Value Stores A table with two columns and a simple interface –Add a key-value –For this key, give me the value –Delete a key Blazingly fast and easy to scale (no joins) Copyright Kelly-McCreary & Associates, LLC 75 KeyValue Blob datatype string datatype

76 M D The Locker Metaphor Copyright Kelly-McCreary & Associates, LLC 76 Key: Value: An arbitrary container data

77 M D No Subset Queries in Key-Value Stores Copyright Kelly-McCreary & Associates, LLC 77

78 M D Types of Key-Value Stores Eventually ‐ consistent key ‐ value store Hierarchical key ‐ value stores Key-Value stores in RAM Key-Value stores on disk High availability key-value store Ordered key ‐ value stores Values that allow simple list operations Copyright Kelly-McCreary & Associates, LLC 78

79 M D Memcached Open source in-memory key-value caching system Make effective use of RAM on many distributed web servers Designed to speed up dynamic web applications by alleviating database load RAM resident key-value store for small chunks of arbitrary data (strings, objects) from results of database calls, API calls, or page rendering Simple interface for highly distributed RAM caches 30ms read times typical Designed for quick deployment, ease of development APIs in many languages Copyright Kelly-McCreary & Associates, LLC 79

80 M D Riak Open source distributed key-value store with support and commercial versions by Basho A "Dynamo-inspired" database Focus on availability, fault-tolerance, operational simplicity and scalability Support for replication and auto-sharding and rebalancing on failures Support for MapReduce, fulltext search and secondary indexes of value tags Written in ERLANG Copyright Kelly-McCreary & Associates, LLC 80

81 M D Redis Open source in-memory key-value store with optional durability Focus on high speed reads and writes of common data structures to RAM Allows simple lists, sets and hashes to be stored within the value and manipulated Many features that developers like –expiration, transactions, pub/sub, partitioning Copyright Kelly-McCreary & Associates, LLC 81

82 M D Amazon DynamoDB Based around scalable key-value store Fastest growing product in Amazon's history SSD only database service Focus on throughput not storage and predictable read and write times Strong integration with S3 and Elastic MapReduce Copyright Kelly-McCreary & Associates, LLC 82

83 M D Column-Family Key includes a row, column family and column name Store versioned blobs in one large table Queries can be done on rows, column families and column names Pros: Good scale out, versioning Cons: Cannot query blob content, row and column designs are critical Copyright Kelly-McCreary & Associates, LLC 83 Examples: Cassandra, HBase, Hypertable, Apache Accumulo, Bigtable

84 M D Column Family (Bigtable) The champion of "Big Data" Excel at highly saleable systems Tightly coupled with MapReduce Technically a "sparse matrix" were most cells have no data Generating a list of all columns is non-trivial Examples: –Google Bigtable –Hadoop HBase –Hypertable Copyright Kelly-McCreary & Associates, LLC 84

85 M D Spreadsheets Use a Row/Column as a Key Bigtable systems use a combination of row and column information as part of their key Copyright Kelly-McCreary & Associates, LLC 85 Column ID Row ID

86 M D Keys Include Family and Timestamps Bigtable systems have keys that include not just row and column ID but other attributes Column Families are created when a table is created Timestamps allows multiple versions of values Values are just ordered bytes and have no strongly typed data system Copyright Kelly-McCreary & Associates, LLC 86

87 M D Column Store Concepts Preserve the table-structure familiar to RDBMS systems Not optimized for "joins" One row could have millions of columns but the data can be very "sparse" Ideal for high-variability data sets Colum families allow to query all columns that have a specific property or properties Allow new columns to be inserted without doing an "alter table" Trigger new columns on inserts Copyright Kelly-McCreary & Associates, LLC 87 Col1Col100000 …

88 M D Column Families Group columns into "Column families" Group column families into "Super-Columns" Be able to query all columns with a family or super family Similar data grouped together to improve speed Copyright Kelly-McCreary & Associates, LLC 88 Table Super Col X Super Col Y Fam1Fam2 Col-ACol-B

89 M D Hadoop/Hbase Open source implementation of MapReduce algorithm written in Java Initially created by Yahoo –300 person-years development Column-oriented data store Java interface HBase designed specifically to work with Hadoop High-level query language (Pig) Strong support by many vendors Copyright Kelly-McCreary & Associates, LLC 89

90 M D Cassandra Apache open source column family database supported by DataStax Peer-to-peer distribution model Strong reputation for linear scale out (millions of writes/second) Database side security Written in Java and works well with HDFS and MapReduce Copyright Kelly-McCreary & Associates, LLC 90

91 M D Netflix Copyright Kelly-McCreary & Associates, LLC 91

92 M D Graph Store Data is stored in a series of nodes, relationships and properties Queries are really graph traversals Ideal when relationships between data is key: –e.g. social networks Pros: fast network search, works with public linked data sets Cons: Poor scalability when graphs don't fit into RAM, specialized query languages (RDF uses SPARQL) Copyright Kelly-McCreary & Associates, LLC 92 Examples: Neo4j, AllegroGraph, Bigdata triple store, InfiniteGraph, StarDog

93 M D Graph Stores Used when the relationship and relationships types between items are critical Used for –Social networking queries: "friends of my friends" –Inference and rules engines –Pattern recognition –Used for working with open-linked data Automate "joins" of public data Copyright Kelly-McCreary & Associates, LLC 93

94 M D Nodes are "joined" to create graphs How do you know that two items reference the same object? Node identification – URI or similar structure Copyright Kelly-McCreary & Associates, LLC 94

95 M D Open Linked Data Copyright Kelly-McCreary & Associates, LLC 95

96 M D Neo4J Graph database designed to be easy to use by Java developers Dual license (community edition is GPL) Works as an embedded java library in your application Disk-based (not just RAM) Full ACID Copyright Kelly-McCreary & Associates, LLC 96

97 M D Document Store Data stored in nested hierarchies Logical data remains stored together as a unit Any item in the document can be queried Pros: No object-relational mapping layer, ideal for search Cons: Complex to implement, incompatible with SQL Copyright Kelly-McCreary & Associates, LLC 97 Examples: MarkLogic, MongoDB, Couchbase, CouchDB, eXist-db

98 M D Document Stores Store machine readable documents together as a single blob of data Use JSON or XML formats to store documents Similar to "object stores" in many ways No shredding of data into tables Sub-trees and attributes of documents can still be queried XQuery or other document query languages Quickly maturing to include ACID transaction support Lack of object-relational mapping permits agile development Fastest growing revenues (MarkLogic, MongoDB, Couchbase) Copyright Kelly-McCreary & Associates, LLC 98

99 M D Estimated Big Data and NoSQL Sales Copyright Kelly-McCreary & Associates, LLC 99 Document Stores

100 M D Document Structure Copyright Kelly-McCreary & Associates, LLC 100 is our root element contain a sequence of one to many elements Each contains the following sequence of elements Darker lines mean "required" and light lines mean optional elements Id and title are required Books have 0 to many author-names Format and license elements are codes that must be in a fixed list of choices Only valid URL characters Must be a valid decimal number

101 M D MarkLogic Native XML database designed to scale to Petabyte data stores Leverages commodity hardware ACID compliant, schema-free document store Heavy use by federal agencies, document publishers and "high-variability" data Arguably the most successful NoSQL company Copyright Kelly-McCreary & Associates, LLC 101

102 M D MongoDB Open Source JSON data store created by 10gen Master-slave scale out model Strong developer community Sharding built-in, automatic Implemented in C++ with many APIs (C++, JavaScript, Java, Perl, Python etc.) Copyright Kelly-McCreary & Associates, LLC 102

103 M D Couchbase Open source JSON document store Code base separate from CouchDB Built around memcached Peer to peer scale out model Written in C++ and Erlang Strengths in scale out, replication and high-availability Copyright Kelly-McCreary & Associates, LLC 103

104 M D CouchDB Apache CouchDB Open source JSON data store Document Model Written in ERLANG RESTful JSON API Distributed, featuring robust, incremental replication with bi-directional conflict detection and management B-Tree based indexing Mobile version Copyright Kelly-McCreary & Associates, LLC 104

105 M D eXist Open source native XML database Strong support for XQuery and XQuery extensions Heavily used by the Text Encoding Initiative (TEI) community and XRX/XForms communities Integrated Lucene search Collection triggers and versioning Extensive XQuery libs (EXPath) Version 2.0 has replication Copyright Kelly-McCreary & Associates, LLC 105

106 M D Structured Search What happens if you retain document structure when you index your documents? How can use document structure to increase your precision and relevancy Copyright Kelly-McCreary & Associates, LLC 106

107 M D Information Retrieval Textbook Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze Cambridge University Press, 2008 107

108 M D Table 10.1 RDB search unstructured retrieval structured retrieval objectsrecords unstructured documents trees with text at leaves modelrelational model vector space & others ? main data structure tableinverted index? queriesSQLfree text queries? 108 XML - Table 10.1 and structured information retrieval. SQLRDB (relational database) search, unstructured information retrieval

109 M D Table 10.1 - Revised RDB search unstructured retrieval structured retrieval objectsrecords unstructured documents trees with text at leaves modelrelational model vector space & others XML hierarchy main data structure tableinverted index trees with node- ids for document ids queriesSQLfree text queries XQuery fulltext 109 XML - Table 10.1 and structured information retrieval. SQLRDB (relational database) search, unstructured information retrieval

110 M D Two Models "Bag of Words" All keywords in a single container Only count frequencies are stored with each word "Retained Structure" Keywords associated with each sub-document component 110 'love' 'hate' 'new' 'fear' keywords doc-id Kelly-McCreary & Associates, LLC

111 M D Keywords and Node IDs Keywords in the reverse index are now associated with the node-id in every document 111 Node-id keywords document-id

112 M D Hybrid architectures Most real world implementations use some combination of NoSQL solutions Example: –Use document stores for data –Use S3 for image/pdf/binary storage –Use Apache Lucene for document index stores –Use MapReduce for real-time index and aggregate creation and maintenance –Use OLAP for reporting sums and totals Copyright Kelly-McCreary & Associates, LLC 112

113 M D Other Terms Hash (Merkle) Trees –where a hash function is used for each leaf and branches up to the root of a file system – fast ways to see if complex documents in tree structures have changed B-Trees, B+ Trees – binary trees that are used to create fast searches of large data sets Vector Clocks – an algorithm for detecting dependencies between remote computer systems – used to track what systems may have corrupted data Bloom Filters – hashing tricks to avoid unnecessary disk operations on large data sets Consistent Hashing – A way of converting strings of data into a unique key that can be used for fast comparisons Copyright Kelly-McCreary & Associates, LLC 113

114 M D Tools to Help You Select A System ATAM – Architecture Tradeoff Methodology CMU developed process to objectively select a system architecture based on business driven use-cases and quality metrics Copyright Kelly-McCreary & Associates, LLC 114

115 M D ATAM Process Flow Copyright Kelly-McCreary & Associates, LLC 115 Business Drivers Quality Attributes User Stories Analysis Architecture Plan Architectural Approaches Architectural Decisions Tradeoffs Sensitivity Points Non-Risks RisksRisk Themes Distilled info Impacts

116 M D Sample Use Cases/User Stories Effort to Load New Data Effort to Query Data Effort to Transform Data Effort to Create RESTful Web Services Copyright Kelly-McCreary & Associates, LLC 116

117 M D 117 Insert/Select/Publish Comparison InsertQuery Create Publishing Web Service SQL WebDAV SQL XQuery Java Tomcat AXIS JDBC Total Effort XQuery logical data modeling SQL XQuery Effort

118 M D Sample Quality Attribute Utility Tree Kelly-McCreary & Associates, LLC 118 Utility Searchability XML Importability Transformability Affordability Sustainability Interoperability Easy to add new XML data. (C, H) Use OpenSource Software. (H, H) Use long standing Standards. (VH, H) Use W3C Standards. (VH, H) Fulltext search on document data. (H, H) Easy to transform XML data. (H, H) Standards Ease of Change Use declarative languages. (VH, H) Security Prevents unauthorized access. (H, H) Fulltext Search XML Search Custom Scoring Drag-and-drop Bulk Import No License Fees XQuery XSLT Web Services Fine Grain Control Standards Based No Translation Key: (Importance, Score) Important to the Project C=Critical, VH=Very High, H=High, M=Medium Architectural Score H=High, M=Medium, L=Low Scoring via XQuery. (M, H) Role-based Works with W3C Forms. (H, M) Works with web standards. (VH, H) No complex languages to learn. (H, H) Fast searching. (H, H) Staff training. (H, M) Collection-based access. (H, M) Centralized security policy. (H, M) Mashups wtih REST Interfaces. (VH, H) Batch Import tools. (M, M) Transform to HTML or PDF. (VH, H) Customizable by non-programmers. (H, H)

119 M D The Hard Part Assigning organizational "value" to –Ease of use –Standards –Vendor viability –Open source project viability –Portability Example: –WebDAV support Copyright Kelly-McCreary & Associates, LLC 119

120 M D Evolution of Ideas in OpenSource How quickly can new ideas be recombined into new database products? OpenSource software has proved to be the most efficient way to quickly recombine new ideas into new products Copyright Kelly-McCreary & Associates, LLC 120 Product A Product B OpenSource Proprietary Software New Database Ideas Schema-free MapReduceAuto-sharding New Products Cloud Computing The Nature of Technology: What It Is and How It Evolves by W. Brian Arthur

121 M D Start Finish Using the Wrong Architecture Credit: Isaac Homelund – MN Office of the Revisor 121 Kelly-McCreary & Associates, LLC

122 M D Using the Right Architecture Start Finish Find ways to remove barriers to empowering the non programmers on your team. 122 Kelly-McCreary & Associates, LLC

123 M D 123 If You Give a Kid a Hammer… People solve problems using familiar tools People develop specific Cognitive Styles* based on training and experience What are we teaching the next generation of developers? * Source: Shoshana Zuboff: In the Age of the Smart Machine (1988) …the whole world becomes a nail

124 M D Cognitive Bias in NoSQL Anchoring bias - the tendency to produce an estimate near a cue amount "Our managers were expecting an RDBMS solution so that’s what we gave them"Anchoring bias Availability heuristic - the tendency to estimate that what is easily remembered is more likely than that which is not. - "I hear that NoSQL does not support ACID"Availability heuristic Bandwagon effect - the tendency to do or believe what others do or believe - "Everyone else here and in our local area uses RDBMS"Bandwagon effect Confirmation bias - the tendency to seek out only that information that supports one's preconceptions – "We only read posts from the Oracle groups"Confirmation bias Framing effect - the tendency to react to how information is framed, beyond its factual content "We know of some NoSQL projects that failed"Framing effect Gambler's fallacy (aka sunk cost bias) the failure to reset one's expectations based on one's current situation – "We already paid for our Oracle license…"Gambler's fallacysunk cost bias Hindsight bias, the tendency to assess one's previous decisions as more efficacious than they were – "Our last five systems worked on RDBMS solutions".Hindsight bias Halo effect, the tendency to attribute unverified capabilities in a person based on an observed capability. – "Oracle sells billions of dollars of licenses each year, how could so many people be wrong".Halo effect Representativeness heuristic, the tendency to judge something as belonging to a class based on a few salient characteristics - "Our accounting systems work on RDBMS so why not our product search?"Representativeness heuristic Copyright Kelly-McCreary & Associates, LLC 124

125 M D The Future of the NoSQL Movement Will data sets continue to grow at exponential rates? Will new system options become more diverse? Will new markets have different demands? Will some ideas be "absorbed" into existing RDBMS vendors products? Will the NoSQL community continue to be the place where new database ideas and products are incubated? Will the job of doing high-quality architectural tradeoffs analysis become easier? Copyright Kelly-McCreary & Associates, LLC 125 GrowthDiversity

126 M D Making Sense of NoSQL Copyright Kelly-McCreary & Associates, LLC 126

127 M D 2013 NoSQL Now! Dataversity's NoSQL Conference August 20-22 San Jose California Copyright Kelly-McCreary & Associates, LLC 127

128 M D Questions Thank You! Dan McCreary President, Kelly-McCreary & Associates twitter: dmccreary Copyright Kelly-McCreary & Associates, LLC 128

Download ppt "NoSQL, XRX and Functional Programming Version 4.0 Dan McCreary Monday May 6th, 2013."

Similar presentations

Ads by Google