Big Data Tools Overview Avi Freedman ServerCentral Technology Executives Club November 13, 2013.

Big Data Tools Overview Avi Freedman ServerCentral Technology Executives Club November 13, 2013

What is Big Data? Canonical definition  Volume: Billions or trillions of rows  Variety: Different schemas  Velocity: Hundreds of thousands of records/sec Traditional systems have difficulty handling data in these dimensions, even with scale-up, partitioning and sharding. Clustered/scale-out solutions are required to solve old problems in new ways.  But… That causes problems meeting traditional database integrity requirements. Our focus will be on open source and < $1 million (min) technology stacks  Not focusing on traditional BI and Teradata/traditional “Make SQL work” soltuions

Big Data Search Trends Source: Google Trends Big Data Data Mining Semantics

Tech Background: Scale-up vs Scale-out To Scale up you buy a bigger machine. But there are limits to how far you can go… Scaling out with traditional software designed for single-machine architectures is typically done by making read replicas (doesn’t help for volume or write-heavy workload); Or clustering with master/master architectures, which still doesn’t help with volume and can increase latency; Or with sharding or partitioning…

Tech Background: Sharding/Partitioning When you shard a database, you split it by sets of the data, typically related to the key (so names starting with “A-C” go one place, “D-F” another, etc). Can be difficult to do manually. Partitioning is usually implemented by slicing the database into separate tables (often all on the same machine) by time.

Tech Background: ACID and CAP ACID  Atomicity (transactions are all or nothing)  Consistency (checking the end results)  Isolation (transactions don’t affect each other)  Durability (transactions once committed are forever CAP Theorem says….  Consistency (all nodes have the same data)  Availability (every request gets a response)  Partition tolerance (any part can fail)  Can’t have all 3 Big Data solutions typically “relax” ACID and are subject to the CAP Theorem.

Big Data Technologies Map/Reduce (Hadoop)  Hadoop, HPCC  (emerging) Streaming Databases NoSQL  Key/Value Store  Document/Scheme-Free Databases  Columnar (Dremel, Impala, Drill)  Graph databases (for social media) NewSQL Revival of Classic SQL DBs

NoSQL Introduction No stored procedures Partial to full SQL Clustered High volume Not ACID (Problem for… Funds transfer, power failures, selling the last item twice) Not ACID (Problem for… Funds transfer, power failures, selling the last item twice)

NoSQL: Map/Reduce  Currently mainly used for batch processing, but streaming is being grafted on.  Older versions had single points of failures but newer versions have implemented system-wide redundancy  Not ACID, though there is some basic “check and set” functionality in underlying databases.  There are SQL-like interfaces (Hive and Pig)  Latency is typically VERY high – minutes – to get queries.  And to be efficient, the map/reduce processes are usually written in Java, which is an obstacle for use in many environments.

NoSQL: Key/Value Store One of the first examples of “NoSQL” software was the set of systems developed to deal with Key/Value lookups. In this kind of system, you get to set, delete, or read a key (like “cloud services”) and get one value (like “are fun”). Values can be lists or even more complex data structures. Sample applications:  Web cookies  State for massively online games  Real-time ad placement  Fraud and intrusion detection monitoring Sample applications:  Web cookies  State for massively online games  Real-time ad placement  Fraud and intrusion detection monitoring

NoSQL: Key/Value Store Leading packages that implement Key/Value stores are: memcached (which clusters but isn’t persistent) redis (clusters and is persistent to disk) riak (clusters, persistent to disk, goes up the chain a bit, but not as performant if disk I/O kicks in)

 Related to Key/Value store  Typically a superset where you get a key, but the value can be a large structured set of data (a “document”).  Usually have more sophisticated ability to do pattern- matching lookups.  MongoDB is the thought if not market leader  Riak and Couchbase are be second in the space  All are still evolving, not perfect, and require some tuning. NoSQL: Document DBs

NoSQL: Columnar DBs  Older-generation columnar databases like HBase (part of mysql) were clustered but not fast enough to ‘move the needle’.  Newer implementations, inspired by Google’s Dremel, like Apache Drill and Cloudera Impala are ordered of magnitude faster (in the seconds for some queries), also cluster, and can deal even better with large ingest volumes and variable schemas.  Apache Drill is just out in alpha, and Impala has yet to achieve the performance of Google’s hosted Dremel service.  But these systems may be the closest to threatening the typical Aster and even core Teradata use cases.

NoSQL: Clustered SQL  Cassandra does offer SQL access and clusters, but is not ACID.  Used by many web-scale companies.  Also relatively steep learning curve though there are commercial providers to assist

NoSQL: Graph DBs  Systems like neo4j have been evolved to deal with problems that arise in heavily-connected data where one is looking for instances or patterns in the relationships between items.  One key space where they are used is in social networks and for evil government projects.  Very specialized and we don’t see many instances deployed in the enterprise.

NewSQL NewSQL was coined recently to describe databases that attempt to cluster (scale-out) and maintain ACID properties. Two leaders are: FoundationDB (currently has a 96 core and 100TB limit), does not require in-RAM presence VoltDB (doing complex work requires Java skills, and it is costly because all data must fit in-RAM across the nodes People are watching this space with interest, but many are dubious about how fast they will develop into truly scalable offerings. Both offer easy access for download and testing in customer environments.

 Microsoft, Oracle/mysql, postgres, and MariaDB projects/companies are all thinking and implementing more scale-out functionality.  Most of the initial approaches seem to be automating the process of sharding and partitioning databases.  We see people trying this most in the mysql community, but the vast majority are still sharding and replicating to deal with scale. Revival of Classic SQL DBs

Gotchas to Watch For ACID Compliance?  Do you get transactional ‘correctness’  Or is the system ‘eventually consistent’  At high constant volumes, eventual consistency may never catch up Ease of use  Non-SQL systems (like Map/Reduce) can be difficult to learn and train for  And many Big Data systems can be difficult to learn/install for DB admins  Commercial solutions can address this but also can cost 10x

What are your application requirements of the data backends? Application support  If you don’t code your applications, will they support using a big data solution on the backend?  It’s sometimes possible to write ‘adapters’ underneath commercial applications, but this is dangerous as the applications may change their schemas or methods without notifying users. Gotchas to Watch For

Questions? Avi Freedman ServerCentral avi@servercentral.com Technology Executives Club November 13, 2013

Big Data Tools Overview Avi Freedman ServerCentral Technology Executives Club November 13, 2013.

Similar presentations

Presentation on theme: "Big Data Tools Overview Avi Freedman ServerCentral Technology Executives Club November 13, 2013."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Big Data Tools Overview Avi Freedman ServerCentral Technology Executives Club November 13, 2013.

Similar presentations

Presentation on theme: "Big Data Tools Overview Avi Freedman ServerCentral Technology Executives Club November 13, 2013."— Presentation transcript:

Similar presentations

About project

Feedback