Fast Data Made Easy Ted Malaska Cloudera With Kafka and Kudu

@jeffholoman @tedmalaska
Fast Data Made Easy Ted Malaska Cloudera With Kafka and Kudu Jeff Holoman Cloudera

A little History First there was HDFS and Mapreduce. And this did a lot of things amazingly well. To this day M/R, with the inclusion of Hive, is probably the most used processing framework on Hadoop. But other use cases emerged that made a lot of sense. So Hbase was introduced into the stack and Hbase does a couple of things amazingly well It does random access by a single row It does highly performant inserts. So hbase can be good when you need the ability to mutate data and store a lot of it very quickly. However, it’s difficult to access and is terrible at doing the one thing that hadoop is prized for, which is large scans across datasets. You just can’t realistically do analytics or even really much in the way for performant queries against data that resides in Hbase. A couple of years later, we get impala, and this is a huge step forward because we get closer to RDBMS like query performance but over massive amounts of data. As impala has matured it has shown us that getting relatively low latency SQL access is possible. Exactly once keys -> you are writing progressively -> Non entity de-dup Counting -> they use increments. Increments are terrible there is no state. WE can puts. Put allow us perfect Spark can do windowing analysis w/out the shuffle.

Bank Ledger txns 100 insert XML 101 insert SQL 100 update 100 update
All txns must be queryable within 5 min XML must be parsed and reformatted In-process counting 100% Correct! So what would a typical application architecture look like? txns 100 insert RDBMS Hadoop XML 101 insert SQL 100 update 100 update 102 insert

Distributed Systems Things Fail
Systems are designed to tolerate failure We must expect failures and design our code and configure our systems to handle them

Option 1 Sqoop Options SQL RDBMS Hadoop
All txns must be queryable within 5 min XML must be parsed and reformatted In-process counting 100 % Correct Option 1 So what would a typical application architecture look like? RDBMS Hadoop SQL Sqoop

Option 2 Bank Ledger txns txns txns txn SQL RDBMS Hadoop
All txns must be queryable within 5 min XML must be parsed and reformatted In-process counting 100% Correct Option 2 So what would a typical application architecture look like? txns txns txns RDBMS txn Hadoop SQL Compaction De-Duplication In-Process Hard

“There are only two hard problems in distributed systems: 2
“There are only two hard problems in distributed systems: 2. Exactly-once delivery 1. Guaranteed order of messages 2. Exactly-once delivery ” --Mathias Verraes @mathiasverraes

Option 3a Bank Ledger txns txns txns txn SQL + App RDBMS HBase Hadoop
All txns must be queryable within 5 min XML must be parsed and reformatted In-process counting Correct! Option 3a So what would a typical application architecture look like? txns txns txns RDBMS txn HBase SQL + App Compaction Hbase->HDFS Complex Hbase ScANS Slow Joins Hadoop

Option 3b Bank Ledger txns txns txns txn SQL + App Check RDBMS HBase
All txns must be queryable within 5 min XML must be parsed and reformatted In-process counting Correct! Option 3b So what would a typical application architecture look like? txns txns txns RDBMS txn HBase Check SQL + App Compaction Complex Hadoop

The new Option Bank Ledger txns txns txns txn SQL OR App RDBMS
All txns must be queryable within 5 min SECONDS XML must be parsed and reformatted In-process counting Correct! The new Option So what would a typical application architecture look like? txns txns txns RDBMS txn SQL OR App Free Exactly-Once Immediately Available Guaranteed ORDERING UPDATES

Apache Kudu (Incubating)
Columnar Datastore Fast Inserts/Updates Efficient Scans Complements HDFS and HBase Real-time Row-Based Storage A B C A1 B1 C1 A2 B2 C2 A3 B3 C3 A1 B1 C1 A2 B2 C2 Columnar Storage A3 B3 C3 A1 A2 A3 B1 B2 B3 C1 C2 C3

Kudu Ledger Table create table `ledger` ( uuid STRING,
transaction_id STRING, customer_id INT, source STRING, db_action STRING, time_utc STRING, `date` STRING, amount_dollars INT, amount_cents INT, local_timestamp BIGINT ) DISTRIBUTE BY HASH(transaction_id) INTO 20 BUCKETS TBLPROPERTIES( 'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler', 'kudu.table_name' = 'ledger', 'kudu.master_addresses' = 'jhol-1.vpc.cloudera.com:7051', 'kudu.key_columns' = 'transaction_id,uuid', 'kudu.num_tablet_replicas' = '3') ;

Kudu Aggregation Demo CREATE EXTERNAL TABLE `gamer` (
`gamer_id` STRING, `last_time_played` BIGINT, `games_played` INT, `games_won` INT, `oks` INT, `deaths` INT, `damage_given` INT, `damage_taken` INT, `max_oks_in_one_game` INT, `max_deaths_in_one_game` INT ) TBLPROPERTIES( 'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler', 'kudu.table_name' = 'gamer', 'kudu.master_addresses' = 'ip us-west-2.compute.internal:7051', 'kudu.key_columns' = 'gamer_id' );

Kudu Aggregation Architecture
Impala Generator Kafka Spark Streaming Kudu SparkSQL SparkSQL Spark MlLib

Kudu Aggregation Demo DEMO

Kudu Aggregation MlLib
val resultDf = sqlContext.sql("SELECT gamer_id, oks, games_won, games_played FROM gamer") val parsedData = resultDf.map(r => { val array = Array(r.getInt(1).toDouble, r.getInt(2).toDouble, r.getInt(3).toDouble) Vectors.dense(array) }) val dataCount = parsedData.count() if (dataCount > 0) { val clusters = KMeans.train(parsedData, 3, 5) clusters.clusterCenters.foreach(v => println(" Vector Center:" + v)) }

Kudu CDC Demo CREATE EXTERNAL TABLE `gamer_cdc` ( `gamer_id` STRING,
èff_to` STRING, èff_from` STRING, `last_time_played` BIGINT, `games_played` INT, `games_won` INT, òks` INT, `deaths` INT, `damage_given` INT, `damage_taken` INT, `max_oks_in_one_game` INT, `max_deaths_in_one_game` INT ) TBLPROPERTIES( 'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler', 'kudu.table_name' = 'gamer_cdc', 'kudu.master_addresses' = 'ip us-west-2.compute.internal:7051', 'kudu.key_columns' = 'gamer_id, eff_to' );

Kudu Aggregation Architecture
Get Gamer_Id + empty Eff_To yes no If Record Found? Put New Gamer_Id + Empty Eff_To Put Old Gamer_Id + New Eff_To Update Old Gamer_Id + Empty Eff_To

Kudu Bitemporality Starting Point Insert New Eff_To Update
Gamer_ID Eff_To Eff_Fr Data 42 3/20/16 Foo Gamer_ID Eff_To Eff_Fr Data 42 3/20/16 Foo 3/31/16 Insert New Eff_To Update Old Record to New Gamer_ID Eff_To Eff_Fr Data 42 3/31/16 Bar 3/20/16 Foo

Kudu CDC Demo Demo

Fast Data Made Easy Ted Malaska Cloudera With Kafka and Kudu

Similar presentations

Presentation on theme: "Fast Data Made Easy Ted Malaska Cloudera With Kafka and Kudu "— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fast Data Made Easy Ted Malaska Cloudera With Kafka and Kudu

Similar presentations

Presentation on theme: "Fast Data Made Easy Ted Malaska Cloudera With Kafka and Kudu "— Presentation transcript:

Similar presentations

About project

Feedback