Presentation is loading. Please wait.

Presentation is loading. Please wait.

Joe Caserta President Elliott Cordo Chief Architect September 30, 2015, Javits Center, New York City Building a Data Lake for Digital Music Dominance.

Similar presentations


Presentation on theme: "Joe Caserta President Elliott Cordo Chief Architect September 30, 2015, Javits Center, New York City Building a Data Lake for Digital Music Dominance."— Presentation transcript:

1 Joe Caserta President Elliott Cordo Chief Architect September 30, 2015, Javits Center, New York City Building a Data Lake for Digital Music Dominance

2 Big Data Strategy Innovation Technical Implementation Awards and Recognition

3

4

5 The Music Maze

6

7 Build a Dynamic Platform – Paradigm Shift OLD WAY: Structure  Ingest  Analyze Fixed Capacity Monolith NEW WAY: Ingest  Analyze  Structure Dynamic Capacity Ecosystem RECIPE: Cloud Data Lake Polyglot Warehouse

8 Move to the Cloud Existing On-Premise Solution Challenges with operations of Hadoop servers in Data Center Increasing infrastructure complexity Keeping up with data growth Cloud Advantages Reduced upfront capital investment Faster speed to value Elasticity “Those that go out and buy expensive infrastructure find that the problem scope and domain shift really quickly. By the time they get around to answering the original question, the business has moved on.” - Matt Wood, AWS

9 Cost savings of dynamic capacity

10 Elasticity not only saves money

11 Essentially, Servers Suck But more importantly think Infrastructure as code Your servers should be API calls Use stateless processes Make all resources ephemeral Make everything scalable and elastic!

12 Ephemeral? Disposable: Processing Fleets Elastic Map Reduce Clusters Redshift Clusters Use distributed services and systems to maintain state and preserve your data: Cassandra, Dynamo S3

13 Anatomy of our Processing Fleet S3 Input Buckets Auto-scaling Queuing service S3 Output Buckets

14 Elastic Map Reduce Hadoop on Demand No Operations –your cluster dies so what Bootstrap whatever processing engine makes sense Programmatically estimate instance type and cluster size

15 You May Need Some Persistent Servers If at all possible they should be inherently scalable, distributed, and elastic

16 Move to a Data Lake Paradigm Technology: Scalable distributed storage  S3 Pluggable fit-for-purpose processing  EMR Functional Capabilities: Remove barriers from data ingestion and analysis Storage and processing for all data Tunable Governance

17 Ingest Raw Data Organize, Define, Complete Munging, Blending Machine Learning Data Quality and Monitoring Metadata, ILM, Security Data Catalog Data Integration Fully Governed ( trusted) Arbitrary/Ad-hoc Queries and Reporting Usage PatternData Governance Metadata, ILM, Security Putting it together: The Big Data Pyramid

18 Data Ingestion and Onboarding Incoming to S3: – Lightweight API wrapper – Web front end – Direct writes to S3 Ingest the data in a reasonable partitioning schema: Bucket and Keys Turn analysts and data scientists loose  Late bind analytics

19 But we need to feed the cash register Data needs to be refined and mapped: – Processing Fleet – EMR 80/20 rule: metadata driven when possible Abstract away “Big Data” And make sure it’s right! – Automated data quality checks using HAMBOT, soon to be open sourced

20 “…any decent sized enterprise will have a variety of different data technologies for different kinds of data. There will still be large amounts of it managed in relational stores, but increasingly we'll be first asking how we want to manipulate the data and only then figuring out what technology is the best bet for it.” - Martin Fowler Think Data Ecosystem, Not Tech Stack

21 Polyglot in Practice Best practices from traditional EDW Consolidation Data Governance Master Data Tuned for analytics Applied to: Fit-for-purpose technologies and approaches Relational, MPP, Graph, KV, TimeseriesDB, Data Lake Apply “tunable governance” and traditional principles Use the right tool for the job

22 The Landscape for Digital Dominance Landing Queue Data Lake BDW Data Science API Data Providers Near Real-time Batch Data Science Clusters EDW Graph RDS Metastore

23 Joe Caserta President, Caserta Concepts joe@casertaconcepts.com @joe_Caserta Elliott Cordo Chief Architect, Caserta Concepts elliott@casertaconcepts.com Award-winning company Transformative Data Strategies Modern Data Engineering Advanced Architecture Innovation Partner Strategic Consulting Advanced Technical Design Build & Deploy Solutions BDW Meetup New York City 3,000+ members Knowledge sharing Data is not important, it’s what you do with it that’s important! Thank You


Download ppt "Joe Caserta President Elliott Cordo Chief Architect September 30, 2015, Javits Center, New York City Building a Data Lake for Digital Music Dominance."

Similar presentations


Ads by Google