Presentation is loading. Please wait.

Presentation is loading. Please wait.

Enabling data management in a big data world Craig Soules Garth Goodson Tanya Shastri.

Similar presentations


Presentation on theme: "Enabling data management in a big data world Craig Soules Garth Goodson Tanya Shastri."— Presentation transcript:

1 Enabling data management in a big data world Craig Soules Garth Goodson Tanya Shastri

2 The problem with data management Hadoop is a collection of tools – Not tightly integrated – Everyone’s stack looks a little different – Everything falls back to files

3 Agenda Traditional data management Hadoop’s eco-system Natero’s approach to data management

4 What is data management? What do you have? – What data sets exist? – Where are they stored? – What properties do they have? Are you doing the right thing with it? – Who can access data? – Who has accessed data? – What did they do with it? – What rules apply to this data?

5 Traditional data management External Data Sources Extract Transform Load Extract Transform Load Data Warehouse Integrated storage Data processing Users SQL

6 Key lessons of traditional systems Data requires the right abstraction – Schemas have value – Tables are easy to reason about Referenced by name, not location Narrow interface – SQL defines the data sources and the processing But not where and how the data is kept!

7 Hadoop eco-system External Data Sources HDFS storage layer Processing Framework (Map-Reduce) Users HBase Sqoop + Flume Pig HiveQL Mahout Hive Metastore (HCatalog) Hive Metastore (HCatalog) Oozie Cloudera Navigator Cloudera Navigator

8 Key challenges External Data Sources HDFS storage layer Users Sqoop + Flume More varied data sources with many more access / retention requirements Processing Framework (Map-Reduce) HBase Pig Hive Metastore (HCatalog) Hive Metastore (HCatalog) Oozie Cloudera Navigator Cloudera Navigator HiveQL Mahout

9 Key challenges External Data Sources HDFS storage layer Users Sqoop + Flume Data accessed through multiple entry points Processing Framework (Map-Reduce) HBase Pig Hive Metastore (HCatalog) Hive Metastore (HCatalog) Oozie Cloudera Navigator Cloudera Navigator HiveQL Mahout

10 Key challenges External Data Sources HDFS storage layer Users Sqoop + Flume Processing Framework (Map-Reduce) HBase Pig Hive Metastore (HCatalog) Hive Metastore (HCatalog) Oozie Cloudera Navigator Cloudera Navigator Lots of new consumers of the data HiveQL Mahout

11 Key challenges External Data Sources HDFS storage layer Users Sqoop + Flume Processing Framework (Map-Reduce) HBase Pig Hive Metastore (HCatalog) Hive Metastore (HCatalog) Oozie Cloudera Navigator Cloudera Navigator One access control mechanism: files HiveQL Mahout

12 Steps to data management Provide access at the right level Limit the processing interfaces Schemas and provenance provide control Enforce policy 1 3 2 4

13 Case study: Natero Cloud-based analytics service – Enable business users to take advantage of big data – UI-driven workflow creation and automation Single shared Hadoop eco-system – Need customer-level isolation and user-level access controls Goals: – Provide the appropriate level of abstraction for our users – Finer granularity of access control – Enable policy enforcement – Users shouldn’t have to think about policy Source-driven policy management

14 Natero application stack External Data Sources HDFS storage layer Processing Framework (Map-Reduce) Users HBase Sqoop + Flume Pig Access-aware workflow compiler Schema Extraction Policy and Metadata Manager Provenance-aware scheduler HiveQL Mahout 1 3 2 4

15 Natero execution example Job Sources Job Compiler Job Compiler Metadata Manager Metadata Manager Scheduler Fine-grain access control Auditing Enforceable policy Easy for users Natero UI Natero UI

16 The right level of abstraction Our abstraction comes with trade-offs – More control, compliance – No more raw Map-Reduce Possible to integrate with Pig/Hive What’s the right level of abstraction for you? – Kinds of execution

17 Hadoop projects to watch HCatalog – Data discovery / schema management / access Falcon – Lifecycle management / workflow execution Knox – Centralized access control Navigator – Auditing / access management

18 Lessons learned If you want control over your data, you also need control over data processing File-based access control is not enough Metadata is crucial Users aren’t motivated by policy – Policy shouldn’t get in the way of use – But you might get IT to reason about the sources


Download ppt "Enabling data management in a big data world Craig Soules Garth Goodson Tanya Shastri."

Similar presentations


Ads by Google