Enabling data management in a big data world Craig Soules Garth Goodson Tanya Shastri.

Slides:



Advertisements
Similar presentations
Oracle Data Warehouse Mit Big Data neue Horizonte für das Data Warehouse ermöglichen Alfred Schlaucher, Detlef Schroeder DATA WAREHOUSE.
Advertisements

Syncsort Data Integration Update Summary Helping Data Intensive Organizations Across the Big Data Continuum Hadoop – The Operating System.
SQOOP HCatalog Integration
Big Data Training Course for IT Professionals Name of course : Big Data Developer Course Duration : 3 days full time including practical sessions Dates.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Senior Project Manager & Architect Love Your Data.
Data Service Abstraction Transformation Provider Data Consumer Role DATA Data Provider Role DATA Capabilities Provider Big Data Framework Scalable Infrastructures.
FAST FORWARD WITH MICROSOFT BIG DATA Vinoo Srinivas M Solutions Specialist Windows Azure (Hadoop, HPC, Media)
Cloudera & Hadoop Use Cases Rob Lancaster | Omer Trajman "Big Data"... Applications From Enterprises to Individuals.
StorIT Certified - Big Data Sales Expert Name of the course: StorIT Certified Bigdata Sales Expert Duration: 1 day full time Date: November 12, 2014 Location:
An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.
Transform + analyze Visualize + decide Capture + manage Dat a.
Hive: A data warehouse on Hadoop
Fraud Detection in Banking using Big Data By Madhu Malapaka For ISACA, Hyderabad Chapter Date: 14 th Dec 2014 Wilshire Software.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
UT DALLAS Erik Jonsson School of Engineering & Computer Science FEARLESS engineering Secure Data Storage and Retrieval in the Cloud Bhavani Thuraisingham,
Hadoop Ecosystem Overview
TITLE SLIDE: HEADLINE Presenter name Title, Red Hat Date For Red Hat, it's 1994 all over again Sarangan Rangachari VP and GM, Storage and Big Data Red.
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
May 23nd 2012 Matt Mead, Cloudera
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Presented by John Dougherty, Viriton 4/28/2015 Infrastructure and Stack.
SEMINAR ON Guided by: Prof. D.V.Chaudhari Seminar by: Namrata Sakhare Roll No: 65 B.E.Comp.
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Distributed Systems Fall 2014 Zubair Amjad. Outline Motivation What is Sqoop? How Sqoop works? Sqoop Architecture Import Export Sqoop Connectors Sqoop.
Big Data for Relational Practitioners Len Wyatt Program Manager Microsoft Corporation DBI225.
Data and SQL on Hadoop. Cloudera Image for hands-on Installation instruction – 2.
Page 1 © Hortonworks Inc – All Rights Reserved More Data, More Problems A Practical Guide to Testing on Hadoop 2015 Michael Miklavcic.
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
Apache Hadoop on the Open Cloud David Dobbins Nirmal Ranganathan.
IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco-System.
Nov 2006 Google released the paper on BigTable.
Zhangxi Lin Texas Tech University
1 © Cloudera, Inc. All rights reserved. Alexander Bibighaus| Director of Engineering The Future of Data Management with Hadoop and the Enterprise Data.
HADOOP Course Content By Mr. Kalyan, 7+ Years of Realtime Exp. M.Tech, IIT Kharagpur, Gold Medalist. Introduction to Big Data and Hadoop Big Data › What.
Harnessing Big Data with Hadoop Dipti Sangani; Madhu Reddy DBI210.
1 Divya Jain Oct 10 th, 2014 Big Data Products: Where do I start?
Moscow, November 16th, 2011 The Hadoop Ecosystem Kai Voigt, Cloudera Inc.
BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.
What is it and why it matters? Hadoop. What Is Hadoop? Hadoop is an open-source software framework for storing data and running applications on clusters.
Apache Hadoop on Windows Azure Avkash Chauhan
B ig D ata Analysis for Page Ranking using Map/Reduce R.Renuka, R.Vidhya Priya, III B.Sc., IT, The S.F.R.College for Women, Sivakasi.
Microsoft Partner since 2011
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Our experience with NoSQL and MapReduce technologies Fabio Souto.
BI 202 Data in the Cloud Creating SharePoint 2013 BI Solutions using Azure 6/20/2014 SharePoint Fest NYC.
Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.
Big Data-An Analysis. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult.
Big Data & Test Automation
OMOP CDM on Hadoop Reference Architecture
Hadoop.
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Hadoop Developer.
MSBIC Hadoop Series Processing Data with Pig
Hadoopla: Microsoft and the Hadoop Ecosystem
Big Data Intro.
SQOOP.
HDInsight makes Hadoop Easy
Visual Analytics Sandbox
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Server & Tools Business
Setup Sqoop.
Charles Tappert Seidenberg School of CSIS, Pace University
06 | Automating Big Data Processing
Pig Hive HBase Zookeeper
Big Data.
Presentation transcript:

Enabling data management in a big data world Craig Soules Garth Goodson Tanya Shastri

The problem with data management Hadoop is a collection of tools – Not tightly integrated – Everyone’s stack looks a little different – Everything falls back to files

Agenda Traditional data management Hadoop’s eco-system Natero’s approach to data management

What is data management? What do you have? – What data sets exist? – Where are they stored? – What properties do they have? Are you doing the right thing with it? – Who can access data? – Who has accessed data? – What did they do with it? – What rules apply to this data?

Traditional data management External Data Sources Extract Transform Load Extract Transform Load Data Warehouse Integrated storage Data processing Users SQL

Key lessons of traditional systems Data requires the right abstraction – Schemas have value – Tables are easy to reason about Referenced by name, not location Narrow interface – SQL defines the data sources and the processing But not where and how the data is kept!

Hadoop eco-system External Data Sources HDFS storage layer Processing Framework (Map-Reduce) Users HBase Sqoop + Flume Pig HiveQL Mahout Hive Metastore (HCatalog) Hive Metastore (HCatalog) Oozie Cloudera Navigator Cloudera Navigator

Key challenges External Data Sources HDFS storage layer Users Sqoop + Flume More varied data sources with many more access / retention requirements Processing Framework (Map-Reduce) HBase Pig Hive Metastore (HCatalog) Hive Metastore (HCatalog) Oozie Cloudera Navigator Cloudera Navigator HiveQL Mahout

Key challenges External Data Sources HDFS storage layer Users Sqoop + Flume Data accessed through multiple entry points Processing Framework (Map-Reduce) HBase Pig Hive Metastore (HCatalog) Hive Metastore (HCatalog) Oozie Cloudera Navigator Cloudera Navigator HiveQL Mahout

Key challenges External Data Sources HDFS storage layer Users Sqoop + Flume Processing Framework (Map-Reduce) HBase Pig Hive Metastore (HCatalog) Hive Metastore (HCatalog) Oozie Cloudera Navigator Cloudera Navigator Lots of new consumers of the data HiveQL Mahout

Key challenges External Data Sources HDFS storage layer Users Sqoop + Flume Processing Framework (Map-Reduce) HBase Pig Hive Metastore (HCatalog) Hive Metastore (HCatalog) Oozie Cloudera Navigator Cloudera Navigator One access control mechanism: files HiveQL Mahout

Steps to data management Provide access at the right level Limit the processing interfaces Schemas and provenance provide control Enforce policy

Case study: Natero Cloud-based analytics service – Enable business users to take advantage of big data – UI-driven workflow creation and automation Single shared Hadoop eco-system – Need customer-level isolation and user-level access controls Goals: – Provide the appropriate level of abstraction for our users – Finer granularity of access control – Enable policy enforcement – Users shouldn’t have to think about policy Source-driven policy management

Natero application stack External Data Sources HDFS storage layer Processing Framework (Map-Reduce) Users HBase Sqoop + Flume Pig Access-aware workflow compiler Schema Extraction Policy and Metadata Manager Provenance-aware scheduler HiveQL Mahout

Natero execution example Job Sources Job Compiler Job Compiler Metadata Manager Metadata Manager Scheduler Fine-grain access control Auditing Enforceable policy Easy for users Natero UI Natero UI

The right level of abstraction Our abstraction comes with trade-offs – More control, compliance – No more raw Map-Reduce Possible to integrate with Pig/Hive What’s the right level of abstraction for you? – Kinds of execution

Hadoop projects to watch HCatalog – Data discovery / schema management / access Falcon – Lifecycle management / workflow execution Knox – Centralized access control Navigator – Auditing / access management

Lessons learned If you want control over your data, you also need control over data processing File-based access control is not enough Metadata is crucial Users aren’t motivated by policy – Policy shouldn’t get in the way of use – But you might get IT to reason about the sources