Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.

Slides:

Advertisements

Similar presentations

Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B I590 Data Science Curriculum August Geoffrey Fox

Advertisements

HDFS & MapReduce Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer.

Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why.

An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.

This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.

MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …

` tuplejump The data engineering platform. A startup with a vision to simplify data engineering and empower the next generation of data powered miracles!

Right In Time Presented By: Maria Baron Written By: Rajesh Gadodia

Microsoft Azure and DataStax: Start Anywhere and Scale to Any Size in the Cloud, On- Premises, or Both with a Leading Distributed Database MICROSOFT AZURE.

Andy Roberts Data Architect

Apache Hadoop on Windows Azure Avkash Chauhan

MSBIC Hadoop Series Implementing MapReduce Jobs Bryan Smith

Microsoft Ignite /28/2017 6:07 PM

Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.

Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit

Big Data & Test Automation

Integration of Oracle and Hadoop: hybrid databases affordable at scale

OMOP CDM on Hadoop Reference Architecture

Connected Infrastructure

5/9/2018 7:28 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS.

Connected Living Connected Living What to look for Architecture

Integration of Oracle and Hadoop: hybrid databases affordable at scale

MapReduce Compiler RHadoop

Smart Building Solution

Hadoop Aakash Kag What Why How 1.

Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.

Introduction to Spark Streaming for Real Time data analysis

Introduction to Distributed Platforms

Hadoop and Analytics at CERN IT

BigDL Deep Learning Library on HDInsight

TDWI EXECUTIVE SUMMIT From Traditional to Modern: How Rakuten Marketing Realized the Promise of a New Generation of BI September 21, 2015 Donald Krapohl.

An Open Source Project Commonly Used for Processing Big Data Sets

INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER

Zhangxi Lin, The Rawls College,

Chapter 14 Big Data Analytics and NoSQL

Open Source distributed document DB for an enterprise

Spark Presentation.

Smart Building Solution

Connected Living Connected Living What to look for Architecture

Connected Infrastructure

Building Analytics At Scale With USQL and C#

Data Platform and Analytics Foundational Training

In-Memory Performance

Extraction, aggregation and classification at Web Scale

Central Florida Business Intelligence User Group

Powering real-time analytics on Xfinity using Kudu

Blaze - An IoT Analytics Engine

Ministry of Higher Education

Introduction to Spark.

Designed for Big Data Visual Analytics, Zoomdata Allows Business Users to Quickly Connect, Stream, and Visualize Data in the Microsoft Azure Platform MICROSOFT.

Capital One Architecture Team and DataTorrent

Big Data - in Performance Engineering

Introduction to PIG, HIVE, HBASE & ZOOKEEPER

Near Real Time ETLs with Azure Serverless Architecture

Introduction to Apache

Overview of big data tools

Technical Capabilities

Charles Tappert Seidenberg School of CSIS, Pace University

Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper

CS639: Data Management for Data Science

Streaming data processing using Spark

Big-Data Analytics with Azure HDInsight

Introduction to Azure Data Lake

Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.

UNIT 6 RECENT TRENDS.

SQL Server 2019 Bringing Apache Spark to SQL Server

Pig Hive HBase Zookeeper

Presentation transcript:

Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014

Hadoop Ecosystem

Real-Time Insight with In-Memory ETL Batch M I R O Intermediate Files RDBMS EDW MPP In Memory P O I Stream Data Accumulation 24 Hours Data Processing 8 Hours Application Complexity Map-Reduce Real-Time Event Driven Seconds Complex Processing

Enterprise Repositories Visualize Business Analytics Business Intelligence Tools Visualization Tools SOURCE DATA Scalable Ingestion Enterprise Repositories RDBMS EDW NoSQL Analytics Alerts Hive Events Load Feed 1 Extract Transform XML Files Ad Hoc Query Feed 2 Load Load Feed…. Sensor data HDFS Raw Archive Feed 400 MS Queue’s Social Data Access Service Databases Feed Consumers/ Applications Scale Out

Stream Processing A Stream is a sequence of data events with schema 1 2 4 3 6 5 A Stream is a sequence of data events with schema An Operator takes input streams and compute output streams Each Operator is YOUR business logic in java, or from our library An Application is a Directed Acyclic Graph (DAG)

DataTorrent Hadoop GRID DT Console 4 1 dtCLI 3 6 2 5 Resource Manager DT Gateway NM NM NM NM MapReduce StrAM MapReduce MapReduce 3 1 MapReduce 4 6 MapReduce 2 5 MapReduce

DataTorrent Platform: . High Performance Extreme Scalability Mission Critical Hadoop 2.0 Native Real-time data ingestion In-memory processing Billions of operations per second DataTorrent automatically scales out/in to changing loads Sub-second latency with linear scalability Complex big data applications Built-in Fault-tolerance 24/7 uptime guaranteed Update your application while it's running! Runs on your existing Apache Hadoop cluster. Develop faster and support any business logic with our open-source framework. Integrate seamlessly with your existing data flow.

DataTorrent YARN Interaction DataTorrent is an java interface based API Default Implementation – Platform Custom Implementation – Application Development Platform components have various configuration properties Container Size (Hadoop Dependent) Operator Memory Max Number of Containers Locality of the Operators and Streams C-Group (Coming soon – Hadoop Dependent) Static and Dynamic Partitioning

Checkpointing Transparent, Distributed, and Asynchronous Resource Requirements directly proportional to Size of the state Frequency of checkpointing Most operators have small (a few KB) state footprint Techniques to lower the cost Identify the state with minimum footprint Use external storage Incremental checkpoints Faster media Stateless Operators Less frequent Disable

DataTorrent vs Alternatives Developed Ground-up to do Streaming natively in Hadoop Relieves Application Developers from Fault Tolerance High Performance yet Resource Friendly Linearly Scalable Hadoop Native and co-exists with other Hadoop Applications 500+ Open Source Operators UI Dashboard Widgets Preferred by Enterprises after Trying Alternatives Enterprise Grade Support

Real-Time Fault Tolerant Use Cases Big Data ETL Offload Predictive Analytics Scalable Ingestion Operational Monitoring and Alerts Real-Time Business Actions Internet of Things Security

Demos malhar-users@googlegroups.com https://www.datatorrent.com/developers/