Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.

Slides:



Advertisements
Similar presentations
Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B I590 Data Science Curriculum August Geoffrey Fox
Advertisements

HDFS & MapReduce Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer.
Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why.
An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
` tuplejump The data engineering platform. A startup with a vision to simplify data engineering and empower the next generation of data powered miracles!
Right In Time Presented By: Maria Baron Written By: Rajesh Gadodia
Microsoft Azure and DataStax: Start Anywhere and Scale to Any Size in the Cloud, On- Premises, or Both with a Leading Distributed Database MICROSOFT AZURE.
Andy Roberts Data Architect
Apache Hadoop on Windows Azure Avkash Chauhan
MSBIC Hadoop Series Implementing MapReduce Jobs Bryan Smith
Microsoft Ignite /28/2017 6:07 PM
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Big Data & Test Automation
Integration of Oracle and Hadoop: hybrid databases affordable at scale
OMOP CDM on Hadoop Reference Architecture
Connected Infrastructure
5/9/2018 7:28 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS.
Connected Living Connected Living What to look for Architecture
Integration of Oracle and Hadoop: hybrid databases affordable at scale
MapReduce Compiler RHadoop
Smart Building Solution
Hadoop Aakash Kag What Why How 1.
Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.
Introduction to Spark Streaming for Real Time data analysis
Hadoop.
Introduction to Distributed Platforms
Hadoop and Analytics at CERN IT
BigDL Deep Learning Library on HDInsight
TDWI EXECUTIVE SUMMIT From Traditional to Modern: How Rakuten Marketing Realized the Promise of a New Generation of BI September 21, 2015 Donald Krapohl.
An Open Source Project Commonly Used for Processing Big Data Sets
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Zhangxi Lin, The Rawls College,
Chapter 14 Big Data Analytics and NoSQL
Open Source distributed document DB for an enterprise
Spark Presentation.
Smart Building Solution
Connected Living Connected Living What to look for Architecture
Connected Infrastructure
Building Analytics At Scale With USQL and C#
Data Platform and Analytics Foundational Training
In-Memory Performance
Extraction, aggregation and classification at Web Scale
Central Florida Business Intelligence User Group
Powering real-time analytics on Xfinity using Kudu
Blaze - An IoT Analytics Engine
Ministry of Higher Education
Introduction to Spark.
Designed for Big Data Visual Analytics, Zoomdata Allows Business Users to Quickly Connect, Stream, and Visualize Data in the Microsoft Azure Platform MICROSOFT.
Capital One Architecture Team and DataTorrent
Big Data - in Performance Engineering
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Near Real Time ETLs with Azure Serverless Architecture
Introduction to Apache
Overview of big data tools
Technical Capabilities
Charles Tappert Seidenberg School of CSIS, Pace University
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
CS639: Data Management for Data Science
Streaming data processing using Spark
Big-Data Analytics with Azure HDInsight
Introduction to Azure Data Lake
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
UNIT 6 RECENT TRENDS.
SQL Server 2019 Bringing Apache Spark to SQL Server
Pig Hive HBase Zookeeper
Presentation transcript:

Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014

Hadoop Ecosystem

Real-Time Insight with In-Memory ETL Batch M I R O Intermediate Files RDBMS EDW MPP In Memory P O I Stream Data Accumulation 24 Hours Data Processing 8 Hours Application Complexity Map-Reduce Real-Time Event Driven Seconds Complex Processing

Enterprise Repositories Visualize Business Analytics Business Intelligence Tools Visualization Tools SOURCE DATA Scalable Ingestion Enterprise Repositories RDBMS EDW NoSQL Analytics Alerts Hive Events Load Feed 1 Extract Transform XML Files Ad Hoc Query Feed 2 Load Load Feed…. Sensor data HDFS Raw Archive Feed 400 MS Queue’s Social Data Access Service Databases Feed Consumers/ Applications Scale Out

Stream Processing A Stream is a sequence of data events with schema 1 2 4 3 6 5 A Stream is a sequence of data events with schema An Operator takes input streams and compute output streams Each Operator is YOUR business logic in java, or from our library An Application is a Directed Acyclic Graph (DAG)

DataTorrent Hadoop GRID DT Console 4 1 dtCLI 3 6 2 5 Resource Manager DT Gateway NM NM NM NM MapReduce StrAM MapReduce MapReduce 3 1 MapReduce 4 6 MapReduce 2 5 MapReduce

DataTorrent Platform: . High Performance Extreme Scalability Mission Critical Hadoop 2.0 Native Real-time data ingestion In-memory processing Billions of operations per second DataTorrent automatically scales out/in to changing loads Sub-second latency with linear scalability Complex big data applications Built-in Fault-tolerance 24/7 uptime guaranteed Update your application while it's running! Runs on your existing Apache Hadoop cluster. Develop faster and support any business logic with our open-source framework. Integrate seamlessly with your existing data flow.

DataTorrent YARN Interaction DataTorrent is an java interface based API Default Implementation – Platform Custom Implementation – Application Development Platform components have various configuration properties Container Size (Hadoop Dependent) Operator Memory Max Number of Containers Locality of the Operators and Streams C-Group (Coming soon – Hadoop Dependent) Static and Dynamic Partitioning

Checkpointing Transparent, Distributed, and Asynchronous Resource Requirements directly proportional to Size of the state Frequency of checkpointing Most operators have small (a few KB) state footprint Techniques to lower the cost Identify the state with minimum footprint Use external storage Incremental checkpoints Faster media Stateless Operators Less frequent Disable

DataTorrent vs Alternatives Developed Ground-up to do Streaming natively in Hadoop Relieves Application Developers from Fault Tolerance High Performance yet Resource Friendly Linearly Scalable Hadoop Native and co-exists with other Hadoop Applications 500+ Open Source Operators UI Dashboard Widgets Preferred by Enterprises after Trying Alternatives Enterprise Grade Support

Real-Time Fault Tolerant Use Cases Big Data ETL Offload Predictive Analytics Scalable Ingestion Operational Monitoring and Alerts Real-Time Business Actions Internet of Things Security

Demos malhar-users@googlegroups.com https://www.datatorrent.com/developers/