Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved. Eric Carr (VP Core Systems Group) Spark Summit 2014 BUILDING BIG DATA.

Slides:



Advertisements
Similar presentations
Module 13: Performance Tuning. Overview Performance tuning methodologies Instance level Database level Application level Overview of tools and techniques.
Advertisements

Chapter 1 Business Driven Technology
Big Data + SDN SDN Abstractions. The Story Thus Far Different types of traffic in clusters Background Traffic – Bulk transfers – Control messages Active.
Retail Planning & Optimization Solution Elevator Pitches.
Spark: Cluster Computing with Working Sets
SAS solutions SAS ottawa platform user society nov 20th 2014.
1 Large-Scale Machine Learning at Twitter Jimmy Lin and Alek Kolcz Twitter, Inc. Presented by: Yishuang Geng and Kexin Liu.
© 2009 VMware Inc. All rights reserved Big Data’s Virtualization Journey Andrew Yu Sr. Director, Big Data R&D VMware.
Small-world Overlay P2P Network
MS DB Proposal Scott Canaan B. Thomas Golisano College of Computing & Information Sciences.
© 2002 McGraw-Hill Companies, Inc., McGraw-Hill/Irwin TURNING MARKETING INFORMATION INTO ACTION.
An Overlay Multicast Infrastructure for Live/Stored Video Streaming Visual Communication Laboratory Department of Computer Science National Tsing Hua University.
1© Copyright 2015 EMC Corporation. All rights reserved. SDN INTELLIGENT NETWORKING IMPLICATIONS FOR END-TO-END INTERNETWORKING Simone Mangiante Senior.
CS2032 DATA WAREHOUSING AND DATA MINING
HOL9396: Oracle Event Processing 12c
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
CS Spring 2012 CS 414 – Multimedia Systems Design Lecture 34 – Media Server (Part 3) Klara Nahrstedt Spring 2012.
The Importance Of Transactions In The World Of Analytics Doug Aoyama Director, Product Marketing.
WHT/ HPCC Systems Flavio Villanustre VP, Products and Infrastructure HPCC Systems Risk Solutions.
“Collaborative automation: water network and the virtual market of energy”, an example of Operational Efficiency improvement through Analytics Stockholm,
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
Last Words COSC Big Data (frameworks and environments to analyze big datasets) has become a hot topic; it is a mixture of data analysis, data mining,
Tennessee Technological University1 The Scientific Importance of Big Data Xia Li Tennessee Technological University.
Big Data. What is Big Data? Big Data Analytics: 11 Case Histories and Success Stories
EVENT MANAGEMENT IN MULTIVARIATE STREAMING SENSOR DATA National and Kapodistrian University of Athens.
` tuplejump The data engineering platform. A startup with a vision to simplify data engineering and empower the next generation of data powered miracles!
Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Highline Class, BI 348 Basic Business Analytics using Excel, Chapter 01 Intro to Business Analytics BI 348, Chapter 01.
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
Introduction to Hadoop and HDFS
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
1 ENTROPY-BASED CONCEPT SHIFT DETECTION PETER VORBURGER, ABRAHAM BERNSTEIN IEEE ICDM 2006 Speaker: Li HueiJyun Advisor: Koh JiaLing Date:2007/11/6 1.
Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 作者 :Rutvik Karve , Devendra Dahiphale , Amit Chhajer 報告 : 饒展榕.
Copyright © 2013 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Business Plug-In B18 Business Intelligence.
DoWitcher: Effective Worm Detection and Containment in the Internet Core S. Ranjan et. al in INFOCOM 2007 Presented by: Sailesh Kumar.
Copyright © 2012, SAS Institute Inc. All rights reserved. ANALYTICS IN BIG DATA ERA ANALYTICS TECHNOLOGY AND ARCHITECTURE TO MANAGE VELOCITY AND VARIETY,
How Companies are Using Spark And where the Edge in Big Data will be Matei Zaharia.
Open-Eye Georgios Androulidakis National Technical University of Athens.
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
Copyright © 2006, GemStone Systems Inc. All Rights Reserved. Increasing computation throughput with Grid Data Caching Jags Ramnarayan Chief Architect GemStone.
Content caching and scheduling in wireless networks with elastic and inelastic traffic Group-VI 09CS CS CS30020 Performance Modelling in Computer.
Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉 教授 : 許毅然 作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
Mining of Massive Datasets Edited based on Leskovec’s from
Big Data Yuan Xue CS 292 Special topics on.
Smart Grid Big Data: Automating Analysis of Distribution Systems Steve Pascoe Manager Business Development E&O - NISC.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
Time Series Data Repository #ODSummit - The Generic, Extensible, and Elastic Data Repository in OpenDaylight for Advanced Analytics.
Course : Study of Digital Convergence. Name : Srijana Acharya. Student ID : Date : 11/28/2014. Big Data Analytics and the Telco : How Telcos.
Big Data Quality Challenges for the Internet of Things (IoT) Vassilis Christophides INRIA Paris (MUSE team)
© 2016 TM Forum | 1 Virtual CPE Platform in the Home.
Data Summit 2016 H104: Building Hadoop Applications Abhik Roy Database Technologies - Experian LinkedIn Profile:
Blue Coat Cloud Continuum
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Connected Infrastructure
SNS COLLEGE OF TECHNOLOGY
University of Maryland College Park
By Arijit Chatterjee Dr
Smart Building Solution
Smart Building Solution
Connected Infrastructure
Provision of Multimedia Services in based Networks
Technical Capabilities
Big DATA.
Big-Data Analytics with Azure HDInsight
Presentation transcript:

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved. Eric Carr (VP Core Systems Group) Spark Summit 2014 BUILDING BIG DATA OPERATIONAL INTELLIGENCE PLATFORM WITH APACHE SPARK 1

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved. Market & Technology Imperatives Communication Service Providers & Big Data Analytics

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved. Industry Context for Communication Service Providers (CSPs) 3 Big Data is at the heart of two core strategies for CSPs: Improve current revenue sources through greater operational efficiencies Create new revenue source with the 4th wave Source: Chetan Consulting

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved. 4 CSPs - Industry Value Chain Shift Source: asymco

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved. CSPs - A High Bar for Operational Intelligence 5 CSPs require solutions engineered to meet very stringent requirements

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved. Type of Data Type of Analysis 6 Different Platforms target Different Questions Data LakeData Streams Discovery Decisioning Data Warehouses + Business Intelligence Search Centric Stream Analytics + Operational Intelligence

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved. Streaming Analytics & Machine Learning to Action

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved. Driving Streaming Analytics to Action Network Flow Analytics Usage Awareness Operational Interactions Real-Time Actions NetFlow, Routing Planning Layer 7 Visibility Content & CDN Analytics Policy Profile Triggers Care & Experience Mgmt New Service Creation & Monetization Small Cell / RAN / Backhaul Differentiation SON / SDN / Virtualization Content Optimization Operational Intelligence 8

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved. Reflex 1.0 Pipeline – Timely Cube Reporting Collector Hadoop Compute Analytics Store RAM Cache Reflex 1.5 Pipeline – Spark / Yarn Reflex 2.0 Pipeline – Spark Streaming core UI Collector Spark / YARN Analytics Store RAM Cache UI Collector Spark / YARN / HDFS2 UI ML Store SQL Streams Cache Msg Queue

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved. Stream Engine Stream Engine - Operational Intelligence Analytics Data Streams Contextual Data Stores Feature Engine Data fusion Metrics creation Variables selection Causality inference Anomaly Engine Multivariate analysis Outlier Detection RCA Engine Statistical learning Clustering Pattern identification Item set mining Targeted Actions Optimized algorithmic support for common stream data processing & fusions Detect unusual events occurring in stream(s) of data. Once detected, isolate root cause Anomaly / outlier detection, commonalities / root cause forensics, prediction / forecasting, actions / alerts / notifications Record enrichment capabilities – e.g. URL categorization, device id, etc. 10

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved. Example - State of the art causality analysis LinearLinear + Non linear Random variables (Time independent) Stochastic processes (Time dependent) Maximal Information Coefficient Maximal Information Coefficient Correlation Granger Causality Transfer Entropy RELATIONSHIP TYPE TIME DEPENDENCY Ranking of metrics: 1) Transfer Entropy 2) Maximal Information Coefficient, Granger Causality 3) Correlation 11

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved. Transfer entropy from a process X to another process Y is the amount of uncertainty reduced in future values of Y by knowing the past values of X given past values of Y. Example - Causality Techniques CONS Challenging joint probability estimation Large amount of data needed for calculation Choice of time lags 12 Maximal Information Coefficient is a measure of the strength between two random variables based on mutual information. Methodology for empirical estimation based on maximizing the mutual information over a set of grids CONS No time information

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved. 13 Network Operations / Care Example Identifying Commonalities, Anomalies, RCA Anomaly Detection Root-Cause Analysis Event Drivers Event Chaining

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved. BinStream Details

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved. Use Case Use IP Traffic Records to calculate Bandwidth Usage as a Time Series (continuously …), can’t do that based on the time the records are received by Spark. –In general for any record which has a timestamp, it important to analyze based on the time of event rather than the reception of the event record. 15

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved. Challenges With For one dataset. Make the time stamp part of the key. For continuously streamed data sets. –You do not know if you have received all data for a particular time slot. Caused by event delay, or event duration. –An event could span multiple time slots. Caused by event duration. 16

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved. Data Processing – Timing (Map-Reduce world) Collector/ Adapter Collector/ Writer MR Jobs/ Cube Generator MR Job/ Cube exporter STORE Columnar Storage BINS HDFS Data Files HDFS Cubes Available For visualization Delayed 2x bin interval(10mins) to allow last bin to be closed k th hour(k+1) th hour(k+2) th hour Bin Interval 5mins Job Duration mins Write on bin interval or size Available For visualization Events PastcurrentFutureClosed 17

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved. Proposed Binning & Proration Solution 18

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved. Solution (cond.) The typical solutions are: –For event delay: Wait (Buffer). –For event duration: Prorate events across time slots. Introduce a concept of BinStream. An abstraction over the Dstream, which needs a –Function to extract the time fields from the records. –Function to prorate the records. Note that this can be trivially achieved by using ‘window’ functionality, by having the batch equal to the time series interval and the window size equal to maximum possible delay. 19

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved. Problems / Solutions window == wait & buffer. –This has two issues. A. Need memory for buffering. B. Downstream needs to wait for the result (or any part of it) BinStream provides two additional options –Gets rid of delay for getting partial results, by sending regular latest snapshots for the old time slots. This does not solve the memory and increases the processing load. –If the client can handle partial results, i.e. if it can aggregate partial results, it can get updates to the old bins. This reduces the memory for the spark-streaming application. 20

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved. Limitations. The number of time series slots for which the updates can be generated is fixed, basically governed by the event delay characteristics. 21

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved. THANK YOU!