Building Scalable Big Data Infrastructure Using Open Source Software Sam William

Slides:

Advertisements

Similar presentations

HBase and Hive at StumbleUpon

Advertisements

Inner Architecture of a Social Networking System Petr Kunc, Jaroslav Škrabálek, Tomáš Pitner.

Data Freeway : Scaling Out to Realtime Author: Eric Hwang, Sam Rash Speaker : Haiping Wang

BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.

Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.

Time Series Data Repository (TSDR)

Kafka high-throughput, persistent, multi-reader streams

© 2013 MediaCrossing, Inc. All rights reserved. Going Live: Preparing your first Spark production deployment Gary Malouf Architect,

NoSQL and NewSQL Justin DeBrabant CIS Advanced Systems - Fall 2013.

Hadoop Ecosystem Overview

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.

HADOOP ADMIN: Session -2

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.

Analytics Map Reduce Query Insight Hive Pig Hadoop SQL Map Reduce Business Intelligence Predictive Operational Interactive Visualization Exploratory.

USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.

Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,

Facebook (stylized facebook) is a Social Networking System and website launched in February 2004, operated and privately owned by Facebook, Inc. As.

The Multiple Uses of HBase Jean-Daniel Cryans, DB Berlin Buzzwords, Germany, June 7 th,

` tuplejump The data engineering platform. A startup with a vision to simplify data engineering and empower the next generation of data powered miracles!

HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.

Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.

Introduction to Hadoop and HDFS

Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.

DBSQL 14-1 Copyright © Genetic Computer School 2009 Chapter 14 Microsoft SQL Server.

SLIDE 1IS 257 – Fall 2014 NewSQL and VoltDB University of California, Berkeley School of Information IS 257: Database Management.

Building Dashboards SharePoint and Business Intelligence.

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities Agile Infrastructure Monitoring CERN IT/CF.

Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*

Virtual techdays INDIA │ 9-11 February 2011 virtual techdays Data grail: Data Market on Windows Azure Sudhindra Kovalam │ Developer, Icertis Inc.

AMQP, Message Broker Babu Ram Dawadi. overview Why MOM architecture? Messaging broker like RabbitMQ in brief RabbitMQ AMQP – What is it ?

IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco-System.

Site Technology TOI Fest Q Celebration From Keyword-based Search to Semantic Search, How Big Data Enables That?

Distributed Time Series Database

Nov 2006 Google released the paper on BigTable.

Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.

CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.

Zhangxi Lin Texas Tech University

Streaming Analytics with Spark 1 Magnoni Luca IT-CM-MM 09/02/16EBI - CERN meeting.

Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.

Harnessing Big Data with Hadoop Dipti Sangani; Madhu Reddy DBI210.

1 Divya Jain Oct 10 th, 2014 Big Data Products: Where do I start?

Moscow, November 16th, 2011 The Hadoop Ecosystem Kai Voigt, Cloudera Inc.

BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.

BIG DATA/ Hadoop Interview Questions.

Apache Kafka A distributed publish-subscribe messaging system

CPSC8985 FA 2015 Team C3 DATA MIGRATION FROM RDBMS TO HADOOP By Naga Sruthi Tiyyagura Monika RallabandiRadhakrishna Nalluri.

Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Our experience with NoSQL and MapReduce technologies Fabio Souto.

CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook

Microsoft Ignite /28/2017 6:07 PM

Big Data-An Analysis. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult.

Pilot Kafka Service Manuel Martín Márquez. Pilot Kafka Service Manuel Martín Márquez.

Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.

Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.

INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER

Chapter 14 Big Data Analytics and NoSQL

Open Source distributed document DB for an enterprise

Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.

Central Florida Business Intelligence User Group

Lab #2 - Create a movies dataset

Introduction to PIG, HIVE, HBASE & ZOOKEEPER

Overview of big data tools

Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper

Big-Data Analytics with Azure HDInsight

Pig Hive HBase Zookeeper

Presentation transcript:

Building Scalable Big Data Infrastructure Using Open Source Software Sam William

What is StumbleUpon? Help users find content they did not expect to find The best way to discover new and interesting things from across the Web.

How StumbleUpon works 1. Register 2. Tell us your interests 3. Start Stumbling and rating web pages We use your interests and behavior to recommend new content for you!

StumbleUpon

The Data Challenge 1 Data collection 2 Real time metrics 3 Batch processing / ETL 4 Data warehousing & ad-hoc analysis 5 Business intelligence & Reporting

Challenges in data collection Different services deployed of different tech stacks Add minimal latency to the production services Application DBs for Analytics / batch processing –From HBase & MySQL SiteRec /Ad Server Other internal services Apache / PHP Scala / Finagle Java / Scala / PHP

Data Processing and Warehousing Raw Data ETLETL Warehouse (HDFS) Tables Massively Denormalized Tables Challenges/Requirements: Scale over 100 TBs of data End product works with easy querying tools/languages Reliable and Scalable – powers analytics and internal reporting.

Real-time analytics and metrics Atomic counters Tracking product launches Monitoring the health of the site Latency – live metrics makes sense A/B tests

Open Source at SU

Data Collection at SU

Activity Streams and Logs All messages are Protocol Buffers Fast and Efficient Multiple Language Bindings ( Java/ C++ / PHP ) Compact Very well documented Extensible

Apache Kafka Distributed pub-sub system LinkedIn Offers message persistence Very high throughput –~300K messages/sec Horizontally scalable Multiple subscribers for topics. –Easy to rewind

Kafka Near real time process can be taken offline and done at the consumer level Semantic partitioning through topics Partitions for parallel consumption High-level consumer API using ZK Simple to deploy- only requires Zookeeper

Kafka At SU 4 Broker nodes with RAID10 disks 25 topics Peak of 3500 msg/s 350 bytes avg. message size 30 days of data retention

Sutro Scala/Finagle Generic Kafka message producer Deployed on all prod servers Local http daemon Publishes to Kafka asynchronously Snowflake to generate unique Ids

Sutro - Kafka Site - Apache/PHP Sutro Ad Server - Scala/Finagle Sutro Rec Server - Scala/Finagle Sutro Other Services Sutro Kafka Broker

Application Data for Analytics & Batch Processing HBase inter-cluster replication (from production to batch cluster) Near real-time sync on batch cluster Readily available in Hive for analysis HBase MySQL replication to Batch DB Servers Sqoop incremental data transfer to HDFS HDFS flat files mapped to Hive tables & made available for analysis MySQL

Real-time metrics 1.HBase – Atomic Counters 2.Asynchbase - Coalesced counter inc++ 1.OpenTSDB (developed at SU) –A distributed time-series DB on HBase –Collects over 2 Billion data points a day –Plotting time series graphs –Tagged data points

Real-time counters Real time metrics from OpenTSDB

Kafka Consumer framework aka Postie Distributed system for consuming messages Scala/Akka -on top of Kafka’s consumer API Generic consumer - understands protobuf Predefined sinks HBase / HDFS (Text/Binary) / Redis Consumers configured with configuration files Distributed / uses ZK to co-ordinate Extensible

Postie

Akka Building concurrent applications made easy !! The distributed nodes are behind Remote Actors Load balancing through custom Routers The predefined sink and services are accessed through local actors Fault-tolerance through actor monitoring

Postie

Batch processing / ETL GOAL: Create simplified data-sets from complex data Create highly denormalized data sets for faster querying Power the reporting DB with daily stats Output structured data for specific analysis e.g. Registration Flow analysis

Our favourite ETL tools: Pig –Optional Schema –Work on byte arrays –Many simple operations can be done without UDFs –Developing UDFs is simple (understand Bags/Tuples) –Concise scripts compared to the M/R equivalents Scalding –Functional programming in Scala over Hadoop –Built on top of Cascading –Operating over tuples is like operating over collections in Scala –No UDFs.. Your entire program is in a full-fledged general purpose language

Warehouse - Hive Uses SQL-like querying language All Analysts and Data Scientists versed in SQL Supports Hadoop Streaming (Python/R) UDFs and Serdes make it highly extensible Supports partitioning with many table properties configurable at the partition level

Hive at StumbleUpon HBaseSerde Reads binary data from HBase Parses composite binary values into multiple columns in Hive (mainly on key) ProtobufSerde For creating Hive tables on top of binary protobuf files stored in HDFS Serde uses Java reflection to parse and project columns

Data Infrastructure at SU

Data Consumption

End Users of Data Data Pipeline Warehouse Who uses this data? Data Scientists/Analysts Offline Rec pipeline Ads Team … all this work allows them to focus on querying and analysis which is critical to the business.

Business Analytics / Data Scientists Feature-rich set of data to work on Enriched/Denormalized tables reduce JOINs, simplifies and speeds queries – shortening path to analysis. R: our favorite tool for analysis post Hadoop/Hive.

Recommendation platform URL score pipeline –M/R and Hive on Oozie –Filter / Classify into buckets –Score / Loop –Load ES/HBase index Keyword generation pipeline –Parse URL data –Generate Tag mappings

URL score pipeline

Advertisement Platform Billing Service –RT Kafka consumer –Calculates skips –Bills customers Audience Estimation tool –Pre-crunched data into multiple dimensions –A UI tool for Advertisers to estimate target audience Sales team tools –Built with PHP leveraging Hive or pre-crunched ETL data in HBase

More stuff on the pipeline Storm from Twitter –Scope for lot more real time analytics. –Very high throughput and extensible –Applications in JVM language BI tools –Our current BI tools / dashboards are minimal –Google charts powered by our reporting DB (HBase primarily).

Open Source FTW!! Actively developed and maintained Community support Built with web-scale in mind Distributed systems – Easy with Akka/ZK/Finagle Inexpensive Only one major catch !! –Hire and retain good engineers !!

Thank You! Questions ?