©2014 LinkedIn Corporation. All Rights Reserved. Gobblin’ Big Data with Ease Lin Qiao Data Analytics LinkedIn.

Slides:



Advertisements
Similar presentations
DIGIDOC A web based tool to Manage Documents. System Overview DigiDoc is a web-based customizable, integrated solution for Business Process Management.
Advertisements

Module 13: Performance Tuning. Overview Performance tuning methodologies Instance level Database level Application level Overview of tools and techniques.
Syncsort Data Integration Update Summary Helping Data Intensive Organizations Across the Big Data Continuum Hadoop – The Operating System.
MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Module 12: Auditing SQL Server Environments
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Drive Data Quality at Your Company: Create a Data Lake George Corugedo Chief Technology Officer & Co-Founder.
Spark: Cluster Computing with Working Sets
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Building Enterprise Applications Using Visual Studio ®.NET Enterprise Architect.
Module 8 Implementing Backup and Recovery. Module Overview Planning Backup and Recovery Backing Up Exchange Server 2010 Restoring Exchange Server 2010.
Slide 1 of 9 Presenting 24x7 Scheduler The art of computer automation Press PageDown key or click to advance.
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | OFSAAAI: Modeling Platform Enterprise R Modeling Platform Gagan Deep Singh Director.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Taming the ETL beast How LinkedIn uses metadata to run complex ETL flows reliably Rajappa Iyer Strata Conference, London, November 12, 2013.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
Introduction to the Enterprise Library. Sounds familiar? Writing a component to encapsulate data access Building a component that allows you to log errors.
Tyson Condie.
Christopher Jeffers August 2012
Hadoop File Formats and Data Ingestion
Data Management Kelly Clynes Caitlin Minteer. Agenda Globus Toolkit Basic Data Management Systems Overview of Data Management Data Movement Grid FTP Reliable.
NOVA: CONTINUOUS PIG/HADOOP WORKFLOWS. storage & processing scalable file system e.g. HDFS distributed sorting & hashing e.g. Map-Reduce dataflow programming.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Emerging Technologies Work Group Master Data Management (MDM) in the Public Sector Don Hoag Manager.
Introduction to Sqoop. Table of Contents Sqoop - Introduction Integration of RDBMS and Sqoop Sqoop use case Sample sqoop commands Key features of Sqoop.
(Business) Process Centric Exchanges
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Framework for MDO Studies Amitay Isaacs Center for Aerospace System Design and Engineering IIT Bombay.
INNOV-10 Progress® Event Engine™ Technical Overview Prashant Thumma Principal Software Engineer.
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities Agile Infrastructure Monitoring CERN IT/CF.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Powered by Microsoft Azure, PointMatter Is a Flexible Solution to Move and Share Data between Business Groups and IT MICROSOFT AZURE ISV PROFILE: LOGICMATTER.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
© 2012 Saturn Infotech. All Rights Reserved. Oracle Hyperion Data Relationship Management Presented by: Prasad Bhavsar Saturn Infotech, Inc.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
What is it and why it matters? Hadoop. What Is Hadoop? Hadoop is an open-source software framework for storing data and running applications on clusters.
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
SQL Database Management
Pilot Kafka Service Manuel Martín Márquez. Pilot Kafka Service Manuel Martín Márquez.
Building Enterprise Applications Using Visual Studio®
LOCO Extract – Transform - Load
Cisco Data Virtualization
Overview of MDM Site Hub
Hadoop MapReduce Framework
Applying Control Theory to Stream Processing Systems
Spark Presentation.
Building Analytics At Scale With USQL and C#
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
Presented by: Warren Sifre
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
Uber How to Stream Data with StorageTapper
Johannes Peter MediaMarktSaturn Retail Group
SETL: Efficient Spark ETL on Hadoop
Metadata The metadata contains
MAPREDUCE TYPES, FORMATS AND FEATURES
Self-Managed Systems: an Architectural Challenge
MapReduce: Simplified Data Processing on Large Clusters
Best Practices in Higher Education Student Data Warehousing Forum
Presentation transcript:

©2014 LinkedIn Corporation. All Rights Reserved. Gobblin’ Big Data with Ease Lin Qiao Data Analytics LinkedIn

©2014 LinkedIn Corporation. All Rights Reserved. Overview Challenges What does Gobblin provide? How does Gobblin work? Retrospective and lookahead

©2014 LinkedIn Corporation. All Rights Reserved. Overview Challenges What does Gobblin provide? How does Gobblin work? Retrospective and lookahead

©2014 LinkedIn Corporation. All Rights Reserved. Perception

©2014 LinkedIn Corporation. All Rights Reserved. Reality 5

©2014 LinkedIn Corporation. All Rights Reserved. LinkedIn Large variety of data sources Multi-paradigm: streaming data, batch data Different types of data: facts, dimensions, logs, snapshots, increments, changelog Operational complexity of multiple pipelines Data quality Data availability and predictability Engineering cost

©2014 LinkedIn Corporation. All Rights Reserved. Open source solutions sqoop p flume p morphline p RDBMS vendor- specific connectors p aegisthus logstash Camus

©2014 LinkedIn Corporation. All Rights Reserved. Goals Unified and Structured Data Ingestion Flow – RDBMS -> Hadoop – Event Streams -> Hadoop Higher level abstractions – Facts, Dimensions – Snapshots, increments, changelog ELT oriented – Minimize transformation in the ingest pipeline

©2014 LinkedIn Corporation. All Rights Reserved. Central Ingestion Pipeline

©2014 LinkedIn Corporation. All Rights Reserved. Overview Challenges What does Gobblin provide? How does Gobblin work? Retrospective and lookahead

©2014 LinkedIn Corporation. All Rights Reserved. Gobblin LinkedIn Business Analytics – Source data for, sales analysis, product sentiment analysis, etc. Engineering – Source data for issue tracking, monitoring, product release, security compliance, A/B testing Consumer product – Source data for acquisition integration – Performance analysis for campaign, ads campaign, etc.

©2014 LinkedIn Corporation. All Rights Reserved. Key Features  Horizontally scalable and robust framework  Unified computation paradigm  Turn-key solution  Customize your own Ingestion

©2014 LinkedIn Corporation. All Rights Reserved. Scalable and Robust Framework 13 Scalable Centralized State Management State is carried over between jobs automatically, so metadata can be used to track offsets, checkpoints, watermarks, etc. Jobs are partitioned into tasks that run concurrently Fault Tolerant Framework gracefully deals with machine and job failures Query Assurance Baked in quality checking throughout the flow

©2014 LinkedIn Corporation. All Rights Reserved. Unified computation paradigm Common execution flow Common execution flow between batch ingestion and streaming ingestion pipelines Shared infra components Shared job state management, job metrics store, metadata management.

©2014 LinkedIn Corporation. All Rights Reserved. Turn Key Solution Built-in Exchange Protocols Existing adapters can easily be re-used for sources with common protocols (e.g. JDBC, REST, SFTP, SOAP, etc.) Built-in Source Integration Fully integrated with commonly used sources including MySQL, SQLServer, Oracle, SalesForce, HDFS, filer, internal dropbox) Built-in Data Ingestion Semantics Covers full dump and incremental ingestion for fact and dimension datasets. Policy driven flow execution & tuning Flow owners just need to specify pre-defined policy for handling job failure, degree of parallelism, what data to publish, etc.

©2014 LinkedIn Corporation. All Rights Reserved. Customize Your Own Ingestion Pipeline Extendable Operators Configurable Operator Flow Operators for doing extraction, conversion, quality checking, data persistence, etc., can be implemented or extended against common API. Configuration allows for multiple plugin points to add in customized logic and code

©2014 LinkedIn Corporation. All Rights Reserved. Overview Challenges What does Gobblin provide? How does Gobblin work? Lookahead

©2014 LinkedIn Corporation. All Rights Reserved. Under the Hood

©2014 LinkedIn Corporation. All Rights Reserved. Computation Model Gobblin standalone – single process, multi-threading – Testing, small data, sampling Gobblin on Map/Reduce – Large datasets, horizontally scalable Gobblin on Yarn – Better resource utilization – More scheduling flexibilities

©2014 LinkedIn Corporation. All Rights Reserved. Scalable Ingestion Flow 20 Source Work Unit Work Unit Work Unit Data Publisher ExtractorConverter Quality Checker Writer ExtractorConverter Quality Checker Writer ExtractorConverter Quality Checker Writer Task

©2014 LinkedIn Corporation. All Rights Reserved. Sources Determines how to partition work -Partitioning algorithm can leverage source sharding -Group partitions intelligently for performance Creates work-units to be scheduled Source Work Unit Publisher Extractor Converter Quality Checker Writer

©2014 LinkedIn Corporation. All Rights Reserved. Job Management Job execution states – Watermark – Task state, job state, quality checker output, error code Job synchronization Job failure handling: policy driven 22 State Store Job run 1 Job run 3 Job run 2

©2014 LinkedIn Corporation. All Rights Reserved. Gobblin Operator Flow Extract Schema Extract Record Convert Record Check Record Data Quality Write Record Convert Schema Check Task Data Quality Commit Task Data 23

©2014 LinkedIn Corporation. All Rights Reserved. Extractors Source Work Unit Publisher Extractor Converter Quality Checker Writer Specifies how to get the schema and pull data from the source Return ResultSet iterator Track high watermark Track extraction metrics

©2014 LinkedIn Corporation. All Rights Reserved. Converters Allow for schema and data transformation – Filtering – projection – type conversion – Structural change Composable: can specify a list of converters to be applied in the given order Source Work Unit Publisher Extractor Converter Quality Checker Writer

©2014 LinkedIn Corporation. All Rights Reserved. Quality Checkers Ensure quality of any data produced by Gobblin Can be run on a per record, per task, or per job basis Can specify a list of quality checkers to be applied – Schema compatibility – Audit check – Sensitive fields – Unique key Policy driven – FAIL – if the check fails then so does the job – OPTIONAL – if the checks fails the job continues – ERR_FILE – the offending row is written to an error file 26 Source Work Unit Publisher Extractor Converter Quality Checker Writer

©2014 LinkedIn Corporation. All Rights Reserved. Writers Writing data in Avro format onto HDFS – One writer per task Flexibility – Configurable compression codec (Deflate, Snappy) – Configurable buffer size Plan to support other data format (Parquet, ORC) Source Work Unit Publisher Extractor Converter Quality Checker Writer

©2014 LinkedIn Corporation. All Rights Reserved. Publishers Determines job success based on Policy. -COMMIT_ON_FULL_SUCCESS  COMMIT_ON_PARTIAL_SUCCESS Commits data to final directories based on job success. Task 1 Task 2 Task 3 File 1 File 2 File 3 Tmp Dir File 1 File 2 File 3 Final Dir File 1 File 2 File 3 Source Work Unit Publisher Extractor Converter Quality Checker Writer

©2014 LinkedIn Corporation. All Rights Reserved. Gobblin Compaction Dimensions: – Initial full dump followed by incremental extracts in Gobblin – Maintain a consistent snapshot by doing regularly scheduled compaction Facts: – Merge small files 29 Ingestion HDFS Compaction

©2014 LinkedIn Corporation. All Rights Reserved. Overview Challenges What does Gobblin provide? How does Gobblin work? Retrospective and lookahead

©2014 LinkedIn Corporation. All Rights Reserved. Gobblin in Production > 350 datasets ~ 60 TB per day Salesforce Responsys RightNow Timeforce Slideshare Newsle A/B testing LinkedIn JIRA Data retention 31 Production Instances Data Volume

©2014 LinkedIn Corporation. All Rights Reserved. Lesson Learned Data quality has a lot more work to do Small data problem is not small Performance optimization opportunities Operational traits

©2014 LinkedIn Corporation. All Rights Reserved. Gobblin Roadmap Gobblin on Yarn Streaming Sources Gobblin Workbench with ingestion DSL Data Profiling for richer quality checking Open source in Q4’14 33

©2014 LinkedIn Corporation. All Rights Reserved.