Building LinkedIn’s Real-time Data Pipeline

Slides:



Advertisements
Similar presentations
From Startup to Enterprise A Story of MySQL Evolution Vidur Apparao, CTO Stephen OSullivan, Manager of Data and Grid Technologies April 2009.
Advertisements

HBase and Hive at StumbleUpon
Hadoop at ContextWeb February ContextWeb: Traffic Traffic – up to 6 thousand Ad requests per second. Comscore Trend Data:
The Big Data Ecosystem at LinkedIn
Media6. Who We Are Media6° is an Online Advertising Company Specializing in Social Graph Targeting –Birds of a feather flock together! –We build.
Data Freeway : Scaling Out to Realtime Author: Eric Hwang, Sam Rash Speaker : Haiping Wang
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Hui Li Pig Tutorial Hui Li Some material adapted from slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012.
13,000 Jobs and counting…. Advertising and Data Platform Our System.
HDFS & MapReduce Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer.
CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
Working with pig Cloud computing lecture. Purpose  Get familiar with the pig environment  Advanced features  Walk though some examples.
LINKEDIN COMMUNICATION ARCHITECTURE Ruslan Belkin, Sean Dawson.
Kafka high-throughput, persistent, multi-reader streams
Building a Real-Time Data Pipeline: Apache Kafka at Linkedin Hadoop Summit 2013 Joel Koshy June 2013 LinkedIn Corporation ©2013 All Rights Reserved.
The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved.
Data Infrastructure at LinkedIn Shirshanka Das XLDB
Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why.
Distributed Computations
Report Distribution Report Distribution in PeopleTools 8.4 Doug Ostler & Eric Knapp 7264.
HOL9396: Oracle Event Processing 12c
Pulsar Realtime Analytics At Scale Tony Ng, Sharad Murthy June 11, 2015.
Distributed Computations MapReduce
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
Grid Computing Meets the Database Chris Smith Platform Computing Session #
Apache Spark and the future of big data applications Eric Baldeschwieler.
Taming the ETL beast How LinkedIn uses metadata to run complex ETL flows reliably Rajappa Iyer Strata Conference, London, November 12, 2013.
WORKFLOW IN MOBILE ENVIRONMENT. WHAT IS WORKFLOW ?  WORKFLOW IS A COLLECTION OF TASKS ORGANIZED TO ACCOMPLISH SOME BUSINESS PROCESS.  EXAMPLE: Patient.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Netflix Data Pipeline with Kafka Allen Wang & Steven Wu.
Wrangling Customer Usage Data with Hadoop Clearwire – Thursday, June 27 th Carmen Hall – IT Director Mathew Johnson – Sr. IT Manager.
` tuplejump The data engineering platform. A startup with a vision to simplify data engineering and empower the next generation of data powered miracles!
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Our Experience Running YARN at Scale Bobby Evans.
WITSML Service Platform - Enterprise Drilling Information
Introduction to Hadoop and HDFS
Experiences with Big Data and Learning What’s Worth Keeping Roger Liew, CTO Orbitz Worldwide Wolfram Data Summit 2011.
1 The Instant Data Warehouse Released 15/01/ Hello and Welcome!! Today I am very pleased to announce the release of the 'Instant Data Warehouse'.
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team
Right In Time Presented By: Maria Baron Written By: Rajesh Gadodia
Wellstorm Development Connecting Real Time Data to Everything Hugh Winkler May 11, 2006.
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities Agile Infrastructure Monitoring CERN IT/CF.
Copyright 2007 SpringSource. Copying, publishing or distributing without express written permission is prohibited. Introduction to Data Access with Spring.
+ Logentries Is a Real-Time Log Analytics Service for Aggregating, Analyzing, and Alerting on Log Data from Microsoft Azure Apps and Systems MICROSOFT.
Log Shipping, Mirroring, Replication and Clustering Which should I use? That depends on a few questions we must ask the user. We will go over these questions.
Stream Processing with Tamás István Ujj
Internal developer tools and bug tracking Arabic / Hebrew Windows 3.1Win95 Japanese Word, OneNote, Outlook
Apache Kafka A distributed publish-subscribe messaging system
Handling Streaming Data in Spotify Using the Cloud
JAFKA A fast MQ. overview ● ● 271KB single jar ● 3.5MB with all dependencies and configurations.
CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook
Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
Solr Power FTW Alex #solrnosql. What Will I Cover? Who I am What Bazaarvoice does SOLR and NoSQL Can SOLR handle 20K queries per second?
Pilot Kafka Service Manuel Martín Márquez. Pilot Kafka Service Manuel Martín Márquez.
Scaling Big Data Mining Infrastructure: The Twitter Experience
Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.
Introduction to Apache Kafka
Collecting heterogeneous data into a central repository
Messaging at CERN Lionel Cons – CERN IT/CM 18 Jan 2017
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
Extraction, aggregation and classification at Web Scale
Central Florida Business Intelligence User Group
Lab #2 - Create a movies dataset
Evolution of messaging systems and event driven architecture
July, 2016 Fangshi Li Carl Steinbach Dr. LinkedIn July, 2016 Fangshi Li Carl Steinbach.
Learn. Imagine. Build. .NET Conf
Presentation transcript:

Building LinkedIn’s Real-time Data Pipeline Jay Kreps

What is a data pipeline?

What data is there? Database data Activity data Messaging Page Views, Ad Impressions, etc Messaging JMS, AMQP, etc Application and System Metrics Rrdtool, graphite, etc Logs Syslog, log4j, etc Thousands of data types being rapidly evolved by hundreds of engineers

Revolution in data systems: more scalable, but more specialized Systems: Data Warehouse, Stream Processing, Sensor networks, Text search, Scientific databases, Document stores 2005 - ICDE

Data Systems at LinkedIn Search Social Graph Recommendations Live Storage Hadoop Data Warehouse Monitoring Systems …

Problem: Data Integration

Point-to-Point Pipelines

Centralized Pipeline

How have companies solved this problem?

The Enterprise Data Warehouse “Data entry” IT-centric “Cleansed and Conformed” Central ETL

Problems Data warehouse is a batch system Central team that cleans all data? One person’s cleaning… Relational mapping is non-trivial…

My Experience 2008 ETL is a huge problem Everything most things that are batch will eventually become real-time Data integration is the value Good way to grow a team

LinkedIn’s Pipeline

LinkedIn Circa 2010 Messaging: ActiveMQ User Activity: In house log aggregation Logging: Splunk Metrics: JMX => Zenoss Database data: Databus, custom ETL

2010 User Activity Data Flow Problems: fragile, labor intensive, unscalable at the human layer

Problems Fragility Multi-hour delay Coverage Labor intensive Slow Does it work? Fundamentally unscalable in a human way “What is going on with X”

Four Ideas Central commit log for all data Push data cleanliness upstream O(1) ETL Evidence-based correctness

Four Ideas Central commit log for all data Push data cleanliness upstream O(1) ETL Evidence-based correctness

What kind of infrastructure is needed?

Very confused Messaging (JMS, AMQP, …) Log aggregation CEP, Streaming What did google do?

First Attempt: Don’t reinvent the wheel!

Problems With Messaging Systems Persistence is an afterthought Ad hoc distribution Odd semantics Featuritis

Second Attempt: Reinvent the wheel!

Idea: Central, Distributed Commit Log Different from messaging: log can be replayed logs are fast

What is a commit log?

Data Flow Two kinds of things: Just apply data to the system for a particular type of query (no transformation) Data processing

Apache Kafka

Some Terminology Producers send messages to Brokers Consumers read messages from Brokers Messages are sent to a Topic Each topic is broken into one or more ordered partitions of messages

APIs send(String topic, String key, Message message) Iterator<Message>

Distribution

Performance 50MB/sec writes 110MB/sec reads

Performance

Performance Tricks Batching Avoid large in-memory structures Producer Broker Consumer Avoid large in-memory structures Pagecache friendly Avoid data copying sendfile Batch Compression

Kafka Replication In 0.8 release Messages are highly available No centralized master Current status: works in the lab Replication factor for each topic Server can block until

Kafka Info http://incubator.apache.org/kafka

Usage at LinkedIn 10 billion messages/day Sustained peak: 367 topics 172,000 messages/second written 950,000 messages/second read 367 topics 40 real-time consumers Many ad hoc consumers 10k connections/colo 9.5TB log retained End-to-end delivery time: 10 seconds (avg)

Datacenters

Four Ideas Central commit log for all data Push data cleanliness upstream O(1) ETL Evidence-based correctness

Problem Hundreds of message types Thousands of fields What do they all mean? What happens when they change?

Make activity data part of the domain model

Schema free?

LOAD ‘student’ USING PigStorage() AS (name:chararray, age:int, gpa:float)

Schemas Structure can be exploited Compatibility Performance Size Compatibility Need a formal contract

Avro Schema Avro – data definition and schema Central repository of all schemas Reader always uses same schema as writer Programatic compatibility model Not part of kafka

Workflow Check in schema Code review Ship

Four Ideas Central commit log for all data Push data cleanliness upstream O(1) ETL Evidence-based correctness

Hadoop Data Load Map/Reduce job does data load One job loads all events Hive registration done automatically Schema changes handled transparently ~5 minute lag on average to HDFS

Four Ideas Central commit log for all data Push data cleanliness upstream O(1) ETL Evidence-based correctness

Does it work? Computer science versus engineering

All messages sent must be delivered to all consumers (quickly)

Audit Trail Each producer, broker, and consumer periodically reports how many messages it saw Reconcile these counts every few minutes Graph and alert

Audit Trail

Questions?