The Big Data Ecosystem at LinkedIn

Slides:



Advertisements
Similar presentations
Meet Hadoop Doug Cutting & Eric Baldeschwieler Yahoo!
Advertisements

Fast Data at Massive Scale Lessons Learned at Facebook Bobby Johnson.
Lucene Near Realtime Search Jason Rutherglen & Jake Mannix LinkedIn 6/3/2009 SOLR/Lucene Users Group San Francisco.
Building LinkedIn’s Real-time Data Pipeline
Ali Ghodsi UC Berkeley & KTH & SICS
Welcome to Middleware Joseph Amrithraj
Suggested Course Outline Cloud Computing Bahga & Madisetti, © 2014Book website:
Omid Efficient Transaction Management and Incremental Processing for HBase Copyright © 2013 Yahoo! All rights reserved. No reproduction or distribution.
From a monolith to microservices + REST The evolution of LinkedIn’s service architecture by Steven Ihde and Karan Parikh (LinkedIn)
Kafka high-throughput, persistent, multi-reader streams
Engineering v v Adam Cataldo Tuesday, January 24, 2012 Quick Deploy A distributed systems approach to developer productivity.
Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved.
1 Large-Scale Machine Learning at Twitter Jimmy Lin and Alek Kolcz Twitter, Inc. Presented by: Yishuang Geng and Kexin Liu.
Data Infrastructure at LinkedIn Shirshanka Das XLDB
Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why.
An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.
ETM Hadoop. ETM IDC estimate put the size of the “digital universe” at zettabytes in forecasting a tenfold growth by 2011 to.
Hadoop Ecosystem Overview
Big Data Use Cases in the cloud Peter Sirota, GM Elastic
Apache Spark and the future of big data applications Eric Baldeschwieler.
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
Ch 4. The Evolution of Analytic Scalability
Project Voldemort: What’s New Alex Feinberg. The plan  Introduction  Motivation  Inspiration  Implementation  Present day  New features within the.
Pepper: An Elastic Web Server Farm for Cloud based on Hadoop Author : S. Krishnan, J.-S. Counio Date : Speaker : Sian-Lin Hong IEEE International.
Promoting Open Source Software Through Cloud Deployment: Library à la Carte, Heroku, and OSU Michael B. Klein Digital Applications Librarian
` tuplejump The data engineering platform. A startup with a vision to simplify data engineering and empower the next generation of data powered miracles!
Database Laboratory Regular Seminar TaeHoon Kim.
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Experimenting Lucene Index on HBase in an HPC Environment Xiaoming Gao Vaibhav Nachankar Judy Qiu.
IMDGs An essential part of your architecture. About me
Amazon Web Services MANEESH MOHANAVILASAM. OLD IS GOLD?...NOT Predicting peaks Developing partnerships Buying and maintaining hardware Upgrading hardware.
How Companies are Using Spark And where the Edge in Big Data will be Matei Zaharia.
Joe Caserta President Elliott Cordo Chief Architect September 30, 2015, Javits Center, New York City Building a Data Lake for Digital Music Dominance.
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities Agile Infrastructure Monitoring CERN IT/CF.
Hadoop implementation of MapReduce computational model Ján Vaňo.
Tier3 monitoring. Initial issues. Danila Oleynik. Artem Petrosyan. JINR.
Streaming Analytics with Spark 1 Magnoni Luca IT-CM-MM 09/02/16EBI - CERN meeting.
Next Generation of Apache Hadoop MapReduce Owen
Harnessing Big Data with Hadoop Dipti Sangani; Madhu Reddy DBI210.
Stream Processing with Tamás István Ujj
Apache Kafka A distributed publish-subscribe messaging system
Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
Data Analytics (CS40003) Introduction to Data Lecture #1
The Big Data Network (phase 2) Cloud Hadoop system
Big thanks to everyone!.
Big Data & Test Automation
Pilot Kafka Service Manuel Martín Márquez. Pilot Kafka Service Manuel Martín Márquez.
OMOP CDM on Hadoop Reference Architecture
Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.
Hadoop.
Hadoop and Analytics at CERN IT
Chapter 14 Big Data Analytics and NoSQL
Hadoopla: Microsoft and the Hadoop Ecosystem
Hadoop.
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
Central Florida Business Intelligence User Group
Ken Birman & Kishore Pusukuri, Spring 2018
Federico Perrero – Plant Manager
Ewen Cheslack-Postava
Ch 4. The Evolution of Analytic Scalability
April 15, 2014 Faceted Browsing: Analysis and implementation of a Big Data Solution using Apache Solr. Advisor: Prof. Sonia Bergamaschi Co-Advisor: Prof.
Managing batch processing Transient Azure SQL Warehouse Resource
Plans for the renovation of the Post Mortem infrastructure
Zoie Barrett and Brian Lam
Charles Tappert Seidenberg School of CSIS, Pace University
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Oracle 1z0-928 Oracle Cloud Platform Big Data Management 2018 Associate.
Presentation transcript:

The Big Data Ecosystem at LinkedIn Jay Kreps

Me Background in data not infrastructure LinkedIn’s SNA team Original co-author of some LinkedIn open source projects (Voldemort, Azkaban, Kafka)

This Talk We are in a renaissance of data infrastructure. How do all these pieces fit together?

Why the current obsession with “Big Data”?

The goal of modern data infrastructure is to make many small computers act like one big one.

The Old Picture

The New Picture

Polyglot persistence?

Infrastructure Icebergs 90k lines of tooling and monitoring, 30k lines of logic Dedicated engineers, operations Training First three nines come from operations

This is (still) a very immature space. Which systems should we have? Good news for users, bad news for distributed systems nerds Filesystems take a decade to mature. Don’t expect this will be easier.

Infrastructure is sculpted by applications and constraints Projects are defined by trade-offs

Constraints Hardware Other Jeff Dean: Numbers everyone should know David Patterson: Latency lags bandwidth $$$ Other Path dependence Complexity Resources

Applications

Common categories of non-CRUD Recommendations & Matching Graphs Search Data Normalization News feed Analysis & Monitoring

Social Graph

Search

Recommendations: People

Recommendations: Jobs

Recommendations: Newsfeed

Data Normalization

Analytics

Infrastructure Search Social Graph Storage Streams Offline Lucene Bobo (facets), Zoie (real-time indexing), Sensei (distribution) Social Graph Storage Oracle Voldemort Espresso Streams Databus Kafka Offline Hadoop & friends (Pig, Hive, Azkaban, etc)

Three Major Paradigms Request/Response Streams Batch Search Social Graph Storage Streams Kafka Batch Hadoop

Most features are multi-paradigm

Request/Response Search Social Graph Storage Voldemort Espresso

Request/Response Patterns Broker, scatter-gather Storage systems: only Partitioning strategy Latency oriented

Batch: Hadoop Uses Ecosystem Ad hoc Production batch Hive, Pig Azkaban (workflow) Avro data Data in: Kafka Data out: Voldemort, Kafka

Why do batch if you have real-time? Batch advantages Safety Easy Throughput Simplicity Economics Tricky bit: engineering the data cycle

Why do streaming? You have to glue all these systems together Throughput as good as batch Latency much better Metaphor more natural for low latency than Hadoop

What makes successful infrastructure systems? Operability and Operations Monitoring Simplicity Documentation Broad adoption Lazy users Open source

Open Source Data > Infrastructure Open source creates better code—even with few outside contributors Commercial infrastructure not interesting

Open Source Projects We made We stole Voldemort: Key/Value storage Sensei, Bobo, Zoie: Elastic, faceted, real-time search with Lucene Kafka: Persistent, distributed data streams Norbert: Cluster aware RPC, load balancing, and group membership And others… We stole Hadoop, Pig, Hive Lucene Netty, Jetty Zookeeper Avro Apache Traffic Server

The End jay.kreps@gmail.com http://www.linkedin.com/in/jaykreps http://twitter.com/jaykreps http://sna-projects.com