Taming the ETL beast How LinkedIn uses metadata to run complex ETL flows reliably Rajappa Iyer Strata Conference, London, November 12, 2013.

Slides:



Advertisements
Similar presentations
Tivoli Software from IBM Storage Resource Management Webcast
Advertisements

INTRODUCTION Agenda BUSINESS CHALLENGES FEATURES OF RAPID MARTS SOLUTION OVERVIEW DWH USING SAP RAPID MARTS BENEFITS TO BUSINESS USERS.
A BPM Framework for KPI-Driven Performance Management
1 Integration Made Easy Agile Integration: Connecting Salesforce With Your Enterprise.
Database Area Neighborhood (DAN)
Merit Consulting Terje Myrseth MUA – October 2008.
Cutting-edge technology for the development of business software applications Takes advantage of the most recent international trends, combining Microsoft.NET.
Week 6: Chapter 6 Agenda Automation of SQL Server tasks using: SQL Server Agent Scheduling Scripting Technologies.
<<replace with Customer Logo>>
The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved.
©2014 LinkedIn Corporation. All Rights Reserved. Gobblin’ Big Data with Ease Lin Qiao Data Analytics LinkedIn.
Components and Architecture CS 543 – Data Warehousing.
Lower costs and improve predictability Automation Enable service owners to focus on work that adds business value Reduce error-prone manual activities.
** MapReduce Debugging with Jumbune. * Agenda * Debugging Challenges Debugging MapReduce Jumbune’s Debugger Zero Tolerance in Production.
HOL9396: Oracle Event Processing 12c
The Importance Of Transactions In The World Of Analytics Doug Aoyama Director, Product Marketing.
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | OFSAAAI: Modeling Platform Enterprise R Modeling Platform Gagan Deep Singh Director.
ETL By Dr. Gabriel.
BUSINESS INTELLIGENCE/DATA INTEGRATION/ETL/INTEGRATION AN INTRODUCTION Presented by: Gautam Sinha.
System Center Operations Manager 2007 Dave Northey Microsoft Ireland.
Information Systems in Organisations
Intro Informatica Productivity Pack Save Time and Money while Increasing the Quality of Your PowerCenter Deployment Louis Hausle.
Christopher Jeffers August 2012
Contacts Enecto - Turning web visits into business InterAction User Group David Botros Senior Account Manager Tel: +44 (0) Mob: +44.
Chapter Intranet Agents. Chapter Background Intranet: an internal corporate network based on Internet technology. Typically, an intranet can.
SOA in Telecommunications September 30, 2008 Speaker: Mike Giordano.
McGraw-Hill/Irwin © The McGraw-Hill Companies, All Rights Reserved CHAPTER 9 Enabling the Organization—Decision Making.
Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering Nithya N. Vijayakumar, Beth Plale DDE Lab, Indiana University {nvijayak,
Learningcomputer.com SQL Server 2008 – Administration, Maintenance and Job Automation.
Right In Time Presented By: Maria Baron Written By: Rajesh Gadodia
FORUM II Best Practices in Data Warehousing in Higher Education: A Framework for Higher Education Reporting April 18, 2005 Slide 1 Cornell University’s.
©2006, Ventana Research, Inc. Business Intelligence Keynote Panel Location Intelligence 2006 Conference.
InsideView Proprietary & Confidential CRM INTELLIGENCE ™ KNOW MORE. WIN MORE. InsideView Proprietary & Confidential.
1.less than 3 million. 2.less than 10 million. 3.over 23 million. 4.over 100 million. 5.Not sure In the U.S., the number of managers that rely on Information.
Kaskad Technology Korrelera for Market Surveillance Candyce Edelen November 8, 2006.
14 Copyright © 2005, Oracle. All rights reserved. Backup and Recovery Concepts.
Combining Cloud Power with Mobile Technology, Fielding Systems Is Delivering the Digital Oilfield to Modern Oil and Gas Production Companies COMPANY PROFILE:
PANEL SENIOR BIG DATA ARCHITECT BD-COE
Manufacturing Operations Center 10 - Differentiators - The Pharmavite Experience APAC Training, Feb-Mar, 2010.
June 2013 BIG DATA SCIENCE: A PATH FORWARD. CONFIDENTIAL | 2  Data Science Lead.
Information Integration 15 th Meeting Course Name: Business Intelligence Year: 2009.
+ Logentries Is a Real-Time Log Analytics Service for Aggregating, Analyzing, and Alerting on Log Data from Microsoft Azure Apps and Systems MICROSOFT.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
Mark Gilbert Microsoft Corporation Services Taxonomy Building Block Services Attached Services Finished Services.
1 Copyright © 2009, Oracle. All rights reserved. Oracle Business Intelligence Enterprise Edition: Overview.
14 Copyright © 2005, Oracle. All rights reserved. Backup and Recovery Concepts.
InsideView Proprietary & Confidential CRMUG PARTNER SHOWCASE KNOW MORE. WIN MORE. InsideView Proprietary & Confidential Heidi Tucker, VP Global Alliances.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Speaker’s Name, SAP Month 00, 2017
Azure Infrastructure for SAP®
Shared Services with Spotfire
HP BSM implementation summary
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
Michael Mast Senior Architect
Creating New Business Value with Big Data
Exploring Azure Event Grid
Mission Control     Using digital to disrupt traditional programme delivery to improve performance and stakeholder confidence.
Analytics for Cloud ERP
The Food Talent Network
Unlock The Power of Your Business Processes Demystifying Workflow Solutions
Accelerate Your Self-Service Data Analytics
Cloud Analytics for Microsoft Azure
XtremeData on the Microsoft Azure Cloud Platform:
Get your ETL flow under statistical process control
Serverless Architecture in the Cloud
Common Data Service Data Integrator
Business Intelligence
Industrial Products Business challenge
A General Approach to Real-time Workflow Monitoring
Presentation transcript:

Taming the ETL beast How LinkedIn uses metadata to run complex ETL flows reliably Rajappa Iyer Strata Conference, London, November 12, 2013

`whoami`  Data LinkedIn since 2011  Prior to that: –Director of Engineering at Digg –Enterprise Data Architect at eBay 

Outline of talk  Background and Context – The Why  Challenges with Data Delivery – The What  Metadata to the Rescue – The How  Q&A

LinkedIn: The World’s Largest Professional Network Members Worldwide 2 new Members Per Second 100M+ Monthly Unique Visitors 259M+ 3M+ Company Pages Connecting Talent  Opportunity. At scale…

Insights (Analysts and Data Scientists) Insights (Analysts and Data Scientists) Data Driven Products and Insights Products for Members (Professionals) Products for Members (Professionals) Products for Enterprises (Companies) Products for Enterprises (Companies) Data, Platforms, Analytics Data, Platforms, Analytics

Products for Members

Products for Enterprises Sell - Sales NavigatorMarket - Marketing Solutions Hire - Talent Solutions

Examples of Insights

Example of Deeper Insight Job Migration After Financial Collapse

Data is critical to LinkedIn’s products It needs to be delivered in a reliable and timely manner LinkedIn Confidential ©2013 All Rights Reserved 10

A Simplified Overview of Data Flow

 Ingress / Egress of message-oriented data –Logs and clickstream data  Ingress / Egress of record-oriented data –Database data  Transformations –Select, project, join –Aggregations –Partitioning –Cleansing and data normalization –Schema conversions – e.g., Nested JSON to Relational Components of typical ETL jobs LinkedIn Confidential ©2013 All Rights Reserved 12

An Example ETL Flow LinkedIn Confidential ©2013 All Rights Reserved 13

Challenges  Complex process dependencies –Some flows are over 30 levels deep –Flows may span multiple platforms (Hadoop, RDBMS etc.)  Complex data dependencies –Multiple flows may consume a data element –Multiple data elements feed into a single flow –Can be viewed as “data sync barriers”  Recovery –Restartable flows that pick up from last checkpoint –Catch up mode to compensate for downtime  Monitoring and Alerting –Prioritization of “important” flows for ops attention –Who do you call when things fail? LinkedIn Confidential ©2013 All Rights Reserved 14

Metadata to the rescue  What metadata is collected? –Process dependencies –Data dependencies –Execution history and data processing statistics  How is it used? –Drives the ETL framework with lots of functionality  Check for data availability  Retries and restarts  Standardized error reporting / alerting  Prioritized view of business critical flows LinkedIn Confidential ©2013 All Rights Reserved 15

Metadata: Process Dependencies  Capture process dependency graph –Also capture metadata such as process owners, importance, SLA etc.  Capture stats for each execution of a workflow –Time of execution –Execution status –Pointer to error logs  Alert on delayed processes –Based on execution history

Metadata: Data Dependencies  For each flow, capture input and output data elements  For each flow execution, capture stats on data element  Number of records or messages processed  Error counts  Watermarks –Can be time based or sequence based –This can be per flow as more than one flow can consume a data element

Metadata: Data Elements  Simple catalog of data elements –Name, physical location, owner etc.  Data elements can have logical names –Names resolve to one or more physical entity –Logical names can represent useful collections  E.g., data as of a particular interval  Data element availability can trigger processes –E.g., kick off hourly process when hourly data is complete and available –Enables data driven ETL scheduling 18

ETL Framework Putting it all together LinkedIn Confidential ©2013 All Rights Reserved 19 Metadata Management System Scheduler Checkpoint Execution State Retry / Resume Data Check Statistics (process and data) Alerting / Monitoring Dashboards, Reports Dashboards, Reports Data Availability Status Execution History Data Lineage ETL applications Name resolver Log Parsers

Questions? More at data.linkedin.com Come Work on Challenging Data Infrastructure problems - We’re Hiring