Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Evolution of Big Data Netflix

Similar presentations


Presentation on theme: "The Evolution of Big Data Netflix"— Presentation transcript:

1 The Evolution of Big Data Platform @ Netflix
Eva Tse July 22, 2015

2

3

4

5 .

6 Our biggest challenge is scale

7 Netflix Key Business Metrics
65+ million members 50 countries 1000+ devices supported 10 billion hours / quarter

8 Global Expansion 200 countries by end of 2016

9 Big Data Size Total ~20 PB DW on S3 Read ~10% DW daily Write ~10% of read data daily ~ 500 billion events daily ~ 350 active users

10 Our traditional BI stack is our competition

11 How do we meet the functionality bar and yet make it scale?
How do we make big data bite-size again?

12 Our North Star Infrastructure Architecture Self-serve
No undifferentiated heavy lifting Architecture Scalable and sustainable Self-serve Ecosystem of tools

13 Data Pipelines Event Data Suro/Kafka Ursula Cloud apps 15 min AWS S3
Dimension Data Aegisthus Cassandra SS Tables Daily

14 Big Data API Big Data Portal Metacat AWS S3 Data movement
Parquet FF Metacat (Federated metadata service) Pig workflow visualization Data movement Data visualization (Hadoop clusters) Job/Cluster perf Data lineage Data quality Storage Compute Service Tools (Federated execution service) Big Data Portal API Portal Big Data API AWS S3

15 Evolving Big Data Processing Needs
Analytics ETL Interactive data exploration Interactive slice & dice RT analytics & iterative/ML algo

16 Evolving Services/Tools Ecosystem
API Portal Evolving Services/Tools Ecosystem Data movement Data visualization Big Data API Big Data Portal Data lineage (Federated execution service) Data quality Metacat Pig workflow visualization (Federated metadata service) Job/Cluster perf visualization

17 AWS S3 as our DW Storage S3 as single source of truth (not HDFS)
11 9’s durability and 4 9’s availability Separate compute and storage Key enablement to multiple clusters easy upgrade via r/b deployment

18 Evolution of Big Data Processing Systems

19

20 Analytics Hive-QL is close to ANSI SQL syntax Hive metastore serves as single source of truth for metadata for big data

21 Better language construct for ETL Contributions since 0.11
Customization Integration with Metacat to Hive Metastore Integration with S3

22 Interactive data exploration and experimentation Why we like presto?
Integration with Hive metastore Easy integration with S3 Works at petabyte scale ANSI SQL for usability Fast

23 Our contributions S3 file system Query optimizations
Complex types support Parquet file format integration Working on predicate pushdown

24 Parquet Columnar file format Supported across Hive, Pig, Presto, Spark
Performance benefits across different processing engines Working on vectorized read, lazy load and lazy materialization

25 Interactive dashboard for slicing and dicing
Column-based in-memory data store for time series data Serves a specific use case very well

26 ETL, RT analytics, ML algorithms Why we like Spark?
Cohesive environment – batch and ‘stream’ processing Multiple language support – Scala, Python Performance benefits Run on top of YARN for multi-tenancy Community momentum

27 Evolution of Services/Tools Ecosystem
API Portal Evolution of Services/Tools Ecosystem Data movement Data visualization Big Data API Big Data Portal Data lineage (Federated execution service) Data quality Metacat Pig workflow visualization (Federated metadata service) Job/Cluster perf visualization

28 Federated execution engine
Expose [your fave big data engine] as a service Flexible data model to support future job types Cluster configuration management

29 Metacat Federated metadata catalog for the whole data platform
Proxy service to different metadata sources Data metrics, data usage, ownership, categorization and retention policy … Common interface for tools to interact with metadata To be open sourced in 2015 on Netflix OSS

30 Big Data API Big Data Portal Metacat d d Data movement
Service Tools API Portal Data movement Data visualization Big Data API Big Data Portal Data lineage (Federated execution service) Data quality Metacat Pig workflow visualization (Federated metadata service) Job/Cluster perf visualization

31

32

33

34

35

36 Big Data API Integration layer for our ecosystem of tools and services
Python library (called Kragle) Building block for our ETL workflow Building block for Big Data Portal

37

38

39 Big Data Portal One stop shop for all big data related tools and services Built on top of Big Data API

40

41

42

43

44 Open source is an integral part of our strategy to achieve scale

45 Big Data Processing Systems
Services/Tools Ecosystem

46 Why use Open Source? Collaborate with other internet scale tech companies Unchartered area/scale, lock-in is not desirable Need the flexibility to achieve scalability BUT… Lots of choices White box approach

47 Why contribute back? Non IP or trade secret
Help shape direction of projects Don’t want to fork and diverge Attract top talent

48 Why contribute our own tool?
Share our goodness Set industry standard Community can help evolve the tool

49

50 Is open source right for you?

51

52 Measuring big data - understanding data by usage
By Charles Smith, Netflix 1:40-2:20pm

53 Eva Tse jobs.netflix.com


Download ppt "The Evolution of Big Data Netflix"

Similar presentations


Ads by Google