MySQL Data Warehousing Survival Guide Marius Moscovici Steffan Mejia

Slides:



Advertisements
Similar presentations
Tuning: overview Rewrite SQL (Leccotech)Leccotech Create Index Redefine Main memory structures (SGA in Oracle) Change the Block Size Materialized Views,
Advertisements

13 Copyright © 2005, Oracle. All rights reserved. Monitoring and Improving Performance.
Module 13: Performance Tuning. Overview Performance tuning methodologies Instance level Database level Application level Overview of tools and techniques.
Big Data Working with Terabytes in SQL Server Andrew Novick
10 REASONS Why it makes a good option for your DB IN-MEMORY DATABASES Presenter #10: Robert Vitolo.
1. Aim High with Oracle Real World Performance Andrew Holdsworth Director Real World Performance Group Server Technologies.
A Fast Growing Market. Interesting New Players Lyzasoft.
Building a High-Volume Reporting System on Amazon AWS with MySQL, Tungsten, and Vertica GAMIFIED REWARDS
Database Optimization & Maintenance Tim Richard ECM Training Conference#dbwestECM Agenda SQL Configuration OnBase DB Planning Backups Integrity.
Cacti Workshop Tony Roman Agenda What is Cacti? The Origins of Cacti Large Installation Considerations Automation The Current.
Meanwhile RAM cost continues to drop Moore’s Law on total CPU processing power holds but in parallel processing… CPU clock rate stalled… Because.
Components and Architecture CS 543 – Data Warehousing.
Physical Design CS 543 – Data Warehousing. CS Data Warehousing (Sp ) - Asim LUMS2 Physical Design Steps 1. Develop standards 2.
Data Warehousing - 3 ISYS 650. Snowflake Schema one or more dimension tables do not join directly to the fact table but must join through other dimension.
A Guide to MySQL 7. 2 Objectives Understand, define, and drop views Recognize the benefits of using views Use a view to update data Grant and revoke users’
5 Creating the Physical Model. Designing the Physical Model Phase IV: Defining the physical model.
A Guide to SQL, Seventh Edition. Objectives Understand, create, and drop views Recognize the benefits of using views Grant and revoke user’s database.
Fast Track, Microsoft SQL Server 2008 Parallel Data Warehouse and Traditional Data Warehouse Design BI Best Practices and Tuning for Scaling SQL Server.
Scalability Module 6.
Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 1 Preview of Oracle Database 12 c In-Memory Option Thomas Kyte
Russ Houberg Senior Technical Architect, MCM KnowledgeLake, Inc.
ETL By Dr. Gabriel.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Chapter 10 : Designing a SQL Server 2005 Solution for High Availability MCITP Administrator: Microsoft SQL Server 2005 Database Server Infrastructure Design.
Best Practices for Data Warehousing. 2 Agenda – Best Practices for DW-BI Best Practices in Data Modeling Best Practices in ETL Best Practices in Reporting.
Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Loading Ola Ekdahl IT Mentors 9/12/08.
The Multiple Uses of HBase Jean-Daniel Cryans, DB Berlin Buzzwords, Germany, June 7 th,
Oracle Challenges Parallelism Limitations Parallelism is the ability for a single query to be run across multiple processors or servers. Large queries.
1 Robert Wijnbelt Health Check your Database A Performance Tuning Methodology.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.
Data Warehousing at Acxiom Paul Montrose Data Warehousing at Acxiom Paul Montrose.
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
SESSION CODE: BIE07-INT Eric Kraemer Senior Program Manager Microsoft Corporation.
Right In Time Presented By: Maria Baron Written By: Rajesh Gadodia
Srik Raghavan Principal Lead Program Manager Kevin Cox Principal Program Manager SESSION CODE: DAT206.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
7 Strategies for Extracting, Transforming, and Loading.
Copyright 2007, Information Builders. Slide 1 Machine Sizing and Scalability Mark Nesson, Vashti Ragoonath June 2008.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
1 Copyright © 2005, Oracle. All rights reserved. Following a Tuning Methodology.
Last Updated : 27 th April 2004 Center of Excellence Data Warehousing Group Teradata Performance Optimization.
1 © 2002 hp Introduction to EVA Keith Parris Systems/Software Engineer HP Services Multivendor Systems Engineering Budapest, Hungary 23May 2003 Presentation.
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
Technology Drill Down: Windows Azure Platform Eric Nelson | ISV Application Architect | Microsoft UK |
MySQL Cluster High-Availability Distributed Database.
Copyright Sammamish Software Services All rights reserved. 1 Prog 140  SQL Server Performance Monitoring and Tuning.
Strategies for Working with Texas-sized Databases Robert L Davis Database Engineer
BIG DATA/ Hadoop Interview Questions.
11 Copyright © 2009, Oracle. All rights reserved. Enhancing ETL Performance.
Hadoop.
Efficient Multi-User Indexing for Secure Keyword Search
Flash Storage 101 Revolutionizing Databases
Antonio Abalos Castillo
Physical Database Design and Performance
Informix Red Brick Warehouse 5.1
Data Warehouse in the Cloud – Marketing or Reality?
MyRocks at Facebook and Roadmaps
Database Performance Tuning and Query Optimization
Upgrading to Microsoft SQL Server 2014
Oracle Storage Performance Studies
Managing batch processing Transient Azure SQL Warehouse Resource
Cloud computing mechanisms
Chapter 11 Database Performance Tuning and Query Optimization
Data Warehousing Concepts
Performance And Scalability In Oracle9i And SQL Server 2000
Presentation transcript:

MySQL Data Warehousing Survival Guide Marius Moscovici Steffan Mejia

Topics The size of the beast Evolution of a Warehouse Lessons Learned Survival Tips Q&A

Size of the beast 43 Servers o 36 active o 7 standby spares 16 TB of data in MySQL 12 TB archived (pre S3 staging) 4 TB archived (S3) 3.5B rows in main warehouse Largest table ~ 500M rows (MySQL)

Warehouse Evolution - First came slaving Problems: Reporting slaves easily fall behind Reporting limited to one-pass SQL

Warehouse Evolution - Then came temp tables Problems: Easy to lock replication with temp table creation Slaving becomes fragile

Warehouse Evolution - A Warehouse is Born Problems: Warehouse workload limited by what can be performed by a single server

Warehouse Evolution - Workload Distributed Problems: No Real-Time Application integration support

Warehouse Evolution - Integrate Real Time Data

Lessons Learned - Warehouse Design Workload exceeds available memory

Lessons Learned - Warehouse Design Keep joins < available memory Heavily Denormalize data for effective reporting Minimize joins between large tables Aggressively archive historical data

Lessons Learned - Data Movement Mysqldump is your friend Sequence parent/child data loads based on ETL assumptions o Orders without order lines o Order lines without orders Data Movement Use Cases o Full o Incremental o Upsert (Insert on duplicate key update)

Full Table Loads Good for small tables Works for tables with no primary key Data is fully replaced on each load

Incremental Loads Table contains new rows but no updates Good for insert-only tables High-water mark level included in Mysqldump where clause

Upsert Loads Table contains new and updated rows Table must have primary key Can be used to update only subset of columns

Lessons Learned - ETL Design Avoid large joins like the plague Break out ETL jobs into bite-size-bites Ensure target data integrity on ETL failure Use memory staging tables to boost performance

ETL Design - Sample Problem Build a daily summary of customer event log activity

ETL Design - Sample Solution

ETL Pseudo code - Step 1 1) Create staging table & Find High Water Mark: SELECT IFNULL(MAX(calendar_date),' ') FROM user_event_log_summary; set max_heap_table_size = CREATE TEMPORARY TABLE user_event_log_summary_staging (.....) ENGINE = MEMORY; CREATE INDEX user_idx USING HASH on user_event_log_summary_staging(user_id);

ETL Pseudo code - Step 2 2) Summarize events: INSERT INTO user_event_log_summary_staging ( calendar_date, user_id, event_type, event_count) SELECT DATE(event_time), user_id, event_type, COUNT(*) FROM event_log WHERE event_time > '23:59:59') GROUP BY 1,2,3;

ETL Pseudo code - Step 3 3) Set denormalized user columns: UPDATE user_event_log_summary_staging log_summary, user SET log_summary.type = user.type, log_summary.status = user.status WHERE user.user_id = log_summary.user_id;

ETL Pseudo code - Step 4 3) Insert into Target Table: INSERT INTO user_event_log_summary (...) SELECT... FROM user_event_log_summary_staging ;

Functional Partitioning Benefits depend on o Partition Execution Times o Data Move Times o Dependencies between functional partitions

Functional Partitioning

Job Management Run everything single-threaded on a server Handle dependencies between jobs across servers Smart re-start key to survival Implemented 3-level hierarchy of processing o Process (collection of build steps and data moves) o Build Steps (ETL 'units of work') o Data Moves

DW Replication Similar to other MySQL environments o Commodity hardware o Master-slave pairs for all databases Mixed environments can be difficult o Use rsync to create slaves o But not with ssh (on private network) Monitoring o Reporting queries need to be monitored  Beware of blocking queries  Only run reporting queries on slave (temp table issues) o Nagios o Ganglia o Custom scripts

Infrastructure Planning Replication latency o Warehouse slave unable to keep up o Disk utilization > 95% o Required frequent re-sync Options evaluated o Higher speed conventional disks o RAM increase o Solid-state-disks

Optimization Check / reset HW RAID settings Use general query log to track ETL / Queries Application timing o isolate poor-performing parts of the build Optimize data storage - automatic roll-off of older data

Infrastructure Changes Increased memory 32GB -> 64GB New servers have 96GB RAM SSD Solution o 12 & 16 disk configurations o RAID6 vs. RAID10 o 2.0T or 1.6TB formatted capacity o SATA2 HW BBU RAID6 o ~ 8 TB data on SSD

Results Sometimes it pays to throw hardware at a problem o 15-hour warehouse builds on old system o 6 hours on optimized system o No application changes

Finally...Archive Two-tiered solution Move data into archive tables in separate DB Use select to dump data - efficient and fast Archive server handles migration o Dump data o GPG o Push to S3

Survival Tips Efforts to scale are non-linear o As you scale, it becomes increasingly difficult to manage o Be prepared to supplement your warehouse strategy  Dedicated appliance  Distributed processing (Hadoop, etc) You can gain a great deal of headroom by optimizing I/O o Optimize current disk I/O path o Examine SSD / Flash solutions o Be pragmatic about table designs It's important to stay ahead of the performance curve o Be proactive - monitor growth, scale early Monitor everything, including your users o Bad queries can bring replication down