MySQL Data Warehousing Survival Guide Marius Moscovici Steffan Mejia
Topics The size of the beast Evolution of a Warehouse Lessons Learned Survival Tips Q&A
Size of the beast 43 Servers o 36 active o 7 standby spares 16 TB of data in MySQL 12 TB archived (pre S3 staging) 4 TB archived (S3) 3.5B rows in main warehouse Largest table ~ 500M rows (MySQL)
Warehouse Evolution - First came slaving Problems: Reporting slaves easily fall behind Reporting limited to one-pass SQL
Warehouse Evolution - Then came temp tables Problems: Easy to lock replication with temp table creation Slaving becomes fragile
Warehouse Evolution - A Warehouse is Born Problems: Warehouse workload limited by what can be performed by a single server
Warehouse Evolution - Workload Distributed Problems: No Real-Time Application integration support
Warehouse Evolution - Integrate Real Time Data
Lessons Learned - Warehouse Design Workload exceeds available memory
Lessons Learned - Warehouse Design Keep joins < available memory Heavily Denormalize data for effective reporting Minimize joins between large tables Aggressively archive historical data
Lessons Learned - Data Movement Mysqldump is your friend Sequence parent/child data loads based on ETL assumptions o Orders without order lines o Order lines without orders Data Movement Use Cases o Full o Incremental o Upsert (Insert on duplicate key update)
Full Table Loads Good for small tables Works for tables with no primary key Data is fully replaced on each load
Incremental Loads Table contains new rows but no updates Good for insert-only tables High-water mark level included in Mysqldump where clause
Upsert Loads Table contains new and updated rows Table must have primary key Can be used to update only subset of columns
Lessons Learned - ETL Design Avoid large joins like the plague Break out ETL jobs into bite-size-bites Ensure target data integrity on ETL failure Use memory staging tables to boost performance
ETL Design - Sample Problem Build a daily summary of customer event log activity
ETL Design - Sample Solution
ETL Pseudo code - Step 1 1) Create staging table & Find High Water Mark: SELECT IFNULL(MAX(calendar_date),' ') FROM user_event_log_summary; set max_heap_table_size = CREATE TEMPORARY TABLE user_event_log_summary_staging (.....) ENGINE = MEMORY; CREATE INDEX user_idx USING HASH on user_event_log_summary_staging(user_id);
ETL Pseudo code - Step 2 2) Summarize events: INSERT INTO user_event_log_summary_staging ( calendar_date, user_id, event_type, event_count) SELECT DATE(event_time), user_id, event_type, COUNT(*) FROM event_log WHERE event_time > '23:59:59') GROUP BY 1,2,3;
ETL Pseudo code - Step 3 3) Set denormalized user columns: UPDATE user_event_log_summary_staging log_summary, user SET log_summary.type = user.type, log_summary.status = user.status WHERE user.user_id = log_summary.user_id;
ETL Pseudo code - Step 4 3) Insert into Target Table: INSERT INTO user_event_log_summary (...) SELECT... FROM user_event_log_summary_staging ;
Functional Partitioning Benefits depend on o Partition Execution Times o Data Move Times o Dependencies between functional partitions
Functional Partitioning
Job Management Run everything single-threaded on a server Handle dependencies between jobs across servers Smart re-start key to survival Implemented 3-level hierarchy of processing o Process (collection of build steps and data moves) o Build Steps (ETL 'units of work') o Data Moves
DW Replication Similar to other MySQL environments o Commodity hardware o Master-slave pairs for all databases Mixed environments can be difficult o Use rsync to create slaves o But not with ssh (on private network) Monitoring o Reporting queries need to be monitored Beware of blocking queries Only run reporting queries on slave (temp table issues) o Nagios o Ganglia o Custom scripts
Infrastructure Planning Replication latency o Warehouse slave unable to keep up o Disk utilization > 95% o Required frequent re-sync Options evaluated o Higher speed conventional disks o RAM increase o Solid-state-disks
Optimization Check / reset HW RAID settings Use general query log to track ETL / Queries Application timing o isolate poor-performing parts of the build Optimize data storage - automatic roll-off of older data
Infrastructure Changes Increased memory 32GB -> 64GB New servers have 96GB RAM SSD Solution o 12 & 16 disk configurations o RAID6 vs. RAID10 o 2.0T or 1.6TB formatted capacity o SATA2 HW BBU RAID6 o ~ 8 TB data on SSD
Results Sometimes it pays to throw hardware at a problem o 15-hour warehouse builds on old system o 6 hours on optimized system o No application changes
Finally...Archive Two-tiered solution Move data into archive tables in separate DB Use select to dump data - efficient and fast Archive server handles migration o Dump data o GPG o Push to S3
Survival Tips Efforts to scale are non-linear o As you scale, it becomes increasingly difficult to manage o Be prepared to supplement your warehouse strategy Dedicated appliance Distributed processing (Hadoop, etc) You can gain a great deal of headroom by optimizing I/O o Optimize current disk I/O path o Examine SSD / Flash solutions o Be pragmatic about table designs It's important to stay ahead of the performance curve o Be proactive - monitor growth, scale early Monitor everything, including your users o Bad queries can bring replication down