Download presentation
Presentation is loading. Please wait.
Published byGervase Simpson Modified over 6 years ago
1
Battle Scars: Porting a SQL Server application to Netezza
October 1, 2016 Aaron N. Cutshall, MSCIS, MSHI
2
30 Just who is this guy? Years Sr. BI Solutions Architect
Chapter President Speaker – various events 30 Years B.S. Computer Science M.S. Computer Information Systems M.S. Health Informatics 10/01/2016 SQL Saturday #557 – MInnesota 2016
3
SQL Saturday #557 Thank you Sponsors! Event After Party
Please visit the sponsors and enter their end- of-day raffles. Event After Party Sky Deck Sports Grille and Lanes at the Mall of America at 7 PM. Want More Free Training? PASSMN meets the 3rd Tuesday of every month. 10/01/2016 SQL Saturday #557 – MInnesota 2016
4
Lunch Sponsor - Dell EMC
For those who paid for lunch already, we will refund you via PayPal. If you wish to donate to Rebecca CoderDojo, please drop your ticket in the bucket at registration. 10/01/2016 SQL Saturday #557 – MInnesota 2016
5
You Rock Sponsor - Pyramid Analytics
Gold Sponsors IDERA Pragmatic Works VMWare GNet Tail Wind Microsoft Dell Software 10/01/2016 SQL Saturday #557 – MInnesota 2016
6
Other Sponsors Silver Sponsors Improving Experts Exchange Pure Storage
Bronze Sponsors SQL Sentry COZYROC PASS Blog Sponsors SQLVariant 10/01/2016 SQL Saturday #557 – MInnesota 2016
7
Battle Scars Agenda Got Big Data? What are your options?
SMP – Symmetric Multi-Processor MPP – Massive Parallel Processor NRDS – Non-Relational Data Stores (NoSQL) CoreANALYTICS Migration Project Challenges Lessons Learned Migrating from SQL Server to Netezza What you gain (the good) What you loose (the bad) Common stumbling blocks (the ugly) 10/01/2016 SQL Saturday #557 – Minneapolis 2016
8
Databases on ACID Atomicity Consistency Isolation Durability
Transaction commit/rollback must be complete (“All or nothing”) Significant overhead on shared-nothing systems Consistency Data written must be valid according to all rules Includes constraints, cascades, and foreign keys Isolation No transaction should be influenced by another Should be same as if transactions were serial Durability Once transaction is committed it will remain so Transactions can be recovered if necessary 10/01/2016 SQL Saturday #557 – Minneapolis 2016
9
Symmetric Multi-Processor
Single Systems Multiple processors (2-64 or more) Shared Everything Architecture Memory Data Storage Operating System ACID Compliant Examples: SQL Server Oracle PostgreSQL MySQL 10/01/2016 SQL Saturday #557 – Minneapolis 2016
10
Symmetric Multi-Processor
Benefits Low cost for implementation/maintenance Highly structured relational database High speeds at low volumes Supports referential integrity Drawbacks Hardware scalability is limited due to architecture Data volume scalability is somewhat limited Shared systems present several performance bottlenecks 10/01/2016 SQL Saturday #557 – Minneapolis 2016
11
Symmetric Multi-Processor
Primary Purpose Online Transaction Processing (OLTP) Very efficient performance for small volumes of data Highly reliable 10/01/2016 SQL Saturday #557 – Minneapolis 2016
12
Massive Parallel Processor
Multiple Nodes Each Node with Multiple processors (2-64 or more) Shared Disk Architecture (Exadata) Shared Nothing Architecture (Netezza, Teradata) Memory Data Storage Operating System High speed connection ACID Compliant 10/01/2016 SQL Saturday #557 – Minneapolis 2016
13
Massive Parallel Processor
Benefits Highly scalable for hardware growth Parallel storage/query for high volume Generally few hardware bottlenecks Single node failure does not bring entire system down Fast Inserts and Selects Drawbacks Often a high cost for implementation/maintenance Loss of referential integrity across nodes No triggers (not a bad thing IMHO) Not optimal for small volume data processing Slow Updates and Deletes 10/01/2016 SQL Saturday #557 – Minneapolis 2016
14
Massive Parallel Processor
Primary Purpose Online Analytical Processing (OLAP) Very efficient performance for large volumes of data Highly reliable 10/01/2016 SQL Saturday #557 – Minneapolis 2016
15
Non-Relational Data Stores
Each server separate Usually governed by a master server (i.e. Riak is a ring) Separate (CPU, RAM, storage) Not usually an appliance Often low-cost computers Cluster configuration Open source options Paid support available Non-relational (NoSQL) Many now have SQL options Effective on known structures Work with traditional RDBMS 10/01/2016 SQL Saturday #557 – Minneapolis 2016
16
Non-Relational Data Stores
Key-Value Stores Redis, CouchDB, MUMPS Column Stores Hbase, Cassandra, Vertica Document Stores MongoDB, Lotus Notes Graph Database Neo4j, Allegro, OrientDB Multi-purpose Hadoop 10/01/2016 SQL Saturday #557 – Minneapolis 2016
17
Non-Relational Data Stores
Benefits Highly scalable for hardware growth Data sharded across servers (similar concept to RAID) Parallel storage/query for high volume Single unit failure does not bring entire system down Does not require highly structured organization Drawbacks No referential integrity (not optimal for OLTP) Data not usually organized for easy OLAP DW queries Not transaction oriented as SMP or MPP systems Not ACID compliant; uses BASE (Basically Availability, Soft state, Eventual consistency) methods instead Hadoop with Hive SQL could be configured similar to MPP 10/01/2016 SQL Saturday #557 – Minneapolis 2016
18
Non-Relational Data Stores
Primary Purpose Handles both small and large loads Very efficient performance for mixed volumes of data Store first, organize later Quick, no hassle storage Handles multiple data types Great for research Heavily used by NASA Fast load – any structure Fast read – parallelism 10/01/2016 SQL Saturday #557 – Minneapolis 2016
19
Mid-point Review Traditional configurations
SMP alone (OLTP with perhaps some OLAP) MMP alone (OLAP with perhaps some non-OLAP) NRDS alone (often with some form of AP) Some scenarios becoming more common SMP (OLTP) → MMP (OLAP) NRDS (Unstructured) → SMP (Structured) SMP (Extracts from multiple sources) → NRDS General Characteristics SMP – high speed low volume transaction MMP – high volume transaction NRDS – unstructured data (low or high volume) 10/01/2016 SQL Saturday #557 – Minneapolis 2016
20
Mid-point Review Relational (SMP, MPP) Non-relational (NRDS)
Structure created BEFORE data is loaded Relies upon the relationships between different data Supports relational integrity among data Requires queries to be aware of the structure Non-relational (NRDS) No structural restrictions when data is loaded Does not support relationships No relational integrity among data Requires applications to be aware of the structure All data is structured at some point Before data is loaded (SMP, MPP) When data is queried (NRDS) 10/01/2016 SQL Saturday #557 – Minneapolis 2016
21
Mid-point Review Some SMP systems include MPP type capabilities
Microsoft SQL Server Analysis Server MySQL (combined with other tools like Ubiq) Many MPP systems incorporating NRDS solutions Microsoft PDW with Hadoop (now Analytics Processing System) Teradata with Aster Data and Hadoop Many NRDS systems becoming more like MPP Hadoop with Hive (SQL interface) Cassandra with CQL (Cassandra Query Language) NoSQL becoming more “Not Only SQL” 10/01/2016 SQL Saturday #557 – Minneapolis 2016
22
Mid-point Review HUGE diversity in data options
10/01/2016 SQL Saturday #557 – Minneapolis 2016
23
Mid-Point Review Most of the division lines are becoming blurred
No clear distinction anymore between the systems No well defined selection parameters Multiple systems applicable for multiple use cases Changes are happening!! Alan Kay 10/01/2016 SQL Saturday #557 – Minneapolis 2016
24
CoreANALYTICS Measure Calculation Engine
Secondary use of clinical data Clinical Quality and Utilization Measures Used for attestation for Meaningful Use (Obamacare) Designed to run on SQL Server Has components of both transactional and analytical Data pulled from existing EMR systems Stored in a relational data warehouse Measures calculated on all available data Results are stored in typical DW type format Individual patient/encounter level measure results Aggregated facility results 10/01/2016 SQL Saturday #557 – Minneapolis 2016
25
CoreANALYTICS Compiled Measures
Created/edited by non-programmers in XML Compiled into stored procedures Client is a major healthcare provider system Data for about 70+ hospitals Use IBM Pure Data Systems (Netezza) as their DW Changes to implementation plans Want to capitalize on their DW investment Implementation challenges Volume of data Modifications to Stored Procedures Differences from SQL Server Lack of available Netezza resources 10/01/2016 SQL Saturday #557 – Minneapolis 2016
26
CoreANALYTICS port to Netezza
Had to run natively on Netezza (NZ) v6 Uses Aginity Workbench (64K limitation) Will process more measures with data for more hospitals and facilities than existing application Must beat performance of the existing custom developed application that was tuned for performance Firm that built the original custom app lost the project bid yet was retained to administer the database platform 10/01/2016 SQL Saturday #557 – Minneapolis 2016
27
CoreANALYTICS port to Netezza
Uses variant of PostgreSQL Common roots with Oracle Functions and date calculations quite different CTEs not available (non-recursive in v7.0.3) No temporary database No correlated sub-queries in joins No identity columns, must use sequences Data is sharded with horizontal partitioning over multiple nodes Optimized for ultra fast inserts and selects 10/01/2016 SQL Saturday #557 – Minneapolis 2016
28
Lessons Learned: Data Distribution
No indexes! Netezza knows where NOT to look Depends upon data distribution key and organization Netezza distributes data across multiple nodes Distribution key is based upon up to 4 fields in the data with the highest selectivity (similar to a primary key) Creates a hash value for the key Related tables with the same distribution key have the same hash value (Parent-Child relationships) 10/01/2016 SQL Saturday #557 – Minneapolis 2016
29
Lessons Learned: Data Distribution
Random distribution Spreads data out uniformly across nodes Improves retrieval by allowing more nodes to process Ordered distribution Places data for certain keys on specific nodes Allows related data to remain on the same node (data slice) Data Organization Option to organize data according by specific field Similar in concept to clustered index Data reorganized when groom process occurs 10/01/2016 SQL Saturday #557 – Minneapolis 2016
30
Lessons Learned: Data Distribution
Selecting a distribution key: The more distinct the distribution key values, the better The system distributes rows with the same distribution key value to the same data slice Tables used together should use the same columns for their distribution key (Parent-Child relationships) If a particular key is used largely in equi-join clauses (INNER JOIN), then that key is a good choice for the distribution key so the system can perform the join operation locally Avoid data skew: imbalance on data distribution 10/01/2016 SQL Saturday #557 – Minneapolis 2016
31
Lessons Learned: Updates & Deletes
NZ built for selects & inserts, not updates or deletes For updates, inserts new record, flags old for deletion Requires a “groom” process to physically delete records and reorganize remaining data as required Node 1 Node 2 Node 3 Node 4 A♠ 2♠ 3♠ 4♠ A♥ 2♥ 3♥ 4♥ A♣ 2♣ 3♣ 4♣ A♦ 2♦ 3♦ 4♦ Q♥ Q♦ 10/01/2016 SQL Saturday #557 – Minneapolis 2016
32
Lessons Learned: Updates & Deletes
To update large numbers of records Perform same procedure manually Bulk-based method more efficient Basic procedure: Create new table Insert modified records into new table Insert unmodified records into new table Drop old table Rename new table to old name Batch Deletes to after-hours if at all possible 10/01/2016 SQL Saturday #557 – Minneapolis 2016
33
Lessons Learned: Complex Queries
Require data to be collated from multiple nodes before used Requires processing at master node MPP systems work closer to data than SMP systems (no shared memory) Actually faster to run part of the query and copy data to another table and complete query against it than to use very complex queries Multiple queries are sometimes faster than a single complex set-based query 10/01/2016 SQL Saturday #557 – Minneapolis 2016
34
Lessons Learned: No Correlated Sub-Queries
Problem for LEFT JOIN with multiple tables when more than one field is in the join: tbl_Element (ElementId, Date, Description, Taxonomy, Term) tbl_Property (PropertyId, ElementId, Taxonomy, Term) tbl_ValueSet (ValueSetId, ValueSetName, Taxonomy, Term) Problem: tbl_Property needs to be restricted by tbl_ValueSet INSERT INTO tbl_Evidence(SessionId, ElementId, DoesRequiredPropertyExist) e.ElementId, CASE WHEN p.PropertyId IS NULL THEN 0 ELSE 1 END FROM tbl_Element e LEFT JOIN tbl_Property p ON p.ElementId = e.ElementId LEFT JOIN tbl_ValueSet vs ON vs.ValueSetName = 'RequiredVSName' AND vs.Taxonomy = p.Taxonomy AND vs.Term = p.Term; 10/01/2016 SQL Saturday #557 – Minneapolis 2016
35
Lessons Learned: No Correlated Sub-Queries
Solution: Change Taxonomy/Term pair for single value tbl_TaxTermPair (TaxTermPairId, Taxonomy, Term) tbl_Element (ElementId, Date, Description, TaxTermPairId) tbl_Property (PropertyId, ElementId, TaxTermPairId) tbl_ValueSet (ValueSetId, ValueSetName, TaxTermPairId) Keeps value set restriction as needed Eliminates two text-based fields in favor of numeric ID INSERT INTO tbl_Evidence(SessionId, ElementId, DoesRequiredPropertyExist) e.ElementId, CASE WHEN p.PropertyId IS NULL THEN 0 ELSE 1 END FROM tbl_Element e LEFT JOIN tbl_Property p ON p.ElementId = e.ElementId AND p.TaxTermPairId IN (SELECT TaxTermPairId FROM tbl_ValueSet WHERE ValueSetName = 'RequiredVSName'); 10/01/2016 SQL Saturday #557 – Minneapolis 2016
36
Lessons Learned: Alternative to JOIN
Use of JOIN in queries INNER JOIN used to identify existence LEFT JOIN for absence JOIN used in set-based approaches Preferred for SMP Takes advantage of shared memory and storage For Netezza, use (NOT) EXISTS or (NOT) IN Allows sub-query to run in parallel Counter-intuitive for SQL Server set-based approach Correlated sub-query OK in WHERE clause 10/01/2016 SQL Saturday #557 – Minneapolis 2016
37
Lessons Learned: Promote Parallelism
Previous SP used multiple inserts: Each insert had different criteria or came from different sources Inserts were performed serially (single threaded) INSERT INTO TargetTable(Column1, Column2, Column3) SELECT Column1, Column2, Column3 FROM SourceTableA; FROM SourceTableB; FROM SourceTableC; 10/01/2016 SQL Saturday #557 – Minneapolis 2016
38
Lessons Learned: Promote Parallelism
Group inserts into the same table with UNION: Encourages SELECT statements to be executed in parallel with each other then collected for final INSERT Works great on SQL Server too! INSERT INTO TargetTable(Column1, Column2, Column3) SELECT Column1, Column2, Column3 FROM SourceTableA UNION ALL FROM SourceTableB FROM SourceTableC 10/01/2016 SQL Saturday #557 – Minneapolis 2016
39
Lessons Learned: Data Normalization
King for SQL Server Reduces data usage in shared storage Joins made easy with shared memory Joker for Netezza Data is split across multiple nodes making joins difficult Requires more data sent across inter-node bus No shared memory for correlation De-normalization actually helps performance Replicate data across nodes Reduce needs for joins across nodes Reduce traffic across inter-node bus 10/01/2016 SQL Saturday #557 – Minneapolis 2016
40
Migrating from SQL Server to Netezza
What you gain (the good) Massive parallel processing Fast bulk inserts Fast straightforward selects Huge data volume capabilities Expandable architecture for greater data volumes 10/01/2016 SQL Saturday #557 – Minneapolis 2016
41
Migrating from SQL Server to Netezza
What you loose (the bad) Not optimal for low count record processing High cost for complex query joins across nodes No recursive capabilities No referential integrity 10/01/2016 SQL Saturday #557 – Minneapolis 2016
42
Migrating from SQL Server to Netezza
Common stumbling blocks (the ugly) Differences in query language Changes in set-based approach Differences in data manipulation Optimal distribution keys 10/01/2016 SQL Saturday #557 – Minneapolis 2016
43
Final Tips Test, test, test!! Learn your new environment
Test application against expected data volume Performance will differ between test and prod May need to make extensive stored procedure changes Learn your new environment Avoid the pitfalls of treating it like SQL Server Take advantage of platform-specific features Be prepared to make substantial changes NEVER ASSUME!! 10/01/2016 SQL Saturday #557 – Minneapolis 2016
44
Questions & Comments Aaron N. Cutshall BONUS:
A TON of free MS eBooks can be found here, here and here! TURN IN EVALUATIONS: Feedback for presenters Improve presentations Make SQLSaturday more valuable!!! Aaron N. Cutshall @ancutshall aaron.n.cutshall 10/01/2016 SQL Saturday #557 – Minneapolis 2016
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.