SAP IQ - Business Intelligence and vertical data processing with 8 GB RAM or less Dipl.- Inform. Volker Stöffler Volker.Stoeffler@DB-TecKnowledgy.info.

Slides:



Advertisements
Similar presentations
Tuning: overview Rewrite SQL (Leccotech)Leccotech Create Index Redefine Main memory structures (SGA in Oracle) Change the Block Size Materialized Views,
Advertisements

Adam Jorgensen Pragmatic Works Performance Optimization in SQL Server Analysis Services 2008.
Database System Concepts and Architecture
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Technical BI Project Lifecycle
Transaction.
A Fast Growing Market. Interesting New Players Lyzasoft.
Presented by Vigneshwar Raghuram
Project Management Database and SQL Server Katmai New Features Qingsong Yao
IBM Software Group ® Recommending Materialized Views and Indexes with the IBM DB2 Design Advisor (Automating Physical Database Design) Jarek Gryz.
Data Warehousing - 3 ISYS 650. Snowflake Schema one or more dimension tables do not join directly to the fact table but must join through other dimension.
Panel Summary Andrew Hanushevsky Stanford Linear Accelerator Center Stanford University XLDB 23-October-07.
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
Architecting a Large-Scale Data Warehouse with SQL Server 2005 Mark Morton Senior Technical Consultant IT Training Solutions DAT313.
Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 1 Preview of Oracle Database 12 c In-Memory Option Thomas Kyte
©2014 Experian Information Solutions, Inc. All rights reserved. Experian Confidential.
Systems analysis and design, 6th edition Dennis, wixom, and roth
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Loading Ola Ekdahl IT Mentors 9/12/08.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
1 © 2012 OpenLink Software, All rights reserved. Virtuoso - Column Store, Adaptive Techniques for RDF Orri Erling Program Manager, Virtuoso Openlink Software.
1 Data Warehouses BUAD/American University Data Warehouses.
Data in the Cloud – I Parallel Databases The Google File System Parallel File Systems.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
Database Management COP4540, SCS, FIU Physical Database Design (ch. 16 & ch. 3)
MANAGING DATA RESOURCES ~ pertemuan 7 ~ Oleh: Ir. Abdul Hayat, MTI.
1 CS 430 Database Theory Winter 2005 Lecture 2: General Concepts.
1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
Data resource management
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
Esri UC 2014 | Technical Workshop | Editing Versioned Geodatabases : An Introduction Cheryl Cleghorn and Shawn Thorne.
INNOV-10 Progress® Event Engine™ Technical Overview Prashant Thumma Principal Software Engineer.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
7 Strategies for Extracting, Transforming, and Loading.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
for all Hyperion video tutorial/Training/Certification/Material Essbase Optimization Techniques by Amit.
Copyright Sammamish Software Services All rights reserved. 1 Prog 140  SQL Server Performance Monitoring and Tuning.
Configuring SQL Server for a successful SharePoint Server Deployment Haaron Gonzalez Solution Architect & Consultant Microsoft MVP SharePoint Server
Information Retrieval in Practice
Databases (CS507) CHAPTER 2.
Databases and DBMSs Todd S. Bacastow January 2005.
View Integration and Implementation Compromises
LOCO Extract – Transform - Load
Chris Menegay Sr. Consultant TECHSYS Business Solutions
Windows Azure Migrating SQL Server Workloads
Storage Virtualization
A developers guide to Azure SQL Data Warehouse
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
Lecture 10: Buffer Manager and File Organization
MANAGING DATA RESOURCES
SQL 2014 In-Memory OLTP What, Why, and How
A developers guide to Azure SQL Data Warehouse
KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures
Introduction to Database Systems
Introduction to Teradata
Database Systems Instructor Name: Lecture-3.
Interpret the execution mode of SQL query in F1 Query paper
Recommending Materialized Views and Indexes with the IBM DB2 Design Advisor (Automating Physical Database Design) Jarek Gryz.
OLAP in DWH Ján Genči PDT.
Data Warehousing Concepts
Applying Data Warehouse Techniques
Power BI at Enterprise-Scale
Database System Architectures
MapReduce: Simplified Data Processing on Large Clusters
Applying Data Warehouse Techniques
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #03 Row/Column Stores, Heap Files, Buffer Manager, Catalogs Instructor: Chen Li.
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

SAP IQ - Business Intelligence and vertical data processing with 8 GB RAM or less Dipl.- Inform. Volker Stöffler Volker.Stoeffler@DB-TecKnowledgy.info Public

Agenda Introduction: What is SAP IQ - in a nutshell Architecture, Idea, Background Exercise: Create a database and database objects What makes SAP IQ eligible for Big Data Scenarios (Un-) Limits, Scalability Aspects Exercise: Populate the database using bulk load Ad-hoc Queries What IQ is good at Exercise: Run predefined or own queries against your database

Learning Objective After completing this sesson, you will be able to: Recognize the benefits of data compression mechanisms in Big Data scenarios Describe how ad-hoc queries against raw fact data give you the flexibility to evaluate these data just along the dimensions you want NOW. Match evaluation patterns against the data structures offered by SAP IQ.

What is SAP IQ - in a nutshell Architecture, Idea, Background

Real- Time Evaluation on Very Large Tables with SAP IQ SAP IQ is a pure-bred Data Warehouse engine designed for Very Large Databases. Like SAP HANA, it utilizes a Columnar Data Store. Unlike SAP HANA, it stores data on Disk and utilizes RAM to cache parts of it. Data Compression multiplies the range of Storage Resources. Dictionary Compression for repeating column values Storage compression for all data structures Storage required for data can be 30% - 80% less than in traditional RDBMS. SAP IQ integrates seamlessly with core components of the Big Data ecosystem. SAP HANA via Smart Data Access / Extended Storage Hadoop via Component Integration Service or Table User Defined Functions

SAP IQ Terms Columnar Data Store: In traditional (OLTP style) RDBMS, the various column values of a data row are stored together, making access to a single complete row very efficient. In a columnar data store, the column values of many rows are stored together. A row is distributed over various column vectors Row ID: Since a row does not exist as a memory entity, it exists as a Row ID indicating the position in the various column vectors Cardinality: The number of unique / distinct column values for a column. Optimized Fast Projection: The SAP IQ term for dictionary compression Bitmap Index: Since a row exists as a Row ID only, columns of low cardinality can be reflected as (usually sparsely populated) bitmaps where each bit represents one row. There is one bitmap per unique value. A set bit indicates a row with that value.

SAP IQ for Big Data Scenarios What makes SAP IQ eligible for Big Data Scenarios (Un-) Limits, Scalability Aspects

Data Acquisition Big Data fast scalable cost efficient versatile Data is acquired through bulk mechanism fast SAP IQ holds the Guinness World Record of 34.3 TB / hour (2014) scalable Parallel Processing of Load data streams cost efficient Runs on standard hardware versatile IQ can load from a wide variety of data sources including leading RDBMSs and Hadoop

Procedure: SAP IQ Data Acquisition Green blocks are eligible for massive parallel execution Incoming data (row oriented, tabular result set or data file / data pipe) Transformation to vertical Dictionary Compression (where applicable) Storage Compression as data is written to disk Incremental indexes are Fast Projection and Bitmaps. Non- incremental indexes are B-Tree style. Load time for incremental indexes is independent of number of existing rows. Maintenance of non-incremental indexes becomes more expansive with increasing number of existing rows. Auxiliary Indexes (incremental or non-incremental)

Optimized Fast Projection Dictionary Compression Eligible Columns have a metadata Lookup Table Each distinct value is represented once in the lookup table Each column value is stored as the position in the lookup table Lookup Table size depends on column data type and cardinality Number of rows in lookup table = cardinality Lookup table row size is calculated upon the column data type Up to cardinality 2^31 (2,147,483,647) Column Vector size depends on number of rows and column cardinality Each column value is represented by as few bits as required to store cardinality in binary E.g. a column with a cardinality of 9 .. 16 requires 4 bits / row, a column with a cardinality of 513 .. 1024 requires 10 bits / row

Data Storage Big Data Life Cycle Compression cost efficient integrated SAP IQ can maintain as many containers (files / raw devices) as the OS allows, each up to 4 TB in size Life Cycle SAP IQ can organize the database for different kinds of storage, reflecting data life cycle or temperature. Compression Raw data size typically reduced by 30 – 70% cost efficient Data compression reduces the disk footprint. The maximum IQ database size can currently be considered unlimited integrated SAP IQ can integrate with HANA to hold no longer hot enough for in- memory and Hadoop to age out data even colder

SAP IQ Storage (Un-) Limitations SAP IQ maximum database size: number of files times maximum file size the OS allows. Maximum file size supported by IQ is 4 TB Organized as DBSpaces consisting of up to 2000 files Up to 2^48 – 1 rows per table 15-digit decimal number Table size is only limited by database size Special declarations required to extend a table beyond the size of a DBSpace Up to 2^32 indexes per table Up to 45000 columns per table (recommended limit: 10000)

Big Data Specific Features – Very Large Database Option Semantic Partitioning Can control data location by data values Read-Only DBSpaces When fully populated with archive data, DBSpaces can be declared read-only and excluded from full backups I/O Striping Tables can be distributed over multiple devices by column and / or by partition I/O Striping Auxiliary indexes can be separated from raw data Data Aging Tables or Partitions (through semantic partitioning) with cold data can be assigned to cheaper storage

Background – What are we doing – System Storage Containers Catalog Store: …database '...\FlightStats.db‘… Database Handle – one file system file (accompanied by a .log) Holds system tables Grows on demand System Main Store: …IQ path '...\FlightStatsMain.IQ‘… One or multiple file system files or raw devices Holds system data Specified current and optionally reserved size for later extension Temp Store: …temporary path '...\FlightStatsTemp_00.IQ‘… Holds temporary data (work tables, temporary tables, processing data) First, we create Catalog Store, System Main Store and Temporary Store (0CreateDB.SQL) Catalog Store Temp. Store System Main Store User Data Store

Background – What are we doing – User Storage Containers Next, we create a User Data Store (1AdjustExtendDB.SQL) User Data Store: …create dbspace User_Store using… One or multiple file system files or raw devices (here: 2 file system files) …file LabCenter_User_00… and …file LabCenter_User_01… Holds user data Specified current and optionally reserved size for later extension Multiple User Data Stores are possible Requires Very Large Database License Option Catalog Store Temp. Store System Main Store User Data Store

Background – What are we doing – Create Tables and Indexes Then, we create Tables and Indexes (2TablesIndexes.SQL) Table: …create table FlightsOnTime… Standard SQL Except iq unique clause (here to bypass dictionary compression) Indexes: Various Index Types Many apply to one column LF – Low Fast for low cardinality columns HNG – High Non Group for parallel calculation of totals and averages DATE – for the low cardinality elements of date values … more and details to follow

Ad-hoc Queries What IQ is good at

Real- Time Evaluation on Very Large Tables with SAP IQ Product Acct. Rep State Year Quarter Revenue IQ Steve TX 2013 1 600 ASE Bill OK 515 ESP Tom MA 780 HANA AZ 340 NJ 375 PH 410 Greg CA 875 724 CO 2 415 655 UT 820 NH 570 To evaluate Distribution of Revenue by Product, Acct. Rep or State, only the columns actually involved in the particular query are required

Real- Time Evaluation on Very Large Tables with SAP IQ Product Acct. Rep State Year Quarter Revenue IQ Steve TX 2013 1 600 ASE Bill OK 515 ESP Tom MA 780 HANA AZ 340 NJ 375 PH 410 Greg CA 875 724 CO 2 415 655 UT 820 NH 570 To evaluate Distribution of Revenue by Product, Acct. Rep or State, only the columns actually involved in the particular query are required

Real- Time Evaluation on Very Large Tables with SAP IQ Product Acct. Rep State Year Quarter Revenue IQ Steve TX 2013 1 600 ASE Bill OK 515 ESP Tom MA 780 HANA AZ 340 NJ 375 PH 410 Greg CA 875 724 CO 2 415 655 UT 820 NH 570 To evaluate Distribution of Revenue by Product, Acct. Rep or State, only the columns actually involved in the particular query are required

Data Processing Big Data scalable scalable efficient Server or Query Workload can be distributed across multiple machines (Multiplex / PlexQ). Columnar Data Store allows evaluation of very large numbers of rows – irrelevant columns have no impact on query performance. Big Data scalable I/O Striping across all eligible disk containers. efficient Bitmap indexes allow complex aggregations through elementary binary operators. Multiplex: Various server processes running on different machines access the same physical data store. Reports can execute on an arbitrary node, controlled by the database client connection and utilize all resources (especially CPU cores, RAM) available to this node. PlexQ: Multiplex plus the option to distribute the execution of a single query across multiple nodes to utilize all resources on all nodes Pipeline Processing Subsequent query operators can start before completion of previous operators

Showcase: Grouped Average Calculation in 2 Dimensions We have a numeric fact value (like number or value of items sold) for which we want to calculate total or average values. Assumptions: Every fact row has one out of 23 status values. We’re only interested in status ‘current’ or ‘historic’. These two make up ~98% of the stored data. Every fact row is assigned to a geography. The geography dimension has a cardinality of ~100, but we’re only interested in 8 of them (e.g. AT, BE, CH, DE, FR, IE, NL, UK). Every fact row is assigned to a product line. There’s 43 of them, and we’ll evaluate them all. See blog post “Vertical, sure – but how lean can it be?” [http://scn.sap.com/community/services/blog/2013/10/17/vertical-sure-but-how-lean-can-it-be] for a similar showcase

Showcase: Sample Data Excerpt – Low Fast (LF) Index Status Geo PL current ES 3 DK 5 pending UK 9 16 historic DE 29 NL 2 FR 4 GA AT 31 IT 24 current historic pending 1 DE DK ES UK 1 PL2 PL3 PL4 PL5 1 There is one bitmap for each status (shown here: current, historic, pending), one bitmap for each geography (shown here: DE, DK, ES, UK) and one bitmap for each product line (shown here: 2, 3, 4, 5) An index with one bitmap per unique value is called Low Fast or LF index

Procedure: Showcase initial process steps Filter: Create a combined bitmap current OR historic Pipeline Execution Permutation 1: Create a combination of this bitmap (AND) with each of AT, BE, CH, DE, FR, IE, NL, UK Threads: 8 Permutation 2: Create an AND combination of each bitmap with each product line Threads: 8*43 Pipeline Execution: Subsequent steps can pick up execution before the current step is complete. This enhances the possible parallelism and makes execution time less sensible on data size Intermediate Result: 8*43 bitmaps each indicating the row set for a combination of Geo and PL

Showcase: Bit Slice Sum Calculation with HNG Index Value 16 8 4 2 1 23 11 17 5 15 24 7 12 25 As an auxiliary index structure, numeric values can be stored in bit slices This is called High Non Group (HNG) Index Every bit value is represented by an own bitmap E.g. for an unsigned smallint (2 bytes; 0 .. 65535) 16 bitmaps are stored Each represents a power of 2 (1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768)

Procedure: Showcase final process steps Intermediate Result: 8*43 bitmaps each indicating the row set for a combination of Geo and PL Pipeline Execution Permutation 3: AND combine each bitmap with each HNG bit slice and count resulting set bits Threads: 8*43*16 Accumulation: Multiply the number of set bits with the weight of the bit and add up for each Geo / PL Threads: 8*43 Permutation 3 does not require the materialization of the resulting bitmap as it won’t be further processed. Only the set bits of the resulting bitmaps are counted. Pipeline Execution: Subsequent steps can pick up execution before the current step is complete. This enhances the possible parallelism and makes execution time less sensible on data size Result: (Up to) 8*43 result rows

Showcase: Summary – Why this is efficient We’re utilizing a very high number of threads These can be executed in parallel if sufficient cores are available but they don’t have to They introduce no overhead and are completely independent of each other Even could be executed on different nodes in a PlexQ setup The operations executed are technically trivial and highly efficient on every hardware The intermediate results fit into hardware registers The persistent input bitmaps can be distributed over multiple disks for I/O striping The intermediate bitmaps can be expected to fit in the cache 128 Mbytes for 1G rows per bitmap (uncompressed)

Scalability Aspects Load time vs. number of existing rows Incremental indexes (for low cardinality data) are insensitive to the number of existing rows. Non- Incremental (B-Tree) indexes are principally sensitive to the number of existing rows, this impact is minimized using tiered B-Trees Query execution time vs. number of cores Most Analytics style queries can efficiently scale out for a high number of CPU cores. Increasing processing power can be expected to produce an adequate gain in response time. Query execution time vs. number of rows Typically, query execution time rises linear with the number of rows or slower (due to pipeline execution) Multinode Setup (Multiplex / PlexQ) Processing power and RAM is not restricted to the capabilities of a single box I’m looking at scalability as the progressive behavior between two aspects assuming everything else remains unchanged. For Bitmap or FP indexes, Loading new rows does not touch any existing rows. The new values are simply appended to the existing bitmaps and column vectors. Query execution time can rise slower than linear with the number of rows since pipelining execution can achieve a higher degree of parallelism if the streams have a higher volume.

Using SAP IQ Standard SQL OLAP Standard APIs Reporting Tools SAP IQ is addressed using standard SQL – easy to use for developers familiar with other RDBMS OLAP The SQL dialect is enhanced by OLAP extensions bringing analytics into the database server Standard APIs ODBC, JDBC, OLE-DB, OpenClient – Simply use your preferred client (unless it’s proprietary) Reporting Tools All reporting tools supporting at least one of the standard APIs can retrieve data from SAP IQ Import Export ASCII Files are the most versatile data exchange format – SAP IQ reads from and writes to these

Consistency - Concurrency Snapshot Isolation SAP IQ uses Snapshot Isolation No Blocks Read operations never get into lock conflicts. This minimizes the impact of data provisioning. Full Consistency Data visible to a reader is always consistent – nothing like dirty reads, non-repeatable reads or phantom rows Parallel If CPU cores are available, typical Analytics operations can massively utilize them

Integration into the SAP Big Data Landscape HANA integration Near Line Storage for SAP BW systems Smart Data Access / HANA Extended Storage Hadoop integration User defined functions in IQ to access Hadoop data and Table Parametrized Functions (TPF) Event Stream Processing SAP ESP comes with a native adapter for SAP IQ Reporting / Predictive Data Analysis Standard APIs (ODBC / JDBC / …) available for SAP and third party products OLAP in the database removes Workload from the Reporting systems

Thank you! Contact information: Volker Stöffler DB-TecKnowledgy Independant Consultant Germany – 70771 Leinfelden-Echterdingen mailto: Volker.Stoeffler@DB-TecKnowledgy.info http://scn.sap.com/people/volker.stoeffler