Rekha Singhal, Amol Khanapurkar, TCS Mumbai.

Rekha Singhal, Amol Khanapurkar, TCS Mumbai.
DB Volume Emulator Rekha Singhal, Amol Khanapurkar, TCS Mumbai.

Contents What is DB Volume Emulator Architecture of DB Volume Emulator
Implementation Challenges Extension to Big Data Platforms Hive Spark Impala

Database Volume Emulator
SQL1 SQL2 SQLN SQL1 SQL2 SQLN Dataless Database e.g. CoDD Emulate SQL Query Execution Based on Data Statistics SQL Tuning in Development Environment Used for Performance Extrapolation Scheduling of SQL Queries Capacity Planning

Output of DB Volume Emulator

Use case- Performance Assurance Tool

SQL Query Execution Plan
SELECT STATEMENT SORT AGGREGATE NESTED LOOPS FULL TABLE SCAN (SUPPLIER) INDEX RANGE SCAN (PARTSUPP_SK )

Tool Output ID Operation Name Estimated Rows Estimated Execution Time (secs) 128 GB Select 1 696.43 Sort Aggregate 2 Nested Loop 102M 681.07 3 386.26 4 Table Access Full Supplier 128K 8.96 5 Table Access by Index Range Scan Partsupp_sk 354.5 6 Table Access by Unique Index Scan Pk_nationkey 200.96 Validated for TPC-H & Insurance applications within 10% average error. POC for Gspeed & NEST project Granted Patents and International Publications

Added by TCS Research

DB Volume Architecture

Inputs to DB Volume Emulator
DB Server Details Data Processing engine with API (Postres…) Configuration Parameters (Working mem size, block size…) Meta Data Table catalog (Number of blocks, Rows, Avg row length..) Index catalog (clustering factor, unique values, number of rows per value…) Column catalog (min value, distinct values, density….) Data Growth Details Tables with projected number of rows Columns values (max, distribution, unique values..)

Working of DB Volume Emulator
Given: Access to an instance of database in development/testing environment Growth Details Create Empty instance of database with new user credentials Transfer schema and all statistics (meta data) from Dev DB to Empty DB. Collate list of table, column, index catalog statistics SENSITIVE to data growth Extrapolate all those statistics and set them back in the Empty DB using API Empty DB is now Emulated DB.

RDBMS Volume Emulator Capture database statistics either through application requirement specifications or from the production system. Linear extrapolation of database statistics Table Statistics – num rows, num blocks etc. Column Statistics- min, max, density, histograms Index Statistics- num rows, density, clustering factor etc. Refer: R. Singhal and M. Nambiar, “Extrapolation of SQL Query Elapsed Response Time at Application Development Stage, Indicon 2012, Kerala, India.

Implementation CoDD ( Prof Haritsa, IISc Bangalore) is open source
Relational database engine (Postgres, Oracle, MySQL..) TCS has its own implementation specific to Oracle, however we extended CoDD for SQL execution Time estimation. The tool always transfer statistics from Dev environment The growth details are captured in user friendly manner: linear, constant, non-linear

Thought Leadership on DB Volume Emulator
Big Data Frameworks

Big Data Processing Stack

9/20/2018 Apache Hive A data warehousing system to store structured data on Hadoop file system Provide an easy query data by executing Hadoop MapReduce plans

Hive Architecture (from the Google Paper)
Metastore: stores system catalog Driver: manages life cycle of HiveQL query as it moves thru’ HIVE; also manages session handle and session statistics Query compiler: Compiles HiveQL into a directed acyclic graph of map/reduce tasks Execution engines: The component executes the tasks in proper dependency order; interacts with Hadoop HiveServer: provides Thrift interface and JDBC/ODBC for integrating other applications. Client components: CLI, web interface, jdbc/odbc inteface Extensibility interface include SerDe, User Defined Functions and User Defined Aggregate Function.

Hive Metadata Database namespace Table definitions Partition data
9/20/2018 Hive Metadata Database namespace Table definitions schema info, physical location In HDFS Partition data Object Relational Mapping Framework All the metadata can be stored in Derby by default Any database with JDBC can be configured

Data Model & Storage Hive structures data into well-understood database concepts such as: tables, rows, cols, partitions It supports primitive types: integers, floats, doubles, and strings Hive also supports: associative arrays: map<key-type, value-type> Lists: list<element type> Structs: struct<file name: file type…> SerDe: serialize and deserialized API is used to move data in and out of tables Tables are logical data units; table metadata associates the data in the table to hdfs directories. Hdfs namespace: tables (hdfs directory), partition (hdfs subdirectory), buckets (subdirectories within partition) 9/20/2018

Query Language (HiveQL)
Subset of SQL Meta-data queries Limited equality and join predicates No inserts on existing tables (to preserve worm property) Can overwrite an entire table

Web UI + Hive CLI + JDBC/ODBC User-defined Map-reduce Scripts
9/20/2018 Map Reduce Web UI + Hive CLI + JDBC/ODBC Browse, Query, DDL User-defined Map-reduce Scripts HDFS Hive QL Parser Planner Optimizer Execution UDF/UDAF substr sum average MetaStore Thrift API FileFormats TextFile SequenceFile RCFile SerDe CSV Thrift Regex

HiveQL Explain Plan

Apache Calcite Optimizer
Incubator project since 2014 Query Planning framework Embedded in Hive Adapters for Spark, Mongo DB, Splunk,Phoenix..

Apache Spark Fast and general cluster computing system, interoperable with Hadoop, included in all major distros Improves efficiency through: In-memory computing primitives General computation graphs Improves usability through: Rich APIs in Scala, Java, Python Interactive shell Write programs in terms of transformations on distributed datasets Resilient Distributed Datasets (RDDs) Collections of objects that can be stored in memory or disk across a cluster Parallel functional transformations (map, filter, …) Automatically rebuilt on failure

SQL Part of the core distribution since Spark 1.0 (April 2014)
Runs SQL / HiveQL queries, optionally alongside or replacing existing Hive deployments

SQL Data Model Nested data model
Supports both primitive SQL types (boolean, integer, double, decimal, string, data, timestamp) and complex types (structs, arrays, maps, and unions); also user defined types.

SparkSQL Explain Plan

Catalyst Optimizer SQL

An Example Catalyst Transformation
Find filters on top of projections. Check that the filter can be evaluated without the result of the project. If so, switch the operators.

Cloudera Impala General Purpose SQL query execution engine for Hadoop
For analytical and transactional workloads Directly workwith Hadoop Same storage managers Same file formats Collocated deamons Queries data in HDFS and Hbase HiveQL (subset of ANSI 92) Uses Hive Metastore Uses Hive Optimizer extended for distributed query processing

Conclusions Analytic SQL Queries access large data sizes in production environment Need to understand SQL query execution plan on large volume Database Volume Emulator with Contentless Database is a solution We have discussed design, example and use cases of DB Volume Emulator for RDBMS We extended thought process to next generation data processing engines such as Spark, Hive and Impala. In built Volume emulator help to create Tuning environment The approach is general and can be applied to any data engine providing API to change data statistics maintained by its optimizer.

Rekha Singhal, Amol Khanapurkar, TCS Mumbai.

Similar presentations

Presentation on theme: "Rekha Singhal, Amol Khanapurkar, TCS Mumbai."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Rekha Singhal, Amol Khanapurkar, TCS Mumbai.

Similar presentations

Presentation on theme: "Rekha Singhal, Amol Khanapurkar, TCS Mumbai."— Presentation transcript:

Similar presentations

About project

Feedback