Interactive Queries in Data Warehouses

Interactive Queries in Data Warehouses
Shreya

What does Data Warehousing consist of?
Capturing data from diverse sources to analyze later usually for business intelligence Usually housed on an enterprise mainframe server (moving towards cloud) User queries and analysis used to support management’s decision making process Examples of reports: annual, quarterly comparisons of trends. Daily sales analysis. Different from standard databases because those focus on strict accuracy of data and updating real time. Data warehouses need long range view. Trade off of transaction volume for data aggregation.

What is the problem? Data now comes from less predictable sources like web applications, mobile devices, and sensor data. This data is often in a schema-less, semi structured format with self defined categories. Eg. Json, xml Defined data types were important for ETL (Extract, Transform, Load) pipelines requiring Hadoop or Spark. These are not efficient for data warehousing. Data warehouses usually received information from predictable sources like transactional systems and CRM applications. They are basically expnsive to scale and don’t excel at handling unstructured or complicated data. The wer designed for fixed resources not taking advantage of cloud scalability.

Snowflake Overview Designed as a pay as you go service
Support for both SQL and semi structured and schema less data (extensions to handle this). Pure software as a service solution Runs on AWS Snowflake was the proposed solution. SAAS – no machines, dbas or install software

Snowflake Separates storage and compute Storage is through Amazon’s S3
Tables are horizontally partitioned. Within each file columns or attributes are grouped together and compressed. Known as PAX or hybrid columnar Compute is provided through snowflake’s share nothing engine To reduce work traffic between the two, compute nodes cache some table data on local disk Because this partition queries only need to download the file headers and those columns they are interested S3 gets header This metadata is in a table of S3 files. Key value store in cloud services layer

Virtual Warehouses This is an abstraction for the users
Virtual warehouses consist of clusters of EC2 (elastic compute). Each EC2 is a worker node for a VW. These can be created, destroyed, resized on demand Come in X Small to XX Large. Each worker node has a cache of table data on local disk Each query spawns a new worker process, which dies after its done. VWs have access to same shared tables Abstractions allow evolving service and pricing independent of underlying Vw don’t affect the database Should close all vws when they have no queries. Reruns query if it doesn’t work bad for long ones Argument same price fro 15 hours on 4 nodes and 2 hours with 32 nodes. Cache has s3 objects that have been accessed. Headers and columns of files. This is shared with work process. Currently LRU but can bee improved in the future Hashing so that nodes handle same table queries Reshuffling slowly lets requests amortize the costs

SQL Execution Engine Characteristics
Columnar: Storing column wise Vectorized: Avoids storing intermediate results. Data is processed in pipelined fashion. Pushbased: relational operators push results to downstream operators. Improves cache efficiency Because it removes control flow logic from tight loops. Enables snowflake to efficiently process DAG shaped plans as opposed to trees. More opportunity to share and pipeline intermediate results. No transaction management needed for execution ecause queries executed for immutable files. No buffer pool. Snowlfake allows all major operators (join, group by, sort) to spill to disk and recurse when main memory is exhausted.

Shared Data Architecture
The cloud service layer is multi tenant VW have their own worker nodes. This layer is shared nothing Data storage shared across VWs belonging to one data center.

Search optimizations Given an analytical workload, Snowflake uses multi- version concurrency control (MVCC), which means a copy of every changed database object is preserved for some duration. Maintaining indices is expensive. Snowflake employs a min-max based pruning. Files have some metadata indicated attributes about the data, so that it may be pruned. small storage overhead does for semi structured data storing columns inside Analytical workload = mostly reads. If writes use multi-version concurrency control (MVCC), which means a copy of every changed database object is preserved for some duration.

Technical Differentiators
Pure software as a service Failure tolerance Allows variant (native SQL type), array (of values), and object (JavaScript like) Cloning and time travel Security Failure tolerance data storage (s3) there are Azs to handle that If VWs fail then rexecuted on a replaced node or a on less nodes. Usually has nodes on standby to replace If entire az fails, then need reallocate to different VW, address in future Online upgrade allows multiple versions of software to run side by side Can do sqll commands and elt manner. Can use statistical analysis to see which paths are frequently common. Columsn are stored separately in compressed format as sql. Optimstic conversion. Tested on schema less table Access to previous versions. Can clone which means copy metadata of source table and both tables refer to saem set of files. Can be modified independently after Security two factor authentication (client side) encrypted data import and export secure data transfer store. Uses AES 256 bit Key rotation Pre operational creation phase, operational phase to encrypt and decrypt, post operational phase, destroyed phase File keys are derived from table key an dunique file name. so when table key changes all file keys change also. Helpful because reencryption doesn’t impact Hiercharl keys so that keys don’t reveal everything

Future Work Make Snowflake a full self service model, without developer involvement. If a query fails it is entirely rerun, which can be costly for a long query. Each worker node has a cache of table data that currently uses LRU, but the policy could be improved. Snowflake also doesn’t handle the situation if an availability zone is unavailable. Currently it requires reallocating the query to another VW.

Interactive Queries in Data Warehouses

Similar presentations

Presentation on theme: "Interactive Queries in Data Warehouses"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Interactive Queries in Data Warehouses

Similar presentations

Presentation on theme: "Interactive Queries in Data Warehouses"— Presentation transcript:

Similar presentations

About project

Feedback