Download presentation
Presentation is loading. Please wait.
Published byLuciano Leonardi Modified over 5 years ago
1
https://docs. aws. amazon
Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service 128 nodes * (16 TB HDD) = 2PB Eralper YILMAZ
2
What Is Amazon Redshift?
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. Amazon Redshift is a fast, fully managed, and cost-effective data warehouse that gives you petabyte scale data warehousing and exabyte scale data lake analytics together in one service. Amazon Redshift is a column based database designed for OLAP allowing customers combine multiple complex queried to provide answers Node and slice numbers for different configurations Provide a full node, slice, etc details showing data in blocks, etc Show how a compressed data is like for different encoding methods Megabyte -> Gigabyte -> Terabyte -> Petabyte -> Exabyte -> Zettabyte -> Yottabyte
3
Data Warehouse A data warehouse is any system that collates data from a wide range of sources within an organization. Data warehouses are used as centralized data repositories for analytical and reporting purposes. Relational Databases: OLTP vs OLAP Non-Relational Databases: NoSQL - Schema-free, horizontally scalable, distributed across different nodes Redis, DynamoDB, Cassandra, MongoDB, Graph Databases Data Warehouse concepts
4
Petabyte Scale Data Warehouse
Cost Effective Fast Compatible Simple Petabyte Scale Data Warehouse Fully Managed Secure 1 - Cost-effective: Costs less than 1000$ per terabyte per year. 10 times less than traditional data warehouse solutions (Google BigQuery 720$, Microsoft Azure 700+$) Pay for Compute Node Hours (Lead node is not chargable) Data transfers within your VPC is not charged between S3 and Redshift (Load unload backup snapshot). Data transfers outside of your VPC is charged. Backups are free up to a provisioned amount of disk. For pricing check 2 - Fast: Columnar storage technology in MPP massively parallel processing architecture to parallelize and distribute data and queries across multiple nodes consistently delivering high performance at any volume of data 3 - Compatible: Supports ODBC and JDBC connections, and existing BI tools are supported 4 – Simple: Uses ANSI SQL for querying data, PL/pgSQL for stored procedures To manage Redshift: Amazon Redshift console AWS Command Line Interface (AWS CLI) Amazon Redshift Query API AWS Software Development Kit (SDK) Some use Spark SQL->different; HiveQL->similar; GQL->similar 5 – Petabyte-Scale DW: 128 nodes * 16 TB disk size = 2 Pbyte data on disks. Additionally Spectrum enables to query data on S3 without limit featuring exabyte scale data lake analytics 6 – Fully Managed: Cloud SaaS Data Warehouse service 6.1 : Automating ongoing administrative tasks (backups, patches) 6.2 : Continuously monitors cluster 6.3 : Automatic recover from disk and drive failures (data itself, replica on other compute node, S3 incremental backups) 6.4 : Scailing without downtime for read access (adds new nodes and redistributes data for maximum performance) 7 – Secure : Encryption (at rest and in transit including backups), VPC (compute nodes in a separate VPC than leader node so data is separated seamlessly) Redshift is not highly available Available only in one availability zone You can restore snapshots of Amazon Redshift databases in other AZs No Upfront Investment Low Ongoing Cost Flexible Capacity Speed & Agility Apps not Ops Global Reach
5
Anti-Patterns of Data Amazon Redshift is not for: Small datasets:
If < 100 gigabytes => RDS OLTP: => RDS or NoSQL Unstructured data: Redshift <> arbitrary schema structure for each row BLOB data: Binary Large Object (BLOB) => Amazon S3 reference in Amazon Redshift Small datasets: If your dataset is less than 100 gigabytes, you’re not going to get all the benefits that Amazon Redshift has to offer and Amazon RDS may be a better solution OLTP: Amazon Redshift is designed for data warehousing workloads delivering extremely fast and inexpensive analytic capabilities. For a fast transactional system a traditional relational database system built on Amazon RDS or a NoSQL database such as Amazon DynamoDB can be a better option Unstructured data: Data in Amazon Redshift must be structured by a defined schema. Amazon Redshift doesn’t support an arbitrary schema structure for each row. BLOB data: If you plan on storing binary large object (BLOB) files such as digital video, images, or music, you might want to consider storing the data in Amazon S3 and referencing its location in Amazon Redshift
6
Magic Quadrant for Data Management Solutions for Analytics
7
Gartner Evaluation Strengths Cautions Leader in Cloud
Number of resources and services Number of third-party resources 40% more growth than DMSA AWS Outposts for on-premises presence Cautions Integration complexity Value for money, pricing and contract flexibility (high scores for evaluation and contract negotiation) Product capabilities: relatively slow to adopt some key features Amazon Web Services Amazon Redshift, a data warehouse service in the cloud. Amazon Redshift Spectrum, a serverless, metered query engine that uses the same optimizer as Amazon Redshift, but queries data in both Amazon S3 and Redshift’s local storage. Amazon S3, a cloud object store; AWS Lake Formation, a secure data lake service; AWS Glue, a data integration and metadata catalog service; Amazon Elasticsearch, a search engine based on the Lucene library. Amazon Kinesis, a streaming data analytics service, Amazon Elastic MapReduce (EMR), a managed Hadoop service; Amazon Athena, a serverless, metered query engine for data residing in Amazon S3; Amazon QuickSight, a business intelligence (BI) visualization tool. Amazon Neptune provides graph capabilities. Relatively slow to adopt some key features that are expected by modern cloud DMSA environments — such as dynamic elasticity, automatic tuning, and separation of compute and storage resources
8
Amazon Redshift Architecture
9
Analytics features, MPP, columnar storage added on top of PostgreSQL (PostgreSQL 8.0.2)
Connection String as PostgreSQL 8.0.2 Similar to PostgreSQL compatible ANSI SQL First release: February 2013 Biweekly updates
10
Compute nodes: Dedicated CPU, memory, and attached disk storage Increase the compute capacity and storage capacity of a cluster by increasing the number of nodes or upgrading the node type
11
Leader Node: Parser Execution Planner and Optimizer Code Generator C++ Code Compiler Task Scheduler WLM: Automatic “Work Load Management” Postgres catalog tables, like pg* tables Compute Nodes: Query execution process I/O and Disk Management Backup and Restore process Replication process Local storage
12
I/O Reduction Columnar Storage Data Compression Zone Maps
13
I/O Reduction Columnar Storage: effective when a column data is read
Only scans blocks for relevant column
14
I/O Reduction DDL shows each column can be seperately compressed with different compressing methods Reduces storage requirements and I/O Compression is also called Encoding Up to 4 times of data Increate query performance by reducing I/O COPY command applies compression by default Select column, type, encoding from pg_table_def where tablename = ‘…’;
15
I/O Reduction Zone maps: In-memory block metadata, store min-max valules of blocks Block: 1 MB chunks of data of data. Eliminates unnecessary I/O by effectively pruning blocks which don’t contain data for the query Blocks are immutable, blocks are not updated, only new blocks are written Check for UPDATE process behind scenes:
16
Zone maps are in memory so before hitting disk we can eliminate unrelated data chunks
Redshift does not support Indexes. Instead, each table has a user-defined sort key which determines how rows are ordered
17
Data Sorting and SORTKEY
Determines physical order of table rows data Picking a Sort Key eliminates I/O by optimizing effectiveness of zone maps Multiple columns can be used as SortKey Optimal SORTKEY is dependent on: Query patterns Data profile Business requirements Determines physical order of table rows data Picking a Sort Key eliminates I/O by optimizing effectiveness of zone maps Multiple columns can be used as SortKey Optimal SORTKEY is dependent on: Query patterns Data profile Business requirements Redshift does not support traditional indexes. Instead, the data is physically stored to maximize query performance using SORT KEYS.
18
Data Distribution Distributes data evenly for parallel processing
Minimize the data movement during query processing Distributes data evenly for parallel processing Minimize the data movement during query processing Data Distribution -> Distribution Style and Distribution Key 3 distribution styles Key: take a good cardinality column + hash it + write to a slice Even: not very good cardinality, round robin data among slices, when key does not produce even distribution All: used for dimension tables, move to nodes (in single slice), small tables < 2-3 MB
19
3 additional system columns in each table for Multiversion Concurrency Control, MVCC
20
Not a good cardinality column – loc
Only 1 node is used, not very parallel On doubt use EVEN distribution key
22
Complete copy on each node
23
Storage Disk More storage on the nodes than advertised.
Advertised (usable size by customer) Data is mirrored . Always 2 copies of data on disk. Partitions (local CN, remote CN) When a commit is executed (ie after Insert command) data is stored on two different nodes in the cluster. 2.5-3 times of usable storage: temp area and system tables, etc Slices: (virtual compute node) computes nodes divided virtually into slices. So parallelism is fetched within each compute node Blocks 1 MB immutable blocks 11 encodings or more (check) Zone Maps (Min/Max values) 8-16 million values Columns Blockchain (of blocks) for each column on each slice There are 2 blockchains if you have a sortkey on a table: 1 for sorted region and 1 for unsorted region Additional columns in a table (system used): row id, deleted id, transaction id (not queriable, for MVCC) Column properties: Distribution key, Sort key, Compression Encoding
24
2 PetaBytes of compressed data (6 PetaBytes in a single Redshift cluster)
UPDATE = DELETE + INSERT Vacuum: process of cleaning deleted records (similar to HANA database Delta Merge process as purpose)
25
Performance tuning: DistKey then SortKey
Analyze Compression
26
Query Lifecycle Query LifeCycle Planner: Execution plans
Optimizer: statistics Step: individual operations (scan, sort, hash, aggr) Stream: collection of C++ binaries incuding segments, including steps One stream should end so following segment can start
27
Query Lifecycle Returns map to end of each streams
So a new cycle starts between leader node and compute nodes to process following stream (of C++ code generated) based on results of previous stream
28
Query Lifecycle
29
Query Lifecycle Results send to client
First compilation cause a delay, second executions use compiled code from cache In maintenance cycle, cache is cleared
30
Move Data from On-Premise to Redshift
Export on-premise data Transfer data file to Amazon S3 Load data from S3 to Amazon Redshift using COPY AWS DMS
31
Multi-File Load with COPY Command
Load data Divide CSV data into multiple
32
Redshift Spectrum
33
Redshift Spectrum
34
How to Connect to Amazon Redshift
AWS Amazon Redshift Query Editor Amazon Redshift API SQL Workbench/J SQL Server Management Studio DataRow pgAdmin JackDB C:\My\VSProjects\AmazonRedshiftAPISample\AmazonRedshiftAPISample.sln
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.