MySQL Cluster overview and ndb-7.0 features demo

MySQL Cluster overview and ndb-7.0 features demo
Presented By: Matthew Montgomery MySQL Meetup San Antonio, TX

Who am I? Matthew Montgomery Senior Support Engineer Working for Sun
MySQL Cluster team based in San Antonio, TX

If you have a question, ask it!
Interactivity If you have a question, ask it! (No matter how silly)

What is MySQL Cluster?

A Storage Engine

unique feature of MySQL
Storage Engines unique feature of MySQL

No one best way to store tables

Choice of Storage Engines

Different engine per table (if you want)

Just like a Virtual File System Layer
Application Application Application Application Kernel VFS ext3 ext4 vfat XFS

Just like a Virtual File System Layer
Application Application Application Application MySQL Server Storage Engine API MyISAM InnoDB Falcon NDB Cluster

What is MySQL Cluster? A High Availability

A High Availability High Performance
What is MySQL Cluster? A High Availability High Performance

A High Availability High Performance In Memory (and disk in 5.1+)
What is MySQL Cluster? A High Availability High Performance In Memory (and disk in 5.1+)

In Memory (and disk in 5.1+) Shared Nothing
What is MySQL Cluster? A High Availability High Performance In Memory (and disk in 5.1+) Shared Nothing

What is MySQL Cluster? A High Availability High Performance In Memory (and disk in 5.1+) Shared Nothing Clustered

What is MySQL Cluster? A High Availability High Performance In Memory (and disk in 5.1+) Shared Nothing Clustered Storage Engine

Designed for Five Nines (99.999%) Uptime

Sub-Second Failover

Sub-Second Failover High Availability mysqld mysqld Transactions
Data Nodes

Hot “Online” Backup

No Locks during Backup

Hot (Online) Backup High Availability mysqld mysqld Transactions
Data Nodes

Hot (Online) Compressed Backup
High Availability Hot (Online) Compressed Backup mysqld mysqld Transactions Compressed Compressed Data Nodes

Configurable Redundancy
NoOfReplicas

NoOfReplicas=1 D a t a

NoOfReplicas=1 D a t a No surviving replica of this data

NoOfReplicas=2 Da ta Da ta

NoOfReplicas=2 Da ta Da ta There is a copy of the data here

NoOfReplicas=2 Da ta Da ta

NoOfReplicas=2 Da ta Da ta No surviving replicas for this data

NoOfReplicas=3 Da Da ta Da ta ta

NoOfReplicas=4 Data Data Data Data

For Production: NoOfReplicas=2 NoOfReplicas=1 (bad) D a t a Da ta Da
No surviving replica of this data Da ta Da ta NoOfReplicas=2 There is a copy of the data here

Not from BEGIN to COMMIT
High Performance Not from BEGIN to COMMIT

...but through Parallelism
High Performance ...but through Parallelism

High Performance Parallelism mysqld mysqld Transactions Data Nodes

Data and Indexes kept in main memory
In Memory (and disk) Data and Indexes kept in main memory

Non-Indexed attributes on disk (introduced in 5.1)
What is MySQL Cluster? Non-Indexed attributes on disk (introduced in 5.1)

Row in memory part

Row in memory part on disk part
Row in memory part on disk part

In memory? What about machine/cluster failures?

Check point to disk

Check point to disk Frequent, Configurable

Check point to disk Not complete data loss after power outage

Shared Nothing Commodity PCs

Commodity Interconnects
Shared Nothing Commodity Interconnects

Commodity Interconnects Ethernet
Shared Nothing Commodity Interconnects Ethernet

Commodity Interconnects Ethernet SCI
Shared Nothing Commodity Interconnects Ethernet SCI

No Expensive Shared Disk
Shared Nothing No Expensive Shared Disk

No Expensive Shared Disk (so no single point of failure)
Shared Nothing No Expensive Shared Disk (so no single point of failure)

In Memory (and disk) Shared Nothing
What is MySQL Cluster? A High Availability High Performance In Memory (and disk) Shared Nothing Clustered Storage Engine

What is MySQL Cluster? Clustered

In Memory (and disk) Shared Nothing
What is MySQL Cluster? A High Availability High Performance In Memory (and disk) Shared Nothing Clustered Storage Engine

What is MySQL Cluster? ENGINE=NDBCLUSTER

What is MySQL Cluster? A Collection of Nodes

Node Types

Node Types Data Nodes (ndbd)

Data Nodes

Data Nodes Data nodes (running ndbd)

Data Nodes Data nodes (running ndbd) These store data

Data Nodes This cloud means a cluster Data nodes (running ndbd)
These store data

Data Nodes grouped

Data Nodes grouped into nodegroups

NoOfReplicas NoOfReplicas=2 Nodegroup 0 Nodegroup 1

NoOfReplicas DA TA NoOfReplicas=2 DA TA Nodegroup 0 Nodegroup 1

Data Nodes Nodegroup 0 Nodegroup 1

Data Nodes pk Nodegroup 0 Nodegroup 1

HASH(pk) pk

SELECT * from t1 pk

MySQL Servers talk to the Data Nodes
mysqld Nodegroup 0 Nodegroup 1

One used as a Transaction Coordinator
mysqld One storage node as Transaction Coordinator (TC) Nodegroup 0 Nodegroup 1

Many can send data mysqld One storage node as
Transaction Coordinator (TC) Nodegroup 0 Nodegroup 1

mysqld Several nodes can be involved in processing a single query

mysqld Several nodes can be involved in processing a single query Parallelism=Better Performance

Data Nodes (ndbd) Up to 48 in one cluster
Node Types Data Nodes (ndbd) Up to 48 in one cluster

Management Server (ndb_mgmd)
Node Types Management Server (ndb_mgmd)

Management Server config.ini mysqld mysqld ndbd ndb_mgmd
[ndbd default] NoOfReplicas= 2 DataMemory= 400M IndexMemory= 32M DataDir= /usr/local/mysql/cluster [ndbd] HostName= HostName= [ndb_mgmd] HostName= [mysqld] mysqld mysqld ndb_mgmd ndbd

Management Server config.ini mysqld mysqld ndb_mgmd ndbd ndbd

Management Server (ndb_mgmd)
Node Types Management Server (ndb_mgmd) also involved in arbitration, starting backups, issuing commands to nodes (start, stop, restart)

SQL Nodes (mysqld) also called API nodes
Node Types SQL Nodes (mysqld) also called API nodes

SQL Nodes mysqld mysqld mysqld

SQL and API Nodes mysqld mysqld mysqld NDB API NDB API

DELETE UPDATE INSERT mysqld mysqld mysqld NDB API NDB API

DELETE UPDATE INSERT update() update() mysqld mysqld mysqld NDB API

SQL Nodes (mysqld) Accessed like any other MySQL Server
Node Types SQL Nodes (mysqld) Accessed like any other MySQL Server

API Nodes Talk NDB API directly to the Data Nodes
Node Types API Nodes Talk NDB API directly to the Data Nodes

Management Server Node Types
Management Client talks to Management Server. Used to administer the cluster

Perl Mono PHP .NET mysql Ruby DELETE UPDATE INSERT Management Server
mysqld mysqld mysqld Management Server Mgm client Data Nodes NDB API NDB API update() update()

Physical Requirements

A node is a process, not a computer

At least three physical machines for High Availability

Three machines minimum for HA
B A

B A Can no longer see A

B A Can no longer see A Did A Die?

B A Or did the network link between A and B die? Can no longer see A

B A Or did the network link between A and B die? Can no longer see B Can no longer see A

Who Is In Charge Now? B A Or did the network Can no longer see B
link between A and B die? Can no longer see B Can no longer see A

Split Brain = Bad B A Or did the network Can no longer see B
link between A and B die? Can no longer see B Can no longer see A

We detect possible Split Brain scenarios
Nodes will shut down instead

B A Management server on 3rd machine

B A Management server on 3rd machine Is Arbitrator

Management Server

Management Server Not CPU Intensive

Management Server Not CPU Intensive Not Memory Intensive

Management Server Not CPU Intensive Not Memory Intensive
Can have multiple for redundancy

Data Nodes

Data Node Requirements
Lots of Memory All indexed data in memory Data in memory Cache for data on disk

Disk IO and capacity IO Rate can be calculated With disk-based tables calculation is harder Space usage calculated

CPU Often not CPU bound (depends on queries) Before 7.0 single threaded SMP does not buy you a lot Few helper threads though Multithreaded ndbmtd (7.0)

SQL Node: Many API/SQL nodes are needed to load Storage Nodes

SQL Node: MySQL is multi-threaded SMP can help

A Configuration [ndbd default] NoOfReplicas= 2 DataMemory= 400M
IndexMemory= 32M DataDir= /usr/local/mysql/cluster [ndbd] HostName= HostName= [ndb_mgmd] HostName= [mysqld]

A Configuration Default settings for data nodes (ndbd) [ndbd default]
NoOfReplicas= 2 DataMemory= 400M IndexMemory= 32M DataDir= /usr/local/mysql/cluster [ndbd] HostName= HostName= [ndb_mgmd] HostName= [mysqld] Default settings for data nodes (ndbd)

A Configuration Settings for a data node [ndbd default]
NoOfReplicas= 2 DataMemory= 400M IndexMemory= 32M DataDir= /usr/local/mysql/cluster [ndbd] HostName= HostName= [ndb_mgmd] HostName= [mysqld] Settings for a data node

A Configuration Settings for a data node Settings for a data node
[ndbd default] NoOfReplicas= 2 DataMemory= 400M IndexMemory= 32M DataDir= /usr/local/mysql/cluster [ndbd] HostName= HostName= [ndb_mgmd] HostName= [mysqld] Settings for a data node Settings for a data node

A Configuration Settings for a management server [ndbd default]
NoOfReplicas= 2 DataMemory= 400M IndexMemory= 32M DataDir= /usr/local/mysql/cluster [ndbd] HostName= HostName= [ndb_mgmd] HostName= [mysqld] Settings for a management server

A Configuration [mysqld] Settings for a SQL/API node [ndbd default]
NoOfReplicas= 2 DataMemory= 400M IndexMemory= 32M DataDir= /usr/local/mysql/cluster [ndbd] HostName= HostName= [ndb_mgmd] HostName= [mysqld] [mysqld] Settings for a SQL/API node

Demo Configuration 2 Replicas 2n Data Nodes (2 or 4) 50MB for Data
5MB for Indexes 1 management server 3 MySQL Servers/API Nodes No other special options

A Configuration ndb_mgmd [ndbd default] NoOfReplicas= 2
DataMemory= 400M IndexMemory= 32M DataDir= /usr/local/mysql/cluster [ndbd] HostName= HostName= [ndb_mgmd] HostName= [mysqld] ndb_mgmd

A Configuration ndbd ndb_mgmd [ndbd default] NoOfReplicas= 2
DataMemory= 400M IndexMemory= 32M DataDir= /usr/local/mysql/cluster [ndbd] HostName= HostName= [ndb_mgmd] HostName= [mysqld] ndb_mgmd ndbd

A Configuration mysqld mysqld ndbd ndb_mgmd [ndbd default]
NoOfReplicas= 2 DataMemory= 400M IndexMemory= 32M DataDir= /usr/local/mysql/cluster [ndbd] HostName= HostName= [ndb_mgmd] HostName= [mysqld] mysqld mysqld ndb_mgmd ndbd

A Configuration applications mysqld mysqld ndbd ndb_mgmd
[ndbd default] NoOfReplicas= 2 DataMemory= 400M IndexMemory= 32M DataDir= /usr/local/mysql/cluster [ndbd] HostName= HostName= [ndb_mgmd] HostName= [mysqld] mysqld mysqld ndb_mgmd ndbd

Starting Nodes

Configuration Information
A Starting node needs: Configuration Information

Location of Management Server
A Starting node needs: Location of Management Server

The Connect String

Lists Management Servers
The Connect String Lists Management Servers

A Connect String:

A Connect String: :1186

A Connect String: :9310

A Connect String: ,

A Connect String: , ,nodeid=3

A Bad Connect String: mgmsrv1,mgmsrv2

use IP Addresses (or hosts file)
DNS is Not Reliable Do Not Trust DNS to work, use IP Addresses (or hosts file)

DHCP is Not Reliable Do Not Trust DHCP to work,
use static IP Addresses

Starting The Cluster

Starting the cluster 1. Management server 2. Data Nodes
3. MySQL Server Nodes

Needs to be started first (so new nodes can get the configuration) On the management server: $ ndb_mgmd -f config.ini 2. Data Nodes 3. MySQL Server Nodes

On each storage node: $ ndbd -c The -c option is the connect string 3. MySQL Server Nodes

3. MySQL Server Nodes Make sure the ndbcluster option is enabled (command line or my.cnf) Make sure the connect string is specified (command line or my.cnf) Start the MySQL server in your preferred way (e.g. /etc/init.d/mysql start)

MySQL Server Options 1. Create a my.cnf file 2. Add ndbcluster option
3. Add ndb-connectstring option 4. Set unique port, socket, datadir 5. mysql_install_db --defaults-file=/path/my.cnf 6. ./mysqld –defaults-file=/path/to/my.cnf 7. repeat for each SQL node

Using the Management Client
Basic Monitoring Using the Management Client

On Management Server, $DataDir/ndb_<id>_cluster.log
Check the Cluster Log On Management Server, $DataDir/ndb_<id>_cluster.log

Using MySQL Cluster ENGINE=NDBCLUSTER

Let's CREATE TABLE CREATE TABLE t1 (
pk1 INT PRIMARY KEY AUTO_INCREMENT, v VARCHAR(100) ) ENGINE=NDBCLUSTER;

SELECT, INSERT, UPDATE, DELETE from all SQL Nodes
and see the new and updated rows!

Isn't this just like Replication?

Isn't this just like Replication?
No.

MySQL Replication Asynchronous Read-only slaves

All nodes can perform reads/writes
MySQL Cluster Synchronous, All nodes can perform reads/writes

MySQL Replication Changes made by a transaction are available on a slave after a small amount of time

MySQL Cluster Changes made by a transaction are instantly available from all nodes on commit

Two-Phase Commit Protocol
MySQL Cluster Two-Phase Commit Protocol

MySQL Cluster Two-Phase Commit Protocol
Ensures consistency in event of failure. (with a performance penalty)

Cluster vs Replication
With replication, a single transaction will be COMMITted quicker. But if master fails before a slave retrieves the binary log, transaction is lost.

With replication, a single transaction will be COMMITted quicker. But if master fails before a slave retrieves the binary log, the transaction is lost.

With Cluster, COMMIT means transaction can survive node failures

Cluster and Replication
We'll cover later

What else does MySQL Cluster Support?

All the Standard 5.1 features

Stored Procedures

Triggers

Triggers Implemented in the MySQL Server, so changes made with NDB API programs do not fire triggers

Standard Permissions GRANT/REVOKE

...and a caveat

The mysql database is per SQL node, not per cluster.

So GRANT/REVOKE, Triggers, Stored Procedures, Views have to be set up on each SQL node.

Also, no native FOREIGN KEYs support

You can emulate foreign keys on the SQL nodes using triggers.

Also, no FULLTEXT indexes

Distributed Metadata

Notice how the 2nd MySQL Server knew that there were tables in the Cluster

The MySQL Server uses .frm files to track table metadata

For MySQL Cluster, we store the FRM files in the Cluster

Retrieving them when needed

Distributed Metadata MySQL server MySQL server .frm files .frm files
Distributed database

Distributed Metadata create table t1 ... MySQL server MySQL server
.frm files MySQL server .frm files MySQL server Distributed database

Distributed Metadata create table t1 ... MySQL server MySQL server
.frm files MySQL server .frm files MySQL server copy .frm compressed .frm copies Distributed database

Distributed Metadata select * from t1 create table t1 ... MySQL server
.frm files MySQL server .frm files MySQL server copy .frm compressed .frm copies Distributed database

Distributed Metadata select * from t1 create table t1 ... MySQL server
.frm files MySQL server .frm files MySQL server autodiscover .frm copy .frm compressed .frm copies Distributed database

Data Distribution

MySQL Cluster implements horizontal partitioning

pk 2 Nodes

pk 2 Nodes F1 F2 Two Fragments

NoOfReplicas=2 pk 2 Nodes F1 F2 Two Fragments

NoOfReplicas=2 pk 2 Nodes F1 F1 F2 F2 Two Fragments

NoOfReplicas=2 pk 2 Nodes F1 F1 F2 F2

NoOfReplicas=2 pk 2 Nodes F1 F1 F1 F1 F2 F2 F2 F2

NoOfReplicas=2 pk 2 Nodes F1 F1 F1 F1 F2 F2 F2 F2 Primary Replica

NoOfReplicas=2 pk 2 Nodes F1 F1 F1 F1 F2 F2 F2 F2 Secondary Replica

Why two fragments for two nodes?

What is a Primary Replica responsible for?

Locks

Locks, Reads

Locks, Reads (among other things)

Two fragments for a two node cluster
Load Balances

What about node failure?

NoOfReplicas=2 pk 2 Nodes F1 F1 F1 F1 F2 F2 F2 F2 Transparent Failover

Surviving nodes take over

Surviving nodes have increased load

What about node recovery?

NoOfReplicas=2 pk 2 Nodes F1 F1 F1 F1 F2 F2 F2 F2 Synchronize data

What about ongoing transactions during node failure?

Transactions using a failed node are aborted

What about MySQL Server node failure?

Application can connect to another MySQL Server

How does an Application connect to another MySQL Server?

Load Balancing system

Connector based(JDBC) Hardware load balancer

What about Management Server failure?

Continued operation of cluster not dependent on Management Server

Management Server required to start new nodes

Can have multiple Management Servers (but there is increased admin work)

Let's kill things

kill -9 (Angel and NDB)

See the failure reported in the logs

See the failure from the management client

See that things still work

Run some SELECT, INSERT, UPDATE queries

Restart the failed data node

See it rejoin

Run more queries

See that all is good with the world

Two-Phase Commit Protocol

Two-phase Commit Protocol
Keeping DB nodes synchronized facilitate immediate fail-over TC DB 1 DB 3 DB 2 DB 4 Node group 1 Node group 2 Two-phase commit Prepare phase: Both node groups get their information updated Commit phase: The change is committed

Transaction Over 3 Replicas
TC DB 1 DB 4 DB 2 DB 5 DB 3 DB 6 Node group 1 Node group 2

Two phase commit enables recovery in distributed system

Nodes communicate with each other over an interconnect

Ethernet is common/cheap

Ethernet isn't the fastest in the world

Performance of some queries is very latency dependent

MySQL Cluster abstracts away the communication method

TCP Transporter

SCI Transporter

SHM Transporter (alpha)

In reality: use TCP or SCI (with appropriate hardware)

In reality: use gigabit ethernet, not 100Mbit for TCP

Use private network for MySQL Cluster traffic

Inter-node communication is not authenticated and not encrypted

other applications on the network may interfere with heartbeats

Heartbeats

Failure detection: Heartbeats, lost connections
Nodes organized in a logical circle DB Node 1 DB Node 4 DB Node 2 All nodes must have the same view of which nodes are alive DB Node 3 Heartbeat messages sent to next node in circle

Schema considerations for MySQL Cluster

Every table has a PRIMARY KEY

Every table has a PRIMARY KEY
Even if you don't explicitly set one

Three types of indexes

Three types of indexes 1) Primary Hash Index 2) Unique Hash Index
3) Ordered T-tree Index

UNIQUE (SQL) is Unique Hash and Ordered Tree (NDB)

UNIQUE USING HASH (SQL) is Unique Hash (NDB)

PRIMARY KEY (SQL) is Primary Hash and Ordered Tree (NDB)

PRIMARY KEY USING HASH (SQL) is Primary Hash (NDB)

Q: What query can use a hash index?
A) Key lookup

Q: What query can use an ordered index?
A) Range scan & ORDER BY

So what happens in a table scan?

MySQL Server NDBCLUSTER Engine Data Nodes (ndbd)

MySQL Server NDBCLUSTER Engine TC TC TC TC Data Nodes (ndbd)

MySQL Server NDBCLUSTER Engine SCANTABREQ TC TC TC TC Data Nodes (ndbd)

MySQL Server NDBCLUSTER Engine SCAN_FRAGREQ SCAN_FRAGREQ TC TC TC TC SCAN_FRAGREQ Data Nodes (ndbd)

MySQL Server NDBCLUSTER Engine LQH LQH LQH LQH Data Nodes (ndbd)

MySQL Server NDBCLUSTER Engine TRANSID_AI LQH LQH LQH LQH Data Nodes (ndbd)

MySQL Server ORDER BY done here NDBCLUSTER Engine TRANSID_AI LQH LQH LQH LQH Data Nodes (ndbd)

MySQL Server WHERE done here NDBCLUSTER Engine TRANSID_AI LQH LQH LQH LQH Data Nodes (ndbd)

MySQL Server NDBCLUSTER Engine SCAN_FRAGCONF SCAN_FRAGCONF TC LQH LQH LQH SCAN_FRAGCONF LQH Data Nodes (ndbd)

MySQL Server NDBCLUSTER Engine SCAN_TABCONF TC Data Nodes (ndbd)

Engine Condition Pushdown

Evaluate conditions in parallel on data nodes

Only send matching rows to API

MySQL Server NDBCLUSTER Engine LQH LQH LQH LQH Data Nodes (ndbd)

MySQL Server NDBCLUSTER Engine TRANSID_AI LQH LQH LQH LQH Data Nodes (ndbd)

MySQL Server ORDER BY done here NDBCLUSTER Engine TRANSID_AI LQH LQH LQH LQH Data Nodes (ndbd)

MySQL Server WHERE already done NDBCLUSTER Engine Data Nodes (ndbd)

set engine_condition_pushdown=on|off

Just about any normal comparisons can be pushed down

Common to see 5-10x improvement

Details in EXPLAIN [EXTENDED]

Batching

Rule: One large network packet is quicker than several small ones

Batching leads to improved performance

We do:

Batched Inserts

INSERT INTO t1 (a) values (1),(2),(3),(4),(5);

Batched Lookups

SELECT * FROM t1 WHERE pk1 IN (11,22,63,14,25,6,9,8);

All key lookups sent together

Batched lookups: 2-3x improvement

Query Cache

Invalidated when table is changed

MySQL Cluster has slightly different semantics...

ndb_cache_check_time

Milliseconds to wait before checking the query cache

Ask Data nodes if other nodes have changed anything

If table changed, invalidate Query Cache

This means:

For up to ndb_cache_check_time milliseconds, result may be old

BACKUP

Data not backed up is data not wanted

Two options for backing up MySQL Cluster

mysqldump Backup

mysqldump: single connection to single MySQL Server

mysqldump: human readable

mysqldump: READ_COMMITTED

mysqldump: READ_COMMITTED i.e. not consistent

MySQL Cluster Native Backup

MySQL Cluster Native Backup non-blocking parallel consistent

ndb_mgm> START BACKUP

Each data node participates in BACKUP

Each node performs backup for its primary fragments

Backups stored in ndb_<id>_fs/BACKUP/

Backup API

Backup API updates updates

Backup API updates updates Data Log Data Log

Backup API updates updates Control Control Data Log Data Log

RESTORE

ndb_restore

Restore Control Control Data Log Data Log

Restore Filesystem Control Control Data Log Data Log

Restore Filesystem Filesystem Control Control Data Log Data Log

Restore - Metadata Control Control Data Log Data Log

Restore - Data Control Control Data Log Data Log

Restore – Data (Log) Control Control Data Log Data Log

Restore API updates updates Control Control Data Log Data Log

Some Configuration Parameters

Several different categories

Why specify resource limits?

We statically allocate memory on startup for some resources

Deterministic behavior for resource allocation at run time

1. Memory

DataMemory

Limits amount of data that can be stored in the Cluster
DataMemory Limits amount of data that can be stored in the Cluster

DataMemory Per Node

Allocated to tables in Pages
DataMemory Allocated to tables in Pages

4 node, 2 replica DataMemory=100MB

400MB total

200MB of Data (2 copies)

Memory used by hash indexes
IndexMemory Memory used by hash indexes

4 node, 2 replica IndexMemory=10MB
20MB of Indexes (2 copies)

StringMemory

Memory used for table names, column names, FRM files etc
StringMemory Memory used for table names, column names, FRM files etc

(the default “5”% is likely fine up to ~1000 tables)
StringMemory (the default “5”% is likely fine up to ~1000 tables)

2. Transaction Parameters

MaxNoOfConcurrentTransactions
Maximum number of ongoing transactions

MaxNoOfConcurrentOperations
Maximum number of uncommitted changed rows (divided by the number of data nodes)

3. Scans and Buffering

MaxNoOfConcurrentScans
Maximum number of parallel scans (for each data node)

BatchSizePerLocalScan
Linked with ScanBatchSize – how many rows we batch for scans

4. Logging and Checkpointing

NoOfFragmentLogFiles

Sets number of REDO log files for each node

Each transaction written to REDO Log

REDO Log used in System Restart

REDO Log record exists for 2 local checkpoints

If no room, transactions aborted with: 410 Out of log file space temporarily

Allocated in units of 64MB (changed with FragmentLogFileSize)

Default is 8; 8x64MB = 512MB

Update heavy systems need large values... even up to 300

300 x 64MB = 19.2GB

Can be changed on a running cluster... rolling –initial restart

5. Metadata Objects

MaxNoOfAttributes

Maximum number of columns
MaxNoOfAttributes Maximum number of columns for all tables

MaxNoOfTables

Maximum number of tables
MaxNoOfTables Maximum number of tables

MaxNoOfOrderedIndexes

MaxNoOfOrderedIndexes
Maximum number of ordered indexes

MaxNoOfUniqueHashIndexes

MaxNoOfUniqueHashIndexes
Maximum number of unique hash indexes

6. Behavior

LockPagesInMainMemory

Prevents memory allocated by ndbd from being swapped out by the Operating System

(This is a good idea)

Diskless

Nothing written to disk
Diskless Nothing written to disk

Diskless No checkpointing

If enabled neither records or tables survive cluster crash
Diskless If enabled neither records or tables survive cluster crash

...but requires much less (zero) disk space and bandwidth.
Diskless ...but requires much less (zero) disk space and bandwidth.

7. Timeouts, Intervals, Disk Paging

TimeBetweenWatchDogCheck

Remember the Angel process from before?

Every TimeBetweenWatchDogCheck milliseconds, check that the main thread isn't stuck

StartPartialTimeout

Normally, we wait for all data nodes before starting the cluster
StartPartialTimeout Normally, we wait for all data nodes before starting the cluster

StartPartialTimeout After StartPartialTimeout milliseconds (30s), we'll perform a partial start

0 means always wait for all the data nodes
StartPartialTimeout 0 means always wait for all the data nodes

StartPartitionedTimeout

StartPartitionedTimeout
If after StartPartialTimeout the cluster could be in a partitioned state, we wait an additional StartPartitionedTimeout milliseconds

HeartbeatIntervalDbDb

Every HeartbeatIntervalDbDb, heartbeats sent between Data nodes

Maximum time to discover node failure is 4 times HeartbeatIntervalDbDb

HeartbeatIntervalDbApi

HeartbeatIntervalDbApi
Each Data node sends heartbeats to each API node connected to it

TimeBetweenLocalCheckpoints
Wins the prize for the strangest units for a configuration parameter

Not a time period

Amount of updates before starting a local checkpoint

...but not a value in bytes

base-2 logarithm of the number of 4 byte words

base-2 logarithm of the number of 4 byte words (sorry, not joking)

Default value is 20 4 x 220 = 4MB

Value of 21 4 x 221 = 8MB

Value of 22 4 x 222 = 16MB (and so on...)

Maximum value of 31 4 x 231 = 8GB

Value of 6 or less Constant local checkpoints

Designed to prevent checkpointing on mostly idle clusters

TimeBetweenGlobalCheckpoints

A COMMITted transaction is in main memory of all replicas

A COMMITted transaction is not immediately flushed to disk

A global checkpoint is where a set of COMMITted transaction are flushed to disk

This is where we recover to after a System Restart

Default is every 2000 milliseconds

Checkpointing

COMMIT= txn survives node failure

COMMIT != Disk Persistence

The D in ACID is still covered

Durable to machine failure

Durable to disk failure

In event of cluster failure, want to be able to restore a consistent image of the database

We checkpoint to disk (except when in diskless mode)

Can't lock the database while we write a checkpoint

Checkpoint in background, while transactions continue

Write image of database (LCP) and REDO log (GCP)

Take a Local Check Point of the database, apply REDO from that point to Global Check Point

Space Usage

Fixed Size Rows (prior to 5. 1) Variable Sized Columns (5
Fixed Size Rows (prior to 5.1) Variable Sized Columns (5.1, with 4byte alignment)

BLOBs and TEXT

BLOBs and TEXT 256 bytes in the row with remainder in 2000 byte chunks stored in separate table

Indexed Columns must be in main memory

Non-Indexed columns can be on disk

Disk Columns are fixed size

Disk Columns are fixed size VARCHAR(11) uses 12 bytes on disk

Columns are 4-byte aligned

Calculate storage requirements

Know your dataset

Use ndb_size.pl (examines existing database, creates report)

ndb_size.pl output

Variable Sized Rows

4.1 and 5.0 1 2 hello int int VARCHAR or just saying hello

Wasted Space 1 2 hello Wasted Space int int VARCHAR
This means you could have a lot of wasted space on a lot of rows Wasted Space

5.1 1 2 hello int int VARCHAR However, in 5.1, we now have variable sized rows. This means that we don't waste space on fields that aren't full.

The Saving Which can save a lot of space when you have any reasonable number of rows

The Saving We just use the space needed by each particular row. Here we only have a few longer rows, saving us a lot of memory.

On line Add/Drop Index

ADD INDEX (4.1, 5.0) t1 Index Index Rows
In 4.1 and 5.0, to add an index to a table

ADD INDEX (4.1, 5.0) t1 temp table Index Index Index Index Index Rows
We first create a temporary table with a schema of what we want t1 to look like (with the new index)

and then copy the data in t1 over to the temporary table, building all the indexes as we go. We keep a TABLE LOCK while we do this

we then delete t1

ADD INDEX (4.1, 5.0) temp table Index Index Index Rows
and rename the temporary table

ADD INDEX (4.1, 5.0) t1 Index Index Index Rows
so we now have t1 with our new index

ADD INDEX in 5.1 t1 Index Index Rows In 5.1, we can do a lot better

ADD INDEX in 5.1 t1 Index Index Index Rows
We build a new index as an online operation – avoiding the copy.

ADD INDEX in 5.1 t1 Index Index Index Rows
so the only thing we have to build is one index, not copying all the data and rebuilding all the other indexes.

ADD INDEX in 5.1 t1 Index Index Index Rows

DROP INDEX in 5.1 t1 Index Index Index Rows
Delete is the same, we only drop the index we don't want

DROP INDEX in 5.1 t1 Index Index Rows

How much faster?

Online Add Index Before (copy the table):
mysql> create index b on t1(b); Query OK, 1356 rows affected (2.20 sec) Records: Duplicates: 0 Warnings: 0 mysql> drop index b on t1; Query OK, 1356 rows affected (2.03 sec)

Online Add Index Before (copy the table):
mysql> create index b on t1(b); Query OK, 1356 rows affected (2.20 sec) Records: Duplicates: 0 Warnings: 0 mysql> drop index b on t1; Query OK, 1356 rows affected (2.03 sec) Now (just add/drop an index): Query OK, 0 rows affected (0.58 sec) Records: 0 Duplicates: 0 Warnings: 0 Query OK, 0 rows affected (0.46 sec)

User Defined Partitioning

Since the dawn of time...

pk Nodegroup 0 Nodegroup 1

HASH(pk) pk

Perception Reality HASH(pk) pk

pk Two Partitions

How the default looks (in 5.1 SHOW CREATE TABLE)
CREATE TABLE account(number int unsigned, location varchar, amount int) PRIMARY KEY (number) [PARTITION BY KEY ()] [(PARTITION P0 NODEGROUP 0, PARTITION P1 NODEGROUP 1, …)] ENGINE=NDBCLUSTER;

Now, In 5.1

User Defined Partitioning

By Key

Partition by Key CREATE TABLE account(number int unsigned, location varchar, amount int) PRIMARY KEY (number) [PARTITION BY KEY ()] [(PARTITION P0 NODEGROUP 0, PARTITION P1 NODEGROUP 1, …)] ENGINE=NDBCLUSTER;

MySQL Cluster Replication

Not Internal Mirroring between nodes

Replication from one Cluster to Another Cluster

the usual reasons

Why replicate between Clusters?
Geographical Redundancy Chatting on replication for about 1min 20sec so ffar.

Geographical Redundancy Split the processing load Chatting on replication for about 1min 20sec so ffar.

Geographical Redundancy Split the processing load e.g. for monthly reports Chatting on replication for about 1min 20sec so ffar.

Quick Overview of MySQL Replication

MySQL Replication

MySQL Replication INSERT ...

MySQL Replication INSERT ... INSERT...

MySQL Replication INSERT ... INSERT... INSERT...

MySQL Replication INSERT ... INSERT ... INSERT... INSERT...

MySQL Replication INSERT ... INSERT ... INSERT... INSERT... INSERT...

MySQL Replication INSERT ... UPDATE ... INSERT ... INSERT... UPDATE...

MySQL Replication INSERT ... UPDATE ... INSERT ... DELETE ...

MySQL Replication Master

MySQL Replication Slave Master Slave Slave

Back at Cluster

mysqld mysqld mysqld NDB API NDB API

mysqld mysqld mysqld UPDATE UPDATE UPDATE NDB API NDB API

mysqld mysqld mysqld NDB API NDB API update() update()

UPDATE DELETE INSERT update() update() mysqld mysqld mysqld NDB API

UPDATE DELETE INSERT mysqld mysqld mysqld

INSERT UPDATE DELETE INSERT UPDATE DELETE INSERT UPDATE DELETE SLAVE

ORDER? SLAVE INSERT UPDATE DELETE INSERT UPDATE DELETE INSERT UPDATE

Serialization is in the storage nodes

NDB Injector Thread A thread inside the MySQL Server
Subscribes to events in NDB The event of “Row was committed” Injects the rows into the binlog Producing a single, canonical binlog of your cluster not just one mysql server it contains EVERYTHING done on ALL ndbapi programs (including mysqld) connected to the cluster

UPDATE DELETE INSERT update() update() mysqld mysqld mysqld NDB API

A Closer Look...

MySQL Replication between Clusters
Application Application Application Application MySQL Server MySQL Server I/O thread Replication Master Apply thread Slave NdbCluste r Handler NdbCluste r Handler Binlog Relay Binlog Binlog NDB Kernel (Data nodes) ndbd NDB Kernel (Data nodes) ndbd

Who spotted the single point of failure?
One thing... Who spotted the single point of failure?

Redundant Replication Channels
MySQL Server MySQL Server Master Master I/O thread Apply thread MySQL Server Slave NdbCluster Handler NdbCluster Handler NdbCluster Handler Binlog Binlog Relay Binlog Binlog NDB Kernel (Data nodes) ndbd NDB Kernel (Data nodes) ndbd NdbCluster Handler Master MySQL Server Binlog Slave Relay Binlog Binlog Apply thread I/O thread NdbCluster Handler MySQL Server Replication

MySQL Server MySQL Server Master Master I/O thread Apply thread MySQL Server Slave NdbCluster Handler NdbCluster Handler NdbCluster Handler Binlog Binlog Relay Binlog Binlog NDB Kernel (Data nodes) ndbd NDB Kernel (Data nodes) ndbd NdbCluster Handler Master MySQL Server Binlog Relay Binlog Binlog NdbCluster Handler Replication MySQL Server I/O thread Apply thread Slave

MySQL Server MySQL Server MySQL Server MySQL Server Master Master Master Master I/O thread Apply thread MySQL Server Slave NdbCluster Handler NdbCluster Handler NdbCluster Handler NdbCluster Handler NdbCluster Handler Binlog Binlog Binlog Binlog Relay Binlog Binlog NDB Kernel (Data nodes) ndbd NDB Kernel (Data nodes) ndbd NdbCluster Handler Master MySQL Server Binlog Slave Relay Binlog Binlog Apply thread I/O thread NdbCluster Handler MySQL Server Replication

MySQL Server MySQL Server Master Master I/O thread Apply thread MySQL Server Slave NdbCluster Handler NdbCluster Handler NdbCluster Handler Binlog Binlog Relay Binlog Binlog NDB Kernel (Data nodes) ndbd NDB Kernel (Data nodes) ndbd NdbCluster Handler Master MySQL Server Binlog Relay Binlog Binlog NdbCluster Handler Replication MySQL Server I/O thread Apply thread Slave

How do I make fail over happen?

But first...

Epoch A point of synchronization in the cluster
Everybody agrees on what transactions are disk persistent In case of system crash, this is where we'll recover to. About 9min 20sec on replication up to this point

Okay, but fail over?

Currently manual, but only four simple steps

STEP 1 Find out where the Slave is up to
In the binary log produced by the injector each Global Check Point (epoch) is a transaction Where we are up to is recorded on the slave The `mysql` database tracks these things here, the mysql.ndb_apply_status table. Which has two columns: server_id and epoch (both integers) and is ENGINE=ndb so is available everywhere! So, we mysqlS`> from mysql.ndb_apply_status; possibly with a WHERE clause for server_id

STEP 2 Find the binlog position for this epoch
The mysql.ndb_binlog_index table will help us here It maps binlog position to GCI and tells us the number of INSERT, UPDATES, DELETES and SCHEMAOPS per GCI Is MyISAM and is per-master So, we run the query from slave) mysqlM`> '/', -1), @pos:=Position FROM mysql.ndb_binlog_index WHERE epoch ORDER BY epoch ASC LIMIT 1;

STEP 3 Synchronize the second channel i.e. change the master
Run the query from last query) mysqlS`> CHANGE MASTER TO

STEP 4 mysqlS`> START SLAVE; No, really, that's it.

Limitations Fail over of replication channels is manual
can be scripted Since all updates are through one injector thread, there is a limit this limit is much less than what you can pump through a good cluster We are working to overcome this

You can now... Have a 99.999% uptime cluster with great performance
and a cool name Have replication between it and another Cluster (or single server) Load balancing Geographical redundancy Redundant replication channels between these setups Redundancy up the wazoo! On switching replication channels - 3.5mins

Disk Data

Two Phase Implementation
1. Data on disk (5.1) 2. Indexes on disk (7.1?)

A few concepts...

Where we store things Table Space Data file

Where we store things Table Space Data file Data file Data file

Where we store things Table Space Table Space Data file Data file

Where we store things Table Space Table Space Log file group Data file
Undo file Undo file

Files are per node Node 2 Node 1 df1 df1

Let's look at the SQL

CREATE LOGFILE GROUP CREATE LOGFILE GROUP lg_1 ADD UNDOFILE 'undo1' INITIAL_SIZE 16M UNDO_BUFFER_SIZE 2M ENGINE=NDB; We can add another undo file ALTER LOGFILE GROUP lg_1 ADD UNDOFILE 'undo2' INITIAL_SIZE 12M ENGINE=NDB; We currently don't auto-extend files For more space, add files

CREATE TABLESPACE CREATE TABLESPACE ts1 ADD DATAFILE 'datafile1' USE LOGFILE GROUP lg1 INITIAL_SIZE 32M ENGINE=NDB; We can add another datafile too ALTER TABLESPACE ts1 ADD DATAFILE 'datafile2' INITIAL_SIZE 48M ENGINE=NDB; We currently don't auto-extend So just add another file

CREATE TABLE CREATE TABLE t1 ( pk1 INT NOT NULL PRIMARY KEY, b INT NOT NULL, c INT NOT NULL) TABLESPACE ts1 STORAGE DISK ENGINE=NDB; b and c will be stored on disk pk1 in memory (as it's indexed)

I_S.FILES for Data files
what ts it belongs to Extent Size (bytes) Number of extents in file Number of free extents So, Free extents multiplied by extent size = free bytes that can be allocated to tables

A useful VIEW CREATE VIEW isf AS SELECT FILE_NAME, (TOTAL_EXTENTS * EXTENT_SIZE) AS 'Total', (FREE_EXTENTS * EXTENT_SIZE) AS 'Free', ( ((FREE_EXTENTS * EXTENT_SIZE)*100) / (TOTAL_EXTENTS * EXTENT_SIZE)) AS '% Free' FROM INFORMATION_SCHEMA.FILES WHERE ENGINE="ndbcluster" and FILE_TYPE = 'DATAFILE';

I_S.FILES for UNDO files
Free log space If running out, maybe need to add more

Optimized NR Traditional NR copy everything over the wire
PRO: easy to implement correctly PRO: not too bad for a few gigs of data CON: very very bad for disk data think 2TB going over the wire... ouch! Details on optimized node recovery for NDB in s1108-ronstrom.pdf recovery from checkpoint, so don't have to copy everything

Thank You

Find out More! MySQL Online Documentation
Cluster Forum Cluster Mailing List

MySQL Cluster overview and ndb-7.0 features demo

Similar presentations

Presentation on theme: "MySQL Cluster overview and ndb-7.0 features demo"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MySQL Cluster overview and ndb-7.0 features demo

Similar presentations

Presentation on theme: "MySQL Cluster overview and ndb-7.0 features demo"— Presentation transcript:

Similar presentations

About project

Feedback