Horizontal Scalability with PostgreSQL Jack Orenstein Hitachi Data Systems

Horizontal Scalability with PostgreSQL Jack Orenstein Hitachi Data Systems jack.orenstein@hds.com

THANKS

Application HCAP: Hitachi Content Archive Platform Cluster of Linux nodes Fixed-content storage Interfaces: HTTP, NFS, CIFS, SMTP, (CUPS?) Configurable levels of data protection “Policies” look for and fix violations of constraints. Optional full-text search (FAST).

Background Created by Archivas, Inc., founded June 2003. Aimed at clusters of cheap Linux nodes with internal storage. (1-2 CPUs, 0.5G RAM, up to 1TB internal storage). Acquired by HDS, February 2007. Current hardware configuration: 8 CPUs, 8G RAM, up to 64 SAN volumes. Re-introducing internal disk product.

HCAP Architecture Front-end switch provides access to nodes. Node contains application stack and subset of data. Back-end switch for inter- node communication

HCAP Architecture Not drawn to scale Metadata ManagerStorage Manager Request Manager Metadata Manager Client Messaging HTTPWebDAVNFSCIFSSMTP pg_data + files pg_xlog + files files

Request Manager Metadata Manager Client Focus on Metadata Manager Metadata ManagerStorage Manager Messaging HTTPWebDAVNFSCIFSSMTP pg_data + files pg_xlog + files files

Outline Design goals Data organization Replication scheme Failover and failback Upgrade Postgres issues

DESIGN GOALS OF THE METADATA MANAGER

Design Goals Functionality:  Record system configuration, admin messages, constraint violations.  Record file metadata (think inodes).

Design Goals Access patterns:  Read/write 1-2 records at a time.  Lookup by key (directory + filename).  Lookup by partial key (directory).  Complete scan.

Design Goals System qualities:  Reliability: No false positives Rare false negatives  Availability  Scalability  Upgradability

DATA ORGANIZATION

Shared-nothing distributed system Partition objects into regions based on object key. Maintained synchronized copies of regions. Route requests to region copies.

Region Map /2007/02/05/img1234.jpg hash() & 0x3ff 0 1 2 3 1022 1023 node 1 node 2 node 3 node 4 node 3 node 4 node 2 node 3 node 4 node 5 node 4 node 5 Region AuthoritativeBackup... Hash object key Last bits of hash value → region number Linear hashing

Postgres schemas Each region copy stored in a Postgres schema. Schema name: mm1_8_a7 8: “Map level” ｰ number of bits in region number. a7: Region number in hex. Each region schema has same table definitions.

Postgres schemas Naming schema allows for adding regions (not yet implemented). E.g., mm1_8_a7 can be split to yield mm1_9_0a7, mm1_9_1a7. These can coexist while splitting proceeds.

Postgres schemas mm1 schema: Exists in every database. Replicated across all nodes. Small data volume, infrequently updated. E.g. cluster configuration.

Postgres schemas Connect to Postgres through JDBC. Connection bound to schema. set search_path = mm1_8_a7,mm1,public

REPLICATION SCHEME

Metadata Manager overview Java/JDBC application. Homegrown messaging layer. MM is used by Request Manager:  Request manager calls MM client API.  MM client issues request.  Routed to node and region using region map.

Processing of MM update request update local database commit database update for each backup region: send update request to backup region wait for ack of update request return control to caller Ack request execute request AuthoritativeBackup async

update local database commit database update for each backup region: send update request to backup region wait for ack of update request return control to caller Request processing – failure scenarios A crashes before commit: Request fails to caller. No update anywhere (not a false negative). Ack request execute request AB async

update local database commit database update for each backup region: send update request to backup region wait for ack of update request return control to caller Request processing – failure scenarios A crashes after commit: Promote B to A. If B does not have update: consistent with request failure. Else: false negative. Ack request execute request AB async

update local database commit database update for each backup region: send update request to backup region wait for ack of update request return control to caller Request processing – failure scenarios B crashes, cannot ack: New map contains new B copy on another node. Ack request execute request AB async

update local database commit database update for each backup region: send update request to backup region wait for ack of update request return control to caller Request processing – failure scenarios B fails to execute request: B commits suicide. Region discarded. New B copy created elsewhere. Ack request execute request AB async

FAILOVER AND FAILBACK

Region lifecycle IAB A: authoritative B: backup I: incomplete (copying data from A)

Region map in normal cluster 8 regions 2 copies of each 3 nodes 0 1 2 3 4 5 6 7 n1 n2 n3 n1 n2 n3 n1 RegionAB

Node 3 crashes Each region needs an A copy to resume service. Create copies for regions missing one. 0 1 2 3 4 5 6 7 n1 n2 n1 n2 n1 RegionAB

Promote B to A Cluster returns to service. 0 1 2 3 4 5 6 7 n1 n2 n1 n2 n1 RegionAB

Create new regions (state I) New regions copy data from A. Update requests are applied immediately in A and B region copies. Logged for later execution in I copy. 0 1 2 3 4 5 6 7 n1 n2 n1 n2 n1 RegionAB n2n1 n2 n1

I regions finish loading Recovery is complete. 0 1 2 3 4 5 6 7 n1 n2 n1 n2 n1 RegionAB n2n1 n2 n1

Failback: n3 returns to service Assign n3 some regions, to load balance. 0 1 2 3 4 5 6 7 n1 n2 n1 n2 n1 RegionAB n2n1 n2 n1 n3 BBI

I regions complete loading Too many B regions. Rebalance to balance A/node, (A+B)/node. 0 1 2 3 4 5 6 7 n1 n2 n1 n2 n1 RegionAB n2n1 n2 n1 n3 BBB

I regions complete loading Rebalance is complete. RegionAB n1 n2 n1 n2 n1 n2 n1 n2 n1 n2 n3 BB 0 1 2 3 4 5 6 7

I region lifecycle Create schema with tables only (no indexes or triggers). Copy data from A region: remote psql copy out piped to local psql copy in. Recompute derived data. Add indexes and triggers.

I region lifecycle Apply updates that arrived during above steps. Updates may arrive during this step. Apply updates again under lock (blocking new updates). Announce conversion from I to B.

UPGRADE

Upgrade requirements Offline upgrade:  Shut down all nodes.  Upgrade software.  Migrate data. Online upgrade:  Shut down one node at a time (failover).  Upgrade software.  Restart node (failback).  Data migrated as part of I region lifecycle.

Upgrade requirements Upgrade ½ cluster, twice:  Not implemented yet.

Online upgrade: data migration during loading of I region Four possible data migrations:  Old→Old  New→New  Old→New  New→Old Old→Old and New→New: just works

Old→New and New→Old Target: create staging tables matching source. Copy data into staging tables (as usual). Run conversion (pl/pgsql procedure).

Other upgrades Offline: Just need old→new conversion scripts. ½ cluster: Same.

POSTGRES ISSUES

State of the world 2004: 1/2G RAM, 1 CPU. Postgres 7.4. shared_buffers = 30000 (250M).

Indexing Table storing file metadata has columns for directory and filename. PK on these columns is wide. Needed to fit more of index in cache. So:  Add columns for hashes of directory, filename.  Index is on these hashes instead. Might not be so important now.

Dealing with hotspots On every file creation, need to maintain metadata on parent directory, (change time, file and subdir counts). Directory is a hotspot for updates. Needed to vacuum frequently to maintain performance. Can't afford to vacuum frequently.

Dealing with hotspots So:  Moved directory records into a separate table.  Reduces width by 80%.  Reduces number of rows by 90-99+%.  Can afford to vacuum every frequently (every 2000 updates).

SPI A few columns have binary-encoded data. Read/written by Java layer. Also need human-readable form in SQL queries. Use SPI to render in python-friendly form. Allows for easy integration with python tools and tests.

Conclusion Postgres has just worked. Vacuuming is a bit of a pain.  Three vacuum schedules (didn't use autovacuum)  Bug in update counting led to failure to vacuum, causing performance problems. Reliable, scalable.

Horizontal Scalability with PostgreSQL Jack Orenstein Hitachi Data Systems

Similar presentations

Presentation on theme: "Horizontal Scalability with PostgreSQL Jack Orenstein Hitachi Data Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Horizontal Scalability with PostgreSQL Jack Orenstein Hitachi Data Systems

Similar presentations

Presentation on theme: "Horizontal Scalability with PostgreSQL Jack Orenstein Hitachi Data Systems"— Presentation transcript:

Similar presentations

About project

Feedback