Presentation on theme: "Horizontal Scalability with PostgreSQL Jack Orenstein Hitachi Data Systems"— Presentation transcript:
Horizontal Scalability with PostgreSQL Jack Orenstein Hitachi Data Systems
Application HCAP: Hitachi Content Archive Platform Cluster of Linux nodes Fixed-content storage Interfaces: HTTP, NFS, CIFS, SMTP, (CUPS?) Configurable levels of data protection “Policies” look for and fix violations of constraints. Optional full-text search (FAST).
Background Created by Archivas, Inc., founded June Aimed at clusters of cheap Linux nodes with internal storage. (1-2 CPUs, 0.5G RAM, up to 1TB internal storage). Acquired by HDS, February Current hardware configuration: 8 CPUs, 8G RAM, up to 64 SAN volumes. Re-introducing internal disk product.
HCAP Architecture Front-end switch provides access to nodes. Node contains application stack and subset of data. Back-end switch for inter- node communication
Outline Design goals Data organization Replication scheme Failover and failback Upgrade Postgres issues
DESIGN GOALS OF THE METADATA MANAGER
Design Goals Functionality: Record system configuration, admin messages, constraint violations. Record file metadata (think inodes).
Design Goals Access patterns: Read/write 1-2 records at a time. Lookup by key (directory + filename). Lookup by partial key (directory). Complete scan.
Design Goals System qualities: Reliability: No false positives Rare false negatives Availability Scalability Upgradability
Shared-nothing distributed system Partition objects into regions based on object key. Maintained synchronized copies of regions. Route requests to region copies.
Region Map /2007/02/05/img1234.jpg hash() & 0x3ff node 1 node 2 node 3 node 4 node 3 node 4 node 2 node 3 node 4 node 5 node 4 node 5 Region AuthoritativeBackup... Hash object key Last bits of hash value → region number Linear hashing
Postgres schemas Each region copy stored in a Postgres schema. Schema name: mm1_8_a7 8: “Map level” ｰ number of bits in region number. a7: Region number in hex. Each region schema has same table definitions.
Postgres schemas Naming schema allows for adding regions (not yet implemented). E.g., mm1_8_a7 can be split to yield mm1_9_0a7, mm1_9_1a7. These can coexist while splitting proceeds.
Postgres schemas mm1 schema: Exists in every database. Replicated across all nodes. Small data volume, infrequently updated. E.g. cluster configuration.
Postgres schemas Connect to Postgres through JDBC. Connection bound to schema. set search_path = mm1_8_a7,mm1,public
Metadata Manager overview Java/JDBC application. Homegrown messaging layer. MM is used by Request Manager: Request manager calls MM client API. MM client issues request. Routed to node and region using region map.
Processing of MM update request update local database commit database update for each backup region: send update request to backup region wait for ack of update request return control to caller Ack request execute request AuthoritativeBackup async
update local database commit database update for each backup region: send update request to backup region wait for ack of update request return control to caller Request processing – failure scenarios A crashes before commit: Request fails to caller. No update anywhere (not a false negative). Ack request execute request AB async
update local database commit database update for each backup region: send update request to backup region wait for ack of update request return control to caller Request processing – failure scenarios A crashes after commit: Promote B to A. If B does not have update: consistent with request failure. Else: false negative. Ack request execute request AB async
update local database commit database update for each backup region: send update request to backup region wait for ack of update request return control to caller Request processing – failure scenarios B crashes, cannot ack: New map contains new B copy on another node. Ack request execute request AB async
update local database commit database update for each backup region: send update request to backup region wait for ack of update request return control to caller Request processing – failure scenarios B fails to execute request: B commits suicide. Region discarded. New B copy created elsewhere. Ack request execute request AB async
FAILOVER AND FAILBACK
Region lifecycle IAB A: authoritative B: backup I: incomplete (copying data from A)
Region map in normal cluster 8 regions 2 copies of each 3 nodes n1 n2 n3 n1 n2 n3 n1 RegionAB
Node 3 crashes Each region needs an A copy to resume service. Create copies for regions missing one n1 n2 n1 n2 n1 RegionAB
Promote B to A Cluster returns to service n1 n2 n1 n2 n1 RegionAB
Create new regions (state I) New regions copy data from A. Update requests are applied immediately in A and B region copies. Logged for later execution in I copy n1 n2 n1 n2 n1 RegionAB n2n1 n2 n1
I regions finish loading Recovery is complete n1 n2 n1 n2 n1 RegionAB n2n1 n2 n1
Failback: n3 returns to service Assign n3 some regions, to load balance n1 n2 n1 n2 n1 RegionAB n2n1 n2 n1 n3 BBI
I regions complete loading Too many B regions. Rebalance to balance A/node, (A+B)/node n1 n2 n1 n2 n1 RegionAB n2n1 n2 n1 n3 BBB
I regions complete loading Rebalance is complete. RegionAB n1 n2 n1 n2 n1 n2 n1 n2 n1 n2 n3 BB
I region lifecycle Create schema with tables only (no indexes or triggers). Copy data from A region: remote psql copy out piped to local psql copy in. Recompute derived data. Add indexes and triggers.
I region lifecycle Apply updates that arrived during above steps. Updates may arrive during this step. Apply updates again under lock (blocking new updates). Announce conversion from I to B.
Upgrade requirements Offline upgrade: Shut down all nodes. Upgrade software. Migrate data. Online upgrade: Shut down one node at a time (failover). Upgrade software. Restart node (failback). Data migrated as part of I region lifecycle.
Upgrade requirements Upgrade ½ cluster, twice: Not implemented yet.
Online upgrade: data migration during loading of I region Four possible data migrations: Old→Old New→New Old→New New→Old Old→Old and New→New: just works
Old→New and New→Old Target: create staging tables matching source. Copy data into staging tables (as usual). Run conversion (pl/pgsql procedure).
Other upgrades Offline: Just need old→new conversion scripts. ½ cluster: Same.
State of the world 2004: 1/2G RAM, 1 CPU. Postgres 7.4. shared_buffers = (250M).
Indexing Table storing file metadata has columns for directory and filename. PK on these columns is wide. Needed to fit more of index in cache. So: Add columns for hashes of directory, filename. Index is on these hashes instead. Might not be so important now.
Dealing with hotspots On every file creation, need to maintain metadata on parent directory, (change time, file and subdir counts). Directory is a hotspot for updates. Needed to vacuum frequently to maintain performance. Can't afford to vacuum frequently.
Dealing with hotspots So: Moved directory records into a separate table. Reduces width by 80%. Reduces number of rows by %. Can afford to vacuum every frequently (every 2000 updates).
SPI A few columns have binary-encoded data. Read/written by Java layer. Also need human-readable form in SQL queries. Use SPI to render in python-friendly form. Allows for easy integration with python tools and tests.
Conclusion Postgres has just worked. Vacuuming is a bit of a pain. Three vacuum schedules (didn't use autovacuum) Bug in update counting led to failure to vacuum, causing performance problems. Reliable, scalable.