Achieving Zero-Downtime Postgres Major Version Upgrades

2025-01-31
ℹ️Note on the source

This blog post was automatically generated (and translated). It is based on the following original, which I selected for publication on this blog:
A Major Postgres Upgrade with Zero Downtime.

Achieving Zero-Downtime Postgres Major Version Upgrades

Upgrading a database to a new major version can be a daunting task, particularly when minimizing downtime is critical. One team recounts their experience upgrading their Aurora Postgres instance, sharing valuable lessons learned and the innovative approach they took to achieve zero downtime.

The Challenge

Faced with performance issues, the team discovered that upgrading to Postgres 16 significantly improved query performance. However, the standard upgrade methods presented unacceptable downtime.

Initial Attempts and Roadblocks

  • In-place Upgrades: While straightforward, this method involved significant downtime, making it unsuitable for their globally distributed user base.
  • Blue-Green Deployments: Aurora Postgres offers blue-green deployments promising minimal downtime. However, the team encountered compatibility issues due to active replication slots, a detail not mentioned in the AWS documentation. This underscores the importance of thorough rehearsal in a production-like environment.

The Manual Approach

With managed options failing, the team opted for a manual upgrade, involving the creation of a new replica running Postgres 16. This approach consisted of these key steps:

  1. Creating a new Postgres Aurora Database on Postgres 16.
  2. Extracting the schema from the existing database.
  3. Importing the schema into the new database.
  4. Creating a publication on the existing database for all tables.
  5. Creating a subscription on the new database with copy_data = true.
  6. Confirming no data loss.
  7. Running vacuum analyze on the new database.

Overcoming Replication Issues

During replication, a custom Postgres function caused errors due to search path issues. The solution involved explicitly adding the public prefix to all function definitions.

Ensuring Data Integrity

To ensure no data was lost during replication, the team performed sanity checks using a dedicated transactions table. This revealed initial data loss, prompting a revised approach.

Replicating from Scratch

To mitigate data loss, they moved to replicating data from scratch instead of cloning. What are the trade-offs between cloning and replicating from scratch?

Achieving Zero Downtime

To achieve zero downtime, the team implemented an algorithm that paused new transactions, waited for active transactions to complete and for the new database to catch up, then unpaused transactions, directing them to the new database. This required temporarily scaling down to a single machine to control all active connections.

Addressing Sequence Issues

Unique constraint violations arose due to sequence data not being replicated. This was resolved by incrementing sequences in the failover function.

Future Improvements

The team identified potential improvements for future upgrades, including:

  • Skipping the pause on read-only connections.
  • Implementing a two-phase-commit system for environments that cannot be scaled down to a single machine.

Key Takeaways

  • Upgrading Postgres versions can improve performance.
  • Thorough rehearsal in a production-like environment is crucial.
  • Manual upgrades offer finer control but require careful planning and execution.
  • Consider replicating from scratch to avoid data loss.
  • Zero-downtime upgrades are possible with careful connection management and a modest scale.

This experience highlights the complexities of database upgrades and the ingenuity required to minimize disruption. As systems grow in scale and complexity, strategies like these will become increasingly vital. Which path do we want to take? How can we simplify the process?


Comments are closed.