TECHNOLOGY

How we upgraded our 4TB PostgreSQL database


Peter Johnston

18 April 2022

8 min study

Retool’s cloud-hosted product is backed by a single elephantine 4 TB Postgres database working in Microsoft’s Azure cloud. Closing drop, we migrated this database from Postgres model 9.6 to model 13 with minimal downtime.

How did we enact it? To be frank, it wasn’t a fully straight course from point A to point B. On this put up, we’ll bellow the fable and piece suggestions to enable you to with a the same toughen.

Motivation

For these of you new to Retool, we’re a platform for constructing interior instruments hastily. That it is seemingly you’ll well be in a position to use a poke-and-drop editor to bag UIs, and with out exertion hook them up to your appreciate recordsdata sources, in conjunction with databases, APIs, and third-party instruments. That it is seemingly you’ll well be in a position to use Retool as a cloud-hosted product (which is backed by the database we’re talking about on this put up), or that it is seemingly you’ll host it your self. As the 4 TB database size suggests, many Retool potentialities are constructing many apps within the cloud.

Closing drop, we determined to toughen our major Postgres database for a compelling motive: Postgres 9.6 was once reaching stop-of-existence on November 11, 2021, which intended it would not get worm fixes or security updates. We didn’t desire to rob any possibilities with our potentialities’ recordsdata, so we couldn’t preserve on that model. It was once that straightforward.

Technical carry out

This toughen inspiring a pair of excessive-stage choices:

  • What model of Postgres have to mild we toughen to?
  • What strategy enact we use to enact the toughen?
  • How enact we test the toughen?

Sooner than we dive in, let’s overview our constraints and desires. There beget been real a pair of.

  • Entire the toughen before November 11, 2021.
  • Decrease downtime, in particular for the period of Monday-Friday change hours worldwide. This was once the largest consideration after the not easy decrease-off date, because of Retool is serious to different our potentialities.
  • Downtime is in particular a part when brooding about working on 4 TB. At this scale, easy things change into more difficult.
  • We wanted our upkeep window to be about one hour max.
  • Maximize the period of time this toughen buys us before we now have to toughen again.

PostgreSQL model 13

We determined to toughen to Postgres 13, because of it fit the general above standards, and in particular the closing one: shopping us the most time before the subsequent toughen.

Postgres 13 was once the absolute most practical launched model of Postgres after we began making ready for the toughen, with a toughen window thru November 2025. We await we’ll beget sharded our database by the stop of that toughen window, and be performing our next sizable model upgrades incrementally.

Postgres 13 also comes with quite loads of parts not available within the market in prior variations. Here is the general list, and right here are a pair of we beget been most hooked in to:

  • Most well-known performance improvements, in conjunction with in parallel demand execution.
  • The flexibility to add columns with non-null defaults safely, which eliminates a classic footgun. In earlier Postgres variations, in conjunction with a column with a non-null default causes Postgres to non-public a table re-write while blocking concurrent reads and writes— which can lead to downtime.
  • Parallelized vacuuming of indices. (Retool has quite loads of tables with excessive write traffic, and we care plenty about vacuuming.)

Upgrade strategy

Astronomical, we’d picked a plot model. Now, how beget been we going to bag there?

In general, the most energetic procedure to toughen Postgres database variations is to enact a pg_dump and pg_restore. You rob down your app, await all connections to discontinuance, then rob down the database. With the database in a frozen disclose, you dump its contents to disk, then restore the contents to a recent database server working on the plot Postgres model. Once the restore is whole, you point your app on the new database and raise the app support.

This toughen possibility was once energetic because of it was once both straightforward, and fully ensures that recordsdata is presumably not out-of-sync between the aged database and new database. But we eliminated this possibility straight away because of we desired to decrease downtime—and doing a dump and restore on 4 TB would require downtime in days, not hours or minutes.

We as an alternative settled on a methodology essentially based entirely on logical replication. With this vogue, you bustle two copies of your database in parallel: the major database you’re upgrading, and a secondary “follower” database working on the plot Postgres model. The major publishes changes to its chronic storage (by decoding its write-forward log) to the secondary database, allowing the secondary database to quickly replicate the major’s disclose. This effectively eliminates the wait to revive the database on the plot Postgres model: as an alternative, the plot database is consistently up up to now.

Notably, this vogue requires great much less downtime than the “dump and restore” strategy. As an different of attending to rebuild all the database, we simply vital to remain the app, await all transactions on the aged v9.6 major to whole, await the v13 secondary to take up, and then point the app on the secondary. As an different of days, this would possibly perhaps well rob predicament within a pair of minutes.

Attempting out strategy

We preserve a staging surroundings of our cloud Retool occasion. Our testing strategy was once to enact a pair of test runs on this staging surroundings, and bag and iterate on a detailed runbook thru that assignment.

The test runs and runbook served us nicely. As you’ll spy within the part underneath, we performed many manual steps for the period of the upkeep window. Throughout the closing cutover, these steps went off largely with out a hitch because of the a pair of costume rehearsals we’d had within the prior weeks, which helped us bag a in actuality detailed runbook.

Our major preliminary oversight was once not testing with a advertising and marketing consultant workload in staging. The staging database was once smaller than the manufacturing one, and though the logical replication strategy will deserve to beget enabled us to handle the larger manufacturing workload, we missed crucial points that ended in an outage for Retool’s cloud provider. We’ll outline these crucial points within the part underneath, but right here is the largest lesson we hope to bellow: the importance of testing with a advertising and marketing consultant workload.

Thought in apply: technical crucial points


Enforcing logical replication

We ended up the use of Warp. Notably, Azure’s Single Server Postgres product doesn’t toughen the pglogical Postgres extension, which our review led us to judge is the supreme-supported possibility for logical replication on Postgres variations before model 10.

One early detour we took was once trying out Azure’s Database Migration Provider (DMS). Below the hood, DMS first takes a snapshot of the supply database and then restores it into the plot database server. Once the preliminary dump and restore completes, DMS turns on logical decoding, a Postgres characteristic that streams chronic database changes to exterior subscribers.

Nonetheless, on our 4 TB manufacturing database, the preliminary dump and restore never carried out: DMS encountered an error but did not file the error to us. Meanwhile, no matter making no forward growth, DMS held transactions commence at our 9.6 major. These long-working transactions in turn blocked Postgres’s autovacuum feature, as the vacuum processes can not trim up uninteresting tuples created after an extended-working transaction begins. As uninteresting tuples piled up, the 9.6 major’s performance began to suffer. This ended in the outage we referenced above. (Now we beget since added monitoring to retain tune of Postgres’s unvacuumed tuple count, allowing us to proactively detect hazardous eventualities.)

Warp functions within the same vogue to DMS but provides much more configuration alternatives. In explicit, Warp supports parallel processing to flee up the preliminary dump and restore.

We needed to enact a little bit of finagling to coax Warp into processing our database. Warp expects all tables to beget a single column major key, so we needed to radically change compound major keys into extraordinary constraints and add scalar major keys. Otherwise, Warp was once very easy to make use of.

Skipping replication of neat tables

We additional optimized our procedure by having Warp skip two in particular large tables that dominated the dump and restore runtime. We did this because of pg_dump can’t feature in parallel on a single table, so the largest table will decide the shortest seemingly migration time.

To address the two large tables we skipped in Warp, we wrote a Python script to bulk switch recordsdata from the aged database server to the new. The larger 2 TB table, an append-handiest table of audit occasions within the app, was once easy to switch: we waited till after the cutover emigrate the contents, as the Retool product functions real intelligent even supposing that table is empty. We also chose to circulate very aged audit occasions to a backup storage resolution, to decrease down on the table size.

The numerous table, a a whole bunch-of-gigabytes append-handiest log of all edits to all Retool apps called page_saves, was once trickier. This table serves as the supply of reality for all Retool apps, so it vital to be up-to-date the second we came support from upkeep. To resolve this, we migrated most of its contents within the days main up to our upkeep window, and migrated the remainder for the period of the window itself. Though this worked, we impress that it did add extra risk, since we now had more work to whole for the period of the shrimp upkeep window.

Developing a runbook

These beget been, at a excessive stage, the steps in our runbook for the period of the upkeep window:

  • Conclude the Retool provider and let all outstanding database transactions commit.
  • Dwell up for the follower Postgres 13 database to take up on logical decoding.
  • In parallel, copy over the final page_saves rows.
  • Once all recordsdata is within the Postgres 13 server, enable major key constraint enforcement. (Warp requires these constraints to be disabled).
  • Enable triggers (Warp requires triggers to be disabled.)
  • Reset all sequence values, so that sequential integer major key allocation would work once the app came support on-line.
  • Slowly raise support the Retool provider, pointing on the new database as an alternative of the aged, performing nicely being checks.

Enabling international key constraint enforcement

As that it is seemingly you’ll spy from the runbook above, one amongst the steps we needed to enact was once to expose off and then re-enable international key constraint checks. The complication is that, by default, Postgres runs a full table scan when enabling international key constraints, to examine that all existing rows are legitimate essentially based entirely on the new constraint. For our neat database, this was once a teach: Postgres simply couldn’t scan terabytes of recordsdata in our one-hour upkeep window.

To unravel this, we ended up deciding on to leave international key constraints unenforced on a pair of neat tables. We reasoned this was once seemingly stable, as Retool’s product good judgment performs its appreciate consistency checks, and moreover doesn’t delete from the referenced tables, which formulation it was once not going we’d be left with a dangling reference. On the different hand, this was once a risk; if our reasoning was once incorrect, we’d stop up with a pile of invalid recordsdata to trim up.

Later, in put up-upkeep cleanup where we restored the lacking international key constraints, we learned that Postgres provides a orderly resolution to our scenario: the NOT VALID possibility to ALTER TABLE. Adding a constraint with NOT VALID causes Postgres to position in power the constraint in opposition to new recordsdata but not existing recordsdata, thus bypassing the pricey full table scan. Later, you fair have to bustle ALTER TABLE … VALIDATE CONSTRAINT, which runs the general table scan and removes the “not legitimate” flag from the constraint. After we did so, we learned no invalid recordsdata in our table, which was once a gigantic relief! We desire we had identified about this possibility before the upkeep window.

Results

We scheduled the upkeep window leisurely on Saturday, October 23rd, on the lowest period of Retool cloud traffic. With the configuration described above, we beget been in a position to raise up a new database server at model 13 in around 15 minutes, subscribed to changes at our 9.6 major with logical decoding.

To attain, a logical replication strategy—aided by Warp—in conjunction with costume rehearsals in a staging surroundings for the period of which we constructed a sturdy runbook, enabled us emigrate our 4 TB database from Postgres 9.6 to 13. In the approach, we learned the importance of testing on real workloads, made artistic use of skipping neat, much less-serious tables, and learned (a little bit leisurely) that Postgres permits you to place in power international key constraints selectively on new recordsdata in preference to all recordsdata. We hope you learned one thing from our journey too.

Related Articles

Leave a Reply

Your email address will not be published.

Back to top button