Hi,
I’m wondering how to achieve zero-downtime deployments of new Hydra versions.
I wasn’t able to find any documentation on this, the only think I found was this case, which doesn’t offer any insight into possible solutions: https://github.com/ory/hydra/issues/1236
As far as I can see, as long as the new version of Hydra doesn’t come with breaking datamodel changes (dropping/renaming columns/tables, changing column definitions etc) NOR handles the data stored in the existing columns differently NOR fills any newly added columns/tables with some data during migration, THEN you can theoretically spin up instances of the new version of Hydra parallel to the old version, both connecting to the same datastore and once the instances for the new version are up and running, shut down the instances of the old version, thus achieving zero downtime deployment of the new version. Would you say this works, within the mentioned limitations?
However, if any of the limitations mentioned above aren’t met, how to achieve zero downtime deployment then?
I understand you’re working on providing the Ory stack as a Service, so you must have tackled this
The only thing I can think of is creating a new database, restoring a backup of the current database into it, running the migration on that, while in parallel track all changes being made in the old DB from the moment of starting the backup (in Postgres using the WAL/logical decoding/…), transform those changes to match the new datamodel structure and then, once the restore to the new database is finished, apply all (cached) changes to the new database as well. And the “tracking of changes > transform > apply to new DB” needs to keep happening, until all instances of the old version of Hydra are shutdown and thus no mutations are being made anymore on the old db.
Besides this being quite a cumbersome process, it has a lot over overhead, but more importantly, we’d need to figure out how to do the proper transforms, sort of by reverse engineering the migration scripts. And I haven’t even though about any race conditions between changes made to the same data simultaneously in both the old and new database and the sync of those changes form the old to the new database.
I guess the later could be minimized by keeping the window in which both versions of Hydra are live very short (by spinning up the new version, but not sending it any traffic yet and then, on the network layer make the cut over from old to new).
So, am I missing something? Is there a better strategy to achieve zero downtime deployment when the migration is such that you cannot run the old and new version in parallel? An generic pointers you have to achieve zero downtime deployments or to make the process of doing so easier would be much appreciated
On a sidenote: I looked at the (Postgres) SQL migration scripts of previous releases and I noticed that some of the best practices for making datamodel changes in live systems weren’t being used. For example creating indexes concurrently. Have a look at https://www.braintreepayments.com/blog/safe-operations-for-high-volume-postgresql/ for some more best practices
Paul