Possible strategies for Zero Downtime Upgrades of Hydra

paul · October 18, 2019, 11:14am

Hi,

I’m wondering how to achieve zero-downtime deployments of new Hydra versions.

I wasn’t able to find any documentation on this, the only think I found was this case, which doesn’t offer any insight into possible solutions: https://github.com/ory/hydra/issues/1236

As far as I can see, as long as the new version of Hydra doesn’t come with breaking datamodel changes (dropping/renaming columns/tables, changing column definitions etc) NOR handles the data stored in the existing columns differently NOR fills any newly added columns/tables with some data during migration, THEN you can theoretically spin up instances of the new version of Hydra parallel to the old version, both connecting to the same datastore and once the instances for the new version are up and running, shut down the instances of the old version, thus achieving zero downtime deployment of the new version. Would you say this works, within the mentioned limitations?

However, if any of the limitations mentioned above aren’t met, how to achieve zero downtime deployment then?

I understand you’re working on providing the Ory stack as a Service, so you must have tackled this

The only thing I can think of is creating a new database, restoring a backup of the current database into it, running the migration on that, while in parallel track all changes being made in the old DB from the moment of starting the backup (in Postgres using the WAL/logical decoding/…), transform those changes to match the new datamodel structure and then, once the restore to the new database is finished, apply all (cached) changes to the new database as well. And the “tracking of changes > transform > apply to new DB” needs to keep happening, until all instances of the old version of Hydra are shutdown and thus no mutations are being made anymore on the old db.

Besides this being quite a cumbersome process, it has a lot over overhead, but more importantly, we’d need to figure out how to do the proper transforms, sort of by reverse engineering the migration scripts. And I haven’t even though about any race conditions between changes made to the same data simultaneously in both the old and new database and the sync of those changes form the old to the new database.

I guess the later could be minimized by keeping the window in which both versions of Hydra are live very short (by spinning up the new version, but not sending it any traffic yet and then, on the network layer make the cut over from old to new).

So, am I missing something? Is there a better strategy to achieve zero downtime deployment when the migration is such that you cannot run the old and new version in parallel? An generic pointers you have to achieve zero downtime deployments or to make the process of doing so easier would be much appreciated

On a sidenote: I looked at the (Postgres) SQL migration scripts of previous releases and I noticed that some of the best practices for making datamodel changes in live systems weren’t being used. For example creating indexes concurrently. Have a look at https://www.braintreepayments.com/blog/safe-operations-for-high-volume-postgresql/ for some more best practices

Paul

hackerman · October 18, 2019, 12:20pm

Yes, but there are no plans to make this public knowledge as of now. Maybe some things will go upstream but no guarantees.

The linked blog post is from 2014, PostgreSQL has evolved quite a lot since then. Also, since version 1.0 we haven’t made any changes that could cause long locks.

In general, blue/green deployments when upgrading is the way to go.

paul · October 18, 2019, 12:46pm

Hi arekkas,

Tnx for the response and I understand you don’t want to make some of your proprietary architecture of your upcoming service offering public, but some pointers on how to achieve zero-downtime deployment for Hydra would be much appreciated.

Also, if the answer is that its not realistically feasible atm to achive this then that is also an answer

I agree the in general blue/green deployments are the way to go, but can they be achieved with Hydra?

Is my assumption correct that you can do a simple blue/green deployment with Hydra on the same datastore if any of three conditions below are NOT met?

breaking datamodel changes (dropping/renaming columns/tables, changing column definitions etc)
new version handles the data stored in the existing columns differently
new version fills any newly added columns/tables with some data during migration

And how would we know that a new version is save for a simple blue/green deployment? Do we have to go through the migration scripts ourselves to figure that out?

And in the case either of the three conditions is met with a new version of Hydra: then what? Is the process I outlined above in the topic somewhat accurate or?

The fact that since version 1 there haven’t been any changes causing long locks is great, but not exactly a guarantee for the future And I need to know what I’m signing our company/Dev/DevOps team up for when choosing Hydra as our oAuth implementation

So, I hope you can find the time to elaborate a bit more on this topic

Paul

hackerman · October 18, 2019, 12:58pm

For clarification: While we won’t provide guidance on this ourselves at the moment, we obviously would not discourage community contributions in that regard. The major problem is that this highly depends on the db type and version used, on the deployment infrastructure, and so on. This is really down to the “ops” figuring out how to run these things in their specific environment.

Regarding your question, I believe almost no open source software that uses datastores with potential locks on certain update types has guidance on zero-downtime upgrades. At least I haven’t seen any, because, as I said, this is highly dependent on your env. In fact, many systems inflict serious downtimes when upgrading, I’ve heard a lot of terrible stories here.

In some cases, such as table rewrites or things that cause locks, a downtime is unavoidable, that’s just the world we live in. For MySQL this is a bigger issue than for Postgres, but it still exists for any datastore. We will obviously try to avoid that but if we have to choose between a new key feature or bugfix vs. breaking zero-downtime upgrade we will obviously choose the first option because you will always have a maintainenance window defined in your SLA. You can rest assured that we won’t do any table rewrites just for the fun of it. So that’s what you’re sining up for

hackerman · October 18, 2019, 2:09pm

Maybe for further clarification, we will try our best to document when changes require table locks. However, as this is specific to the DB type and version, your own due diligence is always necessary.

paul · October 28, 2019, 10:47am

Hi Arekkas,

Looks like the community doesn’t have many pointers to give.

Regarding your comments that the major problem is that it highly depends on db type/version/deployment infrastructure etc, I agree that the details of any approach taken to achieve zero downtime deployment is highly dependent on those things.

However, I think that regardless of db vendor/version and deployment architecture, there could be some basic guidelines/pointers and answers. For example:

Is it correct that when the new version of Hydra operates on the exact same datamodel, you can easily run the old and new version of Hydra in parallel connected to the same datastore?
In case of (breaking) datamodel changes, is my assumption correct that you either have to incur some downtime or otherwise figure out some strategy to copy over your existing data to a new database and run the new version of Hydra (including migration) against this new db and have the old and new version of Hydra running in parallel (for a short time), connected to different databases, figuring out yourself how to sync modifications made through the old version/db to the new version/db?

we will try our best to document when changes require table locks
That would be great, but I’d also would like to ask if you could provide with each release if the datamodel for the new version is compatible the the previous version (or all versions its compatible with), so we’d know that we can ‘just’ run the previous and new version in parallel connected to the same datastore, achieving zero downtime deployment

Regards,
Paul

hackerman · October 28, 2019, 11:50am

Is it correct that when the new version of Hydra operates on the exact same datamodel, you can easily run the old and new version of Hydra in parallel connected to the same datastore?

Yes

In case of (breaking) datamodel changes, is my assumption correct that you either have to incur some downtime or otherwise figure out some strategy to copy over your existing data to a new database and run the new version of Hydra (including migration) against this new db and have the old and new version of Hydra running in parallel (for a short time), connected to different databases, figuring out yourself how to sync modifications made through the old version/db to the new version/db?

Unfortunately there is no simpler approach to this, so yes, and this will probably not change unless there is a “standard” way of approaching this that’s maintainable of reasonable effort.

paul · October 28, 2019, 11:54am

Thank you for the answers!

Just noticed a formatting issue in my previous comment, due to which you might not have seen a last request, hence I repeat it below:

we will try our best to document when changes require table locks

That would be great, but I’d also would like to ask if you could provide with each release if the datamodel for the new version is compatible the the previous version (or all versions its compatible with), so we’d know that we can ‘just’ run the previous and new version in parallel connected to the same datastore, achieving zero downtime deployment

hackerman · October 28, 2019, 11:57am

Unless the upgrade guide references the need to run SQL Migrations (which we always point to in the upgrade guide) the datamodels are compatible. If we expect slow migrations we will add a dedicated warning.