The problem with restart-based deploys

The traditional Odoo deploy flow looks roughly like this. Pull new code from git, run any pending module upgrades, restart the Odoo service. Anywhere from 30 to 90 seconds later your workers are back up. During that window your reverse proxy returns 502 Bad Gateway to every request. Sessions reset. CSRF tokens become invalid. In-flight POST requests fail with no clean way to retry, which is especially painful for payment confirmations, inventory moves, and webhook receivers.

For a small internal tool with five users, this is annoying. For a 24/7 operation with a warehouse picking flow, an ecommerce checkout, or an integration receiving webhooks from Stripe and Shopify, this is a daily incident waiting to happen. We have seen customers stack cron jobs at 3am on Sundays just to deploy small fixes, then spend the rest of the week gun-shy about touching production.

Restart-based deploys also make rollbacks ugly. If the new code panics on startup, you now have a longer outage while you figure out whether to revert the commit, restore from backup, or roll forward with a hotfix. The pressure scales with the length of the outage and so does the chance of making it worse.

Most managed Odoo platforms still use this model. Odoo.sh restarts. Self-hosted setups running systemctl restart odoo restart. The simple Docker Compose recipes you find on GitHub restart. We picked a different approach because we wanted customers to ship multiple times a day without scheduling a maintenance window every time.

Blue-green deploys, explained

The platform runs two pools of Odoo workers per environment. Call them blue and green. At any given time only one pool serves traffic. The other pool sits idle or, more often, does not exist yet. When you trigger a deploy, the platform spins up a fresh pool alongside the live one, waits for it to pass health checks, then flips the load balancer.

Concretely, here is what happens on a deploy of an environment running 4 Odoo workers.

Blue pool is live with workers web-blue-1 through web-blue-4. All traffic flows through them.
The platform starts web-green-1 through web-green-4 with the new code, the new dependencies, and any updated environment variables. Blue keeps serving.
Each green container exposes a /ready probe. The probe response includes the container hostname so the load balancer can verify it is talking to the new pool, not accidentally hitting a stale blue worker through some DNS quirk.
The probe runs the same checks Odoo runs at startup, plus a database connection check, plus a confirmation that all required modules loaded without error. If anything fails the container does not become ready and the deploy stalls before any traffic switches.
Once all 4 green workers are ready, the load balancer flips. New connections go to green. Existing in-flight requests keep flowing through blue until they complete.
Blue workers enter drain mode. They stop accepting new connections, finish what they are working on, then terminate. Drain timeout is configurable. The default is 60 seconds, which covers normal Odoo requests. Long-running exports get a longer window.

From the user's perspective, nothing happens. They click a button, the request goes to a blue worker, the response comes back, the next click goes to a green worker, the response comes back. Sessions stay valid because session storage lives in PostgreSQL or Redis, not in worker memory. The only signal something changed is that the new feature is suddenly there.

Rollback is the same flip in reverse. If the green pool starts logging errors, you click rollback and the load balancer points back at blue. The blue pool was never killed, just drained, so the rollback is instant. We keep the previous pool around for a configurable window, default 15 minutes, after which it gets reaped to free up RAM.

The cost of this approach is RAM. During a deploy you are running two pools at once, so peak memory roughly doubles. We sized the default Odoo memory floors with this in mind, and the runtime resource update mechanism in the next section means you can grow into a bigger box without restarting if you outgrow the original sizing.

Rolling deploys for the platform itself

The OEC.sh control plane, the part that manages your servers, runs deploys, ships notifications, and handles the dashboard, also uses rolling deploys. We do not turn off the platform to update the platform. There is no maintenance window banner that goes up at 2am UTC on Sundays.

This matters more than it sounds. If your monitoring depends on the OEC.sh API being up to fetch health checks, and we go down for 5 minutes, your monitoring goes blind for 5 minutes. If your CI pipeline triggers a deploy via webhook and the webhook endpoint is offline, your pipeline fails and your engineers find out about it on Slack. If your dashboard is the source of truth for what version is in production and the dashboard is down, you make decisions based on stale memory.

Our backend rolls one container at a time. Six replicas, three regions, request routing drains the leaving instance for 30 seconds while the new one warms up. The frontend uses Cloudflare Pages, which is already a rolling deploy by design. New visitors get the new bundle. Active sessions keep the old bundle until they reload, at which point service workers swap them in.

The practical result is that we ship platform changes during the work week, in the middle of the day, multiple times a day, without anyone noticing. No status page incident, no email, no warning banner. The change just appears in the changelog at the end of the day.

Runtime resource changes without restart

Here is a scenario that used to mean downtime. A customer realizes their Odoo instance is running close to the memory limit during end-of-month reporting. They want to bump RAM from 2GB to 4GB. The textbook approach is to stop the container, edit the Docker Compose file or Kubernetes manifest, restart the container with the new limit. That is downtime, even if short.

We use docker update. The command takes a running container and applies new resource constraints to its cgroup without stopping it. Memory limit, CPU shares, CPU quota, blkio weight, all of it can be changed in place. The container keeps serving while the kernel reapplies the new limits. From the workload's perspective, the only thing that happened is that it now has more memory available.

In the dashboard this shows up as a slider. You drag, you confirm, the change applies in under a second. We log the previous and new values in the audit log so you can correlate with any performance changes after the fact.

There are limits. You cannot shrink memory below current usage without OOM-killing something, so the platform refuses that operation if current consumption is too close to the new ceiling. You also cannot change networking or volume mounts at runtime, since those require a container recreate. For everything CPU and memory related, runtime updates work.

Deploy pipeline resilience

The deploy itself is a pipeline of shell commands, git operations, container builds, and health checks. Every step can fail in interesting ways. A naive pipeline that aborts on the first error spends a lot of time half-deployed when the real cause was a transient network blip.

We built a wrapper called aexec_safe that every shell command in the deploy pipeline goes through. It does three things. First, it runs the command with a timeout that escalates on retry, so a slow command gets more time the second try. Second, it captures stdout, stderr, and exit code into the audit log so you can see exactly what happened. Third, it knows the difference between transient and permanent failures. SIGPIPE on a curl to GitHub gets retried. Exit 1 with stderr saying "module not found" does not.

Timeouts are dynamic, sized to the work. Cloning a 10MB addon repo gets 30 seconds. Cloning a 2GB enterprise repo with submodules gets 5 minutes. The platform inspects the git remote to estimate size before picking a timeout, so small repos do not wait forever and big repos do not get killed midway. Pulling a Docker image works the same way, sized by the image manifest.

Git operations specifically get extra retry logic. GitHub returns transient 500s often enough that any pipeline running for years will hit one eventually. Self-hosted GitLab instances on overloaded boxes hiccup more. Bitbucket has occasional clone failures during their maintenance windows. We retry git operations up to 3 times with exponential backoff, and we report the underlying error each time so you know whether it was a network thing or a real problem.

Health checks have backoff too. A container that takes 90 seconds to load its first request is normal Odoo behavior with a lot of modules. Hammering /web/health at 1 request per second during that window just adds load. We start at 5-second intervals, back off to 15 seconds if the container is not ready yet, and only fail the deploy if the container has not become ready within its full timeout window.

The boring stuff that prevents weird outages

Most outages are not caused by the dramatic things. They are caused by silent misconfigurations that compound over time. A few examples of the unglamorous defaults we set so customers do not have to think about them.

Version-aware memory floors. Odoo 14 boots on 1GB. Odoo 18 needs at least 1.5GB before the OOM killer starts visiting. Odoo 19 with the new web framework wants 2GB minimum. The platform sets min_memory based on the Odoo major version detected in your repo. If you try to size below the floor, the dashboard warns you. If you bypass the warning, the platform still pads by 256MB so you do not get a container that boots once and then dies on the first heavy request.

queue_job auto-detection. If your repo has the OCA queue_jobmodule checked in, the platform notices and adds server_wide_modules = web,queue_jobto your odoo.conf automatically. Otherwise queue_job is installed but never loaded, and customers spend hours wondering why their async jobs are sitting in "pending" forever. We had this support ticket enough times that we just made it automatic.

Inotify watch limits. Linux has a per-user limit on how many file watches inotify can register. Odoo's dev mode auto-reload uses inotify. Hit the limit and new files stop triggering reloads, but no error appears anywhere visible. Code changes silently stop applying. The platform raises the limit on container start so this does not happen.

Pre-flight checks before deploys. Before any deploy starts, the platform verifies the external port is reachable from the internet, that SSH pubkey auth still works, and that the firewall has the rules it needs. If any of those fail, the deploy aborts with a clear error before touching production. Better to fail fast than to start a deploy and discover halfway through that port 22 was firewalled off this morning during a security review.

High-priority queue. Deployments and server provisioning use a separate worker queue from background jobs like nightly backups and metric aggregation. A backup of a 50GB database does not block a customer waiting to deploy. The deploy worker pool is always ready.

What happens when a deploy fails?

Deploys do fail. The most common cause is a Python import error in newly added code. Second most common is a database migration that hits a schema constraint. Third is a runaway memory spike during module install. The platform handles each gracefully because the blue pool is still serving traffic the entire time the green pool is being built.

When a green container fails its readiness probe, it does not get added to the load balancer. After the configured timeout, the deploy is marked failed. All green containers are stopped and removed. Blue keeps serving. The audit log captures the full output of the failed step, including stderr, exit code, and which check failed.

In the dashboard this shows up as a clear failure card with a one-click retry button. You see what failed, you see what to fix, you push the fix, you click retry. The previous version stays live the entire time. There is no scramble, no incident channel, no explanation to the team about why the site was down for 8 minutes.

We also do not auto-retry failed deploys. If a deploy fails because of a real bug, retrying it will just fail again, and a retry loop on a broken deploy can mask the actual problem. You see the failure, you decide what to do.

How does Odoo.sh compare?

Odoo.sh restarts. When you push to your production branch, the platform pulls the new code, runs migrations if needed, and restarts the Odoo service. During the restart you see 502s. Most users do not notice because Odoo.sh deploys are fast on small instances, but the downtime is real and it grows with module count and database size.

You also do not pick the deploy strategy on Odoo.sh. There is one mode, take it or leave it. You cannot run blue-green for a high-traffic ecommerce store and rolling for a low-traffic internal tool. You cannot apply a runtime memory increase without a redeploy. You cannot configure drain timeouts or readiness probe behavior.

For the full breakdown including pricing, control, and what you can self-host, see our OEC.sh vs Odoo.sh comparison.

TL;DR

Blue-green pools mean old workers serve traffic while new workers spin up. Traffic flips after the new pool passes health checks.
The platform itself uses rolling deploys, so your dashboard, monitoring, and automations stay online when we ship.
Runtime resource changes via docker update mean RAM and CPU bumps without restarts.
The deploy pipeline retries transient failures, sizes timeouts to the work, and fails fast on real bugs.
Version-aware memory floors, queue_job auto-detection, and pre-flight checks prevent the silent misconfigurations that cause weird outages.
If a deploy fails, the previous version keeps serving. You fix it and retry on your schedule.

How OEC.sh Achieves Zero-Downtime Odoo Deploys