Skip to main content
FeatureMarch 18, 20265 min read

OEC.sh Now Tracks 4 Server Health States — And Auto-Recovers

Most monitoring tools tell you a server is “online” or “offline.” A 2-second network hiccup triggers the same alert as a server that has been down for an hour. OEC.sh now uses four distinct health states with configurable thresholds, circuit breaker logic, and automatic recovery.

Beyond Online or Offline

Most platforms give you a binary status: the server is “online” or “offline.” That's it. A 2-second network hiccup triggers the same alert as a server that has been unreachable for an hour. Your phone buzzes at 2am, you log in, and the server is already back. The next time it buzzes, you ignore it — and miss a real outage.

OEC.sh replaces the binary model with four distinct states. Each state has configurable thresholds. Alerts fire only when the situation warrants human attention, and the system recovers automatically when the server comes back.

Four Health States

HEALTHY

All checks passing. Server operational.

The server is responding to every health check on schedule. All automated tasks (backups, deployments, replica sync) run normally.

DEGRADED

Some failures detected, below threshold. Early warning.

The server has failed 2 or more consecutive checks but has not yet crossed the disconnected threshold. This is an early warning — often caused by transient network issues that resolve on their own.

DISCONNECTED

Threshold reached. Active problem confirmed.

After 5 consecutive failures, the server is marked disconnected. The circuit breaker opens: automated tasks are paused to avoid wasting resources on a server that cannot respond. Alerts fire at this point.

DEAD

Extended outage. Human intervention needed.

After 20 consecutive failures, the server is marked dead. The probe continues running every 5 minutes, but at this point the issue is almost certainly not a network hiccup. Someone needs to investigate the server directly.

State transitions

Degradation path:

HEALTHY → DEGRADED (after 2 failures) → DISCONNECTED (after 5) → DEAD (after 20)

Recovery path:

Any state → HEALTHY after 2 consecutive successes

The Circuit Breaker

The old behavior: when a server goes down, the platform keeps retrying backups, health checks, and replica sync against a server that cannot respond. This wastes resources, generates noise in your logs, and can cause cascading timeouts across your infrastructure.

The new behavior uses a circuit breaker pattern. When a server crosses the failure threshold and enters Disconnected status, the circuit opens. All automated tasks for that server are paused — no more failed backup attempts, no more timeout errors piling up.

How the circuit breaker works

1

Circuit closed (normal)

Server is Healthy. All automated operations (backups, deployments, replica sync, monitoring) run on schedule.

2

Circuit opens (failure threshold reached)

Server reaches Disconnected after 5 consecutive failures. Automated tasks are paused. A lightweight probe continues every 5 minutes to detect recovery.

3

Circuit half-open (probe detects response)

The probe gets a successful response. The system waits for a second consecutive success to confirm the server is truly back.

4

Circuit closes (recovery confirmed)

Two consecutive successful checks. Server returns to Healthy. All automated operations resume automatically. A recovery notification is sent.

What You See in the Dashboard

The server list and server detail views surface health information at the right level of detail.

Server list

Each server shows a colored status badge next to its name. Green dot for Healthy, yellow dot for Degraded, red dot for Disconnected, gray dot for Dead. You can scan the list and immediately see which servers need attention.

Healthy
Degraded
Disconnected
Dead

Server detail

The detail page shows the full monitoring context:

  • Last check time — when the most recent probe ran
  • Last success time — when the server last responded successfully
  • Consecutive failure counter — how many checks have failed in a row
  • Consecutive success counter — how many checks have passed in a row (during recovery)
  • "Check Now" button — trigger an immediate health check without waiting for the next scheduled probe

Configurable Per Organization

Every organization has different tolerance for risk and different infrastructure constraints. Settings > Monitoring lets you tune the monitoring behavior:

SettingDefaultDescription
Check interval5 minutesHow often OEC.sh probes each server. Shorter intervals detect issues faster but increase network overhead.
Failure thresholds2 / 5 / 20Consecutive failures needed to transition to Degraded, Disconnected, and Dead respectively.
Recovery confirmations2Consecutive successful checks needed before a server is marked Healthy again.
Notification channelsEmail + in-appChoose email, in-app notifications, or both. Notifications go to organization admins.
Alert statesDisconnected + Dead + RecoveryWhich state transitions trigger alerts. By default, you are notified when a server reaches Disconnected, Dead, or recovers back to Healthy.

Check Types

OEC.sh supports four check types. Choose based on what level of validation you need for each server:

SSH_PING

Lightweight SSH echo

Sends a minimal SSH command to verify the server is reachable and the SSH daemon is responding. Fastest check type with the least overhead.

NETDATA_API

HTTP check to Netdata

Makes an HTTP request to the Netdata monitoring agent running on the server. Verifies both network connectivity and that the monitoring stack is operational.

BOTH

Both must pass

Runs SSH_PING and NETDATA_API. The check only passes if both succeed. Use this when you want maximum confidence that the server is fully operational.

ANY

Either passing is sufficient

Runs SSH_PING and NETDATA_API. The check passes if at least one succeeds. Use this for servers where one check method may intermittently fail without indicating a real problem.

Deploy with Built-in Monitoring

Every server you connect to OEC.sh gets four-state health monitoring, circuit breaker protection, and automatic recovery out of the box. No agents to install, no configuration required.

Frequently Asked Questions

How often does OEC.sh check server health?

By default, OEC.sh checks every 5 minutes. You can configure the interval per organization from Settings > Monitoring. Shorter intervals mean faster detection but slightly more network overhead. Most teams find 5 minutes strikes the right balance between responsiveness and resource usage.

Will I get alerts for brief network hiccups?

No. A single failed check moves the server to Degraded status, which does not trigger alerts by default. Alerts fire only when the server reaches Disconnected (after 5 consecutive failures) or Dead (after 20 consecutive failures). This means a brief network hiccup that resolves within a few minutes will never wake you up at 2am.

Can I customize monitoring thresholds?

Yes. In Settings > Monitoring, you can configure the failure thresholds for each state transition (Degraded, Disconnected, Dead), the number of consecutive successes needed for recovery, the check interval, and which states trigger notifications. All settings apply per organization.

Does monitoring work on BYOS (bring your own server)?

Yes. Server health monitoring works on every server connected to OEC.sh, whether provisioned through a cloud provider integration or added as a BYOS server. The check types (SSH_PING, NETDATA_API, BOTH, ANY) are available regardless of how the server was connected.