Skip to content

SessionDO & Sandbox State Machines

Two cooperating state machines drive the agent runtime. SessionDO owns conversation state and the agent loop; OmaSandbox (the container DO) owns the workspace and exec surface. They communicate over RPC and have independent lifecycles — knowing which one is in which state is the fastest way to debug “why didn’t my session resume.”

Three states. Status is derived, not stored — there’s no state.status field for code paths to mutate inconsistently.

+-----------+
first /init | idle |
-----------------> | | <----------------+
+-----------+ |
| |
user.message arrives, | |
drainEventQueue starts | |
runFiber INSERT v |
+-----------+ |
| running | |
| | |
+-----------+ |
| |
runFiber DELETE | |
(turn done OR aborted +-------------------------+
OR onFiberRecovered)
|
/destroy |
v
+-------------+
| terminated | (sticks; rejects all events)
+-------------+
deriveStatus(): "idle" | "running" | "terminated" {
if (state.terminated_at != null
|| state.status === "terminated") return "terminated"; // legacy back-compat
if (cf_agents_runs has any row) return "running";
return "idle";
}
TriggerWhat changes in storageResulting derived status
POST /initcf_agents_state row written, terminated_at: nullidle
POST /event user.message arrivesevents row append. If drain reaches runFiber, it INSERTs a row into cf_agents_runsidlerunning
Turn finishes cleanlyrunFiber.finally DELETEs the cf_agents_runs row + appends session.status_idle eventrunningidle
Turn errors outsession.error event appended; runFiber.finally DELETEs rowrunningidle
DO evicted mid-turnNothing; eviction does no writesderived value continues to read running (the row is still there)
_checkRunFibers finds orphanDELETEs the row, then calls onFiberRecoveredrunningidle until recovery’s own runFiber re-INSERTs
recoverAgentTurn re-issues turnFresh runFiber INSERTs new rowidlerunning
5 recoveries exceededforceIdle callback emits session.status_idle event; recovery counter clearedrunningidle
user.interruptcurrentAbortController.abort() fires; runFiber’s try catches → finally DELETEs rowrunningidle
DELETE /destroysetState({...state, terminated_at: Date.now()}); sandbox.snapshotWorkspaceNow + sandbox.destroyany → terminated

Earlier the code wrote setState({status: "running"}) before runFiber INSERTed its row into cf_agents_runs. Those writes were separated by ≥3 await boundaries. An eviction landing in that window left state.status="running" with no orphan fiber to recover, so _checkRunFibers had nothing to do, drainEventQueue saw “running” and returned, and the session deadlocked until a manual user.interrupt. With derivation the divergence is impossible — the fiber row is the status.

A SessionDO that’s been evicted resumes when any of these arrive:

  • HTTP fetch into the DO (user message, console events poll, verify exec)
  • Alarm dispatch from cf_agents_schedules (next due time)
  • WebSocket frame from a hibernated client
  • Direct RPC from another worker

On wake the constructor reloads state from SQL. The alarm path also runs _checkRunFibers on every fire — that’s the recovery hook.

OmaSandbox is the container Durable Object (extends Sandbox from @cloudflare/sandbox). Its lifecycle is managed by Cloudflare, not by SessionDO directly — SessionDO just makes RPC calls (exec, readFile, setOutboundHandler, etc.) and the platform handles boot/sleep/destroy.

+---------+
first RPC | cold |
--------> | (no |
| container)|
+---------+
|
spawn | container boot
container v
+---------------+
| warming-up |
| - PID 1 boot |
| - 5s cert |
| polling |
+---------------+
|
| trustRuntimeCert resolves (cert
| pushed within 5s by setOutboundHandler)
v
+---------------+
SessionDO | active | <----------------+
RPCs | /workspace | | renewActivityTimeout
hit it, | populated by | | (each RPC, plus our
every RPC | restore or | | alarm-driven keepalive
resets | empty | | while bg tasks live)
the idle +---------------+ |
timer | |
| |
no RPC for | |
sleepAfter v |
| onActivityExpired() override:|
| 1. snapshotWorkspaceNow |
| (createBackup → R2 + |
| recordBackup → D1) |
| 2. super.onActivityExpired() |
| -> super.stop() |
v
+---------------+
| stopping |
+---------------+
|
| onStop({exitCode: 0, reason: "exit"})
v
+---------+
| cold |
+---------+

When interceptHttps = true, the container’s PID 1 polls /etc/cloudflare/certs/cloudflare-containers-ca.crt for 5 seconds at startup. The cert is only pushed by the Cloudflare platform once SessionDO calls sandbox.setOutboundHandler(...). If you skip that call, the cert never lands and PID 1 exits with “Certificate not found.” This is the bisected behavior on @cloudflare/sandbox versions prior to the fix; track upstream changes in the sandbox SDK release notes.

The warmup sequence in SessionDO’s warmUpSandbox:

  1. Probe sandbox.exec("true") until container responds (10 attempts).
  2. Probe cat /tmp/.oma-warm — if a non-empty marker exists, the container’s /workspace is already populated; skip restore.
  3. Otherwise, look up the latest backup in D1 by (tenant, env, session) and call sandbox.restoreWorkspaceBackup if found.
  4. Mount any memory_store / env / github_repository resources.
  5. Call sandbox.setOutboundContext({tenantId, sessionId}) — triggers cert push and registers the vault-injection handler.
  6. Call sandbox.setBackupContext({tenantId, environmentId, sessionId}) — gives OmaSandbox the tuple it needs at sleepAfter time.
  7. Write a fresh /tmp/.oma-warm marker so future warmups can detect that this container is still alive.
container's libcurl issues HTTPS request
-> CF Sandbox MITM
-> outboundByHost lookup (static):
"*.r2.cloudflarestorage.com" -> raw fetch (passthrough)
other host -> catch-all handler
-> injectVaultCredsHandler
-> RPC main worker
lookupOutboundCredential
-> Authorization injected
(or null = passthrough)
-> upstream fetch

R2 traffic must bypass the catch-all because the materialize-and-re-PUT path corrupts the squashfs blob — cloudflare/sandbox-sdk#619.

The 5-minute idle timer (sleepAfter = "5m") fires onActivityExpired. We override it to call snapshotWorkspaceNow first — that’s the only point in the lifecycle where /workspace is guaranteed to still exist before the SDK tears the container down. The snapshot writes to R2 (BACKUP_BUCKET) and records the handle in workspace_backups D1 (keyed by tenant_id + environment_id + source_session_id). 7-day TTL by default.

renewActivityTimeout is RPCed by SessionDO’s alarm any time it sees a row in background_tasks — that keeps the container alive past the 5-minute window for bash run_in_background jobs that the agent is waiting on. Capped at 30 minutes total per task (see pollBackgroundTasks’s BG_TASK_MAX_LIFETIME_MS).

SessionDO.fetch DELETE /destroy calls sandbox.snapshotWorkspaceNow() (eager — won’t survive sandbox.destroy() otherwise), then sandbox.destroy(). That’s the only path that takes a final snapshot before the container goes away.

SessionDO OmaSandbox (container DO)
========= =========================
warmUpSandbox()
├─ sandbox.exec("true") ─────────────> cold → warming-up → active
├─ sandbox.exec("cat /tmp/.oma-warm")
│ marker absent → restore needed
├─ sandbox.restoreWorkspaceBackup(h) ──> mount squashfs into /workspace
├─ sandbox.setOutboundContext() ──> cert push + register handler
├─ sandbox.setBackupContext() ──> stash (tenant, env, session)
└─ sandbox.exec("echo $gen > /tmp/.oma-warm")
drainEventQueue()
└─ runFiber("turn:1") ┐
└─ streamText │ — agent runs tools via
└─ sandbox.exec / writeFile ┘ sandbox RPC; container
stays active
(turn ends; runFiber DELETEs cf_agents_runs row)
(no traffic for 5 min)
sleepAfter fires
OmaSandbox.onActivityExpired()
├─ snapshotWorkspaceNow()
│ ├─ createBackup → R2
│ └─ recordBackup → D1 row
└─ super.onActivityExpired() → stop
OmaSandbox.onStop({exitCode: 0})
(next turn arrives)
warmUpSandbox()
├─ sandbox.exec("true") ─────────────> cold → warming-up → active
├─ sandbox.exec("cat /tmp/.oma-warm")
│ marker absent (new container, /tmp wiped) → restore needed
├─ findLatestBackup() in D1 (the row sleepAfter just wrote)
└─ sandbox.restoreWorkspaceBackup(h) ──> /workspace restored

The container can die independently of SessionDO and vice versa. SessionDO’s wrapSandboxWithLazyWarmup proxy detects container recycle by re-probing /tmp/.oma-warm before each call — if the marker is gone, the cached sandboxWarmupPromise is reset and the next call re-runs warmUpSandbox.