Skip to main content

How to Make a Dependency Resumable

A normal injected dependency lives and dies with a single request. When the agent finishes its turn, the cleanup phase closes it and its state is gone. That is the right default for most dependencies.

Some turns run for minutes. An agent that searches, reads documents, loads dataframes, and writes a report can have its process restarted mid-turn by a deploy, a crash, or a node going away. If the dependency holding the model/tool transcript loses its working state, the whole turn restarts from scratch.

A resumable dependency avoids that. It teaches the framework how to serialize its working state to a JSON blob and how to rebuild itself from one, so a turn can be checkpointed on one process and continued on another.

note

This is an SDK extension point, like creating a custom source. It builds on the dependency injection system — read that first. You do not need it for a normal custom agent; reach for it only when a dependency holds working state that must outlive the process.

When State Is Checkpointed

On a stateful turn the framework runs a background worker alongside the agent. The worker calls dump() on every tracked resumable dependency and persists the result to the agent state store:

  • Every 5 seconds while the turn runs, so progress is never more than a few seconds behind.
  • Once more when the turn finishes (or the stream drains), so the final state is always saved.

Each blob is keyed by the dependency's state_key. On the next turn — including a resumed turn after a restart — the framework loads the saved blob back into a freshly constructed dependency by calling load() during DI resolution, before the agent runs. The agent then re-drives the model/tool loop from the restored transcript instead of starting over.

The blob is persisted as JSON and read back in a different process. That single fact drives every rule below.

The ResumableAgentDependency Interface

Subclass ResumableAgentDependency and implement two async methods plus one class attribute:

from typing import Any, ClassVar, Dict

from zav.agents_sdk import ResumableAgentDependency


class MyDependency(ResumableAgentDependency):
state_key: ClassVar[str] = "my_dependency"

async def dump(self) -> Dict[str, Any]:
"""Serialize the working state to a JSON-safe dict."""
...

async def load(self, state: Dict[str, Any]) -> None:
"""Rebuild the working state from a previously dumped dict."""
...
MemberRole
state_keyThe storage key for this dependency's blob. Stable across turns, unique among the resumable dependencies in one agent. A singleton dependency is keyed by state_key alone; a non-singleton is additionally namespaced by its path in the dependency tree, so the same class injected in two places never collides.
dump()Returns a JSON-serializable dict of everything needed to reconstruct the state. Called on every checkpoint.
load(state)Restores state in place from a dict previously returned by dump(). Called during construction (before the agent runs) only when a saved blob exists for this state_key — i.e. on a resumed turn. On a cold first turn there is no saved state, so load() is not called and the freshly constructed dependency must already be valid from its constructor defaults. When it is called, the dict may be {} or carry unfamiliar keys (blobs written by an older/newer version), so load() must tolerate both.

There is nothing to register beyond your usual AgentDependencyFactory. While resolving the agent's constructor, the framework tracks any resolved value that is a ResumableAgentDependency and checkpoints and restores it. Registration is all you need, same as sources.

A Minimal Worked Example

The SDK's own ResumableZAVChatCompletionClient (state_key = "zav_chat_completion_client") is the reference implementation: it wraps the model/tool turn runner and persists the running transcript so a long mission resumes on another pod. The pattern below is the same shape, reduced to a dependency that accumulates events during a turn and must not lose them on restart.

from typing import Any, ClassVar, Dict, List

from zav.agents_sdk import (
AgentDependencyFactory,
AgentDependencyRegistry,
ResumableAgentDependency,
)


class ProgressLog(ResumableAgentDependency):
"""Accumulates step descriptions the agent emits during a long turn."""

state_key: ClassVar[str] = "progress_log"

def __init__(self) -> None:
self._steps: List[str] = []

def record(self, step: str) -> None:
self._steps.append(step)

async def dump(self) -> Dict[str, Any]:
# Return a plain JSON-safe snapshot. Do not mutate self here.
return {"steps": list(self._steps)}

async def load(self, state: Dict[str, Any]) -> None:
# Tolerate a missing key: a first-turn state has no "steps".
self._steps = list(state.get("steps", []))


class ProgressLogFactory(AgentDependencyFactory):
@classmethod
def create(cls) -> ProgressLog:
return ProgressLog()


AgentDependencyRegistry.register(ProgressLogFactory)

Any agent that declares progress_log: ProgressLog in its constructor now gets an instance whose state is checkpointed every few seconds and restored on resume, without the agent code knowing anything about persistence.

tip

dump() runs on every checkpoint. The SDK client caches its last serialization and rebuilds it only when the transcript grew, returning the same object when nothing changed so the state saver skips an unchanged write with a cheap identity check. If your dump() is expensive, do the same.

The Three Contract Rules

Independently written resumable dependencies share one store, so they obey the same three invariants.

1. JSON round-trip safe

Whatever dump() returns must survive json.loads(json.dumps(...)) unchanged, because the store persists it as JSON across processes. This is where state that round-trips fine in memory breaks: a set, datetime, tuple, pandas DataFrame, or Pydantic model is not JSON-native. Serialize to JSON-native types in dump() and reconstruct in load().

2. Cold-start tolerant

The first turn has no saved state, and rolling deploys mean a blob written by an older version may be missing keys you now expect or carry keys you no longer know. load({}) and load({"unknown_key": "..."}) must both succeed. Read with state.get(key, default), never index a key you assume is present, and ignore keys you do not recognize.

3. Non-destructive dump()

dump() is a pure read of the current state. Calling it twice in a row returns equal results and does not mutate the dependency. The background worker calls it many times during a turn; a dump() with side effects would corrupt the state it is supposed to snapshot.

A tiny self-check

Sanity-check all three against your own dependency:

import json


async def check(make_populated, make_empty, key):
populated = await make_populated()

# Rule 1: dump survives a JSON round-trip unchanged.
dumped = await populated.dump()
assert json.loads(json.dumps(dumped)) == dumped

# Rule 3: dump is non-destructive.
assert await populated.dump() == dumped

# Rule 2: a fresh instance tolerates empty and unknown-key states...
empty = await make_empty()
await empty.load({})
await empty.load({"unknown_key": "ignored"})

# ...and a real blob restores to an equivalent instance.
restored = await make_empty()
await restored.load(json.loads(json.dumps(dumped)))
assert (await restored.dump())[key] == dumped[key]

For the ProgressLog above, make_populated builds one with a few recorded steps, make_empty builds a fresh one, and key is "steps".

Gotchas

  • Keep state_key stable. Renaming it orphans every saved blob (those turns cold-start). Keep it unique within an agent — the framework raises a collision error if two resumable dependencies resolve to the same key.
  • Resumability is opt-in per dependency. Only the dependencies you make ResumableAgentDependency are checkpointed; everything else is rebuilt fresh on resume. Persist exactly the state that must survive a restart, and make sure the rest of the dependency can reconstruct itself from it.