The Harness Is the Platform: A Distributed Agent Runtime on Temporal
How we designed our production harness for enterprise agents
Claude Code and Codex are excellent single-process harnesses. Production enterprise agents are a different problem – distributed across machines, across users, across time.
TL;DR
Deploying an agent in a persistent sandbox is like giving every user their own single-tenant EC2 server. It does not scale – not to thousands of runs, not even to a few hundred users. Nobody would ship an application that re-deploys the entire stack for each user session. Agents should not be deployed that way either.
The harness is the layer that holds state, routes tools, injects credentials, manages context, streams to the UI, and survives the death of any one node. Single-process harnesses like Claude Code and Codex do this well for developer tools. Enterprise agents need the same job done across machines, across users, and across long horizons – and done efficiently, on shared infrastructure.
This is a piece about what that distributed harness actually has to do, and what it looks like when you build it on Temporal.
Two shapes of harness
Temporal handles durable execution extremely well: state that survives a crash, retries with backoff, waits that cost nothing. That is the base every serious agent system needs – but it is not, by itself, an agent runtime. The part that turns durable workflows into something an agent actually runs on is where most of the work is. (I made the broader version of that argument in an earlier piece on durable execution.)
The shape of that work depends on who you are building for.
Claude Code, Codex, Cursor – these are excellent harnesses. They are single-process: one user, one machine, one session, one identity. The whole agent run lives inside a single binary, holds state in RAM, talks to one model at a time, and disappears cleanly when the user closes the terminal. For developer tools, this is the right shape. It is fast, simple, and reliable.
It is not what an enterprise platform looks like.
For enterprise use cases, the agent has to run for many users, across many integrations, often for hours, days, or even weeks, with the user not at a keyboard. It has to survive a node failure mid-run, fan out child work, get queried and signaled from outside, hold encrypted credentials it never exposes, and leave an audit trail per action. An agent run is not a process. It is a workflow. The harness has to coordinate that workflow across a fleet, not inside a single process.
The two shapes look superficially similar – an agent loop, some tools, a model – but the engineering underneath them is profoundly different.
What “distributed” means here
When we say the agent is distributed, we mean something specific.
The agent run is durable. It is checkpointed. If the process running it dies, another process resumes from the last decision. State is not in memory, it is in history.
The agent run is long. A turn may take 30 seconds. A run may take several hours, or days when it sits in wait states. A workflow may pause for a day waiting for a callback or a user signal. The harness has to assume nothing about elapsed time.
The agent run can be queried and signaled from outside. You can ask, in the middle of a run, what the agent is doing. You can send it new context. You can cancel it. You can wake it up. The state is reachable from the outside, by design.
The agent run can fan out. A parent workflow can spawn child workflows that run independently, return results, or stream progress back. A single user-facing run can be many actual workflows underneath.
The agent run is multi-tenant. Many users, many projects, many concurrent runs share the same fleet. Isolation is enforced at the activity boundary, not by separate processes.
None of these are exotic distributed-systems requirements. Each one is, however, something a single-process harness simply does not have to think about. That is the gap we are talking about.
Why Temporal as the distributed execution framework
If you set out to build this without a durable-execution runtime, you discover quickly that you are reinventing one. Workflow ID, history, deterministic replay, signal queues, retries with idempotency, child workflow semantics – these are not features you add later. They are the bones.
Temporal is the most mature primitive for this shape today. We chose it because we did not want to reinvent it. The specific properties we lean on:
Workflow as a durable function. The conversation workflow is the agent run. Stable ID, append-only history, deterministic replay. Restart Temporal, restart our workers, the agent picks up where it left off.
A distributed state machine, for free. Deterministic replay means the workflow is a state machine with the state living in history rather than in memory. Any worker in the fleet can advance any workflow, because the history fully defines its state. There is no separate state store to keep consistent.
Retries with idempotency baked in. Every activity gets configurable retries with exponential backoff, attempt limits, non-retryable error types. The activity author writes the function once; the runtime handles failure semantics. This matters a lot when half your activities are model calls that occasionally rate-limit, and the other half are external APIs that occasionally 500.
Hot deploys without losing in-flight runs. Long-running workflows survive code deployments via Temporal’s versioning model. You can ship a new workflow version while old runs are still executing, route new starts to the new version, and let old runs complete on the old version without interruption. For agents that may run for a day, this is not optional.
Elastic scalability through task queues. Workers pull from queues. Add workers, capacity scales. Specialize workers – one queue for model calls, another for browser sandboxes, another for cheap activities – and you get tier-based isolation for free. We run several worker pools today, and adding a new one is configuration, not a redesign.
Long pauses are free. A workflow can sleep for hours, days, or months waiting for a callback, a scheduled time, or a user signal. The pause is a state transition in history, not a timer in memory, so restarts and migrations are invisible to it. Agents that fan out work and check back tomorrow are routine, not a feature you have to engineer.
Activities for I/O. Anything that talks to the world – the model API, the secret store, the audit pipeline – runs as a Temporal activity. Activities can fail and be retried, and the workflow stays consistent.
Tool execution rides on the same primitives. Every tool call the agent makes is also a Temporal activity, so every tool inherits the retry, timeout, idempotency, heartbeat, and crash-recovery semantics of the platform. A browser action that takes ten minutes is normal. A tool that crashes mid-run gets resumed on another worker. The agent does not need to know.
Forced separation of orchestration and execution. Workflows must be deterministic – no I/O, no clocks, no randomness in the orchestration path. Anything with side effects goes into an activity. The constraint turns into a discipline: the workflow is pure planning and dispatch, the activities are the parts that touch the world. Each is easier to read, test, and change in isolation.
Signals and queries. The frontend streams from the workflow via signals. External systems push new input via signals. The UI reads progress via queries. All of this is built in.
Child workflows. A multi-step plan spawns child workflows for sub-tasks. Each can have its own retry policy, its own timeout, its own isolation. The parent composes the results.
History as audit. Every decision the workflow made is in history, deterministically replayable. That is most of an audit log for free, before we even add our own.
Could we have built all of this on raw Kubernetes plus a message queue plus a state store? Yes. Many teams have. Most of them stopped halfway and rebuilt on Temporal a year later. Durable execution stops being optional the moment your agent runs are real.
What the harness builds on top
Temporal gives us durable execution. It does not, on its own, know what an agent is. The harness is the layer that turns Temporal primitives into an agent runtime. At Vertesia, that layer includes the following pieces.
The conversation workflow. The agent run is a single durable workflow. It owns the turn loop: gather context, call the model, parse tool calls, dispatch them, accumulate results, decide whether to continue. It is the deterministic core. Everything non-deterministic – the model call, the tool call, the I/O – happens in activities.
Tools as activities. Every tool is a Temporal activity. Each one is a small, self-contained unit: a typed parameter schema, an implementation that touches the world, and a definition that registers it with the runtime. The workflow calls tools through a typed proxy. Adding a new tool means writing that unit and registering it; the workflow itself does not change.
Credential injection at the activity boundary. When a tool needs an external credential, the activity decrypts it just-in-time – held in memory for the duration of one call, never returned to the model. The model never sees the credential. This is the cardinal rule from the identity and secrets design I cover separately, enforced concretely at the activity edge.
Streaming via signals. As the agent works, it pushes incremental updates into a signal stream. The frontend subscribes and renders them in real time. Behind the scenes the workflow is still deterministic; the streaming is layered on as a side-channel that does not affect replay.
Skill registry. A skill is a tool the model can invoke to load a manual, a set of operating procedures, and a set of otherwise-hidden tools. Until the skill is loaded, the tools it gates are not in the model’s catalog. This is progressive tool discovery: the model only sees what is relevant for the work it has signalled. The skill registry lives in the platform, not in the prompt.
Context engine. The model does not get a static context. It gets a context constructed per turn, dynamically, from the conversation history, the loaded skills, the current task, and whatever the agent has discovered along the way. The context engine is the layer that constructs and ranks this material – which is itself an architectural piece deep enough to deserve its own essay: Intelligence Is Contextual.
Audit pipeline. Every privileged action – tool call, secret access, credential fill, model invocation – emits an event with the principal, the resource, the credential reference, and the result. Events land in a queryable analytics store. The delegation chain is preserved: user → agent run → tool → external API. This is the second cardinal rule. Nothing happens without an audit trail.
Each of these is a layer the model itself does not see. The model sees its context and a tool catalog. Everything below that is platform.
The hard parts of distribution
Building this on Temporal sounds clean. It is clean conceptually. Some of it is genuinely hard in practice.
Streaming and deterministic replay. Workflows must be deterministic. Streaming output naturally is not – tokens arrive in unpredictable order, partial results race, the frontend is impatient. We resolve this by treating streaming as a side-effect of the model activity: the activity streams to a signal channel, and the workflow only sees the final result on completion. Replay does not stream; live runs do. This works, but the discipline to keep streaming out of the deterministic path has to be enforced everywhere.
Cancellation propagation. A user cancels a run. We have to propagate that across child workflows, in-flight activities, pending tool calls, possibly external systems holding open connections. Temporal makes this possible (cancellation signals, heartbeats, contexts) but the harness has to wire it through every layer. Get one layer wrong and you have a workflow that keeps running after the user thinks they killed it.
Memory consistency across runs. Long-term memory is its own architecture problem (a piece I have queued for later in the series). Within the harness, the question is how a workflow on machine A sees what a previous workflow on machine B wrote. The answer is that the memory layer is external to the workflow, written through activities, read through activities, with the harness coordinating consistency at the application level.
Multi-tenant isolation. Every activity carries the principal token. Every call goes through the permission system. Every secret is keyed by project. Every audit event is scoped. The harness has to ensure that no piece of state, no credential, no model call can ever leak across tenants. This is a discipline more than an algorithm. Every activity has to do its part.
Cost attribution. Token usage, activity time, model calls per user – all of it has to be attributable. Inside a distributed runtime where many agents run concurrently across many users, the metering layer cannot be optional. We track usage at the activity boundary and aggregate per principal, per project, per workflow.
None of these is impossible. All of them have to be solved correctly.
What single-process harnesses don’t have to solve
It is worth being fair about what we are taking on. A single-process harness has a much easier job in several specific ways.
State can live in RAM. The conversation is right there, in memory, the entire time the user is using the tool. No serialization, no replay, no checkpointing required.
There is one user. There is one identity. There is one set of credentials. The notion of multi-tenant isolation does not arise.
Streaming is trivial. The model streams tokens; the terminal renders them. No signal channels, no replay discipline, no separate observation path.
Failure modes are simpler. If the process crashes, the session is over. The user restarts. No partial state to recover, no delegated cleanup, no audit obligation.
This is why single-process harnesses are so much easier to build well, and why Claude Code and Codex feel so good to use. They are the right shape for what they are.
The distributed harness is harder for the simple reason that it has to do all the things a single-process harness gets for free, but in a way that survives nodes dying, users disconnecting, runs lasting a day, and a hundred other tenants sharing the fleet at the same time.
What Temporal doesn’t give you
Temporal is excellent at what it is. It does not pretend to be an agent runtime.
It does not know what a model is. It has no notion of tokens, of context windows, of streaming completions. We had to model all of that ourselves.
It does not know what a tool is. Activities are typed functions. Tools, with their parameter schemas, permission contracts, credential dependencies, and audit events – those are our layer.
It does not give you memory across runs. Each workflow has its history; persistent memory across workflows is something the application has to design.
It does not give you cost tracking. Usage attribution is application-level work.
It does not give you a frontend protocol. Streaming, progress, cancellation buttons – all of that is layered on top of Temporal’s signal-and-query model.
We built all of these. They are not glamorous. They are the parts that turn durable execution into a working platform.
The harness is the platform
The model is one component of the system. It is the most visible one, the one everybody talks about. It is also the most replaceable one. The model your agent uses today is unlikely to be the model it uses in twelve months.
The harness is what does not change. The conversation workflow, the tool registry, the credential injection, the audit pipeline, the skill system, the streaming protocol, the multi-tenant isolation – these are the parts that compound. Every new tool we ship works in every existing run. Every new model we adopt slots into the same activity layer. Every new customer comes with the same isolation guarantees.
Single-process harnesses are excellent for a developer at a keyboard. Distributed harnesses are what an enterprise platform needs when there are hundreds of users, dozens of integrations, agents running on schedules and on triggers and on behalf of someone who walked away an hour ago.
The two are different problems. They need different primitives.
We chose Temporal because durable execution is hard and it solves that completely. We built the harness because durable execution is not, by itself, the platform.
That is the work. The model is downstream of it.


