Skip to content

Preview, test, and run

Telbox gives an agent three execution modes that form a safety gradient: a dry-run predicts what an agent would do without spending a token; a test-run executes it for real against the live model in a sandbox where write tools only propose and never auto-execute; and the production trigger path is the only one where the authority gradient actually auto-executes granted reversible writes and posts the agent's reply back to the thread. This page covers all three, the run-trace shape they share, and how to call the previewable two from the SDKs.

The three modes at a glance

Mode LLM? Side effects Write tools Reply posted? How you invoke it
Dry-run No None (pure) Predicted only — shows each write's real authority decision Predicted (would_auto_reply) POST /v1/agents/dry-run (IR) · POST /v1/agents/{id}/dry-run (installed)
Test-run Yes (live) None to your data — runs persist a durable AgentRun Propose — mint a confirm token, never auto-execute Never (no thread_id) POST /v1/agents/{id}/test-run
Production trigger Yes (live) Real — granted reversible writes auto-execute Auto-execute when granted auto_act_limited Yes, when granted thread_replies: auto_act_limited A message_arrival trigger fires (no direct endpoint)

The gradient is intentional

Dry-run is authoritative but free — it reads the persisted IR + guards and computes the same authority decision the runtime would, with no model call. Test-run is live but safe — the model actually runs, but it can't touch your data. Only the production trigger path crosses the line into auto-execution. Build with dry-run, validate with test-run, ship to triggers.


Dry-run — predict, don't execute

A dry-run is a deterministic, side-effect-free preview computed from an agent's compiled IR and its authority guards. There is no LLM call and no execution. For each step it tells you which tool runs, whether it reads or acts, and — crucially — the real authority decision each write would get, so a leash set to "auto" that won't actually auto-act (e.g. an empty-limits capability that resolves to ask) surfaces as a warning instead of a runtime surprise.

It works for any agent: an installed one, an IR posted before creation, or a raw template. Because it never touches the model it is instant and free.

Two endpoints:

  • POST /v1/agents/dry-run — preview a raw IR before you create the agent.
  • POST /v1/agents/{agent_id}/dry-run — preview an installed agent from its persisted IR + guards.

Response shape

DryRunView
{
  "name": "VIP Watcher",
  "triggers": ["When a message arrives in a thread"],
  "steps": [
    {
      "id": "s1",
      "kind": "tool",
      "tool": "read_thread",
      "description": "Read the recent messages in a thread.",
      "side_effects": "none",
      "capability": null,
      "authority": null
    },
    {
      "id": "s2",
      "kind": "tool",
      "tool": "create_task",
      "description": "Create a task from a message.",
      "side_effects": "write",
      "capability": "task_management",
      "authority": {
        "level": "auto_act_limited",
        "mode": "auto",
        "phrase": "acts automatically (within limits)",
        "reason": "within_limits"
      }
    }
  ],
  "effects": { "reads": 1, "drafts": 0, "asks": 0, "auto": 1, "blocked": 0 },
  "would_auto_reply": false,
  "warnings": [
    "“send_email” is set to act automatically but will ask instead (external_always_confirms) — finish the leash to let it auto-act."
  ]
}
Field Meaning
name, triggers The agent's name and human-phrased triggers.
steps[] Per-step preview. kind is tool, say, or if.
steps[].side_effects none (read), write (reversible), or external (irreversible).
steps[].authority The resolved decision for a write/external step (null for reads). modeauto / ask / draft / refuse.
effects Counts of reads / drafts / asks / auto / blocked across the steps.
would_auto_reply Whether the thread_replies guard resolves to auto.
warnings[] Plain-English mismatches — e.g. an auto_act_limited leash that won't auto-act, no trigger, or no tools.

External tools always confirm

A step with side_effects: "external" (e.g. request_uber_ride) is hard-gated out of auto-execution by the runner before the authority resolver is consulted. Dry-run mirrors this exactly: such a step always shows mode: "ask" with reason: "external_always_confirms", whatever grant you configured. Adding an irreversible auto-capability is not possible by leash alone.


Test-run — live model, safe sandbox

A test-run executes the agent once against a sample prompt using the live model, and persists a durable AgentRun you can read back from the run history. It reuses the same agent loop the production runtime uses — but as a safe sandbox:

  • Write tools propose, never auto-execute. The test-run path deliberately passes no authority resolver, so a write tool mints a confirm token instead of acting. You can re-run it as many times as you like without creating real tasks, reminders, or calendar events.
  • No reply is ever posted. A test-run carries no thread_id, so the auto-reply helper is a no-op — a test-run can never post into a real thread. The agent's answer stays in the run trace.
  • Failures are recorded, not raised. A failed run (e.g. no LLM configured) is persisted with its error and shows up in history as failed rather than returning a 500.
POST /v1/agents/{agent_id}/test-run
Content-Type: application/json

{ "prompt": "A VIP just messaged about the Q3 contract — what should I do?" }

The endpoint returns a RunDetail (the same shape GET /v1/agents/{id}/runs/{run_id} returns). Run history is paginated at GET /v1/agents/{id}/runs. See Pagination & Versioning.

Test-run is your debug loop

Every test-run is durable. Iterate on the persona and tools, fire a test-run, then read its trace — args → result/error → duration per tool — to see exactly what the model chose and why. The run is in the history either way.


Production trigger — the autonomy gradient

The production path is the only one that auto-executes. When a message_arrival trigger fires, the runtime runs the agent through the same loop as test-run, but with the agent's authority gradient wired in (build_agent_authority_resolver over the grant's permissions). On a trigger there is no human in the loop, so the gradient is what decides:

  • A capability granted auto_act_limited auto-executes — but only for reversible, own-workspace tools (create_task, create_reminder, create_calendar_event, mute_thread).
  • An ungranted capability resolves to ask — and since no human is present on a trigger, the result stays in the trace for the owner.
  • An external / irreversible tool always confirms — it is hard-gated out of auto-execution regardless of the grant.

Then, if (and only if) the agent is granted thread_replies: auto_act_limited, its answer is posted back to the watched thread — signed with the agent's own Ed25519 key, rendering a cryptographically-verified badge on the recipient. The reply commits atomically with the run completion and is followed by a best-effort APNs push; it is published without a MessageSent event so an agent's own reply can never re-trigger the dispatcher. A draft_only / ask_before_action agent never auto-posts — its answer is recorded as reply_skipped for the owner to review.

There is no taint gate on reversible auto-acts

A message_arrival trigger fires on content the agent did not author. A reversible tool granted auto_act_limited can act on instructions embedded in that message — this is by design ("turn a VIP message into a task"), and the blast radius is bounded to reversible, own-workspace, deduped, rate-limited actions. The default is draft_only; an agent autos nothing until you explicitly grant auto_act_limited. For the full posture see the runtime spec.

For the durability, dedupe, feedback-loop, and isolation guarantees of the trigger path, read the Agents runtime spec.


The run trace

Both test-run and the production trigger path persist an AgentRun.trace — a compact, JSON-safe, size-bounded timeline. The first entry is always the agent's answer; each subsequent tool entry records the arguments the tool ran with, ok/error, a bounded result summary, and duration.

AgentRun.trace
[
  { "kind": "answer", "text": "Flagged the contract message and created a follow-up task." },
  {
    "kind": "tool",
    "name": "read_thread",
    "ok": true,
    "args": { "thread_id": "…", "limit": 20 },
    "duration_ms": 142,
    "result": { "messages": 18 }
  },
  {
    "kind": "tool",
    "name": "create_task",
    "ok": true,
    "args": { "title": "Follow up on Q3 contract" },
    "duration_ms": 31,
    "result": { "task_id": "…" }
  }
]
Trace kind When Key fields
answer Always first text
tool One per tool call name, ok, args, result (when ok), error (when not), duration_ms
error Run failed before/within the loop text
reply Production reply posted message_id, thread_id
reply_skipped Production reply gated off reason (e.g. rate_limited, thread_not_found, an authority reason)
reply_failed Production reply delivery errored error

reply* entries only appear on triggered runs

A test-run has no thread_id, so it never produces reply, reply_skipped, or reply_failed entries — only answer, tool, and (on failure) error. The reply outcome is a production-only signal.

The trace is exposed as the trace field on a RunDetail, alongside the convenience answer (the text of the answer entry).


Calling it from the SDKs

Authenticate with a scoped developer API key (see Authentication). Both SDKs default to https://api.telbox.ai.

Dry-run

# Preview a raw IR before creating the agent
curl -X POST https://api.telbox.ai/v1/agents/dry-run \
  -H "Authorization: Bearer $TELBOX_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "name": "VIP Watcher", "triggers": [...], "steps": [...], "guards": {...} }'

# Preview an installed agent
curl -X POST https://api.telbox.ai/v1/agents/$AGENT_ID/dry-run \
  -H "Authorization: Bearer $TELBOX_API_KEY"
from telbox import TelboxClient

tb = TelboxClient(api_key="tb_live_…")

# Preview a raw IR (a dict) before creating the agent
preview = tb.dry_run(ir)
print(preview.effects)          # {"reads": 1, "auto": 1, ...}
print(preview.would_auto_reply)
for w in preview.warnings:
    print("⚠", w)

# Or preview an installed agent
preview = tb.dry_run_agent(agent_id)
import { TelboxClient } from "@telbox/sdk";

const tb = new TelboxClient({ apiKey: "tb_live_…" });

// Preview a raw IR before creating the agent
const preview = await tb.dryRun(ir);
console.log(preview.effects);          // { reads: 1, auto: 1, ... }
console.log(preview.would_auto_reply);
preview.warnings.forEach((w) => console.warn(w));

// Or preview an installed agent
const installed = await tb.dryRunAgent(agentId);

Test-run

curl -X POST https://api.telbox.ai/v1/agents/$AGENT_ID/test-run \
  -H "Authorization: Bearer $TELBOX_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "prompt": "A VIP just messaged about the Q3 contract — what should I do?" }'
from telbox import TelboxClient

tb = TelboxClient(api_key="tb_live_…")

run = tb.test_run(agent_id, "A VIP just messaged about the Q3 contract.")
print(run.status, run.latency_ms, run.model_used)
print(run.answer)
for entry in run.trace or []:
    print(entry["kind"], entry.get("name", ""), entry.get("duration_ms", ""))

# Read it (or any prior run) back from history
again = tb.get_run(agent_id, run.id)
import { TelboxClient } from "@telbox/sdk";

const tb = new TelboxClient({ apiKey: "tb_live_…" });

const run = await tb.testRun(agentId, "A VIP just messaged about the Q3 contract.");
console.log(run.status, run.latency_ms, run.model_used);
console.log(run.answer);
(run.trace ?? []).forEach((e) => console.log(e.kind, e.name, e.duration_ms));

// Read it (or any prior run) back from history
const again = await tb.getRun(agentId, run.id);

Rate limits

Agent write actions — including test-run and dry-run — are metered per developer (~30/hour). A 429 carries a Retry-After header; back off and retry. See Rate Limits & Quotas.


Next steps

  • Authentication — get a scoped developer API key.
  • Errors — the machine-readable error codes these endpoints return.
  • Agents runtime — the durability, dedupe, feedback-loop, and injection posture of the production trigger path.