v0.1 · Production capture + replay

Stress-test agents. Capture production.
Replay incidents on demand.

Tool Pouch is the reliability layer for AI agents. Catch silent failures pre-deploy with tool-pouch scan. Capture every production request with tool_pouch.wrap_openai. Replay any incident on demand under chaos so you know whether it reproduces, before you ship a fix.

Apache 2.0 Sub-ms capture overhead PII-redacted by default No account, no telemetry Works with OpenAI & Anthropic
The 3am problem

It's 3am. Your agent did something stupid. Now what?

Your agent worked yesterday. Today, a customer says it told them their refund had been processed when nothing actually happened. Your eval suite is green. Your dashboard shows healthy latency. The model returns "everything looks good." So what broke?

The bug lives between the surfaces your existing tools watch. The model called a tool, the tool returned an empty object, and the model, confidently, pretended otherwise. You can't repro it. You can't even see it. By the time it shows up in your observability, the customer has already left.

This is what Tool Pouch is for. Stress-test the adversarial cases pre-deploy. Capture every production request so you have the trace when something does slip through. Replay that trace under chaos in one command, and find out whether it's a one-time fluke or a 78%-of-the-time hazard waiting for its next victim.

Three paths. Five-minute integration.

Pick the layer your agent needs.

01

Pre-deploy

tool-pouch scan injects 12+ tool failures (timeouts, malformed JSON, rate limits, prompt injection, unicode chaos) and reports the silent hallucinations. Auto-discovers @tool_pouch.tool functions, generates test inputs from your docstrings.

tool-pouch init && tool-pouch scan --quick
02

Production

tool_pouch.wrap_openai(client) intercepts every chat.completions.create call. Captures input, output, tool calls, latency. Sub-millisecond overhead. PII-redacted at capture. Pipes to your existing log aggregator out of the box.

client = tool_pouch.wrap_openai(OpenAI())
03

Incident response

tool-pouch replay <trace_id> re-runs a captured trace under chaos. Pair with --repeat 100 to surface flaky failure rates. Use --frozen for a deterministic walk-through, --frozen-tools to isolate model behavior.

tool-pouch replay abc12345 --repeat 100
Production capture

One line. Every request. Zero surprises.

Wrap your OpenAI or Anthropic client and Tool Pouch captures every subsequent call to a destination of your choice. No framework lock-in. No agent rewrite.

The proxy enqueues to a background writer thread on a bounded, non-blocking queue. The hot path is the queue put; serialization, redaction, truncation, and destination IO all run off the request thread.

PII redaction runs at capture time by default (emails, phone numbers, SSNs, credit cards, IPs, API keys) and extends with your own regex.

import tool_pouch
from openai import OpenAI

client = tool_pouch.wrap_openai(
    OpenAI(),
    agent_name="support_bot",
    request_id=lambda **kw: kw.get("user"),
    redact=tool_pouch.redact.builtin(extra_patterns=[
        r"acct_\d{6}",
    ]),
    destinations=[tool_pouch.JSONLogger()],
)

# Use it exactly like before.
client.chat.completions.create(...)
# Walk through what actually happened. No API calls.
tool-pouch replay abc12345 --frozen

# Re-call your model; stub tools with captured outputs.
tool-pouch replay abc12345 --frozen-tools

# Default: full chaos. Real model, real tools, injected scenarios.
tool-pouch replay abc12345

# 100 chaos replays → "this fails 78% of the time on timeouts."
tool-pouch replay abc12345 --repeat 100
Replay

"Would it reproduce?" Answered in one command.

Most production incidents look obvious in hindsight and impossible to reproduce. Tool Pouch turns the captured trace into a runnable scenario and lets you ask, against your real model and tools, what happens this time.

Layered modes give you control. Frozen for review, frozen-tools to isolate model variability, chaos for a true re-run with fault injection. --repeat N aggregates verdicts across runs so you can see flaky failure rates in percentages.

Destinations

Capture once. Pipe anywhere.

L

LocalStore

SQLite at ~/.tool_pouch/tool_pouch.db. Default. Ideal for dev and staging. WAL mode + multi-process safe.

J

JSONLogger

NDJSON to any writable stream, defaults to stderr. Pipes natively into Datadog, Honeycomb, Loki, CloudWatch via your existing log agent.

H

HTTPSink

Batched POST to a URL you control. Configurable batch size, flush interval, retries. Drop-in for in-house observability backends.

A future CloudStore destination will arrive with Tool Pouch Cloud. The wrap API stays unchanged. Pass the new destination, traces flow.

Cloud: coming soon

Replay incidents from anywhere. Triage with your team.

Tool Pouch Cloud is the optional hosted layer. Push captured traces from any environment, search by request_id, replay across your team, retain for compliance. Until it ships, the OSS path is already production-ready via JSONLogger and HTTPSink.

One email at launch. No spam, no sharing your address.

FAQ

Common questions.

How is this different from observability tools I already pay for?
Datadog and Honeycomb watch your traffic and tell you when latency or error rates spike. They can't replay a captured request through your agent under chaos to test whether a fix actually fixes anything. Tool Pouch captures the same telemetry (and pipes to those tools) but adds the replay primitive on top of it.
Does the wrap slow my requests down?
The proxy hot path is a non-blocking queue put. Sub-millisecond p99 in our perf suite. All serialization, redaction, truncation, and destination IO happen on the writer thread, off the request path entirely.
What happens if a destination crashes?
The writer logs to stderr and continues. A misbehaving sink cannot stop captures from flowing or affect your application. Fail-open is the contract.
How does PII redaction work?
tool_pouch.redact.builtin() ships with regex for emails, phones, SSNs, credit cards, IPs, OpenAI/Anthropic keys, AWS/GitHub tokens, and bearer tokens. Pass extra_patterns=[r"acct_\d{6}"] to extend, or pass any callable as redact= to fully customize. Defaults to redaction at capture time so PII never enters the queue.
Can I disable wrap in CI or tests?
Set TOOL_POUCH_DISABLE_WRAP=1 and every wrap_openai / wrap_anthropic call becomes a no-op passthrough.
Which Python versions are supported?
Python 3.10, 3.11, 3.12, and 3.13. CI matrix runs against the pinned and latest OpenAI & Anthropic SDKs.
What about MCP, LangGraph, or my custom orchestrator?
For pre-deploy testing, tool-pouch run takes any agent that exposes agent_fn + real_tool_fn + tools + test_inputs. For production capture, point your underlying OpenAI/Anthropic client through wrap_openai / wrap_anthropic. The orchestration layer doesn't matter.