Testing, Observability, and Iteration - Duckie

Duckie agents improve through a loop: test, deploy carefully, observe real runs, then update the agent. This is an operating model, not just a debugging process. Validate before public replies, watch early performance, and keep improving based on real conversations.

Key Concepts

Concept	Meaning
Run	One agent execution triggered by a customer message, replay, batch test, schedule, or deployment
Testing	Deployment mode for reviewing behavior before Duckie responds live
Live	Deployment mode where responses and actions can affect real customer systems
Rubric	Scoring criteria for Batch Test results
Resolution	How Duckie determines whether a conversation was resolved, deflected, escalated, or still pending
Categories and Attributes	Labels that make performance easier to analyze by topic, priority, product area, or outcome

The Testing and Rollout Loop

Configure the agent with knowledge, guidelines, guardrails, runbooks, workflows, and tools.
Use Test > Playground for fast, interactive scenario testing.
Use Test > Replay Chats to compare Duckie against real historical conversations.
Turn important scenarios into Test > Batch Test suites.
Run Batch Tests with a Rubric and optional Agent test instructions.
Create a deployment in Testing mode with Internal notes only and No write actions when needed.
Review Analyze > Runs and fix gaps in knowledge, guidelines, guardrails, runbooks, workflows, or tool access.
Switch the deployment to Live only after quality is consistent.
Monitor Performance, Breakdown, Runs, and Alerts after launch.
Repeat the loop after major product, policy, or workflow changes.

What to Observe

Surface	Use it for
Analyze > Runs	Inspect Conversation, Agent Steps, Agent Calls, Attributes, Category, Resolution, Event Source, Tool Input, and Tool Output
Analyze > Performance	Track volume, deflection, resolution, escalation, response time, and time to resolution
Analyze > Breakdown	Review Category Breakdown and Attribute Breakdown, then drill into matching runs
Analyze > Alerts	Notify the team when escalation rate, response time, error rate, volume, or resolution rate changes unexpectedly
Train > Knowledge > Gaps	Turn unanswered questions into better knowledge coverage

Examples

Situation	Loop
Launching a new support agent	Playground, Replay Chats, Batch Test, Testing deployment, then Live
Updating a refund policy	Add Batch Test cases, use Agent test instructions, compare old and new results
Investigating an escalation spike	Start in Performance, filter Runs, inspect Agent Steps and Resolution, then update guardrails or knowledge
Filling knowledge gaps	Identify repeated unanswered questions, create or link knowledge, replay the original conversation
Testing tool-heavy workflows	Use Testing mode with No write actions before allowing real updates

Signs You Should Iterate

The agent escalates too often or too rarely.
Runs show repeated failed tool calls.
Customers ask questions that knowledge does not answer.
Batch Test scores drop after a policy, product, or prompt change.
Resolution rates vary sharply by category or attribute.
Reviewers frequently reject approval requests.

Testing

Learn about Playground, Replay Testing, and Batch Testing.

Runs

Inspect execution details and outcomes.

Performance Metrics

Track volume, resolution, deflection, escalation, and timing.

Knowledge Gaps

Find and close unanswered questions.

​Key Concepts

​The Testing and Rollout Loop

​What to Observe

​Examples

​Signs You Should Iterate

​Related Docs