Skip to main content
Duckie agents improve through a loop: test, deploy carefully, observe real runs, then update the agent. This is an operating model, not just a debugging process. Validate before public replies, watch early performance, and keep improving based on real conversations.

Key Concepts

ConceptMeaning
RunOne agent execution triggered by a customer message, replay, batch test, schedule, or deployment
TestingDeployment mode for reviewing behavior before Duckie responds live
LiveDeployment mode where responses and actions can affect real customer systems
RubricScoring criteria for Batch Test results
ResolutionHow Duckie determines whether a conversation was resolved, deflected, escalated, or still pending
Categories and AttributesLabels that make performance easier to analyze by topic, priority, product area, or outcome

The Testing and Rollout Loop

  1. Configure the agent with knowledge, guidelines, guardrails, runbooks, workflows, and tools.
  2. Use Test > Playground for fast, interactive scenario testing.
  3. Use Test > Replay Chats to compare Duckie against real historical conversations.
  4. Turn important scenarios into Test > Batch Test suites.
  5. Run Batch Tests with a Rubric and optional Agent test instructions.
  6. Create a deployment in Testing mode with Internal notes only and No write actions when needed.
  7. Review Analyze > Runs and fix gaps in knowledge, guidelines, guardrails, runbooks, workflows, or tool access.
  8. Switch the deployment to Live only after quality is consistent.
  9. Monitor Performance, Breakdown, Runs, and Alerts after launch.
  10. Repeat the loop after major product, policy, or workflow changes.

What to Observe

SurfaceUse it for
Analyze > RunsInspect Conversation, Agent Steps, Agent Calls, Attributes, Category, Resolution, Event Source, Tool Input, and Tool Output
Analyze > PerformanceTrack volume, deflection, resolution, escalation, response time, and time to resolution
Analyze > BreakdownReview Category Breakdown and Attribute Breakdown, then drill into matching runs
Analyze > AlertsNotify the team when escalation rate, response time, error rate, volume, or resolution rate changes unexpectedly
Train > Knowledge > GapsTurn unanswered questions into better knowledge coverage

Examples

SituationLoop
Launching a new support agentPlayground, Replay Chats, Batch Test, Testing deployment, then Live
Updating a refund policyAdd Batch Test cases, use Agent test instructions, compare old and new results
Investigating an escalation spikeStart in Performance, filter Runs, inspect Agent Steps and Resolution, then update guardrails or knowledge
Filling knowledge gapsIdentify repeated unanswered questions, create or link knowledge, replay the original conversation
Testing tool-heavy workflowsUse Testing mode with No write actions before allowing real updates

Signs You Should Iterate

  • The agent escalates too often or too rarely.
  • Runs show repeated failed tool calls.
  • Customers ask questions that knowledge does not answer.
  • Batch Test scores drop after a policy, product, or prompt change.
  • Resolution rates vary sharply by category or attribute.
  • Reviewers frequently reject approval requests.

Testing

Learn about Playground, Replay Testing, and Batch Testing.

Runs

Inspect execution details and outcomes.

Performance Metrics

Track volume, resolution, deflection, escalation, and timing.

Knowledge Gaps

Find and close unanswered questions.