Key Concepts
| Concept | Meaning |
|---|---|
| Run | One agent execution triggered by a customer message, replay, batch test, schedule, or deployment |
| Testing | Deployment mode for reviewing behavior before Duckie responds live |
| Live | Deployment mode where responses and actions can affect real customer systems |
| Rubric | Scoring criteria for Batch Test results |
| Resolution | How Duckie determines whether a conversation was resolved, deflected, escalated, or still pending |
| Categories and Attributes | Labels that make performance easier to analyze by topic, priority, product area, or outcome |
The Testing and Rollout Loop
- Configure the agent with knowledge, guidelines, guardrails, runbooks, workflows, and tools.
- Use Test > Playground for fast, interactive scenario testing.
- Use Test > Replay Chats to compare Duckie against real historical conversations.
- Turn important scenarios into Test > Batch Test suites.
- Run Batch Tests with a Rubric and optional Agent test instructions.
- Create a deployment in Testing mode with Internal notes only and No write actions when needed.
- Review Analyze > Runs and fix gaps in knowledge, guidelines, guardrails, runbooks, workflows, or tool access.
- Switch the deployment to Live only after quality is consistent.
- Monitor Performance, Breakdown, Runs, and Alerts after launch.
- Repeat the loop after major product, policy, or workflow changes.
What to Observe
| Surface | Use it for |
|---|---|
| Analyze > Runs | Inspect Conversation, Agent Steps, Agent Calls, Attributes, Category, Resolution, Event Source, Tool Input, and Tool Output |
| Analyze > Performance | Track volume, deflection, resolution, escalation, response time, and time to resolution |
| Analyze > Breakdown | Review Category Breakdown and Attribute Breakdown, then drill into matching runs |
| Analyze > Alerts | Notify the team when escalation rate, response time, error rate, volume, or resolution rate changes unexpectedly |
| Train > Knowledge > Gaps | Turn unanswered questions into better knowledge coverage |
Examples
| Situation | Loop |
|---|---|
| Launching a new support agent | Playground, Replay Chats, Batch Test, Testing deployment, then Live |
| Updating a refund policy | Add Batch Test cases, use Agent test instructions, compare old and new results |
| Investigating an escalation spike | Start in Performance, filter Runs, inspect Agent Steps and Resolution, then update guardrails or knowledge |
| Filling knowledge gaps | Identify repeated unanswered questions, create or link knowledge, replay the original conversation |
| Testing tool-heavy workflows | Use Testing mode with No write actions before allowing real updates |
Signs You Should Iterate
- The agent escalates too often or too rarely.
- Runs show repeated failed tool calls.
- Customers ask questions that knowledge does not answer.
- Batch Test scores drop after a policy, product, or prompt change.
- Resolution rates vary sharply by category or attribute.
- Reviewers frequently reject approval requests.
Related Docs
Testing
Learn about Playground, Replay Testing, and Batch Testing.
Runs
Inspect execution details and outcomes.
Performance Metrics
Track volume, resolution, deflection, escalation, and timing.
Knowledge Gaps
Find and close unanswered questions.