> ## Documentation Index
> Fetch the complete documentation index at: https://docs.duckie.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Testing, Observability, and Iteration

> Use Duckie's testing and analytics surfaces to launch agents safely, monitor quality, and improve over time

Duckie agents improve through a loop: test, deploy carefully, observe real runs, then update the agent.

This is an operating model, not just a debugging process. Validate before public replies, watch early performance, and keep improving based on real conversations.

## Key Concepts

| Concept                       | Meaning                                                                                           |
| ----------------------------- | ------------------------------------------------------------------------------------------------- |
| **Run**                       | One agent execution triggered by a customer message, replay, batch test, schedule, or deployment  |
| **Testing**                   | Deployment mode for reviewing behavior before Duckie responds live                                |
| **Live**                      | Deployment mode where responses and actions can affect real customer systems                      |
| **Rubric**                    | Scoring criteria for Batch Test results                                                           |
| **Resolution**                | How Duckie determines whether a conversation was resolved, deflected, escalated, or still pending |
| **Categories and Attributes** | Labels that make performance easier to analyze by topic, priority, product area, or outcome       |

## The Testing and Rollout Loop

1. Configure the agent with knowledge, guidelines, guardrails, runbooks, workflows, and tools.
2. Use **Test > Playground** for fast, interactive scenario testing.
3. Use **Test > Replay Chats** to compare Duckie against real historical conversations.
4. Turn important scenarios into **Test > Batch Test** suites.
5. Run Batch Tests with a **Rubric** and optional Agent test instructions.
6. Create a deployment in **Testing** mode with **Internal notes only** and **No write actions** when needed.
7. Review **Analyze > Runs** and fix gaps in knowledge, guidelines, guardrails, runbooks, workflows, or tool access.
8. Switch the deployment to **Live** only after quality is consistent.
9. Monitor **Performance**, **Breakdown**, **Runs**, and **Alerts** after launch.
10. Repeat the loop after major product, policy, or workflow changes.

## What to Observe

| Surface                  | Use it for                                                                                                                  |
| ------------------------ | --------------------------------------------------------------------------------------------------------------------------- |
| Analyze > Runs           | Inspect Conversation, Agent Steps, Agent Calls, Attributes, Category, Resolution, Event Source, Tool Input, and Tool Output |
| Analyze > Performance    | Track volume, deflection, resolution, escalation, response time, and time to resolution                                     |
| Analyze > Breakdown      | Review Category Breakdown and Attribute Breakdown, then drill into matching runs                                            |
| Analyze > Alerts         | Notify the team when escalation rate, response time, error rate, volume, or resolution rate changes unexpectedly            |
| Train > Knowledge > Gaps | Turn unanswered questions into better knowledge coverage                                                                    |

## Examples

| Situation                         | Loop                                                                                                       |
| --------------------------------- | ---------------------------------------------------------------------------------------------------------- |
| Launching a new support agent     | Playground, Replay Chats, Batch Test, Testing deployment, then Live                                        |
| Updating a refund policy          | Add Batch Test cases, use Agent test instructions, compare old and new results                             |
| Investigating an escalation spike | Start in Performance, filter Runs, inspect Agent Steps and Resolution, then update guardrails or knowledge |
| Filling knowledge gaps            | Identify repeated unanswered questions, create or link knowledge, replay the original conversation         |
| Testing tool-heavy workflows      | Use Testing mode with No write actions before allowing real updates                                        |

## Signs You Should Iterate

* The agent escalates too often or too rarely.
* Runs show repeated failed tool calls.
* Customers ask questions that knowledge does not answer.
* Batch Test scores drop after a policy, product, or prompt change.
* Resolution rates vary sharply by category or attribute.
* Reviewers frequently reject approval requests.

## Related Docs

<CardGroup cols={2}>
  <Card title="Testing" icon="flask" href="/testing/overview">
    Learn about Playground, Replay Testing, and Batch Testing.
  </Card>

  <Card title="Runs" icon="clock-rotate-left" href="/analytics/runs">
    Inspect execution details and outcomes.
  </Card>

  <Card title="Performance Metrics" icon="chart-line" href="/analytics/performance-metrics">
    Track volume, resolution, deflection, escalation, and timing.
  </Card>

  <Card title="Knowledge Gaps" icon="circle-question" href="/knowledge/knowledge-gaps">
    Find and close unanswered questions.
  </Card>
</CardGroup>
