Skip to main content
Testing ensures your agent behaves correctly before it interacts with real customers. Duckie provides two testing approaches: interactive playground testing and automated batch testing.

Why Testing Matters

Before deploying to production:
RiskPrevention
Inaccurate responsesTest with real questions
Guardrails not workingTest escalation triggers
Poor tone or formatReview against guidelines
Missing knowledgeIdentify gaps before customers do
Broken toolsVerify actions execute correctly

Testing Methods

Playground Testing

Interactive, real-time testing:
  • Chat directly with your agent
  • See responses immediately
  • View full execution details
  • Iterate quickly on configuration
Best for:
  • Development and debugging
  • Exploring agent behavior
  • Quick validation

Batch Testing

Automated test suites:
  • Define test cases with expected outcomes
  • Run all tests automatically
  • Compare results over time
  • Catch regressions
Best for:
  • Pre-deployment validation
  • Regression testing
  • Consistent quality checks

Testing Workflow

┌─────────────────────┐
│ Configure Agent     │
└──────────┬──────────┘


┌─────────────────────┐
│ Test in Playground  │ ←─┐
└──────────┬──────────┘   │
           │              │ Iterate
           ▼              │
┌─────────────────────┐   │
│ Issues Found?       │───┘
└──────────┬──────────┘
           │ No

┌─────────────────────┐
│ Save as Batch Tests │
└──────────┬──────────┘


┌─────────────────────┐
│ Deploy (Shadow)     │
└──────────┬──────────┘


┌─────────────────────┐
│ Run Batch Before    │
│ Going Live          │
└─────────────────────┘

What to Test

Response Quality

  • Are answers accurate?
  • Is the tone appropriate?
  • Is the format correct?
  • Are guidelines being followed?

Knowledge Access

  • Does the agent find relevant information?
  • Are the right sources being searched?
  • Is knowledge used correctly in responses?

Guardrails

  • Do escalation rules trigger correctly?
  • Are restrictions enforced?
  • Does the agent respond appropriately when triggered?

Tools and Actions

  • Do tool calls succeed?
  • Are parameters passed correctly?
  • Do actions have the expected results?

Edge Cases

  • Unusual or ambiguous inputs
  • Very long or very short messages
  • Multiple questions in one message
  • Off-topic requests

Test Coverage Checklist

1

Happy Paths

Test common, expected scenarios that should work smoothly.
2

Edge Cases

Test unusual inputs, missing information, and boundary conditions.
3

Guardrails

Test messages that should trigger escalation or restrictions.
4

Knowledge Gaps

Test questions the agent might not know.
5

Tool Execution

Test scenarios that require tool calls.

Next Steps