Batch testing lets you run a saved set of conversations against an agent, compare the agent’s response to the expected response, and score the result with a rubric. Use batch tests before deployment, after configuration changes, and whenever you want a repeatable way to catch regressions.Documentation Index
Fetch the complete documentation index at: https://docs.duckie.ai/llms.txt
Use this file to discover all available pages before exploring further.
How Batch Tests Work
Batch tests are organized around real support conversations:| Item | What it means |
|---|---|
| Batch | A reusable test suite. A batch can be manual or created from a connected source. |
| Ticket | One conversation in the batch. Imported tickets can keep a link back to the source conversation. |
| Turn | One test inside a ticket: the current customer message, prior conversation history, and the expected agent response. |
| Run | One execution of the batch against a selected agent. Each run snapshots the tickets and scoring rubric used at the time. |
| Score | A 1-5 quality rating, with optional per-metric scores and notes. |
The expected response is a reference for what good behavior looks like. Scoring is not a strict text diff.
Create a Batch
Choose how to add tickets
Leave Fetch tickets from a source on to pull real conversations from a connected source. Turn it off to create an empty batch and author tickets manually.
Configure imported tickets
If you fetch from a source, choose the source connection, select channels or groups when available, add tags when available, and set the number of tickets and message limits.
Import From Connected Sources
Duckie can generate batch tickets from connected messaging and ticketing sources that provide conversation history.| Source | Import options |
|---|---|
| Slack | Select channels. |
| Discord | Select channels. |
| Zendesk | Select groups and filter by tags. |
| Freshdesk | Select groups and filter by tags. |
| HubSpot | Pull recent support conversations. |
| Intercom | Pull recent conversations. |
| Plain | Pull recent threads. |
| Pylon | Pull recent issues. |
- looks for recent conversations
- applies your selected channels, groups, tags, and message limits
- filters out conversations below the minimum message count
- clips conversations above the maximum message count
- removes duplicate-looking conversations
- splits each conversation into turn-based tests
Author Tickets Manually
Manual tickets are useful for edge cases, new product behavior, and scenarios that do not already exist in your support history.Build the conversation
Add customer and agent messages. The ticket must include at least one customer message and one agent message.
Run a Batch
Choose a rubric
Select the scoring rubric for this run. The selected rubric is snapshotted on the run.
Add run instructions
Optionally add agent test instructions. Use these to define special behavior for this test run.
Use Run Instructions
Run instructions let you test behavior that is specific to the test, even when the agent is not normally designed to behave that way. They are added when you start a run and are passed to the agent as test-time instructions. Use run instructions when you want to temporarily change how the agent should behave for the batch, such as:- do not call a specific tool
- do not perform a specific type of action
- treat each ticket as if no previous escalation happened
- follow a draft policy or experimental process
- answer with a specific tone, format, or level of detail
- assume a feature, plan, or customer state that is not part of normal production behavior
Test Mode Safety
Batch test runs execute in testing mode. This prevents tests from sending customer-visible replies or performing external writes. In testing mode:- write app tools, custom tools, and MCP tools are skipped
- responder output is converted to internal notes where the source supports internal notes
- Slack and Discord responder delivery is skipped because they do not have an internal note target
- escalation delivery is skipped for batch tests
Score Results
Completed runs can be scored manually or with AI scoring.Default Rubric
The default rubric scores overall quality on a 1-5 scale and evaluates:- accuracy
- completeness
- helpfulness
- tone and professionalism
- guideline adherence
Custom Rubrics
Create and manage custom rubrics from Settings → Rubrics. A custom rubric can define:- evaluator instructions
- skip conditions for an entire result
- a scoring scale from 1 to 10
- custom metrics
- skip conditions for individual metrics
Auto-Score and Manual Score
After a run completes, you can:- auto-score the full run
- re-score one ticket
- re-score one turn
- manually set overall and metric scores
- add notes
- cancel scoring while it is running
Review Results
Open a run to inspect the tickets and turns it tested. The run detail view shows:- the expected response
- Duckie’s actual response
- the agent run behind each result
- turn-level scores and scoring reasoning
- ticket-level scores
- run-level average score
Export Results
For completed runs, use Export to download an Excel workbook. The export includes:- run metadata
- conversation-level scores
- turn-level inputs, expected outputs, actual outputs, scores, reasoning, skip status, and notes
Best Practices
Start From Real Conversations
Use imported tickets to cover common questions, real phrasing, and multi-turn context that is hard to recreate by hand. Use replay testing when you want to inspect a specific historical conversation interactively before adding it to a batch.Add Manual Edge Cases
Add custom tickets for policy boundaries, guardrails, tool-heavy workflows, and scenarios you have not seen in production yet.Keep Expected Responses Fresh
Update expected responses when your product, policy, or desired agent behavior changes.Use Rubrics For Stable Evaluation
Create custom rubrics when the default quality metrics are too generic for your team. Keep rubric changes intentional, because each run snapshots the rubric used for scoring.Compare Runs Before Deploying
Run the same batch before and after agent configuration changes. Investigate low-scoring tickets, re-run the specific turns after fixes, then re-score.Next Steps
Replay Testing
Inspect real conversations interactively
Playground
Test individual scenarios interactively