Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.duckie.ai/llms.txt

Use this file to discover all available pages before exploring further.

Batch testing lets you run a saved set of conversations against an agent, compare the agent’s response to the expected response, and score the result with a rubric. Use batch tests before deployment, after configuration changes, and whenever you want a repeatable way to catch regressions.

How Batch Tests Work

Batch tests are organized around real support conversations:
ItemWhat it means
BatchA reusable test suite. A batch can be manual or created from a connected source.
TicketOne conversation in the batch. Imported tickets can keep a link back to the source conversation.
TurnOne test inside a ticket: the current customer message, prior conversation history, and the expected agent response.
RunOne execution of the batch against a selected agent. Each run snapshots the tickets and scoring rubric used at the time.
ScoreA 1-5 quality rating, with optional per-metric scores and notes.
The expected response is a reference for what good behavior looks like. Scoring is not a strict text diff.

Create a Batch

1

Open Batch Test

Go to Test → Batch Test.
2

Create a new batch

Click New batch and enter a name.
3

Choose how to add tickets

Leave Fetch tickets from a source on to pull real conversations from a connected source. Turn it off to create an empty batch and author tickets manually.
4

Configure imported tickets

If you fetch from a source, choose the source connection, select channels or groups when available, add tags when available, and set the number of tickets and message limits.
5

Create the batch

Click Create & fetch for imported tickets, or Create for a manual batch.

Import From Connected Sources

Duckie can generate batch tickets from connected messaging and ticketing sources that provide conversation history.
SourceImport options
SlackSelect channels.
DiscordSelect channels.
ZendeskSelect groups and filter by tags.
FreshdeskSelect groups and filter by tags.
HubSpotPull recent support conversations.
IntercomPull recent conversations.
PlainPull recent threads.
PylonPull recent issues.
When Duckie imports conversations, it:
  • looks for recent conversations
  • applies your selected channels, groups, tags, and message limits
  • filters out conversations below the minimum message count
  • clips conversations above the maximum message count
  • removes duplicate-looking conversations
  • splits each conversation into turn-based tests
For Slack, Discord, Zendesk, and Freshdesk, you can select specific channels or groups. For Zendesk and Freshdesk, tags are matched as an all-tags filter.

Author Tickets Manually

Manual tickets are useful for edge cases, new product behavior, and scenarios that do not already exist in your support history.
1

Open Manage tickets

Open a batch and click Manage tickets.
2

Add a custom ticket

Click Add custom ticket.
3

Build the conversation

Add customer and agent messages. The ticket must include at least one customer message and one agent message.
4

Create the ticket

Click Create Ticket. Duckie splits the conversation into test turns automatically.
In Manage tickets, you can edit customer messages, edit expected responses, add new customer messages, delete turns, delete tickets, search tickets, and open imported tickets in their source integration when a source link exists.

Run a Batch

1

Open the batch

Select the batch you want to run.
2

Start a new run

Click New run.
3

Choose an agent

Select the agent to test.
4

Choose a rubric

Select the scoring rubric for this run. The selected rubric is snapshotted on the run.
5

Add run instructions

Optionally add agent test instructions. Use these to define special behavior for this test run.
6

Run the tests

Click Run tests. Duckie runs each turn and shows live progress.
You can start another run while one is active, but Duckie asks you to confirm first. Each run creates a separate set of results.

Use Run Instructions

Run instructions let you test behavior that is specific to the test, even when the agent is not normally designed to behave that way. They are added when you start a run and are passed to the agent as test-time instructions. Use run instructions when you want to temporarily change how the agent should behave for the batch, such as:
  • do not call a specific tool
  • do not perform a specific type of action
  • treat each ticket as if no previous escalation happened
  • follow a draft policy or experimental process
  • answer with a specific tone, format, or level of detail
  • assume a feature, plan, or customer state that is not part of normal production behavior
For example:
Do not call the refund tool. Explain refund eligibility only.
Assume the new cancellation policy is already live. Use that policy when deciding what the customer can do.
Run instructions are separate from the selected scoring rubric. The rubric defines how results are evaluated; run instructions define how the agent should behave during the test. Duckie also includes the run instructions in AI scoring context so the evaluator understands what behavior the agent was asked to follow. You can edit run instructions from the run toolbar. Changes do not rewrite existing responses; they apply to later re-runs and re-scores.

Test Mode Safety

Batch test runs execute in testing mode. This prevents tests from sending customer-visible replies or performing external writes. In testing mode:
  • write app tools, custom tools, and MCP tools are skipped
  • responder output is converted to internal notes where the source supports internal notes
  • Slack and Discord responder delivery is skipped because they do not have an internal note target
  • escalation delivery is skipped for batch tests
This lets you test agent reasoning, tools, and response quality without modifying customer systems.

Score Results

Completed runs can be scored manually or with AI scoring.

Default Rubric

The default rubric scores overall quality on a 1-5 scale and evaluates:
  • accuracy
  • completeness
  • helpfulness
  • tone and professionalism
  • guideline adherence

Custom Rubrics

Create and manage custom rubrics from Settings → Rubrics. A custom rubric can define:
  • evaluator instructions
  • skip conditions for an entire result
  • a scoring scale from 1 to 10
  • custom metrics
  • skip conditions for individual metrics
When you start a run, Duckie stores a snapshot of the selected rubric on that run. Later changes to the rubric do not rewrite historical run results.

Auto-Score and Manual Score

After a run completes, you can:
  • auto-score the full run
  • re-score one ticket
  • re-score one turn
  • manually set overall and metric scores
  • add notes
  • cancel scoring while it is running
Scores are shown at the run, ticket, and turn levels. Duckie also shows averages across the run and per-rubric metric breakdowns when available.

Review Results

Open a run to inspect the tickets and turns it tested. The run detail view shows:
  • the expected response
  • Duckie’s actual response
  • the agent run behind each result
  • turn-level scores and scoring reasoning
  • ticket-level scores
  • run-level average score
You can search tickets, filter by score, filter to tickets with notes, re-run a ticket, re-run a single turn, re-score a ticket, or re-score a single turn. Use review mode when you want to move through results quickly. Review mode lets you move across tickets and turns and set 1-5 ratings from the keyboard.

Export Results

For completed runs, use Export to download an Excel workbook. The export includes:
  • run metadata
  • conversation-level scores
  • turn-level inputs, expected outputs, actual outputs, scores, reasoning, skip status, and notes

Best Practices

Start From Real Conversations

Use imported tickets to cover common questions, real phrasing, and multi-turn context that is hard to recreate by hand. Use replay testing when you want to inspect a specific historical conversation interactively before adding it to a batch.

Add Manual Edge Cases

Add custom tickets for policy boundaries, guardrails, tool-heavy workflows, and scenarios you have not seen in production yet.

Keep Expected Responses Fresh

Update expected responses when your product, policy, or desired agent behavior changes.

Use Rubrics For Stable Evaluation

Create custom rubrics when the default quality metrics are too generic for your team. Keep rubric changes intentional, because each run snapshots the rubric used for scoring.

Compare Runs Before Deploying

Run the same batch before and after agent configuration changes. Investigate low-scoring tickets, re-run the specific turns after fixes, then re-score.

Next Steps

Replay Testing

Inspect real conversations interactively

Playground

Test individual scenarios interactively