Regression Testing

Converra automatically verifies that prompt improvements don't break existing functionality. When a variant shows improvement, the system tests it against a "golden set" of scenarios your baseline already handles well.

Why Regression Testing?

A prompt optimized to handle frustrated customers better might inadvertently become worse at handling standard queries. Regression testing catches this before deployment.

The core promise: Improvements are proven, and regressions are surfaced before they reach production.

How It Works

Automatic Triggering

Regression testing runs automatically when these conditions are met:

A golden set exists for the prompt (minimum 3 validated scenarios)
A leading variant shows positive improvement vs baseline
The leader hasn't been regression-tested yet this optimization

You don't need to configure or trigger it—the system guarantees the check runs.

The Flow

Optimization Loop:
1. Generate variants
2. Run simulations → Variant A leads with +15%
3. → AUTOMATIC: Regression test runs
4. Results show: 4/5 scenarios passed, 1 regressed
5. System surfaces tradeoff to user
6. User decides: Apply anyway or reject

Golden Sets

A golden set is a collection of scenarios that your baseline prompt handles reliably. These become the benchmark for regression testing.

Automatic Generation

Golden sets are created automatically on the first optimization:

Generate candidates - AI analyzes your prompt and identifies 5-8 representative user scenarios
Validate against baseline - Each scenario is tested with your baseline prompt
Keep winners - Only scenarios where baseline scores ≥ 0.75 are included
Save for reuse - The golden set persists and is reused in future optimizations

Why validate? We only test against scenarios the baseline handles well. It's not fair to flag "regressions" on things the prompt was never good at.

No Conversations Required

Golden sets are generated from prompt analysis, not imported conversations. This means regression testing works from your first optimization—no data collection needed.

Example Golden Set

Golden Set for "Customer Support Agent" (5 scenarios)

1. Frustrated customer requesting refund      Baseline: 0.82
2. New user onboarding questions              Baseline: 0.78
3. Technical support inquiry                  Baseline: 0.85
4. Feature request submission                 Baseline: 0.80
5. Integration setup help                     Baseline: 0.88

Regression Test Execution

Short Exchanges

Regression tests use 2-3 turn exchanges, not full conversations:

Turn 1: User presents scenario
Turn 2: AI responds
Turn 3: User follow-up (optional)
Turn 4: AI responds (optional)
→ Score based on goal achievement + quality

Why short? We're testing "can the variant still handle this?"—not "can it handle a complex journey." Shorter = faster + cheaper.

Parallel Execution

All scenarios run in parallel:

5 golden scenarios × 2 (baseline + variant) = 10 simulations
All complete in ~60-90 seconds
Results returned immediately

Fluke Detection

When a regression is suspected, the system validates it's not a baseline fluke:

Scenario: Technical support
Baseline: 0.85
Variant:  0.71
Delta:    -0.14 → SUSPECTED REGRESSION

Before flagging:
→ Re-run baseline 2× more on this scenario
→ Results: 0.83, 0.86
→ Baseline average: 0.85 (consistent)
→ CONFIRMED: Real regression

This prevents false positives from random variance.

Understanding Results

All Passed

Regression Check                              PASSED

5/5 scenarios passed

✓ Frustrated customer requesting refund
✓ New user onboarding questions
✓ Technical support inquiry
✓ Feature request submission
✓ Integration setup help

When all scenarios pass, the variant maintains existing functionality.

Regressions Found

Regression Check                         1 REGRESSION

4/5 scenarios passed

✓ Frustrated customer requesting refund       +6%
✓ New user onboarding questions               +2%
✗ Technical support inquiry                  -16%
✓ Feature request submission                  +4%
✓ Integration setup help                      +1%

Variant improved agenda target but regressed on
"Technical support inquiry" scenario.

When regressions are found, you see:

Which scenarios regressed
The performance delta
Clear tradeoff: improvement vs regression

User Decision

When regressions are detected, you decide:

Apply Anyway - Accept the tradeoff (improvement outweighs regression)
Reject & Keep Baseline - Don't deploy the regression

This is a soft gate—regressions inform your decision but don't automatically block deployment.

Settings

Configure regression testing in Settings → Prompt Optimization:

Setting	Default	Description
Minimum scenarios	3	Minimum golden scenarios after validation
Baseline threshold	0.75	Minimum baseline score to include in golden set
Fluke validation runs	3	Runs to confirm suspected regressions

When Regression Testing is Skipped

The system will skip regression testing (with a clear reason) when:

No golden set: First optimization, still creating golden set
No leader: No variant with positive improvement vs baseline
Already tested: Leader was already regression-tested this optimization
Too few scenarios: Less than 3 scenarios passed baseline validation

Cost & Time

Scenario	Time	Approx. Cost
First optimization (golden set creation)	~3-4 min	~$0.50
Subsequent optimizations	~60-90 sec	~$0.25
Fluke validation (if regression suspected)	+30 sec	+$0.05

Golden set creation is a one-time cost per prompt.

Webhook Events

Subscribe to regression testing events:

Event	When Fired
`regression_test.completed`	Regression test finished (includes pass/fail status)

See Webhooks for setup.

Best Practices

Let it run - Don't skip regression testing for "small" changes
Review regressions - A -16% on one scenario might be acceptable if you gain +30% elsewhere
Trust the baseline threshold - The 0.75 default ensures you're testing against scenarios the prompt genuinely handles well
Watch for patterns - If the same scenario regresses repeatedly, your prompt might have a fundamental tension

Next Steps

How Optimization Works - Understand the full loop
Understanding Results - Interpret optimization outcomes
Settings - Configure regression testing thresholds

Regression Testing ​

Why Regression Testing? ​

How It Works ​

Automatic Triggering ​

The Flow ​

Golden Sets ​

Automatic Generation ​

No Conversations Required ​

Example Golden Set ​

Regression Test Execution ​

Short Exchanges ​

Parallel Execution ​

Fluke Detection ​

Understanding Results ​

All Passed ​

Regressions Found ​

User Decision ​

Settings ​

When Regression Testing is Skipped ​

Cost & Time ​

Webhook Events ​

Best Practices ​

Next Steps ​