Skip to content

Regression Testing

Converra automatically verifies that prompt improvements don't break existing functionality. When a variant shows improvement, the system tests it against a "golden set" of scenarios your baseline already handles well.

Why Regression Testing?

A prompt optimized to handle frustrated customers better might inadvertently become worse at handling standard queries. Regression testing catches this before deployment.

The core promise: Improvements are proven, and regressions are surfaced before they reach production.

How It Works

Automatic Triggering

Regression testing runs automatically when these conditions are met:

  1. A golden set exists for the prompt (minimum 3 validated scenarios)
  2. A leading variant shows positive improvement vs baseline
  3. The leader hasn't been regression-tested yet this optimization

You don't need to configure or trigger it—the system guarantees the check runs.

The Flow

Optimization Loop:
1. Generate variants
2. Run simulations → Variant A leads with +15%
3. → AUTOMATIC: Regression test runs
4. Results show: 4/5 scenarios passed, 1 regressed
5. System surfaces tradeoff to user
6. User decides: Apply anyway or reject

Golden Sets

A golden set is a collection of scenarios that your baseline prompt handles reliably. These become the benchmark for regression testing.

Automatic Generation

Golden sets are created automatically on the first optimization:

  1. Generate candidates - AI analyzes your prompt and identifies 5-8 representative user scenarios
  2. Validate against baseline - Each scenario is tested with your baseline prompt
  3. Keep winners - Only scenarios where baseline scores ≥ 0.75 are included
  4. Save for reuse - The golden set persists and is reused in future optimizations

Why validate? We only test against scenarios the baseline handles well. It's not fair to flag "regressions" on things the prompt was never good at.

No Conversations Required

Golden sets are generated from prompt analysis, not imported conversations. This means regression testing works from your first optimization—no data collection needed.

Example Golden Set

Golden Set for "Customer Support Agent" (5 scenarios)

1. Frustrated customer requesting refund      Baseline: 0.82
2. New user onboarding questions              Baseline: 0.78
3. Technical support inquiry                  Baseline: 0.85
4. Feature request submission                 Baseline: 0.80
5. Integration setup help                     Baseline: 0.88

Regression Test Execution

Short Exchanges

Regression tests use 2-3 turn exchanges, not full conversations:

Turn 1: User presents scenario
Turn 2: AI responds
Turn 3: User follow-up (optional)
Turn 4: AI responds (optional)
→ Score based on goal achievement + quality

Why short? We're testing "can the variant still handle this?"—not "can it handle a complex journey." Shorter = faster + cheaper.

Parallel Execution

All scenarios run in parallel:

  • 5 golden scenarios × 2 (baseline + variant) = 10 simulations
  • All complete in ~60-90 seconds
  • Results returned immediately

Fluke Detection

When a regression is suspected, the system validates it's not a baseline fluke:

Scenario: Technical support
Baseline: 0.85
Variant:  0.71
Delta:    -0.14 → SUSPECTED REGRESSION

Before flagging:
→ Re-run baseline 2× more on this scenario
→ Results: 0.83, 0.86
→ Baseline average: 0.85 (consistent)
→ CONFIRMED: Real regression

This prevents false positives from random variance.

Understanding Results

All Passed

Regression Check                              PASSED

5/5 scenarios passed

✓ Frustrated customer requesting refund
✓ New user onboarding questions
✓ Technical support inquiry
✓ Feature request submission
✓ Integration setup help

When all scenarios pass, the variant maintains existing functionality.

Regressions Found

Regression Check                         1 REGRESSION

4/5 scenarios passed

✓ Frustrated customer requesting refund       +6%
✓ New user onboarding questions               +2%
✗ Technical support inquiry                  -16%
✓ Feature request submission                  +4%
✓ Integration setup help                      +1%

Variant improved agenda target but regressed on
"Technical support inquiry" scenario.

When regressions are found, you see:

  • Which scenarios regressed
  • The performance delta
  • Clear tradeoff: improvement vs regression

User Decision

When regressions are detected, you decide:

  • Apply Anyway - Accept the tradeoff (improvement outweighs regression)
  • Reject & Keep Baseline - Don't deploy the regression

This is a soft gate—regressions inform your decision but don't automatically block deployment.

Settings

Configure regression testing in Settings → Prompt Optimization:

SettingDefaultDescription
Minimum scenarios3Minimum golden scenarios after validation
Baseline threshold0.75Minimum baseline score to include in golden set
Fluke validation runs3Runs to confirm suspected regressions

When Regression Testing is Skipped

The system will skip regression testing (with a clear reason) when:

  • No golden set: First optimization, still creating golden set
  • No leader: No variant with positive improvement vs baseline
  • Already tested: Leader was already regression-tested this optimization
  • Too few scenarios: Less than 3 scenarios passed baseline validation

Cost & Time

ScenarioTimeApprox. Cost
First optimization (golden set creation)~3-4 min~$0.50
Subsequent optimizations~60-90 sec~$0.25
Fluke validation (if regression suspected)+30 sec+$0.05

Golden set creation is a one-time cost per prompt.

Webhook Events

Subscribe to regression testing events:

EventWhen Fired
regression_test.completedRegression test finished (includes pass/fail status)

See Webhooks for setup.

Best Practices

  1. Let it run - Don't skip regression testing for "small" changes
  2. Review regressions - A -16% on one scenario might be acceptable if you gain +30% elsewhere
  3. Trust the baseline threshold - The 0.75 default ensures you're testing against scenarios the prompt genuinely handles well
  4. Watch for patterns - If the same scenario regresses repeatedly, your prompt might have a fundamental tension

Next Steps