Appearance
Regression Testing
Converra automatically verifies that prompt improvements don't break existing functionality. When a variant shows improvement, the system tests it against a "golden set" of scenarios your baseline already handles well.
Why Regression Testing?
A prompt optimized to handle frustrated customers better might inadvertently become worse at handling standard queries. Regression testing catches this before deployment.
The core promise: Improvements are proven, and regressions are surfaced before they reach production.
How It Works
Automatic Triggering
Regression testing runs automatically when these conditions are met:
- A golden set exists for the prompt (minimum 3 validated scenarios)
- A leading variant shows positive improvement vs baseline
- The leader hasn't been regression-tested yet this optimization
You don't need to configure or trigger it—the system guarantees the check runs.
The Flow
Optimization Loop:
1. Generate variants
2. Run simulations → Variant A leads with +15%
3. → AUTOMATIC: Regression test runs
4. Results show: 4/5 scenarios passed, 1 regressed
5. System surfaces tradeoff to user
6. User decides: Apply anyway or rejectGolden Sets
A golden set is a collection of scenarios that your baseline prompt handles reliably. These become the benchmark for regression testing.
Automatic Generation
Golden sets are created automatically on the first optimization:
- Generate candidates - AI analyzes your prompt and identifies 5-8 representative user scenarios
- Validate against baseline - Each scenario is tested with your baseline prompt
- Keep winners - Only scenarios where baseline scores ≥ 0.75 are included
- Save for reuse - The golden set persists and is reused in future optimizations
Why validate? We only test against scenarios the baseline handles well. It's not fair to flag "regressions" on things the prompt was never good at.
No Conversations Required
Golden sets are generated from prompt analysis, not imported conversations. This means regression testing works from your first optimization—no data collection needed.
Example Golden Set
Golden Set for "Customer Support Agent" (5 scenarios)
1. Frustrated customer requesting refund Baseline: 0.82
2. New user onboarding questions Baseline: 0.78
3. Technical support inquiry Baseline: 0.85
4. Feature request submission Baseline: 0.80
5. Integration setup help Baseline: 0.88Regression Test Execution
Short Exchanges
Regression tests use 2-3 turn exchanges, not full conversations:
Turn 1: User presents scenario
Turn 2: AI responds
Turn 3: User follow-up (optional)
Turn 4: AI responds (optional)
→ Score based on goal achievement + qualityWhy short? We're testing "can the variant still handle this?"—not "can it handle a complex journey." Shorter = faster + cheaper.
Parallel Execution
All scenarios run in parallel:
- 5 golden scenarios × 2 (baseline + variant) = 10 simulations
- All complete in ~60-90 seconds
- Results returned immediately
Fluke Detection
When a regression is suspected, the system validates it's not a baseline fluke:
Scenario: Technical support
Baseline: 0.85
Variant: 0.71
Delta: -0.14 → SUSPECTED REGRESSION
Before flagging:
→ Re-run baseline 2× more on this scenario
→ Results: 0.83, 0.86
→ Baseline average: 0.85 (consistent)
→ CONFIRMED: Real regressionThis prevents false positives from random variance.
Understanding Results
All Passed
Regression Check PASSED
5/5 scenarios passed
✓ Frustrated customer requesting refund
✓ New user onboarding questions
✓ Technical support inquiry
✓ Feature request submission
✓ Integration setup helpWhen all scenarios pass, the variant maintains existing functionality.
Regressions Found
Regression Check 1 REGRESSION
4/5 scenarios passed
✓ Frustrated customer requesting refund +6%
✓ New user onboarding questions +2%
✗ Technical support inquiry -16%
✓ Feature request submission +4%
✓ Integration setup help +1%
Variant improved agenda target but regressed on
"Technical support inquiry" scenario.When regressions are found, you see:
- Which scenarios regressed
- The performance delta
- Clear tradeoff: improvement vs regression
User Decision
When regressions are detected, you decide:
- Apply Anyway - Accept the tradeoff (improvement outweighs regression)
- Reject & Keep Baseline - Don't deploy the regression
This is a soft gate—regressions inform your decision but don't automatically block deployment.
Settings
Configure regression testing in Settings → Prompt Optimization:
| Setting | Default | Description |
|---|---|---|
| Minimum scenarios | 3 | Minimum golden scenarios after validation |
| Baseline threshold | 0.75 | Minimum baseline score to include in golden set |
| Fluke validation runs | 3 | Runs to confirm suspected regressions |
When Regression Testing is Skipped
The system will skip regression testing (with a clear reason) when:
- No golden set: First optimization, still creating golden set
- No leader: No variant with positive improvement vs baseline
- Already tested: Leader was already regression-tested this optimization
- Too few scenarios: Less than 3 scenarios passed baseline validation
Cost & Time
| Scenario | Time | Approx. Cost |
|---|---|---|
| First optimization (golden set creation) | ~3-4 min | ~$0.50 |
| Subsequent optimizations | ~60-90 sec | ~$0.25 |
| Fluke validation (if regression suspected) | +30 sec | +$0.05 |
Golden set creation is a one-time cost per prompt.
Webhook Events
Subscribe to regression testing events:
| Event | When Fired |
|---|---|
regression_test.completed | Regression test finished (includes pass/fail status) |
See Webhooks for setup.
Best Practices
- Let it run - Don't skip regression testing for "small" changes
- Review regressions - A -16% on one scenario might be acceptable if you gain +30% elsewhere
- Trust the baseline threshold - The 0.75 default ensures you're testing against scenarios the prompt genuinely handles well
- Watch for patterns - If the same scenario regresses repeatedly, your prompt might have a fundamental tension
Next Steps
- How Optimization Works - Understand the full loop
- Understanding Results - Interpret optimization outcomes
- Settings - Configure regression testing thresholds
