Feature · Bench corpus

Lock in good behaviour. Detect regressions before you ship.

The trajectory recorder logs every observation, action, and verification result from every agent run. The bench runner uses those recordings: take a known-good run, declare it the canonical bench flow, then compare every subsequent run against it. If the agent deviates, the bench fails. If it passes, you ship with confidence.

Why a bench gate exists

Tanvrit Automator is an autonomous agent. A small change to PromptBuilder, the planner model, the perception strategy, or the executor can subtly change which actions the agent picks — even if every unit test still passes.

Unit tests cannot catch "the agent now takes 8 steps instead of 5 to log into GitHub". The bench can. Every PR that modifies agent/, llm/, perception/, or execution/ must keep the bench green. This is a hard rule.

How a bench flow is recorded

Run the scenario with automator run --scenario my-flow.yaml until the agent reaches Done.
Inspect the trajectory in the right panel. Confirm every step is doing what you expected.
Click Save as bench, give it a name, and pick assertions: URL match, DOM hash match, visible-text match, or a custom predicate.
The flow is now part of SeedBenchFlows. Future runs replay it and compare to the recording.

AssertionEvaluator

Assertions live in bench/AssertionEvaluator.kt. They are pure: same recording, same new run, same verdict. The built-in assertion types:

sealed class Assertion {
  // Final URL must match (exact or regex)
  data class UrlMatches(val pattern: Regex) : Assertion()

  // Final DOM hash must match the recorded hash
  object DomHashStable : Assertion()

  // Visible text must contain these phrases
  data class VisibleTextContains(val phrases: List<String>) : Assertion()

  // Step count must be within ±N of the recorded count
  data class StepCountWithin(val recorded: Int, val tolerance: Int) : Assertion()

  // Custom predicate (Kotlin lambda)
  data class Custom(val name: String, val check: (RunSnapshot) -> Boolean) : Assertion()
}

Running the bench

# Run the entire bench corpus
./gradlew :composeApp:desktopTest --tests "*BenchRunnerTest"

# Run a single bench flow from CLI
automator bench run github-login

# Diff a failing run against the canonical recording
automator bench diff github-login --run-id <run-id>

Bench runs are also part of CI: ./gradlew :composeApp:desktopTest --no-daemon --no-configuration-cache --stacktrace executes them on every PR.

Cross-platform bench (in progress)

Per AUTOMATOR-CROSSPLATFORM-PLAN-2026-05-01, Tanvrit Automator's bench corpus is being extended to run against the full Tanvrit fleet — all 11 platforms' /app/ URLs. Each platform contributes representative flows so that any regression in the Tanvrit SDK or any individual app surfaces as a bench failure for the whole fleet.