Feature · Bench corpus
Lock in good behaviour. Detect regressions before you ship.
The trajectory recorder logs every observation, action, and verification result from every agent run. The bench runner uses those recordings: take a known-good run, declare it the canonical bench flow, then compare every subsequent run against it. If the agent deviates, the bench fails. If it passes, you ship with confidence.
Why a bench gate exists
Tanvrit Automator is an autonomous agent. A small change to PromptBuilder, the planner model, the perception strategy, or the executor can subtly change which actions the agent picks — even if every unit test still passes.
Unit tests cannot catch "the agent now takes 8 steps instead of 5 to log into GitHub". The bench can. Every PR that modifies agent/, llm/, perception/, or execution/ must keep the bench green. This is a hard rule.
How a bench flow is recorded
- Run the scenario with
automator run --scenario my-flow.yamluntil the agent reachesDone. - Inspect the trajectory in the right panel. Confirm every step is doing what you expected.
- Click Save as bench, give it a name, and pick assertions: URL match, DOM hash match, visible-text match, or a custom predicate.
- The flow is now part of
SeedBenchFlows. Future runs replay it and compare to the recording.
AssertionEvaluator
Assertions live in bench/AssertionEvaluator.kt. They are pure: same recording, same new run, same verdict. The built-in assertion types:
sealed class Assertion {
// Final URL must match (exact or regex)
data class UrlMatches(val pattern: Regex) : Assertion()
// Final DOM hash must match the recorded hash
object DomHashStable : Assertion()
// Visible text must contain these phrases
data class VisibleTextContains(val phrases: List<String>) : Assertion()
// Step count must be within ±N of the recorded count
data class StepCountWithin(val recorded: Int, val tolerance: Int) : Assertion()
// Custom predicate (Kotlin lambda)
data class Custom(val name: String, val check: (RunSnapshot) -> Boolean) : Assertion()
}Running the bench
# Run the entire bench corpus ./gradlew :composeApp:desktopTest --tests "*BenchRunnerTest" # Run a single bench flow from CLI automator bench run github-login # Diff a failing run against the canonical recording automator bench diff github-login --run-id <run-id>
Bench runs are also part of CI: ./gradlew :composeApp:desktopTest --no-daemon --no-configuration-cache --stacktrace executes them on every PR.
Cross-platform bench (in progress)
Per AUTOMATOR-CROSSPLATFORM-PLAN-2026-05-01, Tanvrit Automator's bench corpus is being extended to run against the full Tanvrit fleet — all 11 platforms' /app/ URLs. Each platform contributes representative flows so that any regression in the Tanvrit SDK or any individual app surfaces as a bench failure for the whole fleet.
This is an active workstream. The single-app bench (your scenarios against your app) is fully functional today.