Feature · Perception

DOM first. Vision when DOM stops working.

Tanvrit Automator prefers DOM perception because it is cheaper, faster, and more accurate than asking a vision model what is on screen. Vision is the fallback for the small but real population of pages where the DOM is mostly canvas, mostly a single iframe, or mostly empty.

DOM perception (default)

DOMPerception asks Playwright for the page's accessibility tree. The tree carries every interactive node (buttons, links, inputs, comboboxes), its ARIA role, accessible name, computed bounds, and visibility.

We canonicalise the tree into a list of UIElement records, prune nodes that are off-screen or aria-hidden, and emit a PageSummary containing URL, title, viewport size, and the visible UIElement list. PageSummary is what the planner sees in its prompt.

data class UIElement(
  val id: Int,
  val role: String,
  val name: String,
  val value: String?,
  val bounds: Bounds,
  val ancestorChain: List<String>,
  val isInteractive: Boolean,
)

data class PageSummary(
  val url: String,
  val title: String,
  val viewport: Size,
  val elements: List<UIElement>,
)

Why DOM is preferred

•Latency. A11y tree extract takes 50–200 ms. Vision inference takes 2–10 seconds even on a fast GPU.
•Determinism. The same page produces the same tree. Vision models are stochastic.
•Selectors. A DOM-derived UIElement has a stable CSS selector or ARIA path the executor can target. A vision caption has only pixel coordinates that drift on resize.
•Cost. DOM perception is free. Vision inference consumes RAM, VRAM, and (if you opt into cloud LLMs) tokens.

When vision activates

The agent counts consecutive empty DOM perceptions. Once the count reaches DOM_EMPTY_THRESHOLD (2 by default), VisionPerception takes over for the next step.

Vision uses qwen2.5-vl via Ollama to caption the screenshot, plus Tesseract OCR to extract any visible text the vision model missed. The combined output is mapped back into PageSummary form so the planner sees the same shape regardless of which strategy produced it.

Worked example — a SPA with mostly-canvas DOM

Consider a Figma-style design tool. The page is one <canvas> element with no a11y children — the DOM tree returns an effectively empty list of interactive nodes. The agent observes:

Step 0: PERCEIVE — DOMPerception → 0 interactive nodes
Step 1: PERCEIVE — DOMPerception → 0 interactive nodes
        DOM_EMPTY_THRESHOLD reached → switch strategy
Step 2: PERCEIVE — VisionPerception
        qwen2.5-vl: "Top toolbar with File, Edit, View menus.
                     Left sidebar with shape tools. Centre canvas
                     with a blue rectangle. Right panel with
                     properties for the selected shape."
        Tesseract: ["File", "Edit", "View", "Width: 240"]
Step 3: PLAN — choose action targeting vision-derived bounds
Step 4: EXECUTE — VisionExecutor clicks at (x, y)

The planner now has a usable representation. The executor switches to VisionExecutor for actions that target pixel coordinates. The rest of the loop is unchanged.

Honest caveats

Vision is the deep fallback, not the default. We do not recommend turning DOM_EMPTY_THRESHOLD to 0 — vision is slower, less accurate, and consumes more memory. If your target app is mostly DOM, the default thresholds will keep vision dormant and the agent fast.

Tesseract requires its language pack to be installed. Garbled OCR output usually means the trained data for your language is missing. See Troubleshooting.