[YOUR VOICE] The Claim
Coupling detection and intelligence into a single VLM call is the default in every UI agent framework. Itβs also why most UI agents are slow, expensive, and unreliable. The layers have different failure modes and should be engineered separately.
The Mechanism
Detection asks: what elements exist on this screen? Intelligence asks: which element should I interact with, and how?
These are different problems with different error profiles:
- Detection errors are spatial β missed elements, wrong bounding boxes, overlapping regions
- Intelligence errors are semantic β clicking the wrong button, misunderstanding context, hallucinating elements
When both run in a single VLM call, you canβt diagnose which layer failed. A missed click could be a detection miss or a reasoning error. Separating the layers makes each independently testable and improvable.
uitag handles detection. Leith handles intelligence. The interface between them is a structured JSON manifest β bounding boxes, labels, coordinates. Leith never sees raw pixels; it reasons over structured data.
The Evidence
Detection layer (uitag)
90.8% element coverage on ScreenSpot-Pro. Sub-5-second processing. On-device. The detection layer is fast, deterministic, and benchmarkable.
Source: uitag README
Intelligence layer (Leith)
MISSING β Leith performance data on structured input vs. raw screenshot input. CoT suppression results. Multi-signal verification accuracy.
The cost argument
MISSING β Comparative cost analysis: single VLM call (detection + intelligence) vs. split architecture (Vision + YOLO + LLM reasoning on structured data).
[YOUR VOICE] Implications
MISSING β Why this matters beyond this specific project. The broader argument for perceptual separation in agent architectures.
Open Questions
- At what complexity threshold does the split architecture lose its advantage?
- Can the structured manifest format become a standard interchange format for UI agents?
- Whatβs the right abstraction boundary between detection and intelligence?
Reference Documents
| Document | What it covers |
|---|---|
| uitag | Detection layer implementation |
| Leith _docs/ | MISSING β Intelligence layer architecture and decisions |
| Architecture decision record | MISSING β Why the split was chosen |