4 min read

Hybrid UI Detection: Why We Split Vision and Intelligence

Table of Contents

[YOUR VOICE] The Claim

VLM-only UI detection is the wrong default. Splitting text detection (Apple Vision) from icon detection (fine-tuned YOLO) produces better accuracy at 100x less cost — and the architecture explains why.


The Mechanism

uitag processes screenshots through a seven-stage pipeline:

  1. Apple Vision for text and rectangle detection
  2. YOLO tiled detection for icons and non-text UI elements
  3. Overlap deduplication across both detection sources
  4. OCR correction for misread labels
  5. Text block grouping for multi-line elements
  6. Set-of-Mark annotation with numbered markers
  7. JSON manifest generation with bounding boxes, labels, and coordinates

The key architectural insight: text detection and icon detection are fundamentally different problems. Apple Vision already solves text with near-perfect accuracy and sub-second latency. Routing only the unsolved problem (icons) to a heavier model keeps total inference time under 5 seconds while covering 90.8% of UI elements.


The Evidence

ScreenSpot-Pro benchmark

Testing against 1,581 annotations across 26 professional macOS applications. Metric: center-hit (does any detection’s bounding box contain the center of the ground-truth target?).

PipelineTextIconOverall
Apple Vision + YOLO92.7%87.6%90.8%
YOLO only82.4%75.7%80.1%
Apple Vision only66.4%42.5%57.3%

Additional out-of-distribution benchmarks (YOLO only, no Apple Vision):

BenchmarkMetricScore
GroundCUA (500 images, 30K GT elements)Recall@IoU>=0.594.0%
GroundCUAPrecision@IoU>=0.583.6%
UI-Vision (1,181 images)Recall@IoU>=0.583.5%

What the YOLO model detects

Nine element classes derived from GroundCUA’s annotation taxonomy:

ClassExamples
ButtonToolbar buttons, dialog buttons, toggles
MenuMenu bars, context menus, dropdowns
Input_ElementsText fields, search boxes, spinners
NavigationTabs, breadcrumbs, tree nodes
Information_DisplayStatus bars, tooltips, labels
SidebarSide panels, nav rails
Visual_ElementsIcons, thumbnails, separators
OthersScrollbars, handles, dividers
UnknownAmbiguous elements

Training details

ParameterValue
Base modelYOLO11s (pretrained)
DatasetGroundCUA tiled (224K train, 25K val tiles)
Tile size640x640, 20% overlap
Epochs100
Hardware2x H100 PCIe 80GB (DDP)
Wall clock19.75 hours
mAP@0.5 (val)0.792
Model size18 MB

Augmentation choices reflect the domain: UI elements are axis-aligned, so rotation, flipping, and mosaic are disabled. Removing geometric augmentation caused overfitting within 6 epochs in diagnostic runs.

On-device, no API dependency

The entire pipeline runs locally on macOS. No API calls, no cloud inference, no data leaving the device.

Source: github.com/swaylenhayes/uitag Model: huggingface.co/swaylenhayes/uitag-yolo11s-ui-detect-v1


[YOUR VOICE] Implications

MISSING — What this means for the VLM agent ecosystem. Who should care about the detection/intelligence split.


Open Questions

  • How does coverage degrade on non-standard UI frameworks (Electron with custom components, game UIs)?
  • What’s the minimum training set size for domain-specific YOLO fine-tuning?
  • Can the architecture generalize to mobile screenshots (iOS, Android) with a different Vision backend?

Reference Documents

DocumentWhat it covers
uitag READMEFull pipeline documentation, installation, benchmarks
YOLO model cardModel architecture, training data, performance metrics
ScreenSpot-Pro methodologyMISSING — per-application breakdown and evaluation protocol