[YOUR VOICE] The Claim
Confidence scores from VLMs are unreliable for UI interaction tasks. A model that says it’s 95% confident about a click target is wrong often enough to be dangerous. Trust needs to be calibrated from behavioral signals — consistency across attempts, agreement between detection methods, and success/failure history.
The Mechanism
MISSING — How Leith’s trust calibration works: multi-signal verification, behavioral consistency checks, episodic memory for tracking past success rates per element type
MISSING — Why confidence scores fail: specific examples of high-confidence misclicks and low-confidence correct actions
The Evidence
MISSING — Comparative data: confidence-based trust vs. behavioral trust calibration on UI task accuracy
[YOUR VOICE] Implications
MISSING — Broader lesson for any system that needs to know when to trust LLM output.
Open Questions
- Can behavioral trust signals transfer across applications (does learning to trust in Safari help in Figma)?
- What’s the minimum interaction history needed for reliable calibration?
- How does trust decay over time or across UI updates?
Reference Documents
| Document | What it covers |
|---|---|
| Leith _docs/ | MISSING — Trust calibration implementation |