Human Expertise for Artificial Workers

Screens are just another environment. We're teaching AI to navigate them.

VLA models transformed robotics by learning from video instead of static images. The same paradigm shift is coming for computer use. We're building the training data for that generation.

Photoshop workspacePhotoshop after background removal
Recording
Action Stream
Real-time semantic capture

Photoshop workflow: Select Subject → Remove Background → Continue editing

The Paradigm Shift

What VLA did for robotics, we're doing for computer use.

Early robot learning processed static images. It didn't work. VLA models—video in, actions out—changed everything. Robots learned temporal patterns, error correction, fluid motion.

Current computer use agents process screenshots. Same problem. Same solution.

Current Paradigm

Screenshot → Reason → Click

Static frames. No state. 3-10 seconds per action. Breaks on anything dynamic.

Next Paradigm

Video → Continuous Action

Temporal perception. State tracking. Real-time. Learns from human demonstrations.

The models will come from Anthropic, OpenAI, the frontier labs. The training data will come from us. We're not competing on models. We're building the data infrastructure for the entire category.

The Data Gap

Models can't automate what they've never seen.

The internet has documentation and tutorials. It doesn't have expert demonstrations. The gap isn't model capability—it's training data format and coverage.

No Long-Horizon Data

Tutorials show 5-step tasks. Real work is 200-step sessions with context, decisions, corrections.

No Internal Workflows

Your Salesforce config, admin panels, enterprise tools—zero coverage on the internet.

Wrong Format

Current datasets are screenshot-based. Video-native models need video-native data.

Procedural ≠ Declarative

You can't learn workflows from documentation. You need demonstrations. That's what we collect.

What We Collect

Screen recording is pixels. We add the semantic layer.

Raw video isn't training data. We capture the full signal: what happened, what it means, and why.

Semantics
What was clicked—element type, hierarchy, state changes via platform APIs
Video
30fps screen recording with cursor position, clicks, keystrokes
Intent
Voice annotation explaining reasoning and decision-making
[ Visual: Data stack diagram — Video + UI Tree + Action Sequence + Intent ]

We scale collection through a global contributor network — professionals earn tokens for recording their workflows. Expert demonstrations, not mechanical turk.

The Endgame

Every job on a computer can be automated.

Full digital automation requires two things: models that can operate computers reliably, and training data for specific workflows. The models are coming. The data doesn't exist.

Computer Use Models
+
Workflow-Specific Data
=
Full Automation

We're building the data layer for the automation of all digital work.

Useful today: workflow documentation, internal knowledge, onboarding. Essential tomorrow: training data for automation.

Team

We built computer use agents. We know what's missing.

Shirley Yan

Cofounder & CEO
  • Built HoverGPT — computer use agent before Anthropic's demo
  • AI dashcam & sensor data pipelines at Motive
  • UC Berkeley Computer Science
  • Angel investor — YC co-investments

Eric Liang

Cofounder & CTO
  • Anthropic — RLHF Training for Claude Skills
  • Apple Vision Pro — Spatial Understanding Tech Lead
  • Top 10 globally in macOS system internals
  • 2X Founder — YC W24, Aesop Labs
Get Started

The window is now.

We're partnering with AI labs and enterprises building the next generation of computer use. Early access available.

The last human demonstrations.