OpenAdapt-ML
The ML engine for OpenAdapt -- open-source desktop automation with demo-conditioned AI agents.
OpenAdapt-ML provides the GUI-specific ML layer for training and running vision-language model (VLM) agents that automate desktop tasks. It handles everything between raw screen recordings and a production policy API: canonical schemas for GUI trajectories, VLM adapters, supervised fine-tuning with TRL + Unsloth, grounding, and demo-conditioned inference.
Demos
Synthetic Login -- Qwen3-VL-2B fine-tuned on synthetic UI scenarios:
Key Features
- GUI trajectory schemas -- Pydantic models for Episodes, Steps, Actions, and Observations with JSON Schema export and format converters (WAA, WebArena)
- VLM adapters -- Unified interface for Qwen3-VL, Qwen2.5-VL, Claude, GPT, and Gemini with automatic device selection (CUDA / MPS / CPU)
- Supervised fine-tuning -- TRL SFTTrainer with Unsloth optimizations for 2x faster training and 50% less VRAM via LoRA adapters
- Runtime policy API --
AgentPolicythat predicts the next GUI action (CLICK,TYPE,DONE) from a screenshot and goal - Demo-conditioned inference -- Retrieval-augmented prompting using recorded demonstrations for trajectory-conditioned disambiguation
- Grounding module -- Locate UI elements via Gemini vision API, oracle bounding boxes, or Set-of-Marks (SoM) overlays
- Cloud GPU training -- One-command training pipelines for Lambda Labs and Azure
- Synthetic data generation -- Configurable UI scenarios (login, registration) with layout jitter for rapid iteration
Installation
pip install openadapt-ml
# With training dependencies (TRL + datasets)
pip install openadapt-ml[training]
# With API-backed VLMs (Claude, GPT)
pip install openadapt-ml[api]
# Development (from source)
git clone https://github.com/OpenAdaptAI/openadapt-ml.git
cd openadapt-ml
uv sync
Quick Start
Run a smoke test
uv run python -m openadapt_ml.scripts.demo_policy --backend dummy
Train on synthetic data
uv run python -m openadapt_ml.scripts.train \
--config configs/qwen3vl_synthetic.yaml
Train on real recordings
uv run python -m openadapt_ml.scripts.train \
--config configs/qwen3vl_capture.yaml \
--capture ~/captures/my-workflow \
--open # Opens training dashboard in browser
End-to-end benchmark (train + eval + plot)
--config configs/qwen3vl_synthetic_dev.yaml \
--out-dir experiments/qwen_login/2b_dev
Use the policy API
from openadapt_ml.models.qwen_vl import QwenVLAdapter
adapter = QwenVLAdapter(model_name="Qwen/Qwen3-VL-2B-Instruct")
policy = AgentPolicy(adapter)
# Given an SFT-style sample (screenshot + goal + chat history):
output = policy.predict(sample)
print(output.action) # Action(type=CLICK, coordinates={"x": 0.45, "y": 0.71})
print(output.thought) # "Click the Login button"
Use the schema
episode = Episode(
episode_id="demo_001",
instruction="Open Notepad and type Hello World",
steps=[
Step(
step_index=0,
observation=Observation(screenshot_path="step_0.png"),
action=Action(type=ActionType.CLICK, coordinates={"x": 100, "y": 200}),
),
Step(
step_index=1,
observation=Observation(screenshot_path="step_1.png"),
action=Action(type=ActionType.TYPE, text="Hello World"),
),
],
success=True,
)
Architecture
openadapt_ml/
+-- schema/ # Episode, Step, Action, Observation (Pydantic models)
| +-- episode.py # Core dataclasses + JSON Schema export
| +-- converters.py # WAA/WebArena format converters
+-- models/ # VLM adapters
| +-- base_adapter.py # BaseVLMAdapter ABC
| +-- qwen_vl.py # Qwen3-VL, Qwen2.5-VL
| +-- api_adapter.py # Claude, GPT (inference-only)
| +-- dummy_adapter.py # Fake adapter for testing
+-- training/ # Fine-tuning pipeline
| +-- trl_trainer.py # TRL SFTTrainer + Unsloth
| +-- trainer.py # Training orchestration
| +-- viewer.py # Training dashboard (HTML)
+-- runtime/ # Inference
| +-- policy.py # AgentPolicy (screenshot -> action)
| +-- safety_gate.py # Action safety checks
+-- datasets/ # Data loading
| +-- next_action.py # Episodes -> SFT chat samples
+-- ingest/ # Data ingestion
| +-- synthetic.py # Synthetic UI generation
| +-- capture.py # openadapt-capture loader
| +-- loader.py # Generic episode loader
+-- grounding/ # UI element localization
| +-- base.py # OracleGrounder, GroundingModule ABC
| +-- detector.py # GeminiGrounder, SoM overlays
+-- retrieval/ # Demo-conditioned inference
| +-- retriever.py # Demo retrieval for RAG prompting
| +-- embeddings.py # Screenshot/action embeddings
+-- benchmarks/ # ML-specific benchmark agents
| +-- agent.py # PolicyAgent, APIBenchmarkAgent, UnifiedBaselineAgent
+-- cloud/ # Cloud GPU training
| +-- lambda_labs.py # Lambda Labs integration
| +-- local.py # Local training (CUDA/MPS)
| +-- ssh_tunnel.py # SSH tunnel management
+-- segmentation/ # Recording segmentation pipeline
+-- evals/ # Evaluation metrics (grounding, trajectory matching)
+-- config.py # Settings via pydantic-settings
+-- scripts/ # CLI entry points (train, eval, compare, demo)
Benchmark Results
Synthetic Login (Qwen3-VL-2B with Set-of-Marks)
| Metric | Score |
|---|---|
| Action Type Accuracy | 100% |
| Element Accuracy | 100% |
| Episode Success Rate | 100% |
Multi-Model Comparison (Synthetic Login, coordinate mode)
| Model | Action Accuracy | Coord Error | Click Hit Rate |
|---|---|---|---|
| Qwen3-VL-2B FT | 0.469 | 0.051 | 0.850 |
| Qwen3-VL-8B FT | 0.286 | 0.004 | 1.000 |
| Claude Sonnet 4.5 | 0.121 | 0.757 | 0.000 |
| GPT-5.1 | 0.183 | 0.057 | 0.600 |
These are results on a controlled synthetic benchmark with ~3 UI elements. They validate that the training pipeline works, not real-world performance. Evaluation on standard benchmarks (WAA, WebArena) is ongoing via openadapt-evals.
Cloud GPU Training
Lambda Labs
# One-command: launch, train, download, terminate
uv run python -m openadapt_ml.cloud.lambda_labs train \
--capture ~/captures/my-workflow \
--goal "Turn off Night Shift in System Settings"
Local (CUDA / Apple Silicon)
--capture ~/captures/my-workflow --open
Ecosystem
OpenAdapt-ML is one component in the OpenAdapt stack:
| Package | Purpose |
|---|---|
| openadapt-ml | ML engine: schemas, VLM adapters, training, inference, grounding |
| openadapt-evals | Evaluation infrastructure: VM management, pool orchestration, benchmark runners, oa-vm CLI |
| openadapt-capture | Lightweight GUI recording and demo sharing |
| OpenAdapt | Desktop automation platform (end-user application) |
Looking for benchmark evaluation, Azure VM management, or the
oa-vmCLI? Those live in openadapt-evals.
Documentation
docs/design.md-- System design (schemas, adapters, training, runtime)docs/cloud_gpu_training.md-- Lambda Labs and Azure training guidedocs/qwen_login_experiment.md-- Synthetic benchmark reproductiondocs/gemini_grounding.md-- Grounding module documentation
Contributing
git clone https://github.com/OpenAdaptAI/openadapt-ml.git
cd openadapt-ml
uv sync --extra dev --extra training
# Run tests
uv run pytest
# Lint
uv run ruff check .
We use Angular-style commits (feat:, fix:, docs:, etc.) with Python Semantic Release for automated versioning and PyPI publishing.