Dark Mode

Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Latest commit

History

History
183 lines (139 loc) * 5.1 KB

cloud_gpu_training.md

File metadata and controls

183 lines (139 loc) * 5.1 KB

Cloud GPU Training

This guide covers using cloud GPUs for training OpenAdapt-ML models. Two providers are supported:

Provider Setup Time Cost Best For
Lambda Labs ~1 minute $0.60-2.00/hr Quick experiments, no approval needed
Azure Hours-days $0.90-3.00/hr Free credits ($200), enterprise use

Lambda Labs (Recommended)

Lambda Labs provides affordable GPU instances with no quota approval process.

Setup

  1. Create account at cloud.lambdalabs.com
  2. Get API key at cloud.lambdalabs.com/api-keys
  3. Set environment variable:
    export LAMBDA_API_KEY=your_key_here

Usage

"># List available GPU instances and prices
uv run python -m openadapt_ml.cloud.lambda_labs list

# Launch an A100 instance (~$1.10/hr)
uv run python -m openadapt_ml.cloud.lambda_labs launch --type gpu_1x_a100

# Check running instances
uv run python -m openadapt_ml.cloud.lambda_labs status

# Get SSH command for your instance
uv run python -m openadapt_ml.cloud.lambda_labs ssh

# IMPORTANT: Terminate when done (you're billed by the hour!)
uv run python -m openadapt_ml.cloud.lambda_labs terminate <instance_id>

Instance Types

Type GPU VRAM Price/hr Use Case
gpu_1x_a10 1x A10 24GB ~$0.60 2B-8B models, fine-tuning
gpu_1x_a100 1x A100 40GB ~$1.10 Best value, larger models
gpu_1x_h100 1x H100 80GB ~$2.00 Fastest, very large models

Training on Lambda Labs

Once your instance is running:

# SSH into the instance
ssh ubuntu@<instance_ip>

# Clone and set up the repo
git clone https://github.com/OpenAdaptAI/openadapt-ml.git
cd openadapt-ml
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync

# Run training
uv run python -m openadapt_ml.scripts.train \
--config configs/qwen3vl_capture.yaml \
--capture /path/to/capture

Direct TRL Trainer Usage

For more control, use the TRL trainer directly:

CLI:

# Train from Parquet (recommended for cloud)
python -m openadapt_ml.training.trl_trainer \
--parquet /path/to/episodes.parquet \
--output checkpoints/my_model \
--model unsloth/Qwen2.5-VL-7B-Instruct \
--epochs 3

# With Set-of-Marks DSL
python -m openadapt_ml.training.trl_trainer \
--parquet /path/to/episodes.parquet \
--output checkpoints/my_model \
--use-som

Python API:

from openadapt_ml.training.trl_trainer import (
train_with_trl,
train_from_parquet,
TRLTrainingConfig,
)

# Configure training
config = TRLTrainingConfig(
model_name="unsloth/Qwen2.5-VL-7B-Instruct",
output_dir="checkpoints/my_model",
num_epochs=3,
batch_size=1,
gradient_accumulation_steps=4,
learning_rate=2e-4,
# LoRA settings
lora_r=16,
lora_alpha=32,
)

# Train from Parquet (simpler for cloud workflows)
checkpoint = train_from_parquet(
parquet_path="/path/to/episodes.parquet",
config=config,
use_som=False,
)

# Or train from Episode objects
from openadapt_ml.ingest import load_episodes
episodes = load_episodes("/path/to/workflow_exports/")
checkpoint = train_with_trl(episodes=episodes, config=config)

Key benefits of TRL + Unsloth:

  • 2x training speed, 50% less VRAM
  • SFTTrainer for production-grade training
  • Automatic fallback to standard transformers if Unsloth unavailable

Azure

Azure offers $200 free credits for new accounts, but requires GPU quota approval.

Setup

# Full automated setup (includes GPU quota request)
python scripts/setup_azure.py

# Check GPU quota status (may take hours to approve)
az vm list-usage --location eastus -o table | grep -i nc

The setup script:

  1. Creates resource group and ML workspace
  2. Sets up service principal
  3. Creates container registry
  4. Automatically requests GPU quota (NCv3/V100)

GPU Quota

Azure requires quota approval for GPU VMs. Small requests (6-8 vCPUs) sometimes auto-approve within minutes. If not approved automatically:

  1. Go to Azure Quota Portal
  2. Select your subscription
  3. Search for "Standard NCSv3 Family"
  4. Request increase to 6+ vCPUs

Cost Comparison

VM Size GPU Price/hr Notes
NC6s_v3 1x V100 ~$0.90 Good for training
NC12s_v3 2x V100 ~$1.80 Faster training
ND96asr_v4 8x A100 ~$27.00 Large scale only

Recommendations

  1. For quick experiments: Use Lambda Labs - no approval wait, simple pricing
  2. For sustained training: Azure with free credits if you can wait for quota
  3. For production: Either works, Lambda is simpler, Azure has more enterprise features

Troubleshooting

Lambda Labs: "No regions available"

GPU capacity fluctuates. Try:

  • Different instance type
  • Wait and retry
  • Check status page

Azure: Quota not approved

  • Small requests (6 vCPUs) usually auto-approve
  • Larger requests need business justification
  • Try a different region (westus2, eastus2)