Name	Name	Last commit message	Last commit date
Latest commit History 17 Commits
inference_results	inference_results
vlm_train	vlm_train
.gitignore	.gitignore
.python-version	.python-version
README.md	README.md
pyproject.toml	pyproject.toml
uv.lock	uv.lock

Vision-Language Model (VLM) Training

A PyTorch implementation for training vision-language models from scratch using Q-Former architecture and small language models. This project demonstrates how to build a complete VLM pipeline that can understand and describe images.

Overview

This project implements a vision-language model training pipeline that:

Trains a Q-Former to align visual and text embeddings using contrastive learning (CLIP-style)
Adapts a pretrained language model (SmolLM or Qwen) to understand visual inputs through the trained Q-Former
Enables image captioning and visual question answering with lightweight models suitable for consumer hardware

The architecture uses:

Vision Transformer (ViT) for encoding images
Q-Former (Query Transformer) to compress and align visual features with text
Small Language Models (SmolLM-135M or Qwen3-0.6B) with LoRA fine-tuning
Conceptual Captions dataset for training

Architecture

The diagram above shows the complete end-to-end architecture:

Image Processing: Input images are encoded by ViT into patch embeddings
Q-Former: Compresses ViT embeddings into a fixed number of query tokens
MLP Adapter: Projects Q-Former outputs into the language model's embedding space
Text Input: User prompt (e.g., "Describe this image") is tokenized
Concatenation: Image tokens and text tokens are combined into a single sequence
Language Model: Decoder layers (with LoRA adapters) generate the caption autoregressively

This 2-stage training approach first aligns vision and language representations (Q-Former), then teaches the LLM to understand visual inputs (full VLM).

Project Structure

vlm/ +-- vlm_train/ # Main training directory | +-- datasets/ # Data loading modules | | +-- cc_dataloader.py # Conceptual Captions dataloader | | +-- lm_dataloader.py # Language model dataloader | +-- networks/ # Neural network architectures | | +-- q_former.py # Q-Former implementation | | +-- lm_to_vlm.py # VLM wrapper with LoRA | +-- utils/ # Utility functions | | +-- calculate_recall.py # Retrieval metrics (I2T, T2I) | | +-- utils.py # Visualization utilities | | +-- filter_dataset.py # Dataset preparation script | +-- q_former_train.py # Q-Former training script | +-- lm_train.py # VLM fine-tuning script | +-- test_generation.py # Generation evaluation script | +-- basic_inference.py # Inference and metrics visualization +-- inference_results/ # Output directory for results | +-- similarity_grid.jpg # Image-text retrieval visualization +-- pyproject.toml # Project dependencies

Installation

Prerequisites

Python >= 3.12
CUDA-capable GPU recommended (can also run on MPS/CPU)
~10GB disk space for dataset

Steps

Clone the repository

git clone https://github.com/avbiswas/vlm.git cd vlm

Install dependencies

Using uv (recommended):

uv pip install -e .

Or using pip:

pip install -e .

Download the dataset

Run the filter script to download Conceptual Captions:

uv run vlm_train/utils/filter_dataset.py

This will download ~200k image-caption pairs to dataset/conceptual-captions-200k.parquet.

Download images using img2dataset

img2dataset --url_list dataset/conceptual-captions-200k.parquet \ --input_format "parquet" \ --url_col "image_url" \ --caption_col "caption" \ --output_folder dataset/cc_images \ --processes_count 16 \ --thread_count 64 \ --image_size 224 \ --resize_mode center_crop

Training

Stage 1: Train Q-Former (Vision-Language Alignment)

The Q-Former learns to align visual features from ViT with text embeddings from DistilBERT using contrastive learning:

uv run vlm_train/q_former_train.py

Key hyperparameters:

Learning rate: 1e-4 (default), 1e-3 (cross-attention & queries)
Batch size: 8
Loss: CLIP-style contrastive loss
Device: Auto-detected (CUDA/MPS/CPU)

The trained model will be saved to models/trained_qformer/best/.

Stage 2: Train Vision-Language Model

Fine-tune a language model to understand visual inputs through the trained Q-Former:

uv run vlm_train/lm_train.py

Key hyperparameters:

Model: SmolLM-135M-Instruct (default) or Qwen3-0.6B
Learning rate: 1e-4 (Q-Former), 5e-4 (adapter + LLM LoRA)
Batch size: 8 with gradient accumulation (4 steps)
LoRA config: r=64, alpha=128
Mixed precision: bfloat16

Models are saved to models/vlm_peft/best/ and models/vlm_peft/latest/.

Evaluation and Testing

Generate Captions

Test the model on evaluation samples:

uv run vlm_train/test_generation.py

This generates captions for samples 130-160 and saves results to test_generation_results.csv.

Compute Retrieval Metrics

Run inference to compute image-to-text and text-to-image retrieval metrics:

uv run vlm_train/basic_inference.py

This generates:

Recall@K metrics (K=1, 5, 10) for both I2T and T2I retrieval
Similarity grid visualization saved to inference_results/similarity_grid.jpg

Results

Q-Former Retrieval Performance

Below is an example similarity grid showing image-text retrieval performance on 8 test samples:

The grid shows:

Rows: Text captions
Columns: Images
Cell values: Cosine similarity between image and text embeddings

Retrieval Performance:

Image-to-Text: High recall indicates the model can find correct captions for images
Text-to-Image: High recall indicates the model can find correct images for captions

VLM Caption Generation

Below are examples of captions generated by the trained Vision-Language Model on test images:

The model demonstrates the ability to:

Describe visual content: Objects, scenes, and settings in images
Generate coherent captions: Natural language descriptions that relate to the image content
Handle diverse images: From landscapes and architecture to products and people

Each image is paired with a generated caption (shown in green), demonstrating the model's vision-language understanding capabilities after the 2-stage training process.

Note: This project was created for the YouTube tutorial, focusing on educational clarity over scale. The model was trained on just 50,000 image-caption pairs (a small subset of Conceptual Captions) and the entire 2-stage training process completed in approximately 4 hours on a single GPU. Despite the limited data and training time, the model demonstrates meaningful vision-language understanding, making this an accessible starting point for learning VLM development.

Model Architecture

Q-Former

Based on DistilBERT architecture
32 learnable query tokens to compress visual information
Cross-attention layers to attend to ViT features
Outputs aligned visual and text embeddings for contrastive learning