Imagereasoning

Visual question answering and image reasoning -- multimodal AI that understands images and answers natural language questions about them.

Topics: image-understanding * cognitive-architecture * deep-reinforcement-learning * gemini-2-5 * reward-scoring * step-wise-reasoning * student-teacher-framework * visual-reasoning

Overview

ImageReasoning is a multimodal AI application that combines computer vision and natural language understanding to answer free-form questions about images -- a task known as Visual Question Answering (VQA). It integrates state-of-the-art vision-language models (BLIP-2, LLaVA, or GPT-4 Vision) to handle questions ranging from simple object recognition ('what colour is the car?') to complex spatial reasoning ('which object is between the chair and the window?') to abstract inference ('what is the person in the image likely feeling?').

The application provides two distinct modes. In the standard VQA mode, users upload an image and ask any natural language question, receiving a direct answer with an attention map overlay showing which image regions the model focused on. In the visual reasoning chain mode, the model is prompted to reason step by step before answering -- first identifying relevant objects, then their relationships, then inferring the answer -- providing an interpretable reasoning trace that makes complex spatial and causal questions more accurate and verifiable.

A benchmark evaluation module allows systematic comparison of different VLMs (BLIP-2, LLaVA-1.5, GPT-4V) on standard VQA datasets (VQA v2, GQA, TextVQA) with configurable question categories, providing reproducible performance metrics for model selection in specific visual reasoning tasks.

Motivation

Vision-language understanding is one of the most practically powerful capabilities of modern AI -- enabling applications from medical image Q&A to accessibility tools for visually impaired users to autonomous systems that can answer questions about their visual field. Making these capabilities accessible through a clean, well-documented application reduces the barrier for researchers and developers to experiment with and build on multimodal AI.

Architecture

Image Input + Natural Language Question | Vision Encoder (ViT-L/14 or CLIP) +-- Patch embeddings: (224x224) - 196 tokens +-- [CLS] token: global image representation | Querying Transformer (BLIP-2) or Visual Instruction Tuning (LLaVA) | Language Model Decoder (OPT-6.7B / Vicuna-13B / GPT-4) | Free-form answer generation | Optional: attention map visualisation chain-of-thought reasoning trace

Features

Free-Form Visual Question Answering

Answer any natural language question about an uploaded image using BLIP-2 or LLaVA -- open-ended, multiple choice, yes/no, counting, and colour/attribute questions all supported.

Visual Reasoning Chain

Chain-of-thought visual reasoning mode prompts the model to explicitly state its reasoning steps before answering, improving accuracy on spatial relationship and multi-hop inference questions.

Attention Map Visualisation

Grad-CAM or attention rollout overlay on the input image showing which spatial regions most influenced the model's answer -- making VQA predictions interpretable.

GPT-4 Vision Integration

Optional GPT-4V backend for highest-accuracy complex reasoning questions -- particularly useful for reading text in images (OCR-VQA) and detailed scene understanding.

Multi-Image Comparison Mode

Upload two images and ask comparison questions ('which image has more people?', 'how do these two charts differ?') using a multi-image prompt template.

Document and Chart Q&A

Specialised mode for structured images: tables, charts, diagrams, and document pages -- using layout-aware processing for accurate numerical and relationship extraction.

VQA Benchmark Evaluation

Systematic evaluation on VQA v2, GQA, and TextVQA benchmark subsets with per-category accuracy breakdown (yes/no, number, other) and model comparison table.

Batch Processing Pipeline

Process a JSONL file of image-question pairs for bulk VQA inference, with parallel execution and results CSV export.

Tech Stack

Library / Tool	Role	Why This Choice
BLIP-2 (Salesforce)	Primary VLM	Bootstrapped Language-Image Pre-training for efficient VQA
LLaVA (optional)	Visual instruction model	Large Language and Vision Assistant for complex reasoning
OpenAI GPT-4V (optional)	Highest accuracy	GPT-4 Vision for complex and text-in-image questions
Transformers (HuggingFace)	Model loading	Unified interface for BLIP-2, LLaVA, and CLIP models
PyTorch	Deep learning backend	Model inference and attention extraction
Pillow / OpenCV	Image processing	Upload handling, preprocessing, attention overlay
Streamlit	Application UI	Image upload, question input, answer display with overlay

Getting Started

Prerequisites

Python 3.9+ (or Node.js 18+ for TypeScript/JavaScript projects)
A virtual environment manager (venv, conda, or equivalent)
API keys as listed in the Configuration section

Installation

.env # only if using GPT-4V backend streamlit run app.py">git clone https://github.com/Devanik21/ImageReasoning.git cd ImageReasoning python -m venv venv && source venv/bin/activate pip install streamlit transformers torch pillow accelerate # BLIP-2 weights download automatically from HuggingFace on first run (~5GB for full model) # Optional: GPU recommended for <3s inference echo 'OPENAI_API_KEY=sk-...' > .env # only if using GPT-4V backend streamlit run app.py

Usage

# Launch VQA interface streamlit run app.py # Single question from CLI python ask.py --image photo.jpg --question 'How many people are in this image?' # Batch VQA evaluation python batch_vqa.py --input questions.jsonl --output answers.csv --model blip2 # Benchmark evaluation python benchmark.py --dataset vqa_v2 --split val --model blip2 --samples 1000

Configuration

Variable	Default	Description
`DEFAULT_MODEL`	`blip2-opt-2.7b`	VLM model: blip2-opt-2.7b, blip2-flan-t5-xl, llava-1.5-7b
`OPENAI_API_KEY`	`(optional)`	For GPT-4 Vision backend
`DEVICE`	`auto`	Inference device: auto, cpu, cuda, mps
`ATTENTION_VIZ`	`True`	Show attention map overlay on answers
`CHAIN_OF_THOUGHT`	`False`	Enable reasoning chain mode

Copy .env.example to .env and populate required values before running.

Project Structure

ImageReasoning/ +-- README.md +-- requirements.txt +-- ImAgE.py +-- ...

Roadmap

Video VQA: temporal question answering across video frames (what happens after the person sits down?)
Medical imaging VQA: fine-tuned model for radiology image Q&A (not for clinical use)
Multilingual VQA: answer questions posed in any language about any image
Active grounding: draw bounding box around the object referenced in the answer
Comparative benchmarking dashboard: automated weekly comparison of new VLM releases on standard benchmarks

Contributing

Contributions, issues, and suggestions are welcome.

Fork the repository
Create a feature branch: git checkout -b feature/your-idea
Commit your changes: git commit -m 'feat: add your idea'
Push to your branch: git push origin feature/your-idea
Open a Pull Request with a clear description

Please follow conventional commit messages and add documentation for new features.

Notes

BLIP-2 model weights require ~5GB of disk space and benefit significantly from GPU inference (A100 or equivalent for full model; T4 or equivalent for the smaller 2.7B variant). CPU inference is functional for testing but slow (~30-60s per query). GPT-4 Vision requires an OpenAI API key and incurs per-image API costs.

Author

Devanik Debnath
B.Tech, Electronics & Communication Engineering
National Institute of Technology Agartala

License

This project is open source and available under the MIT License.

Built with curiosity, depth, and care -- because good projects deserve good documentation.

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Latest commit

History

README.md

File metadata and controls

Imagereasoning

Overview

Motivation

Architecture

Features

Free-Form Visual Question Answering

Visual Reasoning Chain

Attention Map Visualisation

GPT-4 Vision Integration

Multi-Image Comparison Mode

Document and Chart Q&A

VQA Benchmark Evaluation

Batch Processing Pipeline

Tech Stack

Getting Started

Prerequisites

Installation

Usage

Configuration

Project Structure

Roadmap

Contributing

Notes

Author

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Imagereasoning

Overview

Motivation

Architecture

Features

Free-Form Visual Question Answering

Visual Reasoning Chain

Attention Map Visualisation

GPT-4 Vision Integration

Multi-Image Comparison Mode

Document and Chart Q&A

VQA Benchmark Evaluation

Batch Processing Pipeline

Tech Stack

Getting Started

Prerequisites

Installation

Usage

Configuration

Project Structure

Roadmap

Contributing

Notes

Author

License