Name	Name	Last commit message	Last commit date
Latest commit History 21 Commits
dataset	dataset
docs	docs
examples	examples
prompts	prompts
script	script
src	src
test	test
tools	tools
.gitignore	.gitignore
CODE_OF_CONDUCT.md	CODE_OF_CONDUCT.md
CONTRIBUTING.md	CONTRIBUTING.md
DEVELOPMENT.md	DEVELOPMENT.md
LICENSE	LICENSE
NOTICE	NOTICE
README.md	README.md
requirements.txt	requirements.txt
run_all_experiments.sh	run_all_experiments.sh
setup.py	setup.py

CodeAssistBench

A benchmark for evaluating AI coding assistants on real GitHub issues. This project includes a curated dataset of GitHub issues with Dockerfiles for reproducible evaluation, plus tools for dataset creation and AI agent evaluation.

Quick Run (5 minutes)

Get started immediately with our pre-built dataset:

# 1. Clone and install git clone https://github.com/your-org/CodeAssistBench.git cd CodeAssistBench pip install -r requirements.txt && pip install -e . # 2. Set AWS credentials (for Bedrock) export AWS_ACCESS_KEY_ID=your_key export AWS_SECRET_ACCESS_KEY=your_secret export AWS_DEFAULT_REGION=us-west-2 # 3. Run evaluation on Python issues (using human-verified dataset) python -m cab_evaluation.cli generation-dataset \ dataset/cab_verified_v2.jsonl \ --output results/quick_test.jsonl \ --agent-models '{"maintainer": "haiku", "user": "haiku"}' \ --language python # 4. Judge the results python -m cab_evaluation.cli evaluation-dataset \ results/quick_test.jsonl \ --output results/quick_eval.jsonl \ --agent-models '{"judge": "haiku"}' # 5. View results python -c " import json with open('results/quick_eval.jsonl') as f: for line in f: r = json.loads(line) print(f\"{r['issue_id']}: {r['verdict']}\") "

What this does:

Generates maintainer responses for Python issues using Claude Haiku (fast & cheap)
Evaluates responses with a judge agent
Outputs verdicts: CORRECT, PARTIALLY_CORRECT, INCORRECT, or ERROR

For production evaluation, use sonnet4 or opus models instead of haiku.

Dataset Overview

CodeAssistBench provides four ready-to-use datasets. We recommend cab_verified_v2.jsonl for evaluation -- it contains 274 human-verified, high-quality issues (scored 4+ out of 5 by annotators):

Dataset	Issues	Languages	Description
`dataset/cab_verified_v2.jsonl`	274	7	Recommended -- Human-verified subset from annotation
`dataset/cab_recent_v2.jsonl`	771	7	Full dataset -- June 2025 - Jan 2026 (with satisfaction conditions & classification)
`dataset/cab_recent.jsonl`	308	7	Earlier recent issues (June 2025 - Jan 2026)
`dataset/cab_verified.jsonl`	149	7	Legacy verified subset with tested Dockerfiles

Dataset Fields

Each issue in the dataset contains:

{ "number": 1234, "title": "Bug: Memory leak in parser", "created_at": "2025-07-15T10:30:00Z", "closed_at": "2025-07-20T14:22:00Z", "commit_id": "abc123def456...", "labels": ["bug", "parser"], "url": "https://github.com/owner/repo/issues/1234", "body": "When parsing large files, memory usage grows unbounded...", "author": "user123", "comments": [ { "user": "maintainer", "created_at": "2025-07-16T08:00:00Z", "body": "Thanks for reporting! Can you share the file?" } ], "satisfaction_conditions": [ "Memory usage remains stable when parsing files >100MB", "Parser handles all edge cases mentioned in the issue", "No regression in parsing speed for normal files" ], "_classification": { "category": "Can be dockerized without any issue", "timestamp": "2025-04-14 01:01:54" }, "dockerfile": "FROM python:3.11-slim\n...", "language": "python" }

Step-by-Step: Generate Your Own Dataset

This section walks through how we generated the dataset from scratch using AWS Bedrock and Strands AI agents.

Prerequisites

# 1. Clone and setup git clone https://github.com/your-org/CodeAssistBench.git cd CodeAssistBench python3 -m venv venv source venv/bin/activate # 2. Install dependencies pip install -r requirements.txt pip install -e . # 3. Install Strands SDK (required for Dockerfile generation) pip install strands-agents strands-agents-tools pip install -e tools/ # 4. Set up LLM credentials (choose ONE option) # Option A: AWS Bedrock (Claude models) export AWS_ACCESS_KEY_ID=your_key export AWS_SECRET_ACCESS_KEY=your_secret export AWS_DEFAULT_REGION=us-west-2 # Option B: OpenAI (GPT-5 models) export OPENAI_API_KEY=your_openai_api_key # 5. Set up GitHub token (for API access) export GITHUB_TOKEN=your_github_personal_access_token

Step 1: Collect GitHub Issues

Collect closed issues from popular repositories. The script uses interactive prompts:

python script/get_github_issue.py # Enter CSV path when prompted (see script/python_repos*.csv for examples) # Choose label-based filtering (y/n)

Or use the bulk collection script:

python script/collect_1000_issues.py # Edit the script to set: language, min_stars, date range

Output: github_issues___.json

[ { "number": 1234, "title": "Bug: Memory leak in parser", "url": "https://github.com/owner/repo/issues/1234", "body": "When parsing large files...", "comments": [...] } ]

Step 2: Get Commit IDs

Find the commit hash at the time each issue was closed:

python script/get_github_commit.py \ --input-dir my_data/collected_issues \ --output-dir my_data/with_commits # Or using short options: python script/get_github_commit.py -i my_data/collected_issues -o my_data/with_commits

Arguments:

Argument	Required	Description
`--input-dir`, `-i`	Yes	Directory containing JSON files with issues
`--output-dir`, `-o`	No	Output directory (default: `github_commits`)

Output: Creates commit data files in the output directory.

Step 3: Generate Satisfaction Conditions (Uses LLM)

Use LLM to generate explicit criteria for issue resolution:

python script/scon_filter.py \ --input-dir my_data/collected_issues \ --output-dir my_data/with_scon # With custom model and region: python script/scon_filter.py \ -i my_data/collected_issues \ -o my_data/with_scon \ --model us.anthropic.claude-sonnet-4-5-20250929-v1:0 \ --region us-west-2

Arguments:

Argument	Required	Default	Description
`--input-dir`, `-i`	Yes	-	Directory containing JSON files with issues
`--output-dir`, `-o`	Yes	-	Output directory for issues with satisfaction conditions
`--model`, `-m`	No	`claude-sonnet-4.5`	Bedrock model ID
`--region`, `-r`	No	`us-west-2`	AWS region for Bedrock

Output: Adds satisfaction_conditions field:

{ "satisfaction_conditions": [ "Memory usage remains stable when parsing files >100MB", "Parser handles all edge cases mentioned in the issue" ] }

Step 4: Classify Dockerizability (Uses LLM)

Classify issues by whether they need a Docker environment:

python script/docker_filter.py \ --input-dir my_data/with_scon \ --output-dir my_data/classified # With custom region: python script/docker_filter.py \ -i my_data/with_scon \ -o my_data/classified \ --region us-east-1

Arguments:

Argument	Required	Default	Description
`--input-dir`, `-i`	Yes	-	Directory containing JSON files with issues
`--output-dir`, `-o`	Yes	-	Output directory for classified issues
`--region`, `-r`	No	`us-west-2`	AWS region for Bedrock

Output structure:

my_data/classified/ +-- need_docker/ # Issues that need Docker environment +-- no_need_docker/ # Documentation/config changes +-- need_docker_but_cannot/ # Hardware-specific issues +-- llm_responses/ # Raw LLM responses for debugging +-- processed_issues.json # Resume checkpoint

Step 5: Generate Dockerfiles (Uses Strands + LLM)

This step requires Strands AI agents to automatically generate and test Dockerfiles:

# Option A: Using AWS Bedrock (Claude) - default STRANDS_NON_INTERACTIVE=true BYPASS_TOOL_CONSENT=true \ python script/generate_dockerfile_with_strands.py \ --input-dir my_data/classified/need_docker \ --languages python \ --max-attempts 3 \ --parallel 2 \ --agent-timeout 180 \ --issue-timeout 600 # Option B: Using OpenAI (GPT-5) STRANDS_NON_INTERACTIVE=true BYPASS_TOOL_CONSENT=true \ python script/generate_dockerfile_with_strands.py \ --input-dir my_data/classified/need_docker \ --languages python \ --max-attempts 3 \ --parallel 2 \ --agent-timeout 180 \ --issue-timeout 600 \ --model-id gpt5 \ --provider openai

What happens:

Strands agent reads the issue and repository structure
Agent generates a Dockerfile based on repo's build system
Docker builds the image to verify it works
If build fails, agent iterates with error feedback
Success: Dockerfile is saved to the issue JSON

Output: Adds dockerfile field:

{ "dockerfile": "FROM python:3.11-slim\n\nWORKDIR /workspace\n\nRUN apt-get update && apt-get install -y git\n\nRUN git clone https://github.com/owner/repo.git . && \\\n git checkout abc123def456\n\nRUN pip install -r requirements.txt\n\nCMD [\"pytest\", \"tests/\"]\n" }

Step 6: Convert to Final Dataset

Combine all processed issues into a single JSONL file:

python script/convert_to_jsonl.py \ --input-dir my_data/classified/need_docker \ --output my_data/my_dataset.jsonl

End-to-End Example

Here's a complete walkthrough processing 2 test issues through the entire pipeline:

Setup

cd CodeAssistBench # Set up credentials (AWS Bedrock + GitHub) export AWS_ACCESS_KEY_ID=your_key export AWS_SECRET_ACCESS_KEY=your_secret export AWS_DEFAULT_REGION=us-west-2 export GITHUB_TOKEN=your_github_token

Step 1: Create Test Data

Create a directory with sample issues:

mkdir -p test_pipeline/step1_raw

Create test_pipeline/step1_raw/test_issues.json:

[ { "number": 1234, "title": "How to handle async operations in Python?", "created_at": "2025-07-15T10:30:00Z", "url": "https://github.com/python/cpython/issues/1234", "body": "I'm trying to use async/await but get 'RuntimeWarning: coroutine was never awaited'.", "author": "user123", "comments": [ {"user": "maintainer", "created_at": "2025-07-16T08:00:00Z", "body": "Use asyncio.run() to execute your coroutine."}, {"user": "user123", "created_at": "2025-07-17T09:00:00Z", "body": "That worked perfectly!"} ] } ]

Step 2: Generate Satisfaction Conditions

python3 script/scon_filter.py \ --input-dir test_pipeline/step1_raw \ --output-dir test_pipeline/step2_scon

Expected output:

Processing directory: test_pipeline/step1_raw Found 1 JSON files Processing conversation 1/1 (ID: 1234) Added satisfaction conditions for conversation 1234 Saved 1 processed conversations to test_pipeline/step2_scon/test_issues.json

Step 3: Classify Issues

python3 script/docker_filter.py \ --input-dir test_pipeline/step2_scon \ --output-dir test_pipeline/step3_classified

Expected output:

Input directory: test_pipeline/step2_scon Output directory: test_pipeline/step3_classified Found 1 JSON files to process. Classified issue #1234 as: Does not need build environment --- Classification Summary --- Total issues processed: 1 Does not need build environment: 1 issues (100.0%)

Final Directory Structure

test_pipeline/ +-- step1_raw/ | +-- test_issues.json # Original issues +-- step2_scon/ | +-- test_issues.json # + satisfaction_conditions | +-- test_issues_prompts_responses.json +-- step3_classified/ +-- no_need_docker/ | +-- test_issues.json # + _classification +-- need_docker/ # (empty for this example) +-- llm_responses/ # Raw LLM outputs +-- classification_summary.json

View Results

# Check satisfaction conditions were added cat test_pipeline/step2_scon/test_issues.json | jq '.[0].satisfaction_conditions' # Check classification cat test_pipeline/step3_classified/no_need_docker/test_issues.js on | jq '.[0]._classification'

Example Outputs

See examples/ for sample outputs at each pipeline stage:

File	Description
`examples/sample_dataset.jsonl`	Complete issues with all fields
`examples/sample_docker_based_issues.jsonl`	Issues requiring Docker
`examples/sample_non_docker_based_issues.jsonl`	Documentation/config issues
`examples/sample_pipeline_output.json`	Single issue showing all fields

Quick Start

Using the Dataset

import json # Load the recommended human-verified dataset with open('dataset/cab_verified_v2.jsonl', 'r') as f: issues = [json.loads(line) for line in f] # Filter by language python_issues = [i for i in issues if i.get('language') == 'python'] print(f"Total issues: {len(issues)}") print(f"Python issues: {len(python_issues)}")

Running Evaluation

The evaluation framework has two phases: Generation (maintainer answers issues) and Evaluation (judge scores responses).

Workflow Overview

Step 1: Generation (Maintainer - User conversation)

python -m cab_evaluation.cli generation-dataset \ dataset/cab_recent.jsonl \ --output results/generation_results.jsonl \ --agent-models '{"maintainer": "sonnet4", "user": "haiku"}' \ --language python \ --resume

Arguments:

Argument	Description
`--output`, `-o`	Output file (default: auto-generated with timestamp)
`--agent-models`	JSON mapping models: `{"maintainer": "sonnet4", "user": "haiku"}`
`--language`, `-l`	Filter by language (python, javascript, etc.)
`--resume`	Skip already-processed issues
`--max-conversation-rounds`	Max rounds between maintainer/user (default: 2)

Step 2: Evaluation (Judge scores responses)

python -m cab_evaluation.cli evaluation-dataset \ results/generation_results.jsonl \ --output results/evaluation_results.jsonl \ --agent-models '{"judge": "sonnet4"}' \ --resume

Arguments:

Argument	Description
`--output`, `-o`	Output file for evaluation results
`--agent-models`	JSON with judge model: `{"judge": "sonnet4"}`
`--resume`	Skip already-evaluated issues
`--iterative`	Enable multi-iteration judge with repo exploration

Verdict Types

The judge assigns one of these verdicts:

Verdict	Description
`CORRECT`	Response fully addresses the issue and satisfies all conditions
`PARTIALLY_CORRECT`	Response addresses some aspects but misses key elements
`INCORRECT`	Response doesn't address the issue or provides wrong information
`ERROR`	Processing failed (timeout, API error, etc.)

Output Format

Each result in the JSONL file contains:

{ "issue_id": "1234", "question_title": "How to handle async operations?", "verdict": "CORRECT", "judgment": "The maintainer correctly identified the issue...", "key_issues": ["Clear explanation provided", "Code example included"], "alignment_score": { "satisfied": 3, "total": 3, "percentage": 100.0, "conditions": [ {"number": 1, "satisfied": true, "description": "Explains async pattern"}, {"number": 2, "satisfied": true, "description": "Provides working example"}, {"number": 3, "satisfied": true, "description": "Addresses RuntimeWarning"} ] }, "generation_metadata": { "user_satisfied": true, "total_conversation_rounds": 2 } }

Analyzing Results

import json from collections import Counter # Load evaluation results with open('results/evaluation_results.jsonl', 'r') as f: results = [json.loads(line) for line in f] # Count verdicts verdicts = Counter(r['verdict'] for r in results) print(f"Total: {len(results)}") print(f"CORRECT: {verdicts['CORRECT']} ({verdicts['CORRECT']/len(results)*100:.1f}%)") print(f"PARTIALLY_CORRECT: {verdicts['PARTIALLY_CORRECT']} ({verdicts['PARTIALLY_CORRECT']/len(results)*100:.1f}%)") print(f"INCORRECT: {verdicts['INCORRECT']} ({verdicts['INCORRECT']/len(results)*100:.1f}%)") print(f"ERROR: {verdicts.get('ERROR', 0)}") # Average alignment score valid_results = [r for r in results if r.get('alignment_score')] avg_alignment = sum(r['alignment_score']['percentage'] for r in valid_results) / len(valid_results) print(f"Average alignment: {avg_alignment:.1f}%")

Model Aliases

Available model shortcuts for --agent-models:

Alias	Full Model ID
`sonnet4`	`us.anthropic.claude-sonnet-4-20250514-v1:0`
`sonnet45`	`us.anthropic.claude-sonnet-4-5-20250929-v1:0`
`haiku`	`us.anthropic.claude-3-5-haiku-20241022-v1:0`
`opus`	`us.anthropic.claude-opus-4-20250514-v1:0`

See examples/USAGE_GUIDE.md for more detailed instructions.

Project Structure

CodeAssistBench/ +-- dataset/ # Final datasets | +-- cab_recent_v2.jsonl # 771 latest issues | +-- cab_verified_v2.jsonl # 274 human-verified issues (quality 4+) | +-- cab_recent.jsonl # 308 recent issues | +-- cab_verified.jsonl # 149 verified issues | +-- recent/ # Additional samples +-- src/cab_evaluation/ # Evaluation framework | +-- agents/ # Agent implementations | +-- core/ # Core models and config | +-- prompts/ # Prompt templates | +-- utils/ # Utilities | +-- workflows/ # Evaluation workflows +-- script/ # Data collection scripts | +-- get_github_issue.py # Step 1: Issue collection | +-- get_github_commit.py # Step 2: Commit ID lookup | +-- scon_filter.py # Step 3: Satisfaction conditions | +-- docker_filter.py # Step 4: Classification | +-- generate_dockerfile_with_strands.py # Step 5: Dockerfiles +-- tools/ # Custom Strands tools (required) +-- examples/ # Sample data and guides | +-- USAGE_GUIDE.md # Detailed usage guide | +-- sample_*.jsonl # Sample datasets +-- prompts/ # Prompt templates +-- docs/ # Documentation +-- DATA_PIPELINE.md # Detailed pipeline docs

Installation

# Clone the repository git clone https://github.com/your-org/CodeAssistBench.git cd CodeAssistBench # Create virtual environment python3 -m venv venv source venv/bin/activate # Install dependencies pip install -r requirements.txt # Install in development mode pip install -e . # Install Strands SDK (REQUIRED for Dockerfile generation) pip install strands-agents strands-agents-tools pip install -e tools/

AWS Credentials (Required for Bedrock)

export AWS_ACCESS_KEY_ID=your_key export AWS_SECRET_ACCESS_KEY=your_secret export AWS_DEFAULT_REGION=us-west-2

Documentation

Usage Guide - Detailed evaluation instructions
Data Pipeline - Complete pipeline documentation
Development - Contributing and development setup

Features

Automated Dockerfile Generation: Uses Strands AI agents with AWS Bedrock
Multi-language Support: Python, JavaScript, TypeScript, Java, Go, C, C++
Satisfaction Conditions: LLM-generated criteria for issue resolution
Docker-based Evaluation: Reproducible evaluation environment
Multiple Agent Frameworks: Supports Strands, OpenHands, and Q-CLI

Citation

If you use CodeAssistBench in your research, please cite our paper:

@inproceedings{ kim2025codeassistbench, title={CodeAssistBench ({CAB}): Dataset \& Benchmarking for Multi-turn Chat-Based Code Assistance}, author={Myeongsoo Kim and Shweta Garg and Baishakhi Ray and Varun Kumar and Anoop Deoras}, booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track}, year={2025}, url={https://openreview.net/forum?id=2R6y4Ku9kG} }

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

The underlying GitHub issues are subject to their respective repository licenses.

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

Appendix: Dockerfile Generation Options

Environment Variables

Variable	Description
`STRANDS_NON_INTERACTIVE=true`	Required. Disables interactive prompts
`BYPASS_TOOL_CONSENT=true`	Required. Bypasses tool confirmation

Command Line Arguments

Argument	Default	Description
`--input-dir`, `-i`	(required)	Directory with classified issues
`--output-dir`, `-o`	`logs/dockerfile_generation_strands`	Output directory
`--languages`	(all)	Specific languages to process
`--max-attempts`	`10`	Max retry attempts per issue
`--docker-timeout`	`600`	Docker build timeout (seconds)
`--agent-timeout`	`300`	Agent attempt timeout (seconds)
`--issue-timeout`	`1800`	Total timeout per issue (seconds)
`--parallel`, `-p`	`1`	Parallel processing count
`--model-id`	`claude-sonnet-4-5`	AWS Bedrock model ID

Folders and files

Latest commit

History

Repository files navigation

CodeAssistBench

Quick Run (5 minutes)

Dataset Overview

Dataset Fields

Step-by-Step: Generate Your Own Dataset

Prerequisites

Step 1: Collect GitHub Issues

Step 2: Get Commit IDs

Step 3: Generate Satisfaction Conditions (Uses LLM)

Step 4: Classify Dockerizability (Uses LLM)

Step 5: Generate Dockerfiles (Uses Strands + LLM)

Step 6: Convert to Final Dataset

End-to-End Example

Setup

Step 1: Create Test Data

Step 2: Generate Satisfaction Conditions

Step 3: Classify Issues

Final Directory Structure

View Results

Example Outputs

Quick Start

Using the Dataset

Running Evaluation

Workflow Overview

Step 1: Generation (Maintainer - User conversation)

Step 2: Evaluation (Judge scores responses)

Verdict Types

Output Format

Analyzing Results

Model Aliases

Project Structure

Installation

AWS Credentials (Required for Bedrock)

Documentation

Features

Citation

License

Contributing

Appendix: Dockerfile Generation Options

Environment Variables

Command Line Arguments

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages