CodeAssistBench Usage Guide

This guide shows how to use CodeAssistBench with both Strands (default) and OpenHands (optional) agent frameworks.

Important: Virtual Environment Setup

Due to externally-managed Python environments, you MUST use a virtual environment:

# 1. Create virtual environment in CodeAssistBench directory python3 -m venv venv # 2. Activate virtual environment source venv/bin/activate # 3. Install dependencies pip install -r requirements.txt # 4. Install CodeAssistBench in development mode pip install -e . # 5. Install Strands framework (if available locally) # Navigate to Strands-Science directory and install: pip install -e strands-1.12.0/ pip install -e tools/ # 7. Verify installation python -c "from src.cab_evaluation.agents.agent_factory import AgentFactory; print(' Setup complete!')"

Note: Always activate the virtual environment before running CodeAssistBench:

source venv/bin/activate # Run this each time you start a new terminal session

Sample Files Included

sample_issue.json - Single issue for individual evaluation
sample_dataset.jsonl - Multiple issues for batch processing
strands_agent_example.py - Code examples and demonstrations

CLI Usage Examples

1. Single Issue Evaluation (Complete Workflow)

Default model (sonnet37 for all agents):

python -m cab_evaluation.cli single examples/sample_issue.json

Custom model for all agents:

python -m cab_evaluation.cli single examples/sample_issue.json \ --agent-models '{"maintainer": "sonnet37", "user": "haiku", "judge": "sonnet37"}'

Output: cab_result_.json

2. Generation Workflow Only (Maintainer & User Agents)

Default models:

python -m cab_evaluation.cli generation examples/sample_issue.json

Custom models for generation:

python -m cab_evaluation.cli generation examples/sample_issue.json \ --agent-models '{"maintainer": "sonnet37", "user": "haiku"}'

Output: generation_result_.json

3. Evaluation Workflow Only (Judge Agent)

Default model:

python -m cab_evaluation.cli evaluation generation_result_1234.json

Custom model for evaluation:

python -m cab_evaluation.cli evaluation generation_result_1234.json \ --agent-models '{"judge": "sonnet37"}'

Output: evaluation_result_.json

4. Dataset Processing (All Agents)

Process entire dataset:

python -m cab_evaluation.cli dataset examples/sample_dataset.jsonl

Filter by language and custom models:

python -m cab_evaluation.cli dataset examples/sample_dataset.jsonl \ --language python \ --agent-models '{"maintainer": "sonnet37", "user": "sonnet37", "judge": "sonnet37"}'

Batch processing with resume:

python -m cab_evaluation.cli dataset examples/sample_dataset.jsonl \ --batch-size 5 \ --resume \ --output-dir results_strands

Model Selection Options

Available Models

sonnet37 - Claude 3.7 Sonnet (default) - Best for complex reasoning
haiku - Claude 3.5 Haiku - Fast and cost-effective
sonnet - Claude 3.7 Sonnet (alias)
thinking - Sonnet with thinking capabilities
deepseek - DeepSeek R1 model
llama - Meta Llama 3.3 70B

Workflow-Specific Model Selection

Generation Workflow (Maintainer + User):

{ "maintainer": "sonnet37", // Complex technical responses "user": "haiku" // Faster user simulation }

Evaluation Workflow (Judge):

{ "judge": "sonnet37" // Detailed evaluation and reasoning }

Complete Workflow (All Agents):

{ "maintainer": "sonnet37", // Technical problem solving "user": "haiku", // User interaction simulation "judge": "sonnet37" // Comprehensive evaluation }

Strands Framework Features

Enhanced Tool Capabilities

All agents now have access to:

File Operations: Read, write, and modify files
Command Execution: Safe bash command execution
Repository Analysis: Advanced code exploration
AWS Integration: Direct AWS service interaction
Advanced Reasoning: Enhanced thinking capabilities

Performance Optimizations

Prompt Caching: Automatic caching for cost reduction
Cost Tracking: Detailed token usage and cost analysis
Metrics Collection: Performance and efficiency monitoring

Safety Controls

Read-Only Mode: Safe operations only
Command Restrictions: Built-in safety for bash execution
Graceful Fallback: Works even without Strands framework

Agent Frameworks

CodeAssistBench supports three agent frameworks for the Maintainer agent:

Strands (Default)

Built-in, no extra installation
Supports all 3 agents (Maintainer, User, Judge)
Optimized with prompt caching
Fast and cost-effective

OpenHands (Optional 3rd Party)

Alternative framework for Maintainer agent only
Specialized for complex code generation tasks
Uses OpenHands SDK for repository exploration

Q-CLI (Optional Amazon Q)

Alternative framework for Maintainer agent only
Uses Amazon Q CLI for AWS-native workflows
Subprocess-based integration

Installation

OpenHands:

# Install OpenHands support pip install -e .[openhands] # Set API key export ANTHROPIC_API_KEY="your-key-here"

Q-CLI:

# Install Q-CLI following AWS documentation # Verify installation q --version # Configure AWS credentials (if not already done) aws configure # Test Q-CLI q chat --model claude-sonnet-4.5 "Hello" # No additional Python packages needed for Q-CLI

CLI Usage

Strands (default):

python -m cab_evaluation.cli dataset examples/sample_dataset.jsonl \ --agent-models '{"maintainer": "sonnet37", "user": "haiku", "judge": "sonnet37"}'

OpenHands maintainer:

python -m cab_evaluation.cli dataset examples/sample_dataset.jsonl \ --agent-models '{"maintainer": "claude-sonnet-4-5-20250929", "user": "haiku", "judge": "sonnet37"}' \ --agent-framework '{"maintainer": "openhands"}'

Q-CLI maintainer:

python -m cab_evaluation.cli dataset examples/sample_dataset.jsonl \ --agent-models '{"maintainer": "claude-sonnet-4.5", "user": "haiku", "judge": "sonnet37"}' \ --agent-framework '{"maintainer": "qcli"}'

Python API Usage

from cab_evaluation import create_cab_evaluator # Strands (default) evaluator = create_cab_evaluator( agent_model_mapping={"maintainer": "sonnet37", "user": "haiku", "judge": "sonnet37"} ) # OpenHands maintainer evaluator = create_cab_evaluator( agent_model_mapping={"maintainer": "claude-sonnet-4-5-20250929", "user": "haiku", "judge": "sonnet37"}, agent_framework_mapping={"maintainer": "openhands"} ) # Q-CLI maintainer evaluator = create_cab_evaluator( agent_model_mapping={"maintainer": "claude-sonnet-4.5", "user": "haiku", "judge": "sonnet37"}, agent_framework_mapping={"maintainer": "qcli"} )

Model Name Formats

Framework	Format	Example
Strands	Short names	`"sonnet37"`, `"haiku"`
OpenHands	Full model ID	`"claude-sonnet-4-5-20250929"`
Q-CLI	Q-CLI model names	`"claude-sonnet-4.5"`, `"claude-haiku-4.5"`

Q-CLI Available Models:

claude-sonnet-4.5
claude-sonnet-4
claude-haiku-4.5
qwen3-coder-480b
Auto (default)

Check available models: q chat --model invalid 2>&1 | grep Available

Framework Comparison

Feature	Strands	OpenHands	Q-CLI
Installation	Included	`pip install openhands`	External CLI
Agent Support	All 3 agents	Maintainer only	Maintainer only
Speed	Fast	Moderate	Fast
Metrics	Full token tracking	Full token tracking	Basic metrics only
Best For	General evaluation	Complex code tasks	AWS Q integration

Troubleshooting

OpenHands not found:

pip install openhands openhands-sdk openhands-tools

OpenHands API key not set:

export ANTHROPIC_API_KEY="your-key-here"

Q-CLI not found:

# Install Q-CLI following AWS documentation # Verify installation q --version # If command not found, ensure Q-CLI is in PATH which q

Q-CLI timeout errors:

# Increase timeout in Python API evaluator = create_cab_evaluator( agent_model_mapping={"maintainer": "claude-sonnet-4.5"}, agent_framework_mapping={"maintainer": "qcli"}, agent_framework_config={"qcli_timeout": 600} # 10 minutes )

Q-CLI model not available:

# Check available models q chat --model invalid 2>&1 | grep Available # Use a supported model python -m cab_evaluation.cli dataset examples/sample_dataset.jsonl \ --agent-models '{"maintainer": "claude-sonnet-4.5"}' \ --agent-framework '{"maintainer": "qcli"}'

Import errors:

System automatically falls back to Strands
No action needed

Q-CLI Quick Reference

Setup

# Install Q-CLI (follow AWS docs), then verify q --version q chat --model claude-sonnet-4.5 "test"

Usage

# CLI python -m cab_evaluation.cli dataset examples/sample_dataset.jsonl \ --agent-models '{"maintainer": "claude-sonnet-4.5", "user": "haiku", "judge": "sonnet37"}' \ --agent-framework '{"maintainer": "qcli"}' # Python API evaluator = create_cab_evaluator( agent_model_mapping={"maintainer": "claude-sonnet-4.5", "user": "haiku", "judge": "sonnet37"}, agent_framework_mapping={"maintainer": "qcli"} )

Models

claude-sonnet-4.5 (recommended)
claude-haiku-4.5 (fast)
Auto (default)

Limitations

No token usage tracking
No cache metrics
Basic metadata only (execution time, return code)

Advanced Usage

Configuration File

Create custom configuration:

python -m cab_evaluation.cli config examples/custom_config.json

Read-Only Mode

For safe repository analysis:

from cab_evaluation.agents.agent_factory import AgentFactory factory = AgentFactory() maintainer = factory.create_maintainer_agent( model_name="sonnet37", read_only=True # Only safe operations )

Logging and Debugging

Enable detailed logging:

python -m cab_evaluation.cli single examples/sample_issue.json \ --log-level DEBUG \ --log-file strands_debug.log

Sample File Structure

Issue JSON Structure

{ "number": 1234, "title": "[Bug]: Issue description", "created_at": "ISO timestamp", "closed_at": "ISO timestamp", "commit_id": "git commit hash", "labels": ["bug", "docker"], "url": "GitHub issue URL", "body": "Detailed issue description with markdown", "author": "username", "comments": [ { "user": "maintainer_username", "created_at": "ISO timestamp", "body": "Response content with code examples" } ], "satisfaction_conditions": [ "Condition 1: What the user needs to be satisfied", "Condition 2: Technical requirements", "Condition 3: Additional expectations" ], "_classification": { "category": "Needs Docker build environment | Does not need build environment", "timestamp": "completion timestamp" } }

Dataset JSONL Structure

Each line contains one complete issue JSON object as shown above.

Expected Results

Generation Result Structure

{ "issue_id": "1234", "user_satisfied": true/false, "satisfaction_status": "FULLY_SATISFIED|PARTIALLY_SATISFIED|NOT_SATISFIED", "total_conversation_rounds": 3, "conversation_history": [...], "agent_model_mapping": {...} }

Evaluation Result Structure

{ "issue_id": "1234", "verdict": "CORRECT|PARTIALLY_CORRECT|INCORRECT", "alignment_score": { "satisfied": 3, "total": 3, "percentage": 100.0 }, "docker_results": {...}, "agent_model_mapping": {...} }

Performance Tips

Use appropriate models: sonnet37 for complex reasoning, haiku for speed
Enable caching: Automatic with Strands framework
Monitor costs: Check logs for token usage and cost tracking
Batch processing: Use reasonable batch sizes for large datasets
Resume capability: Use --resume for interrupted dataset processing

Troubleshooting

Common Issues

Import errors: Ensure Strands framework is available in Python path
Model access: Verify AWS credentials for Bedrock model access
Permission errors: Check file permissions for read/write operations
Tool failures: Review logs for specific tool execution errors

Fallback Behavior

If Strands framework is unavailable, agents automatically fall back to standard LLM service with reduced capabilities but full compatibility.

Next Steps

Try the sample files with different model combinations
Monitor the enhanced logging and metrics
Experiment with read-only mode for safe operations
Use the tool capabilities for repository analysis and code exploration

FilesExpand file tree

USAGE_GUIDE.md

Latest commit

History

USAGE_GUIDE.md

File metadata and controls

CodeAssistBench Usage Guide

Important: Virtual Environment Setup

Sample Files Included

CLI Usage Examples

1. Single Issue Evaluation (Complete Workflow)

2. Generation Workflow Only (Maintainer & User Agents)

3. Evaluation Workflow Only (Judge Agent)

4. Dataset Processing (All Agents)

Model Selection Options

Available Models

Workflow-Specific Model Selection

Strands Framework Features

Enhanced Tool Capabilities

Performance Optimizations

Safety Controls

Agent Frameworks

Strands (Default)

OpenHands (Optional 3rd Party)

Q-CLI (Optional Amazon Q)

Installation

CLI Usage

Python API Usage

Model Name Formats

Framework Comparison

Troubleshooting

Q-CLI Quick Reference

Setup

Usage

Models

Limitations

Advanced Usage

Configuration File

Read-Only Mode

Logging and Debugging

Sample File Structure

Issue JSON Structure

Dataset JSONL Structure

Expected Results

Generation Result Structure

Evaluation Result Structure

Performance Tips

Troubleshooting

Common Issues

Fallback Behavior

Next Steps