CodeAssistBench Usage Guide
This guide shows how to use CodeAssistBench with both Strands (default) and OpenHands (optional) agent frameworks.
Important: Virtual Environment Setup
Due to externally-managed Python environments, you MUST use a virtual environment:
python3 -m venv venv
# 2. Activate virtual environment
source venv/bin/activate
# 3. Install dependencies
pip install -r requirements.txt
# 4. Install CodeAssistBench in development mode
pip install -e .
# 5. Install Strands framework (if available locally)
# Navigate to Strands-Science directory and install:
pip install -e strands-1.12.0/
pip install -e tools/
# 7. Verify installation
python -c "from src.cab_evaluation.agents.agent_factory import AgentFactory; print(' Setup complete!')"
Note: Always activate the virtual environment before running CodeAssistBench:
Sample Files Included
sample_issue.json- Single issue for individual evaluationsample_dataset.jsonl- Multiple issues for batch processingstrands_agent_example.py- Code examples and demonstrations
CLI Usage Examples
1. Single Issue Evaluation (Complete Workflow)
Default model (sonnet37 for all agents):
Custom model for all agents:
--agent-models '{"maintainer": "sonnet37", "user": "haiku", "judge": "sonnet37"}'
Output: cab_result_
2. Generation Workflow Only (Maintainer & User Agents)
Default models:
Custom models for generation:
--agent-models '{"maintainer": "sonnet37", "user": "haiku"}'
Output: generation_result_
3. Evaluation Workflow Only (Judge Agent)
Default model:
Custom model for evaluation:
--agent-models '{"judge": "sonnet37"}'
Output: evaluation_result_
4. Dataset Processing (All Agents)
Process entire dataset:
Filter by language and custom models:
--language python \
--agent-models '{"maintainer": "sonnet37", "user": "sonnet37", "judge": "sonnet37"}'
Batch processing with resume:
--batch-size 5 \
--resume \
--output-dir results_strands
Model Selection Options
Available Models
sonnet37- Claude 3.7 Sonnet (default) - Best for complex reasoninghaiku- Claude 3.5 Haiku - Fast and cost-effectivesonnet- Claude 3.7 Sonnet (alias)thinking- Sonnet with thinking capabilitiesdeepseek- DeepSeek R1 modelllama- Meta Llama 3.3 70B
Workflow-Specific Model Selection
Generation Workflow (Maintainer + User):
"maintainer": "sonnet37", // Complex technical responses
"user": "haiku" // Faster user simulation
}
Evaluation Workflow (Judge):
"judge": "sonnet37" // Detailed evaluation and reasoning
}
Complete Workflow (All Agents):
"maintainer": "sonnet37", // Technical problem solving
"user": "haiku", // User interaction simulation
"judge": "sonnet37" // Comprehensive evaluation
}
Strands Framework Features
Enhanced Tool Capabilities
All agents now have access to:
- File Operations: Read, write, and modify files
- Command Execution: Safe bash command execution
- Repository Analysis: Advanced code exploration
- AWS Integration: Direct AWS service interaction
- Advanced Reasoning: Enhanced thinking capabilities
Performance Optimizations
- Prompt Caching: Automatic caching for cost reduction
- Cost Tracking: Detailed token usage and cost analysis
- Metrics Collection: Performance and efficiency monitoring
Safety Controls
- Read-Only Mode: Safe operations only
- Command Restrictions: Built-in safety for bash execution
- Graceful Fallback: Works even without Strands framework
Agent Frameworks
CodeAssistBench supports three agent frameworks for the Maintainer agent:
Strands (Default)
- Built-in, no extra installation
- Supports all 3 agents (Maintainer, User, Judge)
- Optimized with prompt caching
- Fast and cost-effective
OpenHands (Optional 3rd Party)
- Alternative framework for Maintainer agent only
- Specialized for complex code generation tasks
- Uses OpenHands SDK for repository exploration
Q-CLI (Optional Amazon Q)
- Alternative framework for Maintainer agent only
- Uses Amazon Q CLI for AWS-native workflows
- Subprocess-based integration
Installation
OpenHands:
pip install -e .[openhands]
# Set API key
export ANTHROPIC_API_KEY="your-key-here"
Q-CLI:
# Verify installation
q --version
# Configure AWS credentials (if not already done)
aws configure
# Test Q-CLI
q chat --model claude-sonnet-4.5 "Hello"
# No additional Python packages needed for Q-CLI
CLI Usage
Strands (default):
--agent-models '{"maintainer": "sonnet37", "user": "haiku", "judge": "sonnet37"}'
OpenHands maintainer:
--agent-models '{"maintainer": "claude-sonnet-4-5-20250929", "user": "haiku", "judge": "sonnet37"}' \
--agent-framework '{"maintainer": "openhands"}'
Q-CLI maintainer:
--agent-models '{"maintainer": "claude-sonnet-4.5", "user": "haiku", "judge": "sonnet37"}' \
--agent-framework '{"maintainer": "qcli"}'
Python API Usage
# Strands (default)
evaluator = create_cab_evaluator(
agent_model_mapping={"maintainer": "sonnet37", "user": "haiku", "judge": "sonnet37"}
)
# OpenHands maintainer
evaluator = create_cab_evaluator(
agent_model_mapping={"maintainer": "claude-sonnet-4-5-20250929", "user": "haiku", "judge": "sonnet37"},
agent_framework_mapping={"maintainer": "openhands"}
)
# Q-CLI maintainer
evaluator = create_cab_evaluator(
agent_model_mapping={"maintainer": "claude-sonnet-4.5", "user": "haiku", "judge": "sonnet37"},
agent_framework_mapping={"maintainer": "qcli"}
)
Model Name Formats
| Framework | Format | Example |
|---|---|---|
| Strands | Short names | "sonnet37", "haiku" |
| OpenHands | Full model ID | "claude-sonnet-4-5-20250929" |
| Q-CLI | Q-CLI model names | "claude-sonnet-4.5", "claude-haiku-4.5" |
Q-CLI Available Models:
claude-sonnet-4.5claude-sonnet-4claude-haiku-4.5qwen3-coder-480bAuto(default)
Check available models: q chat --model invalid 2>&1 | grep Available
Framework Comparison
| Feature | Strands | OpenHands | Q-CLI |
|---|---|---|---|
| Installation | Included | pip install openhands |
External CLI |
| Agent Support | All 3 agents | Maintainer only | Maintainer only |
| Speed | Fast | Moderate | Fast |
| Metrics | Full token tracking | Full token tracking | Basic metrics only |
| Best For | General evaluation | Complex code tasks | AWS Q integration |
Troubleshooting
OpenHands not found:
OpenHands API key not set:
Q-CLI not found:
# Verify installation
q --version
# If command not found, ensure Q-CLI is in PATH
which q
Q-CLI timeout errors:
evaluator = create_cab_evaluator(
agent_model_mapping={"maintainer": "claude-sonnet-4.5"},
agent_framework_mapping={"maintainer": "qcli"},
agent_framework_config={"qcli_timeout": 600} # 10 minutes
)
Q-CLI model not available:
q chat --model invalid 2>&1 | grep Available
# Use a supported model
python -m cab_evaluation.cli dataset examples/sample_dataset.jsonl \
--agent-models '{"maintainer": "claude-sonnet-4.5"}' \
--agent-framework '{"maintainer": "qcli"}'
Import errors:
- System automatically falls back to Strands
- No action needed
Q-CLI Quick Reference
Setup
q --version
q chat --model claude-sonnet-4.5 "test"
Usage
python -m cab_evaluation.cli dataset examples/sample_dataset.jsonl \
--agent-models '{"maintainer": "claude-sonnet-4.5", "user": "haiku", "judge": "sonnet37"}' \
--agent-framework '{"maintainer": "qcli"}'
# Python API
evaluator = create_cab_evaluator(
agent_model_mapping={"maintainer": "claude-sonnet-4.5", "user": "haiku", "judge": "sonnet37"},
agent_framework_mapping={"maintainer": "qcli"}
)
Models
claude-sonnet-4.5(recommended)claude-haiku-4.5(fast)Auto(default)
Limitations
- No token usage tracking
- No cache metrics
- Basic metadata only (execution time, return code)
Advanced Usage
Configuration File
Create custom configuration:
Read-Only Mode
For safe repository analysis:
factory = AgentFactory()
maintainer = factory.create_maintainer_agent(
model_name="sonnet37",
read_only=True # Only safe operations
)
Logging and Debugging
Enable detailed logging:
--log-level DEBUG \
--log-file strands_debug.log
Sample File Structure
Issue JSON Structure
"number": 1234,
"title": "[Bug]: Issue description",
"created_at": "ISO timestamp",
"closed_at": "ISO timestamp",
"commit_id": "git commit hash",
"labels": ["bug", "docker"],
"url": "GitHub issue URL",
"body": "Detailed issue description with markdown",
"author": "username",
"comments": [
{
"user": "maintainer_username",
"created_at": "ISO timestamp",
"body": "Response content with code examples"
}
],
"satisfaction_conditions": [
"Condition 1: What the user needs to be satisfied",
"Condition 2: Technical requirements",
"Condition 3: Additional expectations"
],
"_classification": {
"category": "Needs Docker build environment | Does not need build environment",
"timestamp": "completion timestamp"
}
}
Dataset JSONL Structure
Each line contains one complete issue JSON object as shown above.
Expected Results
Generation Result Structure
"issue_id": "1234",
"user_satisfied": true/false,
"satisfaction_status": "FULLY_SATISFIED|PARTIALLY_SATISFIED|NOT_SATISFIED",
"total_conversation_rounds": 3,
"conversation_history": [...],
"agent_model_mapping": {...}
}
Evaluation Result Structure
"issue_id": "1234",
"verdict": "CORRECT|PARTIALLY_CORRECT|INCORRECT",
"alignment_score": {
"satisfied": 3,
"total": 3,
"percentage": 100.0
},
"docker_results": {...},
"agent_model_mapping": {...}
}
Performance Tips
- Use appropriate models: sonnet37 for complex reasoning, haiku for speed
- Enable caching: Automatic with Strands framework
- Monitor costs: Check logs for token usage and cost tracking
- Batch processing: Use reasonable batch sizes for large datasets
- Resume capability: Use
--resumefor interrupted dataset processing
Troubleshooting
Common Issues
- Import errors: Ensure Strands framework is available in Python path
- Model access: Verify AWS credentials for Bedrock model access
- Permission errors: Check file permissions for read/write operations
- Tool failures: Review logs for specific tool execution errors
Fallback Behavior
If Strands framework is unavailable, agents automatically fall back to standard LLM service with reduced capabilities but full compatibility.
Next Steps
- Try the sample files with different model combinations
- Monitor the enhanced logging and metrics
- Experiment with read-only mode for safe operations
- Use the tool capabilities for repository analysis and code exploration