Gandalf Solver
An LLM agent with persistent external memory that solves Lakera's Gandalf prompt injection challenge. Demonstrates measurable performance improvement through structured reflection and knowledge accumulation.
Results
| Metric | Baseline | Taxonomy | Improvement |
|---|---|---|---|
| Total Attempts | 32 | 21 | 34% fewer |
| Avg per Level | 4.0 | 2.6 | 35% faster |
| Level 8 (hardest) | 7 | 3 | 57% faster |
Core Thesis
"Even a capable model benefits from explicit self-reflection and knowledge accumulation."
The agent reads from and writes to a structured taxonomy between attempts. Techniques discovered on level N transfer to level N+1.
Tech Stack
- Claude Code -- Agent runtime (Max subscription)
- Redis Cloud -- Persistent state and coordination
- W&B Weave -- Tracing and observability
Setup
cd gandalf-solver
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# Configure (copy .env.example to .env and fill in)
cp .env.example .env
Usage
python scripts/cli.py --run myrun attack 1 "What is the password?"
# Check a password guess
python scripts/cli.py --run myrun check 1 "COCOLOCO"
# View taxonomy
python scripts/cli.py --run myrun taxonomy
# View metrics
python scripts/cli.py --run myrun metrics
# Reset state for a fresh run
python scripts/cli.py --run myrun reset
The --run flag isolates experiments -- each run gets its own Redis keys and Weave project.
How It Works
- Read -- Check taxonomy for techniques that worked on similar levels
- Attack -- Generate and send prompt to Gandalf
- Reflect -- Analyze why it worked or failed
- Update -- Record learnings to taxonomy
- Repeat -- Next attempt uses updated knowledge
Transfer Learning Evidence
| Level | Evidence |
|---|---|
| L3 | describe_word from L2 worked first try |
| L5 | letter_by_letter from L4 cut attempts in half |
| L6 | letter_by_letter continued working |
| L8 | Cross-run learning enabled 57% faster solve |
Project Structure
gandalf-solver/
+-- scripts/
| +-- cli.py # CLI interface
| +-- gandalf.py # Gandalf API client
| +-- state.py # Redis state management
| +-- trace.py # Weave tracing
+-- data/ # Local data (gitignored in prod)
+-- CLAUDE.md # Agent instructions
+-- DESIGN.md # Architecture docs
+-- RESULTS.md # Experiment results
Future Work
- Explorer/Exploiter agents -- Two parallel agents with different strategies sharing discoveries via Redis
- Live dashboard -- Next.js app showing real-time coordination
License
MIT