Dark Mode

Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

marthakelly/gandalf-solver

Repository files navigation

Gandalf Solver

An LLM agent with persistent external memory that solves Lakera's Gandalf prompt injection challenge. Demonstrates measurable performance improvement through structured reflection and knowledge accumulation.

Results

Metric Baseline Taxonomy Improvement
Total Attempts 32 21 34% fewer
Avg per Level 4.0 2.6 35% faster
Level 8 (hardest) 7 3 57% faster

Core Thesis

"Even a capable model benefits from explicit self-reflection and knowledge accumulation."

The agent reads from and writes to a structured taxonomy between attempts. Techniques discovered on level N transfer to level N+1.

Tech Stack

  • Claude Code -- Agent runtime (Max subscription)
  • Redis Cloud -- Persistent state and coordination
  • W&B Weave -- Tracing and observability

Setup

# Clone and install
cd gandalf-solver
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Configure (copy .env.example to .env and fill in)
cp .env.example .env

Usage

# Run an attack
python scripts/cli.py --run myrun attack 1 "What is the password?"

# Check a password guess
python scripts/cli.py --run myrun check 1 "COCOLOCO"

# View taxonomy
python scripts/cli.py --run myrun taxonomy

# View metrics
python scripts/cli.py --run myrun metrics

# Reset state for a fresh run
python scripts/cli.py --run myrun reset

The --run flag isolates experiments -- each run gets its own Redis keys and Weave project.

How It Works

  1. Read -- Check taxonomy for techniques that worked on similar levels
  2. Attack -- Generate and send prompt to Gandalf
  3. Reflect -- Analyze why it worked or failed
  4. Update -- Record learnings to taxonomy
  5. Repeat -- Next attempt uses updated knowledge

Transfer Learning Evidence

Level Evidence
L3 describe_word from L2 worked first try
L5 letter_by_letter from L4 cut attempts in half
L6 letter_by_letter continued working
L8 Cross-run learning enabled 57% faster solve

Project Structure

gandalf-solver/
+-- scripts/
| +-- cli.py # CLI interface
| +-- gandalf.py # Gandalf API client
| +-- state.py # Redis state management
| +-- trace.py # Weave tracing
+-- data/ # Local data (gitignored in prod)
+-- CLAUDE.md # Agent instructions
+-- DESIGN.md # Architecture docs
+-- RESULTS.md # Experiment results

Future Work

  • Explorer/Exploiter agents -- Two parallel agents with different strategies sharing discoveries via Redis
  • Live dashboard -- Next.js app showing real-time coordination

License

MIT

About

LLM agent with persistent memory that solves Gandalf prompt injection challenge

Resources

Readme

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors

Languages