Python Evaluation Framework¶
LLM test suite for prompt templates and AgenticGuru using G-Eval (LLM-as-judge) and deterministic validation.
Quick Start¶
# Setup
cd python_eval
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
# Run backend
cd .. && ENABLE_TEST_ROUTES=true npm run dev
# Run tests
python test_runner.py --stats # View all tests
python test_runner.py all -f sdf3 # All tests for sdf3
python test_runner.py --all-fixtures # Full regression
python test_runner.py <test_id> -f <fix> -v # Single test, verbose
Test Suite¶
Tests split in 7 collections:
- script_to_scenes (9) - Asset/scene extraction from scripts
- story_operations (9) - Story guru: add/remove/merge scenes, navigation
- scene_auto_actions (2) - Scene guru: proactive layout/shot generation
- scene_operations (4) - Scene guru: add/reorder/merge shots, regenerate
- scene_next_steps (3) - Scene guru: context-aware suggestions
- shot_tools (5) - Template validation (no hallucinations)
- shot_tweak (3) - Shot edit guru: image edits, clarifications
Test types: - Deterministic (10) - Rule-based validation - G-Eval (4) - LLM semantic evaluation - Conversational (21) - Multi-step guru conversations
Project Structure¶
python_eval/
├── test_runner.py # Gurubench-style test runner
├── generate_fixtures.py # Auto-generate fixtures from scripts
├── src/prompt_client/ # API client for testDebug routes
└── tests/
├── test_definitions/ # Modular test organization
│ ├── script_to_scenes.py
│ ├── story_operations.py
│ ├── scene_auto_actions.py
│ ├── scene_operations.py
│ ├── scene_next_steps.py
│ ├── shot_tools.py
│ ├── shot_tweak.py
│ └── utils/
│ ├── g_eval_metrics.py
│ └── deterministic_funcs.py
├── fixtures/ # Test data (3 complete sets)
│ ├── {prefix}_assets.json
│ ├── {prefix}_scenes.json
│ ├── {prefix}_story_data.json
│ ├── {prefix}_chat_story_operations.json
│ ├── {prefix}_scene{N}_*.json (5 states)
│ └── {prefix}_shot_rendered.json
└── series_scripts/ # Source scripts
Environment¶
Loads from ../.env:
- OPENAI_API_KEY - Required for G-Eval
- BACKEND_URL - Optional (default: http://localhost:8081)
Usage¶
# View tests and fixtures
python test_runner.py --stats
# Run by collection
python test_runner.py --collection story_operations -f sdf3
# Run single test with verbose output
python test_runner.py guru_remove_scene -f sdf3 -v
# Full regression (all tests × all fixtures)
python test_runner.py --all-fixtures
Fixtures¶
Each fixture set has 10 files:
Story-level (4 files):
- {prefix}_assets.json - Expected characters/environments
- {prefix}_scenes.json - Expected scene breakdown
- {prefix}_story_data.json - Full story with assets
- {prefix}_chat_story_operations.json - Hello exchange
Scene-level (5 files for one random scene):
- {prefix}_scene{N}_empty.json - No layout/shots
- {prefix}_scene{N}_layout.json - Has layout
- {prefix}_scene{N}_keys.json - Has key shots
- {prefix}_scene{N}_money_rendered.json - Money shots rendered
- {prefix}_scene{N}_keys_rendered.json - All keys rendered
Shot-level (1 file):
- {prefix}_shot_rendered.json - Single shot with image
Minimatics Fixtures¶
This part is a work in progress. For now, you can conveniently make use of tests/minimatics_fixtures/story_status_*/<metadata_set>/*.json as template for speed up the tests.
Example: streamline testing shotThinker and shotProcessor¶
Start a new chats from the minimatics front-end. Connect to the minimaticsProject mongodb collection. Locate your new minimatics project. Replace in your mongodb project record everything from chatSession downwards with the corresponding content inside tests/minimatics_fixtures/story_status_ongoing/char_env_set/tarnished_chat_history.json.
This way, you can keep the characters, environment and chat-history fixed while testing your minimatics production multiple times.
Generate Fixtures¶
python generate_fixtures.py tests/series_scripts/my-script.txt --prefix myscript
Auto-creates complete fixture set ready for testing.
How It Works¶
Efficient Testing¶
Tests grouped by template → one API call, multiple evaluations:
API CALL: generateAssetsV2
├─ assets_characters_consistency (G-Eval)
├─ assets_environments_consistency (G-Eval)
├─ assets_has_characters (deterministic)
└─ assets_has_environments (deterministic)
Multi-Step Conversations¶
Guru tests support conversations with LLM-generated responses:
"guru_remove_scene": {
"steps": [
{
"user_message": "Remove scene 3",
"evaluations": [{"asks_confirmation": "g_eval"}, ...]
},
{
"user_message": "LLM_GENERATE", # GPT generates appropriate response
"evaluations": [{"tool_called": "deterministic"}, ...]
}
]
}
LLM_GENERATE adapts to any guru question ("why?", "are you sure?", etc.)
Adding Tests¶
Find test definitions in tests/test_definitions/ (see NOTE at top of each file).
Template test:
"my_test": {
"type": "g_eval", # or "deterministic"
"collection": "script_to_scenes",
"template_id": "generateAssetsV2",
"g_eval_metric": create_my_metric,
}
Guru conversation test:
"my_guru_test": {
"type": "guru_conversation", # or "scene_guru_conversation", "shot_edit_guru"
"collection": "story_operations",
"guru_type": "story",
"steps": [
{"user_message": "...", "evaluations": [...]},
{"user_message": "LLM_GENERATE", "evaluations": [...]},
]
}
Troubleshooting¶
Connection refused: Ensure ENABLE_TEST_ROUTES=true npm run dev
OpenAI API key error: Set OPENAI_API_KEY in ../.env
Import errors: Activate venv: source venv/bin/activate