Skip to content

Python Evaluation Framework

LLM test suite for prompt templates and AgenticGuru using G-Eval (LLM-as-judge) and deterministic validation.

Quick Start

# Setup
cd python_eval
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# Run backend
cd .. && ENABLE_TEST_ROUTES=true npm run dev

# Run tests
python test_runner.py --stats                    # View all tests
python test_runner.py all -f sdf3                # All tests for sdf3
python test_runner.py --all-fixtures             # Full regression
python test_runner.py <test_id> -f <fix> -v      # Single test, verbose

Test Suite

Tests split in 7 collections: - script_to_scenes (9) - Asset/scene extraction from scripts - story_operations (9) - Story guru: add/remove/merge scenes, navigation - scene_auto_actions (2) - Scene guru: proactive layout/shot generation - scene_operations (4) - Scene guru: add/reorder/merge shots, regenerate - scene_next_steps (3) - Scene guru: context-aware suggestions - shot_tools (5) - Template validation (no hallucinations) - shot_tweak (3) - Shot edit guru: image edits, clarifications

Test types: - Deterministic (10) - Rule-based validation - G-Eval (4) - LLM semantic evaluation - Conversational (21) - Multi-step guru conversations

Project Structure

python_eval/
├── test_runner.py            # Gurubench-style test runner
├── generate_fixtures.py      # Auto-generate fixtures from scripts
├── src/prompt_client/        # API client for testDebug routes
└── tests/
    ├── test_definitions/     # Modular test organization
    │   ├── script_to_scenes.py
    │   ├── story_operations.py
    │   ├── scene_auto_actions.py
    │   ├── scene_operations.py
    │   ├── scene_next_steps.py
    │   ├── shot_tools.py
    │   ├── shot_tweak.py
    │   └── utils/
    │       ├── g_eval_metrics.py
    │       └── deterministic_funcs.py
    ├── fixtures/             # Test data (3 complete sets)
    │   ├── {prefix}_assets.json
    │   ├── {prefix}_scenes.json
    │   ├── {prefix}_story_data.json
    │   ├── {prefix}_chat_story_operations.json
    │   ├── {prefix}_scene{N}_*.json (5 states)
    │   └── {prefix}_shot_rendered.json
    └── series_scripts/       # Source scripts

Environment

Loads from ../.env: - OPENAI_API_KEY - Required for G-Eval - BACKEND_URL - Optional (default: http://localhost:8081)

Usage

# View tests and fixtures
python test_runner.py --stats

# Run by collection
python test_runner.py --collection story_operations -f sdf3

# Run single test with verbose output
python test_runner.py guru_remove_scene -f sdf3 -v

# Full regression (all tests × all fixtures)
python test_runner.py --all-fixtures

Fixtures

Each fixture set has 10 files:

Story-level (4 files): - {prefix}_assets.json - Expected characters/environments - {prefix}_scenes.json - Expected scene breakdown - {prefix}_story_data.json - Full story with assets - {prefix}_chat_story_operations.json - Hello exchange

Scene-level (5 files for one random scene): - {prefix}_scene{N}_empty.json - No layout/shots - {prefix}_scene{N}_layout.json - Has layout - {prefix}_scene{N}_keys.json - Has key shots - {prefix}_scene{N}_money_rendered.json - Money shots rendered - {prefix}_scene{N}_keys_rendered.json - All keys rendered

Shot-level (1 file): - {prefix}_shot_rendered.json - Single shot with image

Minimatics Fixtures

This part is a work in progress. For now, you can conveniently make use of tests/minimatics_fixtures/story_status_*/<metadata_set>/*.json as template for speed up the tests.

Example: streamline testing shotThinker and shotProcessor

Start a new chats from the minimatics front-end. Connect to the minimaticsProject mongodb collection. Locate your new minimatics project. Replace in your mongodb project record everything from chatSession downwards with the corresponding content inside tests/minimatics_fixtures/story_status_ongoing/char_env_set/tarnished_chat_history.json.

This way, you can keep the characters, environment and chat-history fixed while testing your minimatics production multiple times.

Generate Fixtures

python generate_fixtures.py tests/series_scripts/my-script.txt --prefix myscript

Auto-creates complete fixture set ready for testing.

How It Works

Efficient Testing

Tests grouped by template → one API call, multiple evaluations:

API CALL: generateAssetsV2
  ├─ assets_characters_consistency (G-Eval)
  ├─ assets_environments_consistency (G-Eval)
  ├─ assets_has_characters (deterministic)
  └─ assets_has_environments (deterministic)

Multi-Step Conversations

Guru tests support conversations with LLM-generated responses:

"guru_remove_scene": {
    "steps": [
        {
            "user_message": "Remove scene 3",
            "evaluations": [{"asks_confirmation": "g_eval"}, ...]
        },
        {
            "user_message": "LLM_GENERATE",  # GPT generates appropriate response
            "evaluations": [{"tool_called": "deterministic"}, ...]
        }
    ]
}

LLM_GENERATE adapts to any guru question ("why?", "are you sure?", etc.)

Adding Tests

Find test definitions in tests/test_definitions/ (see NOTE at top of each file).

Template test:

"my_test": {
    "type": "g_eval",  # or "deterministic"
    "collection": "script_to_scenes",
    "template_id": "generateAssetsV2",
    "g_eval_metric": create_my_metric,
}

Guru conversation test:

"my_guru_test": {
    "type": "guru_conversation",  # or "scene_guru_conversation", "shot_edit_guru"
    "collection": "story_operations",
    "guru_type": "story",
    "steps": [
        {"user_message": "...", "evaluations": [...]},
        {"user_message": "LLM_GENERATE", "evaluations": [...]},
    ]
}

Troubleshooting

Connection refused: Ensure ENABLE_TEST_ROUTES=true npm run dev

OpenAI API key error: Set OPENAI_API_KEY in ../.env

Import errors: Activate venv: source venv/bin/activate